These are some Pandas tricks I use frequently; I hope it’s just as useful to you too!
Update: we’ll now use iris
for this (thanks Sam!):
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
df = pd.DataFrame(
data = iris['data'],
columns = iris['feature_names']
)
df['species'] = iris['target_names'][iris['target']]
df.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Visual aids
TQDM
I absolutely love TQDM, partly because of how much I end up coding in IPython
or a Jupyter environment.
It’s always helpful to know how far along I’ve gone along in applying some function:
from tqdm import tqdm
tqdm.pandas()
def foo(z: float) -> bool:
"""
A pretty meaningless function
"""
if z >= 5:
return 1
else:
return 0
# foo is a function to apply on a value of the column
df['sepal length (cm)'].progress_apply(lambda z: foo(z)) # watch magic happen!
100%|██████████| 150/150 [00:00<00:00, 81930.67it/s]
0 1
1 0
2 0
3 0
4 1
..
145 1
146 1
147 1
148 1
149 1
Name: sepal length (cm), Length: 150, dtype: int64
Plotting!
I don’t think I take advantage of this feature enough, partly because libraries like Seaborn, plotly, plotnine
and Altair all work natively with Dataframe
objects. But if you just want something quick, these go a long
way, too:
_ = df['sepal length (cm)'].hist()
_ = df.plot(x = 'sepal length (cm)', y = 'sepal width (cm)', kind = 'scatter')
Column manipulation
Strings as aggregation functions
There’s loads of these, e.g. std
, mean
, first
, etc…
df.groupby("species").agg({
"petal length (cm)": "mean",
"petal width (cm)": "std"
})
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Getting unique values
Gone are the days of using something like set(df["column"])
; behold, unique
and nunique
are your friend!
df['species'].unique()
array(['setosa', 'versicolor', 'virginica'], dtype=object)
df['species'].nunique()
3
# this also works
df[df['sepal length (cm)'] > 6].agg(
{
"species": "unique"
}
)
species [versicolor, virginica]
dtype: object
Slightly more efficient CSV reading/handling
Only getting some columns
This is a three-stage process, but it saves memory, and leads to faster reading too, which is a bonus! Notice how, without this, it can be a big TSV to read:
PATH = "../gene_exp/E-MTAB-5214-query-results.tpms.tsv"
gtex = pd.read_csv(PATH, comment='#', sep='\t')
gtex
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Let’s do this a bit better:
def get_col(column_name: str) -> bool:
"""
Simple function to filter column names
"""
if "gene" in column_name.lower() or "blood" in column_name.lower():
return True
else:
return False
# Get the header by reading the first line
header = pd.read_csv(PATH, nrows = 1, sep='\t', comment='#').columns
# Filter the columns of interest
usecols = [c for c in header if get_col(c)]
# now read it
gtex = pd.read_csv(PATH, usecols = usecols, comment='#', sep = '\t')
gtex
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Chunkifying data
I have to credit my old colleague Magda for this trick:
import re
pattern = re.compile(r"^IL1[A-Z]$")
def filter_interleukin1(chunk):
"""
Apply a regex to filter out chunks
"""
return chunk[
chunk['Gene Name'].apply(lambda z: True if pattern.findall(z) else False)
]
gtex = pd.concat(
[filter_interleukin1(chunk) for chunk in pd.read_csv("../gene_exp/E-MTAB-5214-query-results.tpms.tsv",
sep='\t', comment = '#',
iterator=True, chunksize=1000)]
)
gtex
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Memory efficiency
This is not what I do immediately, but I find that it sometimes has benefits, especially when memory is a bit precious and I have to make ends meet:
pd.to_numeric(df['numeric_column'], downcast='unsigned') # only really works for positive integers
pd.to_numeric(df['numeric_column'], downcast='Sparse[int]') # more effective with lots of 0s
df['column'].astype(bool) # is your data full of 0s and 1s...?
SQL(?) for Pandas
Yes, you can call SQL via Pandas, e.g.
conn = sqlite3.connect() # or a sqlalchemy connection... etc.
pd.read_sql("""SELECT * FROM ... """, con = conn)
but you can also write string queries for your Pandas data! Let’s look at iris
again:
df.head(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# you may know this as df[df['species'] == 'setosa']
df.query("species == 'setosa'").head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
I find this is slightly more readable, especially when there’s lots of conditions, such as:
# you may know this as df[(df['species'] == 'setosa')&(df['sepal width (cm)'] < 3.2)]
df.query("species == 'setosa' and `sepal width (cm)` < 3.2").head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Serialisation
While TSV/CSV is the go-to format for many, I’ve found that working with them can be a bit of a pain,
especially when my files get big. Some formats I’ve played with lately are Apache’s feather
and Parquet
formats.
While they sometimes don’t offer as much compression as the humble gzip
, they’re still much
better at reading; remember to have pyarrow
installed!
df.to_parquet()
pd.read_parquet()
Next time I’ll cover numba
, which has been one of the most exciting things I’ve worked with lately.