These are some Pandas tricks I use frequently; I hope it’s just as useful to you too!

Update: we’ll now use iris for this (thanks Sam!):

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
df = pd.DataFrame(
    data = iris['data'],
    columns = iris['feature_names']
)
df['species'] = iris['target_names'][iris['target']]
df.head(3)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Visual aids

TQDM

I absolutely love TQDM, partly because of how much I end up coding in IPython or a Jupyter environment. It’s always helpful to know how far along I’ve gone along in applying some function:

from tqdm import tqdm
tqdm.pandas()

def foo(z: float) -> bool:
    """
    A pretty meaningless function
    """
    if z >= 5:
        return 1
    else:
        return 0

# foo is a function to apply on a value of the column
df['sepal length (cm)'].progress_apply(lambda z: foo(z)) # watch magic happen!
100%|██████████| 150/150 [00:00<00:00, 81930.67it/s]





0      1
1      0
2      0
3      0
4      1
      ..
145    1
146    1
147    1
148    1
149    1
Name: sepal length (cm), Length: 150, dtype: int64

Plotting!

I don’t think I take advantage of this feature enough, partly because libraries like Seaborn, plotly, plotnine and Altair all work natively with Dataframe objects. But if you just want something quick, these go a long way, too:

_ = df['sepal length (cm)'].hist()
_ = df.plot(x = 'sepal length (cm)', y = 'sepal width (cm)', kind = 'scatter')

Column manipulation

Strings as aggregation functions

There’s loads of these, e.g. std, mean, first, etc…

df.groupby("species").agg({
    "petal length (cm)": "mean",
    "petal width (cm)": "std"
})
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Getting unique values

Gone are the days of using something like set(df["column"]); behold, unique and nunique are your friend!

df['species'].unique()
array(['setosa', 'versicolor', 'virginica'], dtype=object)
df['species'].nunique()
3
# this also works
df[df['sepal length (cm)'] > 6].agg(
    {
        "species": "unique"
    }
)
species    [versicolor, virginica]
dtype: object

Slightly more efficient CSV reading/handling

Only getting some columns

This is a three-stage process, but it saves memory, and leads to faster reading too, which is a bonus! Notice how, without this, it can be a big TSV to read:

PATH = "../gene_exp/E-MTAB-5214-query-results.tpms.tsv"
gtex = pd.read_csv(PATH, comment='#', sep='\t')
gtex
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Let’s do this a bit better:

def get_col(column_name: str) -> bool:
    """
    Simple function to filter column names
    """
    if "gene" in column_name.lower() or "blood" in column_name.lower():
        return True
    else:
        return False


# Get the header by reading the first line
header = pd.read_csv(PATH, nrows = 1, sep='\t', comment='#').columns

# Filter the columns of interest
usecols = [c for c in header if get_col(c)]

# now read it
gtex = pd.read_csv(PATH, usecols = usecols, comment='#', sep = '\t') 
gtex
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Chunkifying data

I have to credit my old colleague Magda for this trick:

import re
pattern = re.compile(r"^IL1[A-Z]$")

def filter_interleukin1(chunk):
    """
    Apply a regex to filter out chunks
    """
    return chunk[
        chunk['Gene Name'].apply(lambda z: True if pattern.findall(z) else False)
    ]


gtex = pd.concat(
    [filter_interleukin1(chunk) for chunk in pd.read_csv("../gene_exp/E-MTAB-5214-query-results.tpms.tsv", 
                                    sep='\t', comment = '#',
                                    iterator=True, chunksize=1000)]
)
gtex
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Memory efficiency

This is not what I do immediately, but I find that it sometimes has benefits, especially when memory is a bit precious and I have to make ends meet:

pd.to_numeric(df['numeric_column'], downcast='unsigned') # only really works for positive integers
pd.to_numeric(df['numeric_column'], downcast='Sparse[int]') # more effective with lots of 0s
df['column'].astype(bool) # is your data full of 0s and 1s...?

SQL(?) for Pandas

Yes, you can call SQL via Pandas, e.g.

conn = sqlite3.connect() # or a sqlalchemy connection... etc.
pd.read_sql("""SELECT * FROM ... """, con = conn)

but you can also write string queries for your Pandas data! Let’s look at iris again:

df.head(5)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

# you may know this as df[df['species'] == 'setosa']
df.query("species == 'setosa'").head(10)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

I find this is slightly more readable, especially when there’s lots of conditions, such as:

# you may know this as df[(df['species'] == 'setosa')&(df['sepal width (cm)'] < 3.2)]
df.query("species == 'setosa' and `sepal width (cm)` < 3.2").head(10)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Serialisation

While TSV/CSV is the go-to format for many, I’ve found that working with them can be a bit of a pain, especially when my files get big. Some formats I’ve played with lately are Apache’s feather and Parquet formats.

While they sometimes don’t offer as much compression as the humble gzip, they’re still much better at reading; remember to have pyarrow installed!

df.to_parquet()
pd.read_parquet()

Next time I’ll cover numba, which has been one of the most exciting things I’ve worked with lately.