Python Pandas and Pandera
- Description: Pandas,
SeriesandDataFrame, indexing with.loc/.iloc, missing data,groupby, joins/merges, time series, I/O, and Pandera schemas for runtime DataFrame validation - My Notion Note ID: K2A-D2-4
- Created: 2023-11-12
- Updated: 2026-05-11
- License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io
Table of Contents
- 1. What Pandas Gives You
- 2.
SeriesandDataFrame - 3. Selection:
[],.loc,.iloc - 4. Missing Data
- 5. Transform:
assign,apply, Vectorized Ops - 6.
groupby - 7. Joins and Concatenation
- 8. Reshape:
pivot,melt,stack,unstack - 9. Time Series
- 10. I/O
- 11. Performance and Gotchas
- 12. Pandera: Schemas and Validation
- 13. References
1. What Pandas Gives You
- Tabular-data workhorse for Python data analysis
DataFrame≈ in-memory SQL table, columns of typed arrays + a labelled index- Built on NumPy (most numeric columns are
ndarray) - Rich API for filter / transform / group / join
When to use:
- Ad-hoc analysis
- CSV/Parquet ETL
- Prototyping feature pipelines
When NOT to:
- Streaming bigger-than-memory data → Polars, DuckDB, PySpark
- Microsecond-latency online serving
import pandas as pd
import numpy as np
2. Series and DataFrame
Series, 1-D labelled arrayDataFrame, dict-of-Seriessharing the same row index
s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="x")
df = pd.DataFrame({
"name": ["Yu", "Lee", "Ada"],
"age": [30, 25, 40],
"team": ["A", "B", "A"],
})
df.shape # (3, 3)
df.dtypes # name: object, age: int64, team: object
df.columns # Index(['name', 'age', 'team'], dtype='object')
df.index # RangeIndex(0, 3)
df.head(2)
df.describe() # numeric summary
df.info() # types + memory
3. Selection: [], .loc, .iloc
| Form | Meaning |
|---|---|
df["col"] |
one column (Series) |
df[["a", "b"]] |
several columns (DataFrame) |
df[bool_mask] |
row filter |
df.loc[row_label, col_label] |
label-based |
df.iloc[row_pos, col_pos] |
position-based |
df.at[row, col] / df.iat[i, j] |
fast scalar access |
df[df["age"] >= 30]
df.loc[df["team"] == "A", ["name", "age"]]
df.iloc[0] # first row
df.iloc[:2, 1:] # first two rows, cols from idx 1
df.loc[:, "name":"age"] # label slice, INCLUSIVE on both ends
# Multiple conditions: bitwise & | ~ and PARENTHESIZE
df[(df["age"] >= 30) & (df["team"] == "A")]
- Combine booleans with
&,|,~, notand/or, Python's keywords don't broadcast
4. Missing Data
NaN(numeric) orpd.NA(newer nullable dtypes)- Propagate through arithmetic
df["age"].isna() # mask
df.dropna() # drop rows with any NaN
df.dropna(subset=["age"])
df.fillna({"age": 0, "team": "?"})
df["age"].fillna(df["age"].mean())
# Forward/backward fill (time series)
ts.ffill()
ts.bfill()
- Never write
== NaN,NaN != NaN. Use.isna()/.notna()
5. Transform: assign, apply, Vectorized Ops
df = df.assign(
age_in_months = lambda d: d["age"] * 12,
senior = lambda d: d["age"] >= 30,
)
# Element-wise op (vectorized, fastest)
df["age"] = df["age"] + 1
# Row-wise with a Python function (slow; avoid if you can vectorize)
df["label"] = df.apply(lambda r: f"{r['name']} ({r['team']})", axis=1)
# Series.map for elementwise function over a Series
df["team_name"] = df["team"].map({"A": "Alpha", "B": "Bravo"})
- Vectorized ops run in C/NumPy, 10–1000× faster than
.apply(axis=1) - Reach for
.applyonly when no vectorized form exists
6. groupby
g = df.groupby("team")
g["age"].mean() # mean age per team
g.agg({"age": ["mean", "max"], "name": "count"})
# Multi-key
df.groupby(["team", df["age"] >= 30])["name"].count()
# Transform = same-shape result, broadcast back
df["age_z"] = df.groupby("team")["age"].transform(
lambda x: (x - x.mean()) / x.std()
)
# Filter groups
df.groupby("team").filter(lambda g: len(g) >= 2)
- Group keys become the result's index by default
- Pass
as_index=False(or.reset_index()) to keep them as columns
7. Joins and Concatenation
# Vertical stack (rows)
pd.concat([df1, df2], ignore_index=True)
# Horizontal stack (cols), aligns on index
pd.concat([df1, df2], axis=1)
# SQL-style join
pd.merge(orders, users, on="user_id", how="left")
pd.merge(orders, users, left_on="uid", right_on="id", how="inner",
suffixes=("_o", "_u"))
how:"left","right","inner","outer","cross"indicator=Trueadds a_mergecolumn showing source side
8. Reshape: pivot, melt, stack, unstack
# Long → wide
df.pivot(index="date", columns="metric", values="value")
df.pivot_table(index="team", values="age", aggfunc="mean")
# Wide → long
wide.melt(id_vars=["id"], var_name="metric", value_name="value")
# MultiIndex pivot
df.set_index(["team", "date"]).unstack("team")
df_multiindex.stack() # one level → into the index
pivoterrors on duplicate (index, columns) pairspivot_tableaggregates them (defaultmean)
9. Time Series
df["t"] = pd.to_datetime(df["t"]) # parse strings
df = df.set_index("t").sort_index()
df["2026-01"] # all rows in Jan 2026 (string slice)
df.loc["2026-01-01":"2026-01-07"]
df.resample("D").mean() # daily average
df.resample("1h").agg({"x": "mean", "y": "sum"})
df.rolling(window="7D").mean() # 7-day rolling mean
df.shift(1) # lag by 1
df.tz_localize("UTC").tz_convert("America/Los_Angeles")
- Time zones bite, always localize naive timestamps before converting
10. I/O
df = pd.read_csv("data.csv", parse_dates=["date"], dtype={"id": "string"})
df = pd.read_parquet("data.parquet") # fastest columnar format
df = pd.read_json("data.json", lines=True) # ndjson
df = pd.read_sql("SELECT * FROM users", con=engine)
df.to_csv("out.csv", index=False)
df.to_parquet("out.parquet", compression="snappy")
df.to_json("out.json", orient="records", lines=True)
- For anything larger than a few hundred MB, prefer Parquet over CSV, 10–100× smaller and far faster to read
11. Performance and Gotchas
- Chained assignment (
df[df.x > 0]["y"] = 1) is ambiguous and may not modifydf. Use.loc[mask, "y"] = 1instead. With pandas 3.0, copy-on-write is the default and chained inplace updates raiseChainedAssignmentError;SettingWithCopyWarninghas been removed. SettingWithCopyWarning(pre-3.0) came from chained indexing; same.loc-on-one-line fix.objectdtype for strings is slow and memory-heavy. Use"string"(nullable) or pyarrow-backed (dtype="string[pyarrow]") for big columns.apply(axis=1)is row-by-row in Python, try vectorized arithmetic,.map, ornp.wherefirst.- Categoricals for low-cardinality strings:
df["team"] = df["team"].astype("category"), saves memory, speeds upgroupby. inplace=Trueis deprecated in pandas 2.x, always re-assign.- Polars is 5–50× faster for many workloads and has a saner API. Worth learning when Pandas becomes the bottleneck.
12. Pandera: Schemas and Validation
pip install pandera- Runtime schema validation for DataFrames
- Catches "column missing", "wrong dtype", "negative price", "duplicate ID" at the boundary, with pinpointed offending rows
import pandera.pandas as pa
from pandera.pandas import Column, Check
schema = pa.DataFrameSchema(
{
"id": Column(int, Check.gt(0), unique=True),
"name": Column(str, Check.str_length(1, 100)),
"age": Column(int, Check.in_range(0, 120), nullable=True),
"email": Column(str, Check.str_matches(r"^[^@]+@[^@]+$")),
"team": Column(str, Check.isin(["A", "B", "C"])),
"joined": Column(pa.DateTime),
},
strict=True, # fail on unexpected columns
coerce=True, # cast columns to declared dtype where possible
)
clean = schema.validate(raw_df, lazy=True) # lazy: collect ALL errors before raising
lazy=True, aggregates every failure into oneSchemaErrorsinstead of stopping at the first
Class-style schemas (preferred, closer to dataclasses, plays well with IDEs):
from pandera.typing import Series, DataFrame
import pandera.pandas as pa
class UserSchema(pa.DataFrameModel):
id: Series[int] = pa.Field(gt=0, unique=True)
name: Series[str] = pa.Field(str_length={"min_value": 1, "max_value": 100})
age: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 120}, nullable=True)
email: Series[str] = pa.Field(str_matches=r"^[^@]+@[^@]+$")
team: Series[str] = pa.Field(isin=["A", "B", "C"])
class Config:
strict = True
coerce = True
# Use as a decorator to validate inputs/outputs of a function
@pa.check_types
def clean_users(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
return df.assign(name=df["name"].str.strip())
Custom row-wise checks:
class OrderSchema(pa.DataFrameModel):
qty: Series[int]
price: Series[float]
total: Series[float]
@pa.dataframe_check
def total_matches(cls, df: pd.DataFrame) -> Series[bool]:
return df["total"] == df["qty"] * df["price"]
Where to put it in a pipeline:
- Validate at I/O boundaries, file → DataFrame, DataFrame → DB, function I/O
- For hot-path / huge columns:
Config.strict = "filter"drops unknowns; or sample rows for spot checks