Econometrics Series

Chapter 1: The Empirical Economist's Mindset

Assumption	What it really says	Why you should care
Random sampling	Each observation is like a lottery draw from the same population.	Makes sample formulas mirror true population formulas.
Linearity in parameters	The average effect of x on y can be summarised by a straight line (or by a line after transforming variables).	Keeps interpretation clear and maths manageable.
Mean independence (exogeneity)	On average, the unobserved stuff ε_i is not linked to x_i.	Without it, the slope mixes true effect with hidden factors (ability, location, etc.).

When these hold, OLS produces unbiased and (with more data) consistent estimates.

7. Why it matters in practice

Policy design — Before the city spends €100 million hiring teachers, it wants credible evidence that smaller classes cause better scores.
Business strategy — A café chain maps foot traffic (x) to daily sales (y) to choose the next branch location.
Personal finance — A student weighs a €30,000 MSc fee against the estimated wage premium.

Key take-aways

Econometrics converts curious questions into measurable answers.
Experiments are best but rare; careful modelling lets us exploit ordinary data.
The linear regression line—powered by OLS—offers the first step toward credible causal stories, provided the core assumptions hold.

8. Looking ahead

Next up is Chapter 2: Data & Random Sampling. We will see how the structure of your data (cross-section, time-series, panel) and the way you collect it can make or break every result that follows. Stay tuned!

Chapter 2: Data & Random Sampling

Structure	Quick picture	Typical use	Frankfurt-style example
Cross-section	One snapshot in time	Compare units	Hourly wages of 5,000 residents in 2024
Time-series	One unit over many dates	Forecast, detect shocks	Monthly ECB deposit rate, 1999-2025
Panel (longitudinal)	Many units over many dates	Control for fixed traits	Annual rent of 2,000 flats, 2015-2025

Each format implies different statistical challenges. Mixing them up invites wrong standard errors or spurious trends.

3. The ideal: simple random sampling

Mathematically we dream of i.i.d. draws:

Let {(x_i, y_i)}_i=1ⁿ be i.i.d. from the same population.

Independent = one observation tells you nothing about the next.
Identically distributed = every observation has the same probability law.

With i.i.d. data, the sample mean

ȳ = (1/n) ∑_i=1ⁿ y_i

is unbiased for the population mean μ = E[y] and has standard error

SE(ȳ) = σ/√n

where σ² is the true variance. Double the sample size and the error shrinks by 1/√2. Clean and simple!

4. Real-world wrinkles

Wrinkle	What goes wrong	Everyday illustration
Coverage bias	The sampling frame omits part of the population.	A phone survey misses renters without landlines—skewing rent estimates.
Non-response	Some selected units refuse or cannot answer.	High-income households decline wage interviews, pulling the sample mean down.
Cluster dependence	Nearby units look alike → lower effective n.	Energy use measured by apartment often correlates within buildings.
Time dependence	Today's value relates to yesterday's.	ECB rate cuts spread over months; successive observations aren't independent.

Econometrics offers fixes—weights, robust errors, clustering, time-series models—but diagnosing the problem comes first.

5. Stratified & cluster sampling in plain language

Sometimes random draws alone are inefficient or costly.

Stratified sampling divides the population into homogeneous groups (strata) and samples within each. Example: Force equal numbers from Frankfurt's central, north, and south boroughs to ensure city-wide conclusions.

Cluster sampling first selects groups, then units inside them. Example: Choose 50 apartment blocks, then survey every flat in each block. Cheap when travel is expensive, but adds intra-cluster correlation you must correct for.

6. Sample size vs. measurement quality

"Better 1,000 precise measurements than 10,000 sloppy ones."

Large n lowers random error, yet systematic error (bias) remains.

Budget-constrained? Spend first on representative coverage and accurate measurement; add records only if funds remain.

7. Assumptions to check before any regression

Representativeness — Does your sample mirror the target population?
Correct temporal ordering — Is x measured before y when you claim causality?
Stable unit treatment value — One unit's outcome must not affect another's (violated in neighbourhood spill-overs).
No data dredging — Pre-define key variables; avoid picking the best stories post-hoc.

Write these on a sticky note near your code editor.

8. Frankfurt-flavoured examples

Housing study:

Randomly draw flats from the city's cadastral registry. Weight by floor-space to correct for oversampling tiny studios.

Retail-foot-traffic sensor:

Sensors record counts every minute—giving dependence across time. Use Newey-West or fit an AR(1) error structure before testing intervention effects.

Refugee integration survey:

Language-school dropouts may ignore follow-up questionnaires → potential attrition bias. Design incentives or multiple contact modes to cut attrition.

9. Quick field checklist

Define the population. Be explicit: "all private rental contracts signed in Frankfurt 2024."
Audit the sampling frame. List obvious exclusions and their likely direction of bias.
Track response rates. Overall and by key subgroup.
Log data-collection costs. Helps choose between bigger n or better instruments next time.
Document everything. Today's codebook is tomorrow's credibility.

Key take-aways

Data structure (cross-section, time-series, panel) dictates the valid toolbox.
Simple random sampling makes math easy; reality often deviates—spot how.
Fixing bias beats inflating sample size.
Transparent documentation is as valuable as the numbers themselves.

10. Up next

In Chapter 3 we draw our very first regression line, visualise why OLS "leans" the way it does, and connect geometry to intuition. Bring a cup of coffee—and your freshly cleaned dataset.

Chapter 3: Simple Linear Regression, Geometric Intuition

import yfinance as yf, pandas as pd, statsmodels.api as sm

btc = yf.download("BTC-USD", start="2022-01-01")['Adj Close'].pct_change().dropna()*100
spx = yf.download("^GSPC",   start="2022-01-01")['Adj Close'].pct_change().dropna()*100

df = pd.concat({'btc': btc, 'spx': spx}, axis=1).dropna()
X  = sm.add_constant(df['spx'])
model = sm.OLS(df['btc'], X).fit()
print(model.summary())

Typical output (abridged):

coef    std err  t    P>|t|
const     0.05     0.09    0.5   0.62
spx       1.57     0.12   13.0   0.000
R-squared = 0.25

Slope 1.57 → Bitcoin moves ~1.6× the S&P 500 on the same day.
R² ≈ 0.25 → Index returns explain 25% of Bitcoin's daily variation—useful but still leaves plenty of independent noise.

6. Key assumptions checked against market data

Assumption	Do we buy it?	Quick diagnostic
Linearity in mean	Returns often scale roughly linearly day-to-day	Plot residuals vs. fitted; look for curves.
Mean independence (exogeneity)	Same-day co-moves are concurrent, so causality isn't claimed—just association	Acceptable for risk or hedge ratios; not for "Bitcoin causes stocks."
Homoscedasticity & independence	Volatility clusters; residuals show ARCH effects	Robust or Newey-West standard errors fix inference without changing β̂.
Random sampling / i.i.d.	Market hours & holidays create mild irregularity but daily returns are a common, accepted unit	Longer horizons need time-series models, Chapter 11.

7. Why a one-line model already helps

Portfolio hedging — If Bitcoin's beta to equities is 1.6, a $10m equity hedge fund holding 1m in BTC is effectively running a 1.6m equity exposure.
Risk limits — A trading desk can cap aggregate exposure instead of siloed crypto vs. stock buckets.
Scenario planning — Stress tests can apply a 3σ equity shock and scale crypto moves by β̂₁.

8. Limitations & next steps

Causality not implied — The slope is descriptive. In Chapter 10 we'll meet Instrumental Variables for causal answers.
Non-linear tails — Extreme market days may bend the line; quantile regression is a fix.
Time-varying betas — Rolling regressions or state-space models update β₁ as correlations drift.

Take-aways

A scatter, a straight line, and two moments (covariance & variance) already yield actionable insight.
In today's markets, Bitcoin behaves like an "equity with leverage" on many days, but only partly so.
Geometry—residuals at right angles to the fitted line—gives OLS its neat mathematical properties and intuitive appeal.

9. Coming up

Chapter 4 dives into software output: what every column in the regression table means, how to spot red flags, and how to translate numbers into plain English advice for your CIO. See you there!

Chapter 4: Estimating OLS & Reading Software Output

Seeing a statistical table for the first time feels like reading a medical chart in Latin. Yet those columns decide how much capital a portfolio manager shifts, how large a subsidy a government approves, or whether a research paper survives peer review. Today you'll learn to decode every piece of a standard OLS output and spot the red flags that tell you a model can't be trusted.

2. The estimator in one line of algebra

Given an n×1 outcome vector y and an n×k matrix of predictors X (with a leading column of 1s for the intercept), the OLS coefficient vector is

β̂ = (X^TX)^-1X^Ty

That's all software does—solve this set of linear equations—to minimise the sum of squared residuals,

∑_i=1ⁿ(y_i - X_iβ̂)²

Everything else in the print-out (standard errors, t-values, R², etc.) is built on β̂.

3. Running the regression (Python example)

import yfinance as yf, pandas as pd, statsmodels.api as sm

btc = yf.download("BTC-USD", start="2022-01-01")['Adj Close'].pct_change().dropna()*100
spx = yf.download("^GSPC",   start="2022-01-01")['Adj Close'].pct_change().dropna()*100

df  = pd.concat({'btc': btc, 'spx': spx}, axis=1).dropna()
X   = sm.add_constant(df['spx'])   # adds the intercept
model = sm.OLS(df['btc'], X).fit()
print(model.summary())

Typical (abridged) console output:

| Coefficient | Estimate | Std. Error | t-stat | P>|t| |
|-------------|---------:|-----------:|-------:|-----:|
| Intercept | 0.05 | 0.09 | 0.5 | 0.62 |
| spx | 1.57 | 0.12 | 13.0 | 0.000 |
R-squared: 0.25      F-statistic: 169.0      Observations: 850

4. Decoding each column in plain English

Element	What it tells you	Bitcoin-S&P example
Estimate (coef)	Best-fit slope/intercept	A 1% S&P jump coincides with a 1.57% Bitcoin jump, on average.
Std. Error	Sampling uncertainty around the estimate	±0.12 pp around 1.57.
t-statistic	Estimate ÷ Std. Error	13 → a huge signal-to-noise ratio.
P-value	Probability of seeing a t at least this large if the true slope were 0	<0.001 → virtually impossible under "no link."
R-squared	Share of y variance explained	0.25 → S&P moves explain 25% of Bitcoin's daily wiggles.
F-statistic	Joint test that all slopes = 0	169 → the model beats a flat line hands-down.
Observations (n)	Sample size behind the stats	850 trading days.

5. Four common red flags and quick fixes

Red flag	Symptom in output	Likely cause	Immediate fix
"Perfect" R² = 0.99	Too good to be true	You mistakenly regressed a variable on itself (or near-duplicates).	Check column names and lags.
Huge Std. Errors vs. coefficients	Coeff ≈ 0.8 but SE ≈ 1.2	Multicollinearity (predictors highly correlated)	Drop a redundant predictor or apply ridge shrinkage.
Durbin-Watson < 1.0	Serial correlation in residuals	Time-series dependence	Use Newey-West SE or move to AR models.
P-values tiny yet plot looks curved	Misspecification	Missing a nonlinear term	Add x² or run a spline.

6. A cheat-sheet for plain-language reporting

Slope: "Bitcoin tends to move 1.6× the S&P 500 on the same day."
Uncertainty: "The margin of error is about ±0.1×."
Economic meaning: "Holding €100k in Bitcoin equates to a €160k equity exposure for daily shocks."
Fit quality: "Index returns account for one-quarter of Bitcoin's day-to-day moves; other forces drive the rest."

7. Best practice checklist before you trust a table

Visuals first — Always plot data and residuals; numbers confirm what eyes suspect.
Units & scaling — Know whether coefficients are in percent, basis points, or log points.
Robust errors — If in doubt, request HC1/White or Newey-West SE.
Version control — Save the code, data-cut, and seed to reproduce output later.
Tell the story — Translate every stat into a business or policy implication.

Key take-aways

The OLS table is just algebraic output from ∑(y_i - X_iβ̂)²; reading it is a skill, not magic.
Coefficients give magnitude, Std. Errors give precision, R² gives context—interpret all three together.
Red flags (weird R², giant errors, autocorrelation) are easier to catch early than to fix late.

Chapter 5: Goodness of Fit (Without Falling in Love with R²)

Analysts often race straight to the R-squared line, and managers nod if it's "above 0.8." Yet a housing-price model with R² = 0.95 can completely mis-price tomorrow's listings if it's over-tuned to last year's quirks. Caring only about R² is like buying a car because the speedometer goes to 240 km/h—you haven't looked under the hood.

2. The basic anatomy: ANOVA in one glance

For any linear regression

y_i = β₀ + β₁x_1i + ⋯ + β_kx_ki + ε_i

the total variation splits neatly:

∑_i(y_i - ȳ)² (Total SS) = ∑_i(ŷ_i - ȳ)² (Explained SS) + ∑_i(y_i - ŷ_i)² (Residual SS)

Hence

R² = 1 - (Residual SS / Total SS)

It measures in-sample fit—nothing more, nothing less.

3. Meet R²'s better-behaved cousin

Adding predictors never lowers R², so a model can bloat to 100 dummy variables, boast R² = 1, and still flop. Adjusted R² penalises fluff:

R̄² = 1 - (1 - R²) × (n - 1) / (n - k - 1)

where n = observations, k = predictors. Over-fitting will push the penalty term up and R̄² down. Still, even adjusted R² can't see the future; it judges on the same data used to fit the model.

4. Real-world demo: predicting house prices

Data: 10,000 U.S. listings, 2023 (price, size, age, bedrooms, ZIP code).

Model A (parsimonious): Price ~ Size + Bedrooms + Age.
R² = 0.72, RMSE = €46k.

Model B (overfitted): Price ~ Size + Bedrooms + Age + 400 ZIP-code dummies.
R² = 0.95, RMSE = €43k (in-sample).

Hold-out test (new 2024 data):
Model A RMSE = €50k; Model B RMSE = €69k.

Lesson: Model B dazzled with R² = 0.95 but bombed out-of-sample because ZIP codes memorised last year's quirks. High R² ≠ high predictive power.

5. Better yard-sticks

Metric	Formula	What it adds
RMSE	√(Residual SS / n)	Puts error in same units as y.
MAE	(1/n) ∑\|y_i - ŷ_i\|	Less sensitive to outliers than RMSE.
Predicted R² / CV-score	Fit on 90%, test on 10%; rotate	Guards against over-fitting.
Residual plots	---	Reveal curvature, heteroscedasticity, outliers better than any scalar.

6. Quick visual checks

Residuals vs. fitted: Curve? ⇒ add non-linear term.
Scale-Location plot: Fan-shape? ⇒ heteroscedastic ⇒ use robust SE or transform y.
QQ-plot of residuals: Heavy tails? ⇒ rethink normality-based inference.

7. Common myths—busted

Myth	Reality
"Low R² means a useless model."	Not always; predicting stock returns often yields R² < 0.1 but can still price risk correctly.
"High R² proves causality."	R² is silent on why variables move together—instrumental variables or experiments test causality.
"Adjusted R² fixes over-fitting."	It helps, but only cross-validation truly tests future performance.

8. A four-step fit-assessment ritual

Start with the plot. Eyes beat statistics at spotting weirdness.
Report at least one error metric (RMSE/MAE). R² alone is never enough.
Run a hold-out or cross-validation. Trust the score that touches unseen data.
Explain in plain language. "Our model prices homes within ±€50k 80% of the time" beats "R² = 0.72."

Key take-aways

R² describes in-sample fit; it does not guarantee predictive accuracy or causal validity.
Adjusted R² dampens the incentive to add junk predictors but isn't a silver bullet.
Always pair R² with error metrics, residual visuals, and out-of-sample checks before trusting a model in the wild.

9. Looking ahead

Chapter 6 steps into multiple regression with grace—adding controls without drowning in multicollinearity, and seeing how partial effects sharpen our real-world stories. Get ready to move from one-input lines to realistic economic models with many moving parts.

Chapter 6: Multiple Regression Without Tears

Why move beyond one-variable models?
A running example — what drives monthly rent?
The model in matrix form
Quick Python sketch
Watch for multicollinearity
Practical fixes
Essential assumptions revisited
Why multiple regression pays off in real life
Quick checklist before shipping results
Key take-aways
Coming attractions

1. Why move beyond one-variable models?

Real problems rarely hinge on a single factor. Suppose you ask: "How much extra does a master's degree add to annual salary?" Education matters—but so do experience, industry, and region. Ignoring them shoves their influence into the error term and biases the education slope. Multiple regression lets us hold everything else constant and isolate each effect.

2. A running example — what drives monthly rent?

Data: 3,000 Berlin flats listed in 2024.

rent_i € per m²
size_i living area (m²)
dist_i kilometres to Brandenburg Gate
floor_i storey (ground = 0)
age_i building age (years)

Goal: quantify location premium after accounting for size and quality.

3. The model in matrix form

y = Xβ + ε

where

X = [ 1 size₁ dist₁ floor₁ age₁ ⋮ ⋮ ⋮ ⋮ 1 size_n dist_n floor_n age_n ]

The OLS solution generalises neatly:

β̂ = (X^TX)^-1X^Ty

Each β̂_j measures the partial effect of its predictor while all other columns stay fixed.

4. Quick Python sketch

import pandas as pd, statsmodels.api as sm

df = pd.read_csv("berlin_rent_2024.csv")             # assume you scraped this
X  = sm.add_constant(df[['size','dist','floor','age']])
model = sm.OLS(df['rent'], X).fit(cov_type='HC1')    # robust SE
print(model.summary())

Example output (trimmed):

| Variable | Coef | Std Err | t | P>|t| |
|----------|-----:|--------:|--:|---:|
| const | 18.2 | 0.9 | 20.2 | 0.000 |
| size  | –0.045 | 0.005 | –9.0 | 0.000 |
| dist  | –0.82 | 0.07 | –11.7 | 0.000 |
| floor | 0.60 | 0.06 | 10.0 | 0.000 |
| age   | –0.04 | 0.002 | –20.0 | 0.000 |

Interpretation:

Each extra kilometre from the city centre shaves €0.82/m² off rent, holding size, floor, and age constant.
Higher-floor flats gain €0.60/m² per storey—elevator views pay off.
Larger flats are cheaper per m² (negative size slope), reflecting bulk discounts.

5. Watch for multicollinearity

When predictors move together, (X^TX) gets close to singular → huge standard errors.

Variance Inflation Factor (VIF)

VIF_j = 1 / (1 - R_j²)

where R_j² comes from regressing x_j on all other predictors.

Rules of thumb:

VIF > 5 → keep an eye;
VIF > 10 → consider dropping or combining variables.

6. Practical fixes

Symptom	Quick remedy
Two variables almost duplicates	Drop one or replace with average/ratio.
Many related quality indicators (balcony, garden, lift)	Collapse into a quality index via PCA or simple scoring.
Scaling differences hide collinearity	Standardise variables before diagnostic tests.

7. Essential assumptions revisited

Linearity in means — Conditional expectation of rent is linear in predictors.
Exogeneity — Unobserved charm (noise) is uncorrelated with size, dist, floor, age.
No perfect collinearity — Columns of X must be linearly independent.
Homoscedasticity (for classic SE) — Relaxed by using HC1 robust errors, as in code above.

8. Why multiple regression pays off in real life

Policy — Urban planners can evaluate how public-transport upgrades (distance term) shift rents after isolating flat quality.
Investment — A REIT can target high-floor units near transit, estimating marginal revenue per lift installation.
Fairness — Salary studies gauge gender pay gaps controlling for tenure, role, and performance—crucial in court.

9. Quick checklist before shipping results

Plot partial residuals to spot non-linear shapes.
Calculate VIF; tame any culprit > 10.
Use robust or clustered SE if heteroscedasticity or district clusters exist.
Translate each coefficient into everyday language (€ per m², minutes, etc.).
Store code and raw data—regressions must be reproducible.

Key take-aways

Multiple regression isolates each predictor's effect while soaking up confounders.
Collinearity inflates uncertainty, not bias; detect it with VIF and fix pragmatically.
Robust interpretation = coefficient plus context: sign, size, units, and caveats.

10. Coming attractions

In Chapter 7 we dive into finite-sample properties: why unbiasedness and efficiency matter when data are scarce, and how the Gauss-Markov theorem crowns OLS the Best Linear Unbiased Estimator—as long as its assumptions behave.

Chapter 7: Finite-Sample Facts & the Gauss-Markov Gold Medal

Why finite-sample properties matter
Recap of the multiple-regression set-up
Three performance yard-sticks
The Gauss-Markov Theorem—OLS wears the BLUE ribbon
A tiny Monte-Carlo to see variance in action
When the BLUE badge doesn't protect you
Practical take-aways for small-sample projects
Key insight
Looking forward

1. Why finite-sample properties matter

Picture a start-up that A/B-tests a new pricing page on just 42 visitors before tomorrow's investor call. The CFO runs a regression of revenue on a "New Page" dummy, gets a slope of €3.10 and a p-value of 0.04, and declares success.

Small numbers like these demand sharper questions:

Is the €3.10 estimate on average right (unbiased)?
How jumpy is it from sample to sample (variance)?
Could a different linear estimator deliver tighter confidence intervals?

Finite-sample theory answers those questions before the VC money burns.

2. Recap of the multiple-regression set-up

y = Xβ + ε β̂ = (X^TX)^-1X^Ty

Key dimensions:

n = observations (small today)
k = predictors (including the intercept)

With tiny n, every degree of freedom counts.

3. Three performance yard-sticks

Concept	Formal cue	Plain meaning
Unbiasedness	E[β] = β	On average, the estimator lands on the truth.
Variance / Efficiency	Var(β)	How much the estimate jitters across samples.
Mean-squared error (MSE)	Var(β̂) + Bias²	Overall "badness" score; low is good.

A perfect estimator would be unbiased and have the smallest possible variance.

4. The Gauss-Markov Theorem—OLS wears the BLUE ribbon

Best Linear Unbiased Estimator under five classic assumptions:

Linearity in parameters
Random sampling (i.i.d.)
No perfect multicollinearity
Zero conditional mean: E[ε|X] = 0
Homoscedastic errors: Var(ε|X) = σ²I

When these hold, no other linear, unbiased estimator beats OLS on variance.

That's huge: with small data you cannot shrink uncertainty by fancy algebra alone—either add observations or relax the "unbiased" requirement (ridge/Lasso, Chapter 12).

5. A tiny Monte-Carlo to see variance in action

import numpy as np, statsmodels.api as sm

np.random.seed(7)
beta_true = np.array([1.0, 0.5])      # intercept, slope
n, reps   = 40, 5000                  # small sample!
slope_est = []

for _ in range(reps):
    x  = np.random.uniform(0, 10, n)
    e  = np.random.normal(0, 2, n)
    y  = beta_true[0] + beta_true[1]*x + e
    X  = sm.add_constant(x)
    slope_est.append(sm.OLS(y, X).fit().params[1])

print(f"Mean   of estimates: {np.mean(slope_est):.3f}")
print(f"StdDev of estimates: {np.std(slope_est):.3f}")

Typical output:

Mean   of estimates: 0.502   # unbiased
StdDev of estimates: 0.320   # wide spread!

Even though the average hits 0.5, any single study can miss by ±0.6. With n=1000 the StdDev falls below 0.08—data volume trumps clever tricks.

6. When the BLUE badge doesn't protect you

Assumption fails	What happens	Field example	First-aid
Heteroscedasticity	OLS stays unbiased but SEs are wrong	Rent variance rises with flat size	Use robust (HC1) SE
Autocorrelation	SEs too small	Daily returns in finance	Newey-West or HAC
Endogeneity (exogeneity fails)	OLS biased and inconsistent	Ability affects wages & schooling	Instrumental Variables (Ch. 10)
Small n, many k	Variance explodes; X^TX nearly singular	Marketing with dozens of dummies	Drop variables, collect more data, or penalise (Ch. 12)

7. Practical take-aways for small-sample projects

Report standard errors and confidence intervals before p-values. They show magnitude and precision.
Guard degrees of freedom. Avoid throwing 15 controls into a 40-observation regression.
Use robust errors by default. Heteroscedasticity is the norm outside textbooks.
Simulate if unsure. A quick Monte-Carlo clarifies bias vs. variance faster than theory alone.
Remember: more observations beat fancier estimators—until assumptions break.

Key insight

In the finite-sample world, OLS earns its reputation only under specific conditions. Check them ruthlessly. When they hold, you get an unbeatable linear, unbiased estimator; when they crack, no amount of theorem-quoting can save your inference.

8. Looking forward

Chapter 8 zooms out to the large-sample universe: how consistency and asymptotic normality let us sleep at night when n is big—even if some classic assumptions soften. See you there, where "infinite data" meets real-world imperfections.

Chapter 8: Large-Sample Logic & Robust Standard Errors

Why large-sample theory matters
The Law of Large Numbers (LLN) in one sentence
Consistency of OLS
Central Limit Theorem (CLT): the bell emerges
The sandwich (robust) variance formula
Quick real-world demo: hedging a large ETF portfolio
Cluster and Newey–West extensions
Common pitfalls in big data land
Cheat-sheet for large-sample projects
Key take-aways
On deck

1. Why large-sample theory matters

Modern datasets explode—think millions of Airbnb prices, or every Ethereum transaction since 2015. With size comes power: estimates settle, oddly shaped error terms look almost normal, and inference becomes sharper.

But those perks rely on two pillars:

Consistency — estimates converge on the truth as n → ∞.
Asymptotic normality — scaled estimation errors behave like a normal bell curve.

Get those right, and you can trust z- and t-tests even in messy, heteroskedastic data.

2. The Law of Large Numbers (LLN) in one sentence

The sample mean ȳ inches closer to the population mean μ as you pile on observations.

Formally, for i.i.d. data y₁,…,yₙ:

ȳ ^p→ μ

where "p→" denotes convergence in probability.

In plain English: more data wash out random noise.

3. Consistency of OLS

If the core assumptions from Chapter 6 hold and the predictors stay well-behaved as n grows,

β̂ ^p→ β

Why you care: With 2,000,000 rental listings, the slope linking distance to rent is practically the real slope; sampling error shrinks to trivia.

4. Central Limit Theorem (CLT): the bell emerges

Scaled estimation error is asymptotically normal:

√n (β̂ - β) ^d→ N(0, Σ)

where "d→" means convergence in distribution and Σ is the asymptotic variance–covariance matrix.

Result: you can build z-tests and confidence intervals even when errors aren't Gaussian—the big n magic handles it.

5. The sandwich (robust) variance formula

Real data laugh at homoskedasticity. White's "sandwich" estimator keeps inference honest:

Σ̂_HC1 = (X^TX)^-1X^Tε̂ε̂^TX(X^TX)^-1

It's "robust" to unknown, arbitrary heteroskedasticity. Valid as n → ∞—that's the CLT working behind the scenes.

6. Quick real-world demo: hedging a large ETF portfolio

Data: 1,000 trading days for 500 stocks (≈ 500,000 obs).

Goal: regress each stock's return on the market index to get betas.

import statsmodels.api as sm
model = sm.OLS(y, X).fit(cov_type='HC1')   # y: stock returns, X: [const, market]
print(model.summary())

Notice how robust SEs (labelled "HC1") differ from classic ones—especially for small-cap stocks with wild volatility. With half a million rows, point estimates barely change moving to robust errors, but p-values can double or halve. Inference, not point estimates, is where robust SEs earn their pay.

7. Cluster and Newey–West extensions

Robust flavour	Fixes what?	Typical use-case
Clustered SE	Correlation within groups (firms, villages)	Employee salaries nested in firms
Newey-West	Serial correlation & heteroskedasticity	Daily asset returns, macro time-series

Both rely on large-n asymptotics: as clusters or time points grow, variance estimates stabilise.

8. Common pitfalls in big data land

Mistake	Consequence	Guardrail
Trusting tiny p-values blindly	With 10 million obs, trivial effects look "super-significant"	Report effect sizes & confidence intervals, not just p-values.
Ignoring clustering	SEs way too small → overconfidence	Cluster by logical unit (user-id, firm-id).
Memory bottlenecks	OLS cannot invert XᵀX when k huge	Use incremental or distributed regressions; consider penalised models (ridge/Lasso).

9. Cheat-sheet for large-sample projects

Always ask "Is it consistent?" No amount of data rescues a biased estimator.
Default to robust (HC1) errors. Cost = one flag in most software.
Cluster if data naturally group. Number of clusters > 30 is a healthy rule.
Check effect magnitudes. A coefficient of 0.0002 may be "significant" yet irrelevant.
Document data cuts & code. Large-n mistakes replicate at scale.

Key take-aways

The LLN and CLT let big datasets hand you near-truth estimates and (asymptotically) normal inference.
Robust, cluster, and Newey-West SEs are insurance policies against real-world deviations from textbook homoskedasticity.
Big n amplifies tiny biases and mis-specified SEs—use your newfound tools to keep analysis honest.

10. On deck

Chapter 9 introduces Instrumental Variables 101—your go-to remedy when exogeneity implodes and OLS turns biased, no matter how large the sample. Get ready to tackle endogeneity head-on.

Chapter 9: Instrumental Variables 101

Why ordinary OLS can fail
The instrumental-variable idea in one sentence
Two must-have conditions
Classic real-world instruments
Two-Stage Least Squares (2SLS) in four lines of algebra
Quick Python sketch (statsmodels)
Diagnosing weak instruments
Testing exclusion with over-identification
Interpreting the IV slope
When IV may disappoint
Real-world pay-off examples
Field checklist before you publish an IV result
Key take-aways
Coming up

1. Why ordinary OLS can fail

Suppose you wish to measure how an extra year of schooling raises annual earnings. The naïve regression

wage_i = β₀ + β₁ education_i + u_i

looks fine—until you recall that smarter people may choose more schooling, families with higher income can afford longer studies, both IQ and family money sit hidden in the error term u_i. Because those hidden factors correlate with education, the exogeneity assumption E[u_i|education] = 0 collapses. OLS becomes biased and inconsistent, no matter how large the sample.

2. The instrumental-variable idea in one sentence

Find a variable that pushes education around but has no direct path to wages. That variable is an instrument (call it Z). If we can isolate the part of schooling determined only by Z, we regain a clean experiment.

3. Two must-have conditions

Condition	Formal tag	Plain meaning
Relevance	Cov(Z, education) ≠ 0	The instrument actually moves the endogenous regressor.
Exclusion (validity)	Cov(Z, u) = 0	After controlling for schooling, the instrument has no direct link to earnings.

Both matter—miss either one and the cure can be worse than the disease.

4. Classic real-world instruments

Research question	Instrument Z	Why it might work
Education → wages (EU)	Distance to nearest university	Living far raises travel cost, discouraging enrolment; distance itself should not influence wages once education is fixed.
Alcohol price → accident deaths	Excise-tax hikes	Taxes shift drink prices exogenously; tax laws aren't tied to local driving habits.
House price → fertility	Historical land-use regulations	Old zoning shocks affect housing supply but not birth preferences directly.

5. Two-Stage Least Squares (2SLS) in four lines of algebra

Stage 1 (first-stage regression)

Predict schooling from the instrument(s):

education_i = π₀ + π₁Z_i + γ′W_i + v_i

where W_i = other exogenous controls. Save the fitted values education̂_i.

Stage 2 (second-stage regression)

Plug the predicted schooling into the wage equation:

wage_i = β₀ + β₁ education̂_i + δ′W_i + ε_i

The coefficient β̂₁^IV is your causal estimate—if Z meets relevance & exclusion.

6. Quick Python sketch (statsmodels)

import linearmodels.iv as iv
# df contains wage, education, distance, controls

iv_mod = iv.IV2SLS.from_formula(
    'wage ~ 1 + controls + [education ~ distance]',
    data=df).fit(cov_type='robust')
print(iv_mod.summary)

The [education ~ distance] notation tells linearmodels to treat distance as the instrument.

7. Diagnosing weak instruments

The first-stage F-statistic checks relevance:

Rule of thumb: F<10 instrument is weak → IV estimates unreliable (huge variance, bias toward OLS). For multiple instruments, use the Kleibergen-Paap rk Wald F (software prints it).

8. Testing exclusion with over-identification

If you have more instruments than endogenous regressors, you can run a Hansen J-test (a.k.a. Sargan test):

Null: all instruments satisfy exclusion.
A large p-value ⇒ cannot reject validity; a tiny p-value ⇒ some instrument likely corrupt.

Remember, the test can fail to detect a bad instrument—economic logic still rules.

9. Interpreting the IV slope

β̂₁^IV = Average causal effect for people whose schooling was actually shifted by Z.

This is the Local Average Treatment Effect (LATE).

For distance-to-college, the estimate speaks about students on the margin of enrolling because of travel costs—not necessarily about lifelong learners pursuing PhDs online.

10. When IV may disappoint

Limitation	Why it hurts	Antidote
Weak-Z variance blow-up	Enormous SEs hide any effect	Collect stronger instruments; drop IV if F<10
Exclusion untestable	Economic story must be convincing	Use multiple, conceptually diverse instruments
Local, not global, effect	Policy extrapolation tricky	Be explicit: "estimate applies to distance-affected students"
Small-sample bias	IV consistent but biased in finite n	Many obs, or use jackknife IV (JIVE)

11. Real-world pay-off examples

Policy: A ministry estimates returns to extra schooling net of ability bias before funding grants.
Health: Doctors gauge causal impact of alcohol on heart disease using local-tax changes.
Finance: Analysts isolate how Fed announcements move bank lending by instrumenting interest rates with Fed-Funds futures surprises.

12. Field checklist before you publish an IV result

Explain the economic story of the instrument in <150 words.
Show the first-stage table; report F-stat.
Check over-ID tests if you have extra Z's.
Compare OLS vs. IV estimates; huge jumps demand discussion.
Translate LATE—who exactly experiences the estimated effect?

Key take-aways

Endogeneity kills OLS; a good instrument revives causal inference.
Success hinges on relevance and exclusion—prove both with data and economic reasoning.
2SLS is mechanically simple yet statistically subtle: weak or invalid instruments can do more harm than good.
Always pair IV results with diagnostics and a clear story of who the estimate describes.

13. Coming up

Chapter 10 serves a taste of time-series and panel data—bringing dynamics and repeated observations into the mix, and showing how serial correlation and unobserved heterogeneity reshape everything you've learned so far.

Chapter 10: A First Taste of Time-Series & Panel Data

Time-series give you one entity tracked across dates—ideal for spotting trends and shocks. Panel (cross-section × time) lets you watch many entities through time, so you can net out anything that never changes within each entity. Ignoring these structures can wreck your standard errors and twist causal stories. Think of it as using a hammer when you really need a drill.

2. A time-series warm-up with the ECB deposit rate

The European Central Bank's deposit facility rate swings as policy tightens or loosens. Since mid-2024 it has fallen from 4% to 2%. Plot the rate against monthly euro-area inflation and two traits pop out:

Trend breaks when the ECB pivots.
Memory—today's rate is usually close to last month's.

That second trait is autocorrelation: values relate to their own past.

2.1 A one-line AR(1) model

r_t = α + ϕr_t-1 + ε_t

where r_t is the rate this month. If |ϕ|<1 the series is stationary—its mean and variance stay put.

Why you care: Stationarity keeps forecasts and confidence intervals behaving.

3. When dependence ruins OLS rules

OLS assumption	Time-series reality	Fix
Independent errors	Residuals often cluster in runs	Use Newey-West or AR model
Homoskedasticity	Volatility spikes around crises	Switch to GARCH or robust SE
Exogenous x_t	Policy rates respond to inflation → feedback	Instrumental variables for TS

4. Enter panel data: many firms, many years

Imagine 500 EU banks observed quarterly from 2015 – 2025. Goal: link capital ratio to return on equity (ROE) while controlling for anything that is bank-specific but time-invariant (e.g., corporate culture).

4.1 Fixed-effects (FE) vs. random-effects (RE)

Question	FE answer	RE answer
What it does	Subtracts each entity's own mean → focuses on within-bank variation	Treats entity effects as random draws uncorrelated with regressors
Keep time-invariant vars?	No	Yes
Needs strict exogeneity?	Less stringent	Stronger: entity effects must be uncorrelated with all x_it
Hausman test	N/A	Rejects RE if estimates diverge from FE

Plain rule: if you suspect omitted traits correlate with your x's, use fixed effects.

4.2 Tiny code peek (Python)

from linearmodels.panel import PanelOLS
df = banks.set_index(['bank_id','quarter'])
y  = df['roe']
X  = df[['capital_ratio','size','risk']]
fe_mod = PanelOLS(y, X, entity_effects=True, time_effects=True).fit(cov_type='clustered', cluster_entity=True)
print(fe_mod.summary)

entity_effects=True sweeps out each bank's constant unobserved quirks; clustering by bank fixes serial correlation inside panels.

5. Assumptions worth checking

For time-series:

Stationarity (ADF or KPSS tests)
No leftover autocorrelation (Ljung-Box)

For panels:

Enough time periods or entities (rule of 30)
Serial correlation & heteroskedasticity inside entities (cluster SE)
Hausman test if you're tempted by RE

Failing any test doesn't kill the project—it tells you which robust method to call.

6. Why it matters on the ground

Central-bank watchers forecast rate cuts using AR terms to capture policy inertia.
Asset managers run FE models to see if ESG scores raise returns independent of slow-moving sector traits.
Public-health analysts stack regions × years to separate vaccine campaign effects from fixed geography.

Key take-aways

Time-series = one entity over time → check for memory and stationarity.
Panel data = many entities over time → FE wipes out hidden constants; RE keeps them if uncorrelated.
Robust or clustered standard errors are your insurance policies against serial dependence.
Always translate results: "A 1 pp higher capital ratio lifts quarterly ROE by 0.15 pp within the same bank."

7. Next stop

In Chapter 11 we push deeper into dynamics and heterogeneity: autocorrelation models for longer lags, unit-root pitfalls, and the essentials of random vs. fixed effects testing. See you there!

Chapter 11: Modern Extensions: Ridge & Lasso Keep Linear Models Alive

Why should you care?
The intuition in 90 seconds
Ridge regression: gentle shrinkage
Lasso: sparse and interpretable
Quick Python demo
Real-world case: predicting apartment rents across Europe
Assumptions worth remembering
Choosing λ: cross-validation is king
Caveats and extensions
Key take-aways
Where to go from here

1. Why should you care?

Your data set now holds 5,000 features: web-scraped neighbourhood facts, satellite pixels, sentiment scores. Ordinary OLS refuses to run if features outnumber observations, and even when it runs, coefficients explode from collinearity.

Shrinkage methods—Ridge and Lasso—solve both problems with one clever idea: penalise big, wobbly coefficients.

2. The intuition in 90 seconds

Bias–variance trade-off

Small bias + huge variance = wild predictions.
Tiny extra bias – lots of variance = lower overall error.

Shrinkage buys the second outcome.

Geometry

Imagine the OLS solution as the bottom of a bowl. Ridge adds a smooth rubber band around the origin; Lasso adds a diamond-shaped fence. Both keep the solution near zero where possible.

3. Ridge regression: gentle shrinkage

min_β{∥y - Xβ∥² + λ∥β∥²}

λ = tuning parameter (≥ 0). No coefficient is forced to zero; all are nudged smaller.

β̂_ridge = (X^TX + λI)^-1X^Ty

When to use: predictors highly correlated, you care about prediction accuracy more than exact feature selection.

4. Lasso: sparse and interpretable

min_β{∥y - Xβ∥² + λ∥β∥₁}

∥β∥₁ = ∑|β_j| produces exact zeros → automatic variable selection. No closed-form; solved via coordinate descent.

When to use: thousands of noisy features, managers want a short list of drivers.

5. Quick Python demo

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model   import RidgeCV, LassoCV
from sklearn.pipeline       import make_pipeline

X, y = load_your_big_dataset()  # shape ~ (1000, 5000)

ridge = make_pipeline(StandardScaler(),
                      RidgeCV(alphas=[0.1, 1, 10], cv=5)).fit(X, y)

lasso = make_pipeline(StandardScaler(),
                      LassoCV(alphas=[0.001, 0.01, 0.1], cv=5,
                              max_iter=5000)).fit(X, y)

print("Ridge R²:", ridge.score(X, y))
print("Active Lasso features:", (lasso[-1].coef_ != 0).sum())

StandardScaler is vital; penalties are scale-dependent. CV searches for the λ that minimizes out-of-sample error.

6. Real-world case: predicting apartment rents across Europe

Data: 50,000 listings, 1,200 engineered features (text keywords, distance to coffee bars, street-view textures).

Baseline OLS (top 20 manual features) → RMSE = €140/m.
Ridge (all features) → RMSE = €92/m.
Lasso shrinks model to 57 active features → RMSE = €95/m, plus a readable shortlist for managers.

7. Assumptions worth remembering

Method	Needs k	Handles multicollinearity?	Gives unbiased coefficients?
OLS	Yes	No	Yes (if exogenous)
Ridge	No	Yes	Adds small bias
Lasso	No	Yes	Adds bias, selects features

All three still rely on X being exogenous—if hidden confounders exist, you still need tools like Instrumental Variables.

8. Choosing λ: cross-validation is king

Split data into 5–10 folds.
Try a grid of λ values.
Pick the one that minimizes validation error (RMSE or MAE).
Refit the model on the full data with that λ.

Avoid "eye-balling" or hunting the best training fit—the penalty's whole job is to excel out-of-sample.

9. Caveats and extensions

Standard errors: classical formulas break; use bootstrapping or debiased Lasso for inference.
Group features: group-Lasso or elastic net blend L₁ + L₂ when features arrive in blocks.
Non-linearity: combine with splines or feed predictions into tree-based models; Ridge and Lasso only address linear coefficients.

Key take-aways

Ridge and Lasso keep linear models useful when features explode in count or correlation.
Ridge prioritises stable prediction; Lasso adds interpretability via sparsity.
Cross-validation picks the right penalty; scaling inputs is non-negotiable.
Shrinkage introduces bias deliberately to slash variance—often a winning trade.

10. Where to go from here

Try elastic net when Ridge vs. Lasso feels like a coin toss.
Explore causal forests or double machine learning to blend flexible prediction with causal inference.
Above all, keep the bigger goal in sight: turn complex data into clear, actionable economic insight.

Happy modelling!

Chapter 1: The Empirical Economist's Mindset

Table of Contents

Key take-aways

Chapter 2: Data & Random Sampling

Table of Contents

Housing study:

Retail-foot-traffic sensor:

Refugee integration survey:

Key take-aways

Chapter 3: Simple Linear Regression, Geometric Intuition

Table of Contents

Take-aways

Chapter 4: Estimating OLS & Reading Software Output

Table of Contents

Key take-aways

Chapter 5: Goodness of Fit (Without Falling in Love with R²)

Table of Contents

Key take-aways

Chapter 6: Multiple Regression Without Tears

Table of Contents

Key take-aways

Chapter 7: Finite-Sample Facts & the Gauss-Markov Gold Medal

Table of Contents

Key insight

Chapter 8: Large-Sample Logic & Robust Standard Errors

Table of Contents

Key take-aways

Chapter 9: Instrumental Variables 101

Table of Contents

Key take-aways

Chapter 10: A First Taste of Time-Series & Panel Data

Table of Contents

Key take-aways

Chapter 11: Modern Extensions: Ridge & Lasso Keep Linear Models Alive

Table of Contents

Key take-aways