Chapter 1: The Empirical Economist's Mindset
Table of Contents
- Why should you care?
- What exactly is econometrics?
- Correlation is not causation — a class-size tale
- Experiments are gold… and rare
- The very first model — a straight line
- Three essential assumptions (in plain language)
- Why it matters in practice
- Key take-aways
- Looking ahead
On any given morning in Frankfurt am Main you might hear:
"Why do rents in Sachsenhausen rise faster than wages?"
"Does offering German courses to refugees really speed up their integration?"
"Will the new city toll cut traffic or just annoy commuters?"
Instincts, anecdotes, and even descriptive statistics can hint at answers. Econometrics turns those hints into quantified, testable, decision-ready evidence.
Econometrics = Economics + Data + Statistics
Economists propose theories (demand curves, human-capital models, monetary rules). Econometricians bring in data and statistical tools to:
- Verify — Does the theory fit the facts?
- Quantify — How big is the effect? (e.g. +7% wages per extra year of school)
- Evaluate — Did a new policy work?
- Predict — What happens if we raise the ECB deposit rate by 0.25 pp?
If you want more than educated guesses, you need econometrics.
A simple scatter plot of German school districts shows smaller classes often score higher in tests. Nice!
But richer districts both hire more teachers and fund better facilities.
If we naively act on the scatter, we could spend millions reducing class size and discover—too late—that facilities did the heavy lifting.
Econometrics provides the discipline needed to separate "looks related" from "really causes".
Randomised Controlled Trials (RCTs) shuffle treatment like a lottery. Example: randomly assign some schools extra teachers, others none, then compare outcomes.
In many economic settings RCTs are too costly, unethical, or impossible. Most evidence therefore comes from observational data (surveys, admin records, market prices).
Econometrics supplies the rules that let us learn responsibly from those less-than-perfect data.
Many questions start with the simple linear regression
where:
- yi = outcome of interest (hourly wage)
- xi = explanatory variable (years of education)
- εi = all other influences we did not measure
The population slope that describes the true relationship is:
Because we do not know the population moments, we estimate them with sample averages.
The resulting Ordinary Least Squares (OLS) estimator
is the slope that minimises the squared vertical distances between data points and the fitted line.
Even this humble straight line can answer:
- "How many euro does one more year of schooling add to average pay in Frankfurt?"
- "How sensitive is household electricity use to outside temperature?"
| Assumption | What it really says | Why you should care |
|---|---|---|
| Random sampling | Each observation is like a lottery draw from the same population. | Makes sample formulas mirror true population formulas. |
| Linearity in parameters | The average effect of x on y can be summarised by a straight line (or by a line after transforming variables). | Keeps interpretation clear and maths manageable. |
| Mean independence (exogeneity) | On average, the unobserved stuff εi is not linked to xi. | Without it, the slope mixes true effect with hidden factors (ability, location, etc.). |
When these hold, OLS produces unbiased and (with more data) consistent estimates.
- Policy design — Before the city spends €100 million hiring teachers, it wants credible evidence that smaller classes cause better scores.
- Business strategy — A café chain maps foot traffic (x) to daily sales (y) to choose the next branch location.
- Personal finance — A student weighs a €30,000 MSc fee against the estimated wage premium.
Key take-aways
- Econometrics converts curious questions into measurable answers.
- Experiments are best but rare; careful modelling lets us exploit ordinary data.
- The linear regression line—powered by OLS—offers the first step toward credible causal stories, provided the core assumptions hold.
Next up is Chapter 2: Data & Random Sampling. We will see how the structure of your data (cross-section, time-series, panel) and the way you collect it can make or break every result that follows. Stay tuned!
Chapter 2: Data & Random Sampling
Table of Contents
- Why this chapter matters
- Three common data structures
- The ideal: simple random sampling
- Real-world wrinkles
- Stratified & cluster sampling in plain language
- Sample size vs. measurement quality
- Assumptions to check before any regression
- Frankfurt-flavoured examples
- Quick field checklist
- Key take-aways
- Up next
A brilliant model cannot rescue bad data. Before tuning regressions or debating p-values, you must ask:
- Where did my numbers come from?
- Do they represent the population I care about?
- Could the collection process itself bias the result?
Skipping these questions can waste millions—or worse, mislead policy.
| Structure | Quick picture | Typical use | Frankfurt-style example |
|---|---|---|---|
| Cross-section | One snapshot in time | Compare units | Hourly wages of 5,000 residents in 2024 |
| Time-series | One unit over many dates | Forecast, detect shocks | Monthly ECB deposit rate, 1999-2025 |
| Panel (longitudinal) | Many units over many dates | Control for fixed traits | Annual rent of 2,000 flats, 2015-2025 |
Each format implies different statistical challenges. Mixing them up invites wrong standard errors or spurious trends.
Mathematically we dream of i.i.d. draws:
Independent = one observation tells you nothing about the next.
Identically distributed = every observation has the same probability law.
With i.i.d. data, the sample mean
is unbiased for the population mean μ = E[y] and has standard error
where σ² is the true variance. Double the sample size and the error shrinks by 1/√2. Clean and simple!
| Wrinkle | What goes wrong | Everyday illustration |
|---|---|---|
| Coverage bias | The sampling frame omits part of the population. | A phone survey misses renters without landlines—skewing rent estimates. |
| Non-response | Some selected units refuse or cannot answer. | High-income households decline wage interviews, pulling the sample mean down. |
| Cluster dependence | Nearby units look alike → lower effective n. | Energy use measured by apartment often correlates within buildings. |
| Time dependence | Today's value relates to yesterday's. | ECB rate cuts spread over months; successive observations aren't independent. |
Econometrics offers fixes—weights, robust errors, clustering, time-series models—but diagnosing the problem comes first.
Sometimes random draws alone are inefficient or costly.
Stratified sampling divides the population into homogeneous groups (strata) and samples within each. Example: Force equal numbers from Frankfurt's central, north, and south boroughs to ensure city-wide conclusions.
Cluster sampling first selects groups, then units inside them. Example: Choose 50 apartment blocks, then survey every flat in each block. Cheap when travel is expensive, but adds intra-cluster correlation you must correct for.
"Better 1,000 precise measurements than 10,000 sloppy ones."
Large n lowers random error, yet systematic error (bias) remains.
Budget-constrained? Spend first on representative coverage and accurate measurement; add records only if funds remain.
- Representativeness — Does your sample mirror the target population?
- Correct temporal ordering — Is x measured before y when you claim causality?
- Stable unit treatment value — One unit's outcome must not affect another's (violated in neighbourhood spill-overs).
- No data dredging — Pre-define key variables; avoid picking the best stories post-hoc.
Write these on a sticky note near your code editor.
Housing study:
Randomly draw flats from the city's cadastral registry. Weight by floor-space to correct for oversampling tiny studios.
Retail-foot-traffic sensor:
Sensors record counts every minute—giving dependence across time. Use Newey-West or fit an AR(1) error structure before testing intervention effects.
Refugee integration survey:
Language-school dropouts may ignore follow-up questionnaires → potential attrition bias. Design incentives or multiple contact modes to cut attrition.
- Define the population. Be explicit: "all private rental contracts signed in Frankfurt 2024."
- Audit the sampling frame. List obvious exclusions and their likely direction of bias.
- Track response rates. Overall and by key subgroup.
- Log data-collection costs. Helps choose between bigger n or better instruments next time.
- Document everything. Today's codebook is tomorrow's credibility.
Key take-aways
- Data structure (cross-section, time-series, panel) dictates the valid toolbox.
- Simple random sampling makes math easy; reality often deviates—spot how.
- Fixing bias beats inflating sample size.
- Transparent documentation is as valuable as the numbers themselves.
In Chapter 3 we draw our very first regression line, visualise why OLS "leans" the way it does, and connect geometry to intuition. Bring a cup of coffee—and your freshly cleaned dataset.
Chapter 3: Simple Linear Regression, Geometric Intuition
Table of Contents
- Why you should care
- The data in one minute
- Scatter plot → straight line
- Where the formula comes from
- A tiny code sketch (Python + pandas/statsmodels)
- Key assumptions checked against market data
- Why a one-line model already helps
- Limitations & next steps
- Take-aways
- Coming up
Investors keep asking: "If the S&P 500 jumps 1%, what usually happens to Bitcoin the same day?" Knowing the average co-movement helps with hedging, risk limits and portfolio design. The most direct way to answer is a simple linear regression of Bitcoin's daily return on the S&P 500's return.
Recent numbers show the link is real but not perfect: Rolling 30-day correlations have swung between 0 and 0.5 since 2020. That pattern begs to be quantified—exactly what this chapter does.
- Units: trading days
- Variables:
- yt = Bitcoin daily return (in %)
- xt = S&P 500 daily return (in %)
- Sample: January 2022 – June 2025 (≈ 850 observations)
- Source: Public price feeds (e.g. Yahoo Finance or a crypto exchange API)
Because returns are logged at the same timestamp, neither variable can "look into the future," keeping the design symmetric.
Imagine plotting xt on the horizontal axis and yt on the vertical. The cloud of dots leans upward: strong index rallies often coincide with even stronger Bitcoin moves. The best-fit line through that cloud is the simple-regression model
The line's slope β1 captures average amplification: If β1 = 1.6, a 1% rise in the S&P 500 is typically matched by a 1.6% jump in Bitcoin. Each residual ε̂t is the vertical gap between a dot and the line. By construction, the residual cloud is orthogonal to the fitted line.
For the population:
Replace the unknown moments with sample counterparts and you get the Ordinary Least Squares (OLS) estimator
OLS is "ordinary" because it's simply the line that minimises the sum of squared residuals—the most intuitive yard-stick for closeness.
import yfinance as yf, pandas as pd, statsmodels.api as sm
btc = yf.download("BTC-USD", start="2022-01-01")['Adj Close'].pct_change().dropna()*100
spx = yf.download("^GSPC", start="2022-01-01")['Adj Close'].pct_change().dropna()*100
df = pd.concat({'btc': btc, 'spx': spx}, axis=1).dropna()
X = sm.add_constant(df['spx'])
model = sm.OLS(df['btc'], X).fit()
print(model.summary())
Typical output (abridged):
coef std err t P>|t|
const 0.05 0.09 0.5 0.62
spx 1.57 0.12 13.0 0.000
R-squared = 0.25
Slope 1.57 → Bitcoin moves ~1.6× the S&P 500 on the same day.
R² ≈ 0.25 → Index returns explain 25% of Bitcoin's daily variation—useful but still leaves plenty of independent noise.
| Assumption | Do we buy it? | Quick diagnostic |
|---|---|---|
| Linearity in mean | Returns often scale roughly linearly day-to-day | Plot residuals vs. fitted; look for curves. |
| Mean independence (exogeneity) | Same-day co-moves are concurrent, so causality isn't claimed—just association | Acceptable for risk or hedge ratios; not for "Bitcoin causes stocks." |
| Homoscedasticity & independence | Volatility clusters; residuals show ARCH effects | Robust or Newey-West standard errors fix inference without changing β̂. |
| Random sampling / i.i.d. | Market hours & holidays create mild irregularity but daily returns are a common, accepted unit | Longer horizons need time-series models, Chapter 11. |
- Portfolio hedging — If Bitcoin's beta to equities is 1.6, a $10m equity hedge fund holding 1m in BTC is effectively running a 1.6m equity exposure.
- Risk limits — A trading desk can cap aggregate exposure instead of siloed crypto vs. stock buckets.
- Scenario planning — Stress tests can apply a 3σ equity shock and scale crypto moves by β̂1.
- Causality not implied — The slope is descriptive. In Chapter 10 we'll meet Instrumental Variables for causal answers.
- Non-linear tails — Extreme market days may bend the line; quantile regression is a fix.
- Time-varying betas — Rolling regressions or state-space models update β1 as correlations drift.
Take-aways
- A scatter, a straight line, and two moments (covariance & variance) already yield actionable insight.
- In today's markets, Bitcoin behaves like an "equity with leverage" on many days, but only partly so.
- Geometry—residuals at right angles to the fitted line—gives OLS its neat mathematical properties and intuitive appeal.
Chapter 4 dives into software output: what every column in the regression table means, how to spot red flags, and how to translate numbers into plain English advice for your CIO. See you there!
Chapter 4: Estimating OLS & Reading Software Output
Table of Contents
- Why this chapter matters
- The estimator in one line of algebra
- Running the regression (Python example)
- Decoding each column in plain English
- Four common red flags and quick fixes
- A cheat-sheet for plain-language reporting
- Best practice checklist before you trust a table
- Key take-aways
Seeing a statistical table for the first time feels like reading a medical chart in Latin. Yet those columns decide how much capital a portfolio manager shifts, how large a subsidy a government approves, or whether a research paper survives peer review. Today you'll learn to decode every piece of a standard OLS output and spot the red flags that tell you a model can't be trusted.
Given an n×1 outcome vector y and an n×k matrix of predictors X (with a leading column of 1s for the intercept), the OLS coefficient vector is
That's all software does—solve this set of linear equations—to minimise the sum of squared residuals,
Everything else in the print-out (standard errors, t-values, R², etc.) is built on β̂.
import yfinance as yf, pandas as pd, statsmodels.api as sm
btc = yf.download("BTC-USD", start="2022-01-01")['Adj Close'].pct_change().dropna()*100
spx = yf.download("^GSPC", start="2022-01-01")['Adj Close'].pct_change().dropna()*100
df = pd.concat({'btc': btc, 'spx': spx}, axis=1).dropna()
X = sm.add_constant(df['spx']) # adds the intercept
model = sm.OLS(df['btc'], X).fit()
print(model.summary())
Typical (abridged) console output:
| Coefficient | Estimate | Std. Error | t-stat | P>|t| |
|-------------|---------:|-----------:|-------:|-----:|
| Intercept | 0.05 | 0.09 | 0.5 | 0.62 |
| spx | 1.57 | 0.12 | 13.0 | 0.000 |
R-squared: 0.25 F-statistic: 169.0 Observations: 850
| Element | What it tells you | Bitcoin-S&P example |
|---|---|---|
| Estimate (coef) | Best-fit slope/intercept | A 1% S&P jump coincides with a 1.57% Bitcoin jump, on average. |
| Std. Error | Sampling uncertainty around the estimate | ±0.12 pp around 1.57. |
| t-statistic | Estimate ÷ Std. Error | 13 → a huge signal-to-noise ratio. |
| P-value | Probability of seeing a t at least this large if the true slope were 0 | <0.001 → virtually impossible under "no link." |
| R-squared | Share of y variance explained | 0.25 → S&P moves explain 25% of Bitcoin's daily wiggles. |
| F-statistic | Joint test that all slopes = 0 | 169 → the model beats a flat line hands-down. |
| Observations (n) | Sample size behind the stats | 850 trading days. |
| Red flag | Symptom in output | Likely cause | Immediate fix |
|---|---|---|---|
| "Perfect" R² = 0.99 | Too good to be true | You mistakenly regressed a variable on itself (or near-duplicates). | Check column names and lags. |
| Huge Std. Errors vs. coefficients | Coeff ≈ 0.8 but SE ≈ 1.2 | Multicollinearity (predictors highly correlated) | Drop a redundant predictor or apply ridge shrinkage. |
| Durbin-Watson < 1.0 | Serial correlation in residuals | Time-series dependence | Use Newey-West SE or move to AR models. |
| P-values tiny yet plot looks curved | Misspecification | Missing a nonlinear term | Add x² or run a spline. |
- Slope: "Bitcoin tends to move 1.6× the S&P 500 on the same day."
- Uncertainty: "The margin of error is about ±0.1×."
- Economic meaning: "Holding €100k in Bitcoin equates to a €160k equity exposure for daily shocks."
- Fit quality: "Index returns account for one-quarter of Bitcoin's day-to-day moves; other forces drive the rest."
- Visuals first — Always plot data and residuals; numbers confirm what eyes suspect.
- Units & scaling — Know whether coefficients are in percent, basis points, or log points.
- Robust errors — If in doubt, request HC1/White or Newey-West SE.
- Version control — Save the code, data-cut, and seed to reproduce output later.
- Tell the story — Translate every stat into a business or policy implication.
Key take-aways
- The OLS table is just algebraic output from ∑(yi - Xiβ̂)2; reading it is a skill, not magic.
- Coefficients give magnitude, Std. Errors give precision, R² gives context—interpret all three together.
- Red flags (weird R², giant errors, autocorrelation) are easier to catch early than to fix late.
Chapter 5: Goodness of Fit (Without Falling in Love with R²)
Table of Contents
- Why this topic matters
- The basic anatomy: ANOVA in one glance
- Meet R²'s better-behaved cousin
- Real-world demo: predicting house prices
- Better yard-sticks
- Quick visual checks
- Common myths—busted
- A four-step fit-assessment ritual
- Key take-aways
- Looking ahead
Analysts often race straight to the R-squared line, and managers nod if it's "above 0.8." Yet a housing-price model with R² = 0.95 can completely mis-price tomorrow's listings if it's over-tuned to last year's quirks. Caring only about R² is like buying a car because the speedometer goes to 240 km/h—you haven't looked under the hood.
For any linear regression
the total variation splits neatly:
Hence
It measures in-sample fit—nothing more, nothing less.
Adding predictors never lowers R², so a model can bloat to 100 dummy variables, boast R² = 1, and still flop. Adjusted R² penalises fluff:
where n = observations, k = predictors. Over-fitting will push the penalty term up and R̄² down. Still, even adjusted R² can't see the future; it judges on the same data used to fit the model.
Data: 10,000 U.S. listings, 2023 (price, size, age, bedrooms, ZIP code).
Model A (parsimonious): Price ~ Size + Bedrooms + Age.
R² = 0.72, RMSE = €46k.
Model B (overfitted): Price ~ Size + Bedrooms + Age + 400 ZIP-code dummies.
R² = 0.95, RMSE = €43k (in-sample).
Hold-out test (new 2024 data):
Model A RMSE = €50k; Model B RMSE = €69k.
Lesson: Model B dazzled with R² = 0.95 but bombed out-of-sample because ZIP codes memorised last year's quirks. High R² ≠ high predictive power.
| Metric | Formula | What it adds |
|---|---|---|
| RMSE | √(Residual SS / n) | Puts error in same units as y. |
| MAE | (1/n) ∑|yi - ŷi| | Less sensitive to outliers than RMSE. |
| Predicted R² / CV-score | Fit on 90%, test on 10%; rotate | Guards against over-fitting. |
| Residual plots | --- | Reveal curvature, heteroscedasticity, outliers better than any scalar. |
- Residuals vs. fitted: Curve? ⇒ add non-linear term.
- Scale-Location plot: Fan-shape? ⇒ heteroscedastic ⇒ use robust SE or transform y.
- QQ-plot of residuals: Heavy tails? ⇒ rethink normality-based inference.
| Myth | Reality |
|---|---|
| "Low R² means a useless model." | Not always; predicting stock returns often yields R² < 0.1 but can still price risk correctly. |
| "High R² proves causality." | R² is silent on why variables move together—instrumental variables or experiments test causality. |
| "Adjusted R² fixes over-fitting." | It helps, but only cross-validation truly tests future performance. |
- Start with the plot. Eyes beat statistics at spotting weirdness.
- Report at least one error metric (RMSE/MAE). R² alone is never enough.
- Run a hold-out or cross-validation. Trust the score that touches unseen data.
- Explain in plain language. "Our model prices homes within ±€50k 80% of the time" beats "R² = 0.72."
Key take-aways
- R² describes in-sample fit; it does not guarantee predictive accuracy or causal validity.
- Adjusted R² dampens the incentive to add junk predictors but isn't a silver bullet.
- Always pair R² with error metrics, residual visuals, and out-of-sample checks before trusting a model in the wild.
Chapter 6 steps into multiple regression with grace—adding controls without drowning in multicollinearity, and seeing how partial effects sharpen our real-world stories. Get ready to move from one-input lines to realistic economic models with many moving parts.
Chapter 6: Multiple Regression Without Tears
Table of Contents
- Why move beyond one-variable models?
- A running example — what drives monthly rent?
- The model in matrix form
- Quick Python sketch
- Watch for multicollinearity
- Practical fixes
- Essential assumptions revisited
- Why multiple regression pays off in real life
- Quick checklist before shipping results
- Key take-aways
- Coming attractions
Real problems rarely hinge on a single factor. Suppose you ask: "How much extra does a master's degree add to annual salary?" Education matters—but so do experience, industry, and region. Ignoring them shoves their influence into the error term and biases the education slope. Multiple regression lets us hold everything else constant and isolate each effect.
Data: 3,000 Berlin flats listed in 2024.
- renti € per m²
- sizei living area (m²)
- disti kilometres to Brandenburg Gate
- floori storey (ground = 0)
- agei building age (years)
Goal: quantify location premium after accounting for size and quality.
where
The OLS solution generalises neatly:
Each β̂j measures the partial effect of its predictor while all other columns stay fixed.
import pandas as pd, statsmodels.api as sm
df = pd.read_csv("berlin_rent_2024.csv") # assume you scraped this
X = sm.add_constant(df[['size','dist','floor','age']])
model = sm.OLS(df['rent'], X).fit(cov_type='HC1') # robust SE
print(model.summary())
Example output (trimmed):
| Variable | Coef | Std Err | t | P>|t| |
|----------|-----:|--------:|--:|---:|
| const | 18.2 | 0.9 | 20.2 | 0.000 |
| size | –0.045 | 0.005 | –9.0 | 0.000 |
| dist | –0.82 | 0.07 | –11.7 | 0.000 |
| floor | 0.60 | 0.06 | 10.0 | 0.000 |
| age | –0.04 | 0.002 | –20.0 | 0.000 |
Interpretation:
- Each extra kilometre from the city centre shaves €0.82/m² off rent, holding size, floor, and age constant.
- Higher-floor flats gain €0.60/m² per storey—elevator views pay off.
- Larger flats are cheaper per m² (negative size slope), reflecting bulk discounts.
When predictors move together, (XTX) gets close to singular → huge standard errors.
Variance Inflation Factor (VIF)
where Rj2 comes from regressing xj on all other predictors.
Rules of thumb:
- VIF > 5 → keep an eye;
- VIF > 10 → consider dropping or combining variables.
| Symptom | Quick remedy |
|---|---|
| Two variables almost duplicates | Drop one or replace with average/ratio. |
| Many related quality indicators (balcony, garden, lift) | Collapse into a quality index via PCA or simple scoring. |
| Scaling differences hide collinearity | Standardise variables before diagnostic tests. |
- Linearity in means — Conditional expectation of rent is linear in predictors.
- Exogeneity — Unobserved charm (noise) is uncorrelated with size, dist, floor, age.
- No perfect collinearity — Columns of X must be linearly independent.
- Homoscedasticity (for classic SE) — Relaxed by using HC1 robust errors, as in code above.
- Policy — Urban planners can evaluate how public-transport upgrades (distance term) shift rents after isolating flat quality.
- Investment — A REIT can target high-floor units near transit, estimating marginal revenue per lift installation.
- Fairness — Salary studies gauge gender pay gaps controlling for tenure, role, and performance—crucial in court.
- Plot partial residuals to spot non-linear shapes.
- Calculate VIF; tame any culprit > 10.
- Use robust or clustered SE if heteroscedasticity or district clusters exist.
- Translate each coefficient into everyday language (€ per m², minutes, etc.).
- Store code and raw data—regressions must be reproducible.
Key take-aways
- Multiple regression isolates each predictor's effect while soaking up confounders.
- Collinearity inflates uncertainty, not bias; detect it with VIF and fix pragmatically.
- Robust interpretation = coefficient plus context: sign, size, units, and caveats.
In Chapter 7 we dive into finite-sample properties: why unbiasedness and efficiency matter when data are scarce, and how the Gauss-Markov theorem crowns OLS the Best Linear Unbiased Estimator—as long as its assumptions behave.
Chapter 7: Finite-Sample Facts & the Gauss-Markov Gold Medal
Table of Contents
- Why finite-sample properties matter
- Recap of the multiple-regression set-up
- Three performance yard-sticks
- The Gauss-Markov Theorem—OLS wears the BLUE ribbon
- A tiny Monte-Carlo to see variance in action
- When the BLUE badge doesn't protect you
- Practical take-aways for small-sample projects
- Key insight
- Looking forward
Picture a start-up that A/B-tests a new pricing page on just 42 visitors before tomorrow's investor call. The CFO runs a regression of revenue on a "New Page" dummy, gets a slope of €3.10 and a p-value of 0.04, and declares success.
Small numbers like these demand sharper questions:
- Is the €3.10 estimate on average right (unbiased)?
- How jumpy is it from sample to sample (variance)?
- Could a different linear estimator deliver tighter confidence intervals?
Finite-sample theory answers those questions before the VC money burns.
Key dimensions:
- n = observations (small today)
- k = predictors (including the intercept)
With tiny n, every degree of freedom counts.
| Concept | Formal cue | Plain meaning |
|---|---|---|
| Unbiasedness | E[β] = β | On average, the estimator lands on the truth. |
| Variance / Efficiency | Var(β) | How much the estimate jitters across samples. |
| Mean-squared error (MSE) | Var(β̂) + Bias² | Overall "badness" score; low is good. |
A perfect estimator would be unbiased and have the smallest possible variance.
Best Linear Unbiased Estimator under five classic assumptions:
- Linearity in parameters
- Random sampling (i.i.d.)
- No perfect multicollinearity
- Zero conditional mean: E[ε|X] = 0
- Homoscedastic errors: Var(ε|X) = σ²I
When these hold, no other linear, unbiased estimator beats OLS on variance.
That's huge: with small data you cannot shrink uncertainty by fancy algebra alone—either add observations or relax the "unbiased" requirement (ridge/Lasso, Chapter 12).
import numpy as np, statsmodels.api as sm
np.random.seed(7)
beta_true = np.array([1.0, 0.5]) # intercept, slope
n, reps = 40, 5000 # small sample!
slope_est = []
for _ in range(reps):
x = np.random.uniform(0, 10, n)
e = np.random.normal(0, 2, n)
y = beta_true[0] + beta_true[1]*x + e
X = sm.add_constant(x)
slope_est.append(sm.OLS(y, X).fit().params[1])
print(f"Mean of estimates: {np.mean(slope_est):.3f}")
print(f"StdDev of estimates: {np.std(slope_est):.3f}")
Typical output:
Mean of estimates: 0.502 # unbiased
StdDev of estimates: 0.320 # wide spread!
Even though the average hits 0.5, any single study can miss by ±0.6. With n=1000 the StdDev falls below 0.08—data volume trumps clever tricks.
| Assumption fails | What happens | Field example | First-aid |
|---|---|---|---|
| Heteroscedasticity | OLS stays unbiased but SEs are wrong | Rent variance rises with flat size | Use robust (HC1) SE |
| Autocorrelation | SEs too small | Daily returns in finance | Newey-West or HAC |
| Endogeneity (exogeneity fails) | OLS biased and inconsistent | Ability affects wages & schooling | Instrumental Variables (Ch. 10) |
| Small n, many k | Variance explodes; XTX nearly singular | Marketing with dozens of dummies | Drop variables, collect more data, or penalise (Ch. 12) |
- Report standard errors and confidence intervals before p-values. They show magnitude and precision.
- Guard degrees of freedom. Avoid throwing 15 controls into a 40-observation regression.
- Use robust errors by default. Heteroscedasticity is the norm outside textbooks.
- Simulate if unsure. A quick Monte-Carlo clarifies bias vs. variance faster than theory alone.
- Remember: more observations beat fancier estimators—until assumptions break.
Key insight
In the finite-sample world, OLS earns its reputation only under specific conditions. Check them ruthlessly. When they hold, you get an unbeatable linear, unbiased estimator; when they crack, no amount of theorem-quoting can save your inference.
Chapter 8 zooms out to the large-sample universe: how consistency and asymptotic normality let us sleep at night when n is big—even if some classic assumptions soften. See you there, where "infinite data" meets real-world imperfections.
Chapter 8: Large-Sample Logic & Robust Standard Errors
Table of Contents
- Why large-sample theory matters
- The Law of Large Numbers (LLN) in one sentence
- Consistency of OLS
- Central Limit Theorem (CLT): the bell emerges
- The sandwich (robust) variance formula
- Quick real-world demo: hedging a large ETF portfolio
- Cluster and Newey–West extensions
- Common pitfalls in big data land
- Cheat-sheet for large-sample projects
- Key take-aways
- On deck
Modern datasets explode—think millions of Airbnb prices, or every Ethereum transaction since 2015. With size comes power: estimates settle, oddly shaped error terms look almost normal, and inference becomes sharper.
But those perks rely on two pillars:
- Consistency — estimates converge on the truth as n → ∞.
- Asymptotic normality — scaled estimation errors behave like a normal bell curve.
Get those right, and you can trust z- and t-tests even in messy, heteroskedastic data.
The sample mean ȳ inches closer to the population mean μ as you pile on observations.
Formally, for i.i.d. data y₁,…,yₙ:
where "p→" denotes convergence in probability.
In plain English: more data wash out random noise.
If the core assumptions from Chapter 6 hold and the predictors stay well-behaved as n grows,
Why you care: With 2,000,000 rental listings, the slope linking distance to rent is practically the real slope; sampling error shrinks to trivia.
Scaled estimation error is asymptotically normal:
where "d→" means convergence in distribution and Σ is the asymptotic variance–covariance matrix.
Result: you can build z-tests and confidence intervals even when errors aren't Gaussian—the big n magic handles it.
Real data laugh at homoskedasticity. White's "sandwich" estimator keeps inference honest:
It's "robust" to unknown, arbitrary heteroskedasticity. Valid as n → ∞—that's the CLT working behind the scenes.
Data: 1,000 trading days for 500 stocks (≈ 500,000 obs).
Goal: regress each stock's return on the market index to get betas.
import statsmodels.api as sm
model = sm.OLS(y, X).fit(cov_type='HC1') # y: stock returns, X: [const, market]
print(model.summary())
Notice how robust SEs (labelled "HC1") differ from classic ones—especially for small-cap stocks with wild volatility. With half a million rows, point estimates barely change moving to robust errors, but p-values can double or halve. Inference, not point estimates, is where robust SEs earn their pay.
| Robust flavour | Fixes what? | Typical use-case |
|---|---|---|
| Clustered SE | Correlation within groups (firms, villages) | Employee salaries nested in firms |
| Newey-West | Serial correlation & heteroskedasticity | Daily asset returns, macro time-series |
Both rely on large-n asymptotics: as clusters or time points grow, variance estimates stabilise.
| Mistake | Consequence | Guardrail |
|---|---|---|
| Trusting tiny p-values blindly | With 10 million obs, trivial effects look "super-significant" | Report effect sizes & confidence intervals, not just p-values. |
| Ignoring clustering | SEs way too small → overconfidence | Cluster by logical unit (user-id, firm-id). |
| Memory bottlenecks | OLS cannot invert XᵀX when k huge | Use incremental or distributed regressions; consider penalised models (ridge/Lasso). |
- Always ask "Is it consistent?" No amount of data rescues a biased estimator.
- Default to robust (HC1) errors. Cost = one flag in most software.
- Cluster if data naturally group. Number of clusters > 30 is a healthy rule.
- Check effect magnitudes. A coefficient of 0.0002 may be "significant" yet irrelevant.
- Document data cuts & code. Large-n mistakes replicate at scale.
Key take-aways
- The LLN and CLT let big datasets hand you near-truth estimates and (asymptotically) normal inference.
- Robust, cluster, and Newey-West SEs are insurance policies against real-world deviations from textbook homoskedasticity.
- Big n amplifies tiny biases and mis-specified SEs—use your newfound tools to keep analysis honest.
Chapter 9 introduces Instrumental Variables 101—your go-to remedy when exogeneity implodes and OLS turns biased, no matter how large the sample. Get ready to tackle endogeneity head-on.
Chapter 9: Instrumental Variables 101
Table of Contents
- Why ordinary OLS can fail
- The instrumental-variable idea in one sentence
- Two must-have conditions
- Classic real-world instruments
- Two-Stage Least Squares (2SLS) in four lines of algebra
- Quick Python sketch (statsmodels)
- Diagnosing weak instruments
- Testing exclusion with over-identification
- Interpreting the IV slope
- When IV may disappoint
- Real-world pay-off examples
- Field checklist before you publish an IV result
- Key take-aways
- Coming up
Suppose you wish to measure how an extra year of schooling raises annual earnings. The naïve regression
looks fine—until you recall that smarter people may choose more schooling, families with higher income can afford longer studies, both IQ and family money sit hidden in the error term ui. Because those hidden factors correlate with education, the exogeneity assumption E[ui|education] = 0 collapses. OLS becomes biased and inconsistent, no matter how large the sample.
Find a variable that pushes education around but has no direct path to wages. That variable is an instrument (call it Z). If we can isolate the part of schooling determined only by Z, we regain a clean experiment.
| Condition | Formal tag | Plain meaning |
|---|---|---|
| Relevance | Cov(Z, education) ≠ 0 | The instrument actually moves the endogenous regressor. |
| Exclusion (validity) | Cov(Z, u) = 0 | After controlling for schooling, the instrument has no direct link to earnings. |
Both matter—miss either one and the cure can be worse than the disease.
| Research question | Instrument Z | Why it might work |
|---|---|---|
| Education → wages (EU) | Distance to nearest university | Living far raises travel cost, discouraging enrolment; distance itself should not influence wages once education is fixed. |
| Alcohol price → accident deaths | Excise-tax hikes | Taxes shift drink prices exogenously; tax laws aren't tied to local driving habits. |
| House price → fertility | Historical land-use regulations | Old zoning shocks affect housing supply but not birth preferences directly. |
Stage 1 (first-stage regression)
Predict schooling from the instrument(s):
where Wi = other exogenous controls. Save the fitted values education̂i.
Stage 2 (second-stage regression)
Plug the predicted schooling into the wage equation:
The coefficient β̂1IV is your causal estimate—if Z meets relevance & exclusion.
import linearmodels.iv as iv
# df contains wage, education, distance, controls
iv_mod = iv.IV2SLS.from_formula(
'wage ~ 1 + controls + [education ~ distance]',
data=df).fit(cov_type='robust')
print(iv_mod.summary)
The [education ~ distance] notation tells linearmodels to treat distance as the instrument.
The first-stage F-statistic checks relevance:
Rule of thumb: F<10 instrument is weak → IV estimates unreliable (huge variance, bias toward OLS). For multiple instruments, use the Kleibergen-Paap rk Wald F (software prints it).
If you have more instruments than endogenous regressors, you can run a Hansen J-test (a.k.a. Sargan test):
Null: all instruments satisfy exclusion.
A large p-value ⇒ cannot reject validity; a tiny p-value ⇒ some instrument likely corrupt.
Remember, the test can fail to detect a bad instrument—economic logic still rules.
β̂1IV = Average causal effect for people whose schooling was actually shifted by Z.
This is the Local Average Treatment Effect (LATE).
For distance-to-college, the estimate speaks about students on the margin of enrolling because of travel costs—not necessarily about lifelong learners pursuing PhDs online.
| Limitation | Why it hurts | Antidote |
|---|---|---|
| Weak-Z variance blow-up | Enormous SEs hide any effect | Collect stronger instruments; drop IV if F<10 |
| Exclusion untestable | Economic story must be convincing | Use multiple, conceptually diverse instruments |
| Local, not global, effect | Policy extrapolation tricky | Be explicit: "estimate applies to distance-affected students" |
| Small-sample bias | IV consistent but biased in finite n | Many obs, or use jackknife IV (JIVE) |
- Policy: A ministry estimates returns to extra schooling net of ability bias before funding grants.
- Health: Doctors gauge causal impact of alcohol on heart disease using local-tax changes.
- Finance: Analysts isolate how Fed announcements move bank lending by instrumenting interest rates with Fed-Funds futures surprises.
- Explain the economic story of the instrument in <150 words.
- Show the first-stage table; report F-stat.
- Check over-ID tests if you have extra Z's.
- Compare OLS vs. IV estimates; huge jumps demand discussion.
- Translate LATE—who exactly experiences the estimated effect?
Key take-aways
- Endogeneity kills OLS; a good instrument revives causal inference.
- Success hinges on relevance and exclusion—prove both with data and economic reasoning.
- 2SLS is mechanically simple yet statistically subtle: weak or invalid instruments can do more harm than good.
- Always pair IV results with diagnostics and a clear story of who the estimate describes.
Chapter 10 serves a taste of time-series and panel data—bringing dynamics and repeated observations into the mix, and showing how serial correlation and unobserved heterogeneity reshape everything you've learned so far.
Chapter 10: A First Taste of Time-Series & Panel Data
Table of Contents
- Why bother with new data shapes?
- A time-series warm-up with the ECB deposit rate
- When dependence ruins OLS rules
- Enter panel data: many firms, many years
- Fixed-effects (FE) vs. random-effects (RE)
- Tiny code peek (Python)
- Assumptions worth checking
- Why it matters on the ground
- Key take-aways
- Next stop
Time-series give you one entity tracked across dates—ideal for spotting trends and shocks. Panel (cross-section × time) lets you watch many entities through time, so you can net out anything that never changes within each entity. Ignoring these structures can wreck your standard errors and twist causal stories. Think of it as using a hammer when you really need a drill.
The European Central Bank's deposit facility rate swings as policy tightens or loosens. Since mid-2024 it has fallen from 4% to 2%. Plot the rate against monthly euro-area inflation and two traits pop out:
- Trend breaks when the ECB pivots.
- Memory—today's rate is usually close to last month's.
That second trait is autocorrelation: values relate to their own past.
where rt is the rate this month. If |ϕ|<1 the series is stationary—its mean and variance stay put.
Why you care: Stationarity keeps forecasts and confidence intervals behaving.
| OLS assumption | Time-series reality | Fix |
|---|---|---|
| Independent errors | Residuals often cluster in runs | Use Newey-West or AR model |
| Homoskedasticity | Volatility spikes around crises | Switch to GARCH or robust SE |
| Exogenous xt | Policy rates respond to inflation → feedback | Instrumental variables for TS |
Imagine 500 EU banks observed quarterly from 2015 – 2025. Goal: link capital ratio to return on equity (ROE) while controlling for anything that is bank-specific but time-invariant (e.g., corporate culture).
| Question | FE answer | RE answer |
|---|---|---|
| What it does | Subtracts each entity's own mean → focuses on within-bank variation | Treats entity effects as random draws uncorrelated with regressors |
| Keep time-invariant vars? | No | Yes |
| Needs strict exogeneity? | Less stringent | Stronger: entity effects must be uncorrelated with all xit |
| Hausman test | N/A | Rejects RE if estimates diverge from FE |
Plain rule: if you suspect omitted traits correlate with your x's, use fixed effects.
from linearmodels.panel import PanelOLS
df = banks.set_index(['bank_id','quarter'])
y = df['roe']
X = df[['capital_ratio','size','risk']]
fe_mod = PanelOLS(y, X, entity_effects=True, time_effects=True).fit(cov_type='clustered', cluster_entity=True)
print(fe_mod.summary)
entity_effects=True sweeps out each bank's constant unobserved quirks; clustering by bank fixes serial correlation inside panels.
For time-series:
- Stationarity (ADF or KPSS tests)
- No leftover autocorrelation (Ljung-Box)
For panels:
- Enough time periods or entities (rule of 30)
- Serial correlation & heteroskedasticity inside entities (cluster SE)
- Hausman test if you're tempted by RE
Failing any test doesn't kill the project—it tells you which robust method to call.
- Central-bank watchers forecast rate cuts using AR terms to capture policy inertia.
- Asset managers run FE models to see if ESG scores raise returns independent of slow-moving sector traits.
- Public-health analysts stack regions × years to separate vaccine campaign effects from fixed geography.
Key take-aways
- Time-series = one entity over time → check for memory and stationarity.
- Panel data = many entities over time → FE wipes out hidden constants; RE keeps them if uncorrelated.
- Robust or clustered standard errors are your insurance policies against serial dependence.
- Always translate results: "A 1 pp higher capital ratio lifts quarterly ROE by 0.15 pp within the same bank."
In Chapter 11 we push deeper into dynamics and heterogeneity: autocorrelation models for longer lags, unit-root pitfalls, and the essentials of random vs. fixed effects testing. See you there!
Chapter 11: Modern Extensions: Ridge & Lasso Keep Linear Models Alive
Table of Contents
- Why should you care?
- The intuition in 90 seconds
- Ridge regression: gentle shrinkage
- Lasso: sparse and interpretable
- Quick Python demo
- Real-world case: predicting apartment rents across Europe
- Assumptions worth remembering
- Choosing λ: cross-validation is king
- Caveats and extensions
- Key take-aways
- Where to go from here
Your data set now holds 5,000 features: web-scraped neighbourhood facts, satellite pixels, sentiment scores. Ordinary OLS refuses to run if features outnumber observations, and even when it runs, coefficients explode from collinearity.
Shrinkage methods—Ridge and Lasso—solve both problems with one clever idea: penalise big, wobbly coefficients.
Bias–variance trade-off
- Small bias + huge variance = wild predictions.
- Tiny extra bias – lots of variance = lower overall error.
Shrinkage buys the second outcome.
Geometry
Imagine the OLS solution as the bottom of a bowl. Ridge adds a smooth rubber band around the origin; Lasso adds a diamond-shaped fence. Both keep the solution near zero where possible.
λ = tuning parameter (≥ 0). No coefficient is forced to zero; all are nudged smaller.
When to use: predictors highly correlated, you care about prediction accuracy more than exact feature selection.
∥β∥₁ = ∑|βj| produces exact zeros → automatic variable selection. No closed-form; solved via coordinate descent.
When to use: thousands of noisy features, managers want a short list of drivers.
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.pipeline import make_pipeline
X, y = load_your_big_dataset() # shape ~ (1000, 5000)
ridge = make_pipeline(StandardScaler(),
RidgeCV(alphas=[0.1, 1, 10], cv=5)).fit(X, y)
lasso = make_pipeline(StandardScaler(),
LassoCV(alphas=[0.001, 0.01, 0.1], cv=5,
max_iter=5000)).fit(X, y)
print("Ridge R²:", ridge.score(X, y))
print("Active Lasso features:", (lasso[-1].coef_ != 0).sum())
StandardScaler is vital; penalties are scale-dependent. CV searches for the λ that minimizes out-of-sample error.
Data: 50,000 listings, 1,200 engineered features (text keywords, distance to coffee bars, street-view textures).
- Baseline OLS (top 20 manual features) → RMSE = €140/m.
- Ridge (all features) → RMSE = €92/m.
- Lasso shrinks model to 57 active features → RMSE = €95/m, plus a readable shortlist for managers.
| Method | Needs k| Handles multicollinearity? |
Gives unbiased coefficients? |
|
|---|---|---|---|
| OLS | Yes | No | Yes (if exogenous) |
| Ridge | No | Yes | Adds small bias |
| Lasso | No | Yes | Adds bias, selects features |
All three still rely on X being exogenous—if hidden confounders exist, you still need tools like Instrumental Variables.
- Split data into 5–10 folds.
- Try a grid of λ values.
- Pick the one that minimizes validation error (RMSE or MAE).
- Refit the model on the full data with that λ.
Avoid "eye-balling" or hunting the best training fit—the penalty's whole job is to excel out-of-sample.
- Standard errors: classical formulas break; use bootstrapping or debiased Lasso for inference.
- Group features: group-Lasso or elastic net blend L₁ + L₂ when features arrive in blocks.
- Non-linearity: combine with splines or feed predictions into tree-based models; Ridge and Lasso only address linear coefficients.
Key take-aways
- Ridge and Lasso keep linear models useful when features explode in count or correlation.
- Ridge prioritises stable prediction; Lasso adds interpretability via sparsity.
- Cross-validation picks the right penalty; scaling inputs is non-negotiable.
- Shrinkage introduces bias deliberately to slash variance—often a winning trade.
- Try elastic net when Ridge vs. Lasso feels like a coin toss.
- Explore causal forests or double machine learning to blend flexible prediction with causal inference.
- Above all, keep the bigger goal in sight: turn complex data into clear, actionable economic insight.
Happy modelling!