How we validate the 509 dataset: what we corrected, why, and what it changes

Methodology · White paper · Updated 2026-06-28

The ABA Standard 509 disclosures are the best public record of who law schools admit, what they charge, and where their graduates end up — but fifteen years of spreadsheets kept by more than two hundred schools leave the raw record with real errors. This is the full accounting of how Exhibit 509 reconciles that record: the one rule we follow, the audit that backs it, every class of correction we made and why, the year each field actually describes, and the limits we keep visible.

The one rule: each field is checked against its own source of truth

Not every number on a school's page comes from the same document, so not every number is validated the same way. The governing principle is simple: each field is reconciled against the authoritative ABA source for that field, and nowhere else.

Field group	Authoritative source
Admissions (LSAT & GPA 25/50/75, plus GRE and JD-NEXT percentiles where reported), enrollment & volume (applications, offers, JD enrollment), tuition, fees, scholarships & grants (the award-distribution shares and the grant-amount dollar percentiles), bar passage	The ABA Standard 509 PDF reports — 3,009 individual reports, 2011–2025
Employment outcomes (`emp_*`)	The ABA's separate employment-summary disclosure — an Excel workbook, not carried in the 509 PDF

This distinction matters and we state it plainly: the 509 PDF does not contain employment data. Anyone who claims a school's employment numbers are "validated against the 509 PDF" is mistaken — the PDF has no such section. Employment's only authoritative source is the ABA's separate employment workbook, which is what we reconcile those fields against. For that reason employment sits outside the PDF audit below, by construction, not by oversight.

The PDF-controlled audit

To check the PDF-sourced fields we ran a controlled audit: the control is the authoritative 509 PDFs themselves, and each report is identified by its URL root plus reference short-name — deliberately independent of the name crosswalk that populates the dataset, so agreement also proves the crosswalk is sound. The audit compares 25,457 individual cells against the PDF they came from.

Field group	Cells checked	Agreed with the PDF
Fees	1,596	1,596
Tuition	2,762	2,760
Admissions percentiles (LSAT/GPA)	16,008	16,005
Enrollment & volume — 2017–2025	3,697	3,687
Enrollment & volume — 2011–2016	1,394	1,392
Total	25,457	25,440

The crosswalk audited clean: 0 of 210 school mappings diverged. The 17 cells that did not match are not data errors — every one is a documented artifact of the audit's own block-parser, and the shipped value is the correct one:

Tuition (2): Florida State 2019 and South Carolina 2022 non-resident tuition — the audit parser misread the block; our values ($22,634 / $35,342) match the master compilation.
Volume (12): applications/offers off by 1–4 from parser rounding (Iowa 2025, San Diego 2023, Toledo 2025, South Dakota 2025) and enrollment off by 1–3 from parser drift (Penn State Dickinson 2023, Texas Tech 2023), all inside the JD1–4-vs-Total sum check's tolerance.
Percentiles (3): two legacy cells where the PDF reports Full-Time-vs-All differently (Akron 2013), and one where the PDF itself is internally impossible (South Dakota 2011) — our data is correct against the underlying figures.

Read the table as counts, not as a slogan. On the PDF-sourced fields the dataset matches the source documents cell-for-cell except for seventeen places where the audit tool, not the data, rounds or misreads. There is no category of known, uncorrected data error remaining in the PDF-sourced fields.

A second layer: reconciliation across every metric

The PDF audit checks that a value matches its source document. A second, independent pass checks whether the values are internally consistent — across 94 metrics, 218 schools and 2011–2025 — by testing the relationships the numbers must obey: percentiles that must stay in order, shares that must reach 100, sub-buckets that must add up to their total, and single-year reversals. This is also where we scrutinize the fields the PDF audit doesn't touch, including employment composition and demographics.

The rule here is the opposite of silent cleanup: every anomaly is flagged for human review and never auto-corrected. A flag is not proof of error. The latest pass raised 128 flags after de-duplication — and most are real reporting events or structural features of the ABA forms, not defects. We keep them visible rather than smooth them away; the full log is public in validation-report.md.

Severity	What it means
`logical`	A hard contradiction (A > B where that is impossible). Highest priority.
`bounds` / `ordering`	A value outside its valid range, or a percentile out of order — usually a parsing or column-mapping check.
`reconcile`	A group that should sum or match is off — often structural, verified case by case.
`spike`	A single-year reversal; kept as REAL when sibling buckets absorb it, flagged for review when they don't.

What the flags actually are, in order of how common they are:

Demographic shares that fall a little short of 100 (roughly −9 to −12 points) — the largest cluster. In the earlier ABA forms, nonresident and international students sit outside the race/ethnicity buckets, so the buckets legitimately don't sum to the whole class. Structural, not a parse error.
Employment firm-size buckets running a few points above the headline "law firm jobs" line — a rounding-and-definition seam in the employment form; logged, within a known tolerance.
Grant-policy "spikes" that reverse — a school that briefly stops awarding grants and resumes. These reconcile across the sibling buckets back to ~100, so they are real policy years, kept and annotated rather than smoothed.
The Marquette / Wisconsin bar series — flagged as internally impossible (it alternates 0/100), but real: diploma privilege means only the handful of graduates who sit an exam are counted, so the denominator is tiny. The small-sample note now appears on those school pages.

A short tail of flags points to genuine open questions we name rather than explain away. Four 2025 clinic entries report more seats filled than available: the figures match what the school disclosed — the University of Pennsylvania's Carey Law shows 140 filled against 132 available, with the same pattern at UNLV, Washburn and Widener Delaware — but the mechanism isn't recoverable from the form. It could be a mid-year clinic expansion, a course-selection or graduation-placement adjustment, or a reporting-input quirk. We treat it as a data-input question that can't be resolved without contacting the school directly, so the values stay exactly as disclosed and flagged rather than guessed at or quietly changed. One grant-mix reversal — Columbia's 2021 share of students on a less-than-half-tuition grant, which reads near 80% for that single year — is likewise kept as reported. We would rather show you an open question than paper over it.

What we corrected, and why

1. The modern enrollment off-by-one

This was the largest single fix. For 2018–2025, the curated JD-enrollment series had been aligned to the prior academic year's population (the denominator the grants section uses) rather than the present-year "J.D. Enrollment as of October [Y]" census on the front of the report. We re-sourced every modern enrollment cell from that present-year census — 1,383 cells across 201 schools — and confirmed it against the raw PDF (e.g. CUNY 2020 = 672) and an independent PDF-verified dataset (99.6% match, zero regressions). A build guard now fails the build if any 2018+ enrollment cell drifts off the present-year census, so the error cannot silently return.

2. The 2026-06-27 enrollment re-shift

Closely related: modern enrollment had been normalized to comp[Y−1] to line up with the grants section's denominator. It is now sourced from comp[Y] — the year-Y report's JD-enrollment grand total — which is the convention the live site uses and which matches the 2011–2017 regime. The re-shift touched 1,552 cells and filled no blanks; it only moved values onto the correct year.

3. Tuition reconciliation

The raw tuition record carries recurring, mechanical errors: a stray apostrophe that zeroes a cell, per-term figures reported where the annual figure belongs, and cells entered at 2× or ½. Each is reconciled against the master workbook and annualized to a consistent basis. The full field guide is its own piece: The holes in 15 years of 509 tuition data.

4. Residual PDF-verified corrections

After the systematic passes, a short tail of high-confidence, individually PDF-verified cells remained — applications/offers, a few LSAT/GPA and faculty counts, and two enrollment cells (St. Thomas (Miami) 2017, Missouri-Kansas City 2025). These were applied cell-by-cell. Every adjudicated change, with its before/after value and source, is logged in the public corrections ledger.

5. The crosswalk

Mapping two hundred shifting school names — mergers, renames, campus splits — to stable identities is where silent corruption usually hides. Because the audit keys on URL root independent of the crosswalk and still agreed, the mapping is validated as a side effect: 0 of 210 divergent.

Which year each field actually describes

A 509 report has one year on its cover, but the fields inside describe different moments — we keep each one faithful to how the ABA structures the report rather than forcing them onto a single year. This is essential to reading a school's page correctly: the entering-class medians and the bar-passage number on the same row are not about the same cohort.

Field	Year it describes (in a year-Y report)
Enrollment, demographics, admissions, LSAT/GPA & GRE/JD-NEXT percentiles, faculty, tuition, fees	Report year Y — the Fall-Y census and Fall-Y matriculating class
Scholarships & grants — award-distribution shares (`schol_`) and grant-amount dollar percentiles (`grant_`)	Prior academic year — the Y−1→Y awards
Bar passage (`bar`, `bar_2yr`)	Prior graduating class — the class that graduated in spring Y−1
Employment (`emp_*`)	Prior graduating class — class-of-(Y−1) outcomes, from the separate employment workbook

The newer admissions signals: GRE, JD-NEXT, and grant amounts

Three field groups are recent additions to the 509 form, and the dataset carries them exactly as the PDFs report them — sparse where the reporting is sparse, never filled in. They come from the same 509 PDF sections as the rest of the admissions and grants data (the First-Year Class table and the Grants & Scholarships table), but because they are new and thinly reported they sit outside the headline LSAT/GPA percentile audit rather than inside it.

GRE percentiles (Verbal, Quantitative, Analytical Writing at 25/50/75) appear from 2020 on, only for schools with enough GRE matriculants to disclose a distribution. That is a small set: 11 schools reported a full GRE distribution in 2025, and 116 school-years carry one across 2020–2025. A separate GRE-applicant count is recorded far more widely. We show the distribution only where the school published it.
JD-NEXT percentiles are newer still — the test reached the 509 form in 2024, and the adopter pool is tiny: two schools (Arizona and Dayton) disclosed a JD-NEXT distribution in 2025, three in 2024. We carry it as an emerging signal, not yet something comparable across the field.
Grant amounts — the 25th, median, and 75th-percentile grant dollars awarded to recipients (2017 on) — are kept rigorously separate from the scholarship-distribution shares (the percentage of students in each award band). Dollars live in their own fields and are never mixed with counts or percentages; a median grant and a share of students are different quantities, and the dataset never conflates them.

Coverage

The audit draws on 3,009 PDF reports across the fifteen years 2011–2025, roughly 195–210 reports per year. Coverage is a function of how many schools were ABA-accredited and reporting in a given year, which falls as schools close or merge; blank recent-year cells reflect reports the ABA has not yet released, not data we are missing.

What this does to the data

Net effect: a reader looking at a school today sees admissions, tuition, and fees that match the source PDFs cell-for-cell; enrollment that reads the correct present-year October census instead of lagging a year; tuition normalized to a single annual basis; and bar passage tied to the right graduating cohort. Employment is reconciled against the only source that carries it. The corrections moved the data measurably closer to the documents of record without inventing a single value — every fix is a re-sourcing or a re-alignment, never a fabrication, and every blank stays blank.

Limits we keep visible

Employment is from a separate ABA workbook, not the 509 PDF, and is therefore outside the PDF audit above. We will not describe it as "PDF-validated," because it cannot be.
Legacy (2011–2017) enrollment is excluded from the direct PDF audit because the older reports use a different academic-year window; its fidelity is verified separately against the ABA enrollment-and-ethnicity compilations.
The 17 residual audit misses are parser artifacts, documented above; the shipped values are correct, but we name them rather than hide them.
2026 is partial — bar passage is loaded and verified, but the rest awaits the ABA's release for that cycle.