ooligo
claude-skill

Compensation benchmarking with Claude

Difficulty
intermediate
Setup time
30min
For
recruiter · compensation-analyst · hiring-manager
Recruiting & TA

Stack

A Claude Skill that takes a role’s level, geography, and a comp-survey export (Radford, Pave, Carta), and produces a structured pay-band recommendation per component (base, equity, bonus / OTE) with named percentile, source-survey citation, and the calibration notes the recruiter brings to the offer call. Replaces the recruiter’s open-tab-spreadsheet juggle with a single document the hiring manager and finance approver can sign off on. Posts the public-facing range (NYC LL 32-A, CO/CA/WA pay-transparency compliant) as a separate output.

When to use

  • You’re posting a new role and need a public range that is defensibly sourced (not the vague industry standard framing, not 75th percentile without naming the survey or the geography).
  • You’re preparing an offer and need the band the hiring manager can approve without a half-day finance back-and-forth.
  • You’re auditing existing comp bands quarterly and want a structured comparison of “what we pay” vs. “what the survey says” per role family.

When NOT to use

  • Unilateral comp decisions outside an approved approval chain. The skill produces a recommendation. The compensation philosophy and approval matrix are owned by People Ops / Finance / Comp Committee. The skill informs them; it does not replace them.
  • Equity comp at pre-Series-B startups. Equity benchmarking at very-early stages is more about the firm’s specific cap table and dilution path than about market data. The survey numbers don’t carry there.
  • Negotiation script generation. The skill outputs a band; it does not author negotiation language. Auto-generated comp-negotiation language reads as cold and damages candidate experience.
  • Candidate-specific exception decisions. “Can we offer 15% above the band for this candidate?” is a question for the hiring manager and finance, not for the skill. The skill informs by surfacing the band; it does not approve exceptions.
  • Geographies where the survey has thin data. Surveys cover the US, EU, and major APAC markets well; emerging-market data (LatAm, Africa, smaller APAC) is thinner. The skill flags low-N geographies in the output.

Setup

  1. Drop the bundle. Place apps/web/public/artifacts/compensation-benchmark-skill/SKILL.md into your Claude Code skills directory.
  2. Configure the survey source. The skill reads exports from Radford, Pave, Carta, or a custom CSV. Per-source schema lives in references/1-survey-source-schemas.md. The skill does not call survey APIs directly — exports go through your comp-analyst’s approved access path.
  3. Set the firm’s compensation philosophy. What percentile does the firm pay at (50th, 60th, 75th)? Does base+equity sum to a target percentile, or is each calibrated separately? The philosophy lives in references/2-comp-philosophy-template.md and is the input the skill calibrates against.
  4. Configure approval-chain output. The skill emits the public-facing range as a separate output (NYC LL 32-A, CO/CA/WA pay-transparency compliant). Wire that output to your job-posting publication step (Greenhouse / Ashby job description), or copy by hand, depending on your team’s process.
  5. Dry-run on a closed offer. Benchmark a role you closed last quarter. Compare the skill’s band to what the offer actually was. If the divergence is large, either the survey export is off-cycle or the firm’s philosophy file doesn’t match how offers are actually being approved.

What the skill actually does

Five steps. The order keeps the deterministic survey lookups before the LLM-driven calibration, because letting the model paraphrase survey numbers introduces drift the recruiter can’t audit.

  1. Validate the role definition. Check that the role’s level, geography, and function are present and match values in the survey export. Halt on missing or ambiguous fields (“Senior Engineer” without a level on the firm’s ladder is ambiguous).
  2. Look up survey percentiles. Deterministic lookup, not LLM. For each of base, equity (annualized), and bonus / OTE, pull the 25th / 50th / 60th / 75th / 90th percentiles from the survey export for the matched (level, geography, function) cell. If the cell has fewer than the survey’s documented sample-size threshold (varies by survey: Radford typically 5+, Pave typically 10+), flag low-N and refuse to recommend a percentile-based band — fall back to broader (level, function) without geography or to expanded geography (e.g. “US-wide” instead of “Bay Area”).
  3. Calibrate against firm philosophy. Read the firm’s comp philosophy. Apply the target percentile to the survey numbers. The output is a structured band per component:
    • Base: target_pct of survey, with a ±10% range to absorb candidate-level variation.
    • Equity: same; convert to dollar-value at the firm’s strike price for new grants, document the math.
    • Bonus / OTE: target_pct on the OTE; split base/variable per the firm’s ratio for the function.
  4. Compose the public-facing range. Per NYC LL 32-A and CO/CA/WA pay-transparency requirements, the public posting needs a base salary range. Default: “min of the band’s lower-edge to max of the band’s upper-edge, expressed as a single salary range.” If the role straddles US states with different transparency-law thresholds, the broadest range applies. The skill emits this as a separate output for direct use in the JD.
  5. Emit the recommendation report + audit record. The report has: per-component bands with cited percentile and source survey, calibration notes, low-N or thin-data warnings, and the public-facing range. The audit record is one JSONL line: role, geography, level, percentile-targeted, survey source, survey export date, recommended band — for the firm’s pay-equity audit later in the year.

Cost reality

Per role benchmarked, on Claude Sonnet 4.6:

  • LLM tokens — typically 5-8k input (role definition + survey export rows + philosophy + skill instructions) and 1-2k output (structured report). Roughly $0.04-0.08 per role. Negligible.
  • Survey access cost — the survey subscriptions themselves are the binding cost (Radford, Pave, Carta range from $15K-$80K+ annual depending on coverage). The skill assumes the comp analyst already has access; it does not change that math.
  • Recruiter / comp-analyst time — the win. Hand-composing a comp recommendation is 30-90 minutes per role (survey lookup + spreadsheet juggle + philosophy application + writing the calibration note). The skill is 5-10 minutes including the dry-run sanity check.
  • Setup time — 30 minutes once for the philosophy file and survey-export integration. The philosophy file is rarely revised; survey exports refresh quarterly.

Success metric

Track three numbers, quarterly:

  • Offer acceptance rate within 3 weeks — calibrated comp drives acceptance. Below 60% in your geography and you’re under-paying; above 90% you may be over-paying. Both directions matter; the right number depends on the firm’s compensation philosophy (high-equity startups accept lower base; high-base mid-stage firms accept higher base).
  • Comp-band edit rate post-skill — share of the skill’s recommended bands that the hiring manager or finance edits before approval. Should sit at 10-25%. Above 40% means the philosophy file doesn’t reflect actual approval behavior; below 5% means the panel is rubber-stamping (the failure mode the skill is designed against).
  • Pay-equity audit drift — at the annual pay-equity review, do the skill’s recommendations correlate with where actual offers landed? If the audit surfaces equity gaps the skill’s recommendations would have closed, the skill is doing its job; if the audit surfaces gaps the skill’s recommendations would have widened, the philosophy file or the calibration is biased.

vs alternatives

  • vs Pave / Carta / Radford / Mercer reports directly. The reports are the source data; the skill composes them into a per-role recommendation. Pick the reports alone if your comp analyst lives in them and the recruiter only consumes “tell me the 75th.” Pick the skill if the recruiter needs the calibration note + public range + audit record without the analyst in the loop for every role.
  • vs ChatGPT-style “what should I pay a senior engineer in NYC.” Generic chat returns paraphrased survey data with no audit trail and no version-pinned source — that’s not defensible at pay-equity audit time. The skill cites the survey export by name and date.
  • vs spreadsheet templates. Templates are fine until the firm’s philosophy changes or the survey export refreshes; then every saved template silently goes stale. The skill reads from current sources every run.
  • vs no benchmarking. The default at many smaller firms. Predictable failure mode: pay-equity gaps surface at the annual audit, and the recruiter gets blamed for individual offers that were inside the firm’s normal practice. Defensible benchmarking is the cheapest intervention against this.

Watch-outs

  • Survey-export staleness. Guard: the skill reads the export’s dated metadata and warns if the export is older than 6 months. Survey data shifts faster than annual; quarterly refresh is the floor.
  • Geography mis-mapping. Guard: the skill matches the role’s geography against the survey’s geography taxonomy explicitly (Pave’s “SF Bay Area” is not the same cell as Radford’s “San Francisco MSA”). If the match is ambiguous, the skill halts and asks the recruiter to disambiguate rather than picking a default.
  • Low-N cell. Guard: the skill refuses to recommend a percentile-based band when the survey cell has fewer respondents than the survey’s documented threshold. It falls back to a broader cell (broader function, broader geography) and notes the fallback.
  • Equity-comparison drift. Guard: equity values are annualized and converted at the firm’s current strike price. The conversion math is documented in the report. The audit record stores the raw and converted values so future audits can re-derive.
  • Public-facing range too tight. Guard: if the public range is so tight that it functions as a single number, the skill warns. Posting “$140K-$145K” is a violation of the spirit (and arguably the letter) of NYC LL 32-A, which requires a “good faith” range. The skill enforces a minimum band width per geography.
  • Bias propagation through historical comp. Guard: if the firm’s philosophy file is calibrated by “match what we’ve paid in this band before,” the skill propagates whatever pay gaps exist in historical data. The skill flags this when philosophy matching closely tracks historical pay rather than survey percentiles, and recommends the comp analyst run a separate pay-equity check.

Stack

The skill bundle lives at apps/web/public/artifacts/compensation-benchmark-skill/ and contains:

  • SKILL.md — the skill definition
  • references/1-survey-source-schemas.md — per-source export schemas (Radford, Pave, Carta, custom CSV)
  • references/2-comp-philosophy-template.md — fillable per-firm philosophy file

Tools the workflow assumes you use: Claude (the model), Ashby or Greenhouse (the ATS, for posting the public range).

Related concepts: recruiting funnel metrics, offer acceptance rate, candidate experience.

Files in this artifact

Download all (.zip)