claude-skill

Évaluateur de test à emporter avec Claude

Difficulty

intermédiaire

Setup time

40min

For

recruiter · hiring-manager · technical-screener

Recruiting & TA

Stack

Un Claude Skill qui score la soumission à un test à emporter d’un candidat contre un rubrique rédigé par l’équipe d’embauche, avec des citations ligne par ligne depuis le code ou les documents soumis, et produit un rapport d’évaluation structuré — sans jamais valider ou invalider automatiquement. Le panel d’entretien utilise le rapport pour ancrer le debrief en direct ; la vraie décision embauche/non-embauche se produit dans la discussion du panel, pas dans le rapport. Remplace les 60-90 minutes désorganisées de lecture par panéliste de « j’ai lu ça samedi matin et je pense que c’était pas mal ? » par une revue structurée de 15 minutes par panéliste plus un debrief calibré de 30 minutes.

Quand l’utiliser

Le poste utilise un test à emporter comme étape dans le process (entretiens structurés prérequis — sans rubrique écrit, le skill n’a rien à scorer).
Vous voulez un scoring cohérent entre les panélistes. Les revues de tests à emporter sont notoirement incohérentes parce que chaque panéliste lit à un moment différent avec un niveau d’attention différent ; le rapport ancré dans le rubrique est l’artefact de nivellement.
Le test à emporter est un exercice de code, une écriture de conception système, un exercice écrit (brouillon de PRD, mock-write-up d’appel de vente), ou un build d’intégration qui produit des artefacts inspectables.

Quand NE PAS l’utiliser

Valider/invalider automatiquement dans le process. Le skill produit un rapport scoré. La décision d’embauche se produit dans le debrief du panel. Câbler le score agrégé du rapport à une transition d’étape déclenche la même exposition NYC LL 144 / AI Act européen que l’auto-rejet au screening.
Entretiens de code en direct. Workflow différent (observation en direct du processus, pas évaluation d’artefact). Le workflow de debrief d’entretien couvre ce cas.
Tests à emporter de plus de 4 heures de travail candidat. Les longs tests à emporter sont eux-mêmes un anti-pattern d’expérience candidat ; le skill ne corrigera pas cela.
Soumissions où le candidat n’a pas signé la déclaration d’usage de l’IA. Le scoring du rubrique est calibré contre une politique d’usage de l’IA spécifique (par ex. « outils IA autorisés pour l’aide à la syntaxe, pas pour la génération de solution ») ; sans la déclaration, le skill ne peut pas calibrer la détection de « signal IA uniquement ».
Détection de plagiat comme usage principal. Le skill signale les patterns suspects (correspondances verbatim avec des dépôts publics, boilerplate IA générique généré de façon suspecte) mais n’est pas un outil de détection de plagiat légal. Utilisez un outil dédié si vous avez besoin de résultats de plagiat défendables.

Setup

Déposez le bundle. Placez apps/web/public/artifacts/take-home-evaluator-claude-skill/SKILL.md dans votre répertoire de skills Claude Code.
Rédigez le rubrique. Par test à emporter, écrivez un rubrique JSON avec les dimensions que vous scorez réellement (justesse, qualité du code, prise de décision documentée dans les commentaires / README, gestion des erreurs, couverture de tests). Ancres par dimension à 1-5. Le template se trouve dans references/1-take-home-rubric-template.md.
Configurez la politique d’usage de l’IA. Le prompt du skill dit explicitement à Claude quel usage de l’IA était autorisé (« aide à la syntaxe uniquement », « outils IA autorisés tout au long », « pas d’outils IA », etc.). Le paramètre correspond au langage de déclaration dans le brief du test à emporter — ils doivent correspondre.
Définissez le mode de distribution aux panélistes. Soit le mode mono-panéliste (un rapport par soumission) soit le mode par panéliste (chaque panéliste reçoit la même soumission, génère sa propre évaluation, et le skill agrège les deltas inter-panélistes). Le mode par panéliste détecte la dérive de scoring mais double le coût modèle.
Testez sur un test à emporter clôturé. Scorez un test à emporter d’un candidat embauché (ou non) le trimestre dernier. Comparez les scores par dimension du skill aux scores réels du panel. Ajustez les ancres du rubrique si le skill pondère différemment les dimensions.

Ce que le skill fait réellement

Six étapes. L’ordre compte : les vérifications déterministes (compilation, exécution, structure des fichiers) se produisent avant que le LLM ne score quoi que ce soit, parce que laisser le modèle scorer une soumission non fonctionnelle produit un rapport confiant sur un artefact cassé.

Valider la forme de la soumission. Vérifier que tous les livrables nommés dans le brief du test à emporter existent (par ex. README.md, fichiers source, fichiers de tests). Livrables manquants → signaler dans le rapport ; ne PAS scorer ces dimensions.
Exécuter les vérifications déterministes. Compiler le code. Exécuter la suite de tests que le candidat a écrite. Capturer la sortie. Ce sont les résultats auditables et reproductibles — le LLM ne les rejoue pas.
Scorer par dimension du rubrique. Pour chaque dimension dans le rubrique, scorer 1-5 avec des citations verbatim depuis la soumission du candidat (chemin de fichier + plage de lignes + le code ou texte). Les citations sont requises ; sans citation, le score se défaut à l’ancre 1 du rubrique. L’exigence de citation maintient le modèle ancré dans la soumission réelle plutôt que dans un feedback générique.
Détecter le signal d’usage de l’IA contre la politique. Exécuter des vérifications de patterns contre la politique d’usage de l’IA déclarée. Les correspondances verbatim avec du boilerplate IA générique public, le style suspicieusement cohérent entre fichiers de complexité variable, ou les commentaires génériques sans engagement avec les décisions spécifiques au problème font tous remonter des notes ai-use-signal — pas comme violation, juste comme signal que le panel discute contre la politique déclarée.
Calculer l’agrégat SANS recommandation embauche/non-embauche. Sommer les scores par dimension. Exposer l’agrégat comme chiffre. Ne PAS traduire l’agrégat en recommandation. Le skill renvoie explicitement « rapport ; pas une décision » plutôt que « pass / fail ».
Émettre un rapport par panéliste ou agrégé. En mode mono-panéliste, le rapport va au panéliste appelant. En mode par panéliste, le skill agrège entre panélistes, expose les deltas inter-panélistes par dimension (et quel panéliste a vu quoi différemment), et émet un rapport prêt pour le debrief.

Coûts réels

Par soumission de test à emporter, sur Claude Sonnet 4.6 :

Tokens LLM — typiquement 15-30 000 tokens d’input (rubrique + code/texte de soumission + instructions du skill) et 3-5 000 tokens d’output (rapport scoré par dimension). Environ 0,15-0,25 $ par soumission en mode mono-panéliste. Le mode par panéliste (3-4 panélistes) multiplie linéairement.
Coût CI / sandbox — exécuter la suite de tests du candidat coûte ce que votre CI coûte normalement ; généralement négligeable. L’exécution en sandbox (recommandée — n’exécutez jamais le code du candidat sur le laptop du panel) coûte ce que votre fournisseur de runner sandbox facture.
Temps du panéliste — le gain. La première passe de revue d’un test à emporter par un panéliste est de 60-90 minutes quand c’est fait bien, moins quand c’est fait mal. Examiner le rapport du skill et noter accord/désaccord par dimension est de 15-25 minutes. Temps panel total économisé par test à emporter : 2-3 heures de panéliste.
Temps de setup — 40 minutes une fois pour le rubrique et le mapping de politique d’usage de l’IA par format de test à emporter. La réutilisation entre postes de la même famille est élevée.

Métrique de succès

Suivez trois choses par cycle de test à emporter :

Variance de score inter-panélistes — variance entre les scores par dimension des panélistes. Le skill devrait comprimer la variance (panélistes ancrés sur le même rubrique et les mêmes citations) sans forcer un accord artificiel. Une variance inférieure à ~0,5 (sur une échelle de 5 points) suggère que les panélistes approuvent le rapport du skill sans regard ; supérieure à ~1,5 suggère que les ancres du rubrique sont trop vagues pour que le test à emporter puisse discriminer.
Corrélation décision-embauche-versus-non-embauche avec l’agrégat du skill — sur un trimestre, la décision d’embauche du panel corrèle-t-elle avec l’agrégat du skill ? Devrait être positive mais PAS 1,0 ; si c’est 1,0, le panel défère automatiquement (ce qui est le mode d’échec contre lequel le skill est conçu), et si c’est 0, le rubrique ou le skill est désalligné avec ce que le panel valorise réellement.
Durée du debrief du test à emporter — temps d’horloge murale entre « tous les panélistes ont soumis leurs revues » et « décision enregistrée ». Devrait passer de 1-2 jours à moins de 4 heures, parce que le rapport est une ancre partagée.

Comparaison avec les alternatives

Versus les rapports de code CodeSignal / notation automatique HackerRank. Ces produits exécutent le code du candidat contre les cas de test de la plateforme et émettent un score. Choisissez-les si votre test à emporter est structuré input-bien-défini-vers-output-bien-défini (style LeetCode). Choisissez le skill si le test à emporter est un build (écrire un petit système, concevoir une API, rédiger un PRD), où le rubrique est le score et le score est le rubrique. Les deux sont complémentaires ; CodeSignal peut être l’input vers l’étape d’exécution de tests du skill.
Versus les tests à emporter corrigés à la main. La correction à la main est juste pour les recrutements à plus forts enjeux (ingénieur fondateur, IC principal) où le jugement narratif du panel est le livrable. Le skill rentabilise son coût de setup sur les 80 % de tests à emporter où l’application cohérente du rubrique est ce qui manque.
Versus ChatGPT-style « revois ce code ». Le chat générique renvoie un feedback générique. Le skill est structurellement différent : il requiert des citations verbatim, exécute les vérifications déterministes en premier, et refuse de produire une recommandation embauche/non-embauche.
Versus pas de test à emporter (process uniquement en direct). Un choix raisonnable pour les rôles seniors où les références et les tours en direct portent la charge. Le skill est sans pertinence si le process n’a pas de test à emporter.

Points de vigilance

Dérive vers validation/invalidation automatique. Garde-fou : la sortie du skill se termine par les scores par dimension et l’agrégat. Il n’y a pas de chaîne « pass » ou « fail ». Le schéma omet explicitement un champ de recommandation.
Hallucination de feedback générique. Garde-fou : chaque score de dimension requiert une citation verbatim (chemin de fichier + plage de lignes + contenu). Les scores sans citations se défautent à 1.
Héritage de biais du rubrique. Garde-fou : le rubrique est en amont de ce skill. Faites passer le rubrique par le cadrage de l’auditeur de slate de diversité — le rubrique score-t-il sur des dimensions avec un impact disparate connu (par ex. « utilise des idiomes obscurs », qui corrèle souvent avec le background bootcamp versus CS-program) ?
Faux positif de détection d’usage de l’IA. Garde-fou : les signaux d’usage de l’IA sont remontés comme notes, pas comme violations. Le panel examine contre la politique déclarée. Le flag automatique comme violation serait la mauvaise lecture ; l’usage légitime des outils IA (dans le cadre de la politique) est de plus en plus la norme.
Échec de sandboxing sur le code du candidat. Garde-fou : le skill recommande explicitement l’exécution en sandbox et avertit si l’environnement appelant exécute la suite de tests directement sur la machine du panel. N’exécutez jamais du code candidat non revu sur une machine ayant accès aux secrets du cabinet.
Explosion de la taille de la soumission. Garde-fou : si la soumission dépasse ~50 000 LOC, le skill avertit que le scoring sera partiel et invite le panéliste à identifier les parties sur lesquelles se concentrer. Les tests à emporter qui produisent 50 000 LOC sont eux-mêmes un signe que le brief était faux.

Stack

Le bundle du skill se trouve dans apps/web/public/artifacts/take-home-evaluator-claude-skill/ et contient :

SKILL.md — la définition du skill
references/1-take-home-rubric-template.md — template de rubrique remplissable
references/2-ai-use-policy-mapping.md — comment la politique déclarée se mappe aux vérifications de patterns du skill

Outils supposés que vous utilisez : Claude (le modèle). Optionnel : CodeSignal ou HackerRank pour le leg de vérification déterministe ; Ashby pour la fiche candidat. L’exécution en sandbox est le choix du recruteur / hiring manager (conteneurs Docker, GitHub Actions, etc.).

Concepts associés : entretiens structurés, entretiens comportementaux, expérience candidat, qualité de l’embauche.

Modifier cette page sur GitHub

Files in this artifact

Download all (.zip)

---
name: take-home-evaluator
description: Score a take-home submission against a rubric, with verbatim citations from the candidate's code or text, plus deterministic checks (compile, run tests). Output is a structured report with per-dimension scores and an aggregate — never a hire/no-hire recommendation. Detects AI-use signals against the disclosed policy as notes for the panel debrief.
---

# Take-home assessment evaluator

## When to invoke

Use this skill when a panelist has a candidate's take-home submission and wants a rubric-anchored evaluation report to bring to the panel debrief. Take the submission directory plus the role rubric as input and return a structured Markdown report.

Do NOT invoke this skill for:

- **Auto-pass / auto-fail in the loop.** This skill produces a scored report. The hire decision happens in the panel debrief, not in the report.
- **Live coding interviews.** Different workflow.
- **Submissions where the candidate did not sign the AI-use disclosure.** The skill calibrates against a specific use-of-AI policy; without the disclosure, there is nothing to calibrate against.
- **Plagiarism forensics.** The skill flags suspicious patterns but is not a defensible plagiarism tool.

## Inputs

- Required: `submission_dir` — path to the candidate's submission directory.
- Required: `rubric` — path to the take-home rubric file. See `references/1-take-home-rubric-template.md` for the shape.
- Required: `ai_use_policy` — string identifying the disclosed policy. One of `none-allowed`, `syntax-help-only`, `ai-tools-allowed`. The skill's pattern checks are calibrated against the policy, not against an absolute "AI-generated" detector.
- Optional: `panelist_id` — if running per-panelist mode, identify the panelist for cross-panelist aggregation.
- Optional: `sandboxed` — boolean. If `false`, the skill warns about running candidate code on the local machine and asks for confirmation before proceeding to step 2.

## Reference files

- `references/1-take-home-rubric-template.md` — the rubric shape the skill expects.
- `references/2-ai-use-policy-mapping.md` — how each `ai_use_policy` value maps to the pattern checks in step 4.

## Method

Six steps.

### 1. Validate the submission shape

Walk `submission_dir`. Compare against the deliverables named in the take-home brief (the rubric file's `expected_deliverables` field). For each missing deliverable, add to a `missing_deliverables` array on the report. Do NOT score dimensions that depend on missing files; their score is `not assessed` rather than `1`.

### 2. Run deterministic checks

If the submission has a build/test command in `package.json`, `Makefile`, `pyproject.toml`, or `Cargo.toml`:

- Compile / install dependencies in a sandboxed environment. If the calling environment is not sandboxed (per `sandboxed: false`), warn and stop until the panelist confirms.
- Run the candidate's test suite. Capture pass/fail counts.
- Run linters / formatters in check mode (do NOT modify the candidate's code). Capture findings.

Record the deterministic results in a `deterministic_checks` block on the report. These are auditable — the LLM does not re-litigate them in step 3.

### 3. Score per rubric dimension

For each dimension in the rubric:

- Read the rubric anchors (1-5).
- Find evidence in the submission. Evidence is a verbatim string from the code or text, with file path + line range.
- If you cannot find verbatim evidence for a score above 1, the score is 1 — no inference, no generic feedback.
- Tag each cited line with which dimension it supports.

### 4. Detect AI-use signal against the policy

Run pattern checks per `references/2-ai-use-policy-mapping.md`:

- **Verbatim public matches** — does any chunk of the submission match a public AI-generated boilerplate exactly? Surface as `signal: verbatim_public_match`.
- **Style consistency vs. complexity** — is the style suspiciously consistent across files of varying complexity? Real candidates' style varies with the difficulty of the section. Surface as `signal: uniform_style`.
- **Generic comments without engagement** — comments that explain what the code does without engaging with the problem-specific decisions are a common AI-generated tell. Surface as `signal: generic_comments`.

Surface signals as NOTES, never as VIOLATIONS. The panel reviews against the disclosed policy. If `ai_use_policy: ai-tools-allowed`, the signals are informational only. If `ai_use_policy: none-allowed`, the panel discusses whether the signals warrant follow-up — the skill does not decide.

### 5. Compute aggregate

Sum the per-dimension scores. Surface the aggregate as a number AND the per-dimension breakdown.

Do NOT translate the aggregate into a recommendation. The skill's schema explicitly omits a `recommendation` field. The report ends after the aggregate and the AI-use signals section.

### 6. Emit report

Write the report to `report.md` in the submission directory or to stdout (depending on the calling environment). In per-panelist mode, write to `report-<panelist_id>.md`.

## Output format

```markdown
# Take-home evaluation — {Candidate name} — {Role}

Submission: `{submission_dir}` · Rubric: `{rubric_path}` (SHA `{short}`)
Generated: {ISO timestamp} · Skill v1.0 · Model: claude-sonnet-4-6
Panelist: {panelist_id or "single-panelist mode"}
AI-use policy: {ai_use_policy}

## Deterministic checks

- **Build:** {passed | failed | not applicable}
- **Tests:** {N/M passed} ({test command})
- **Linter:** {N findings}
- **Missing deliverables:** {list or "none"}

## Per-dimension scores

### Correctness — 4/5

> Evidence: `src/router.rs:42-57` — handles the request-routing edge case for retries with exponential backoff and jitter, including the cap at max 60s. Anchor 4 ("handles the named edge cases with explicit code paths") matches.

Counter-evidence (would have been 5): retry budget is hardcoded at 5 attempts; the rubric's anchor 5 names "configurable retry budget."

### Code quality — 3/5

> Evidence: `src/router.rs` — file is 800 lines with no module split. Anchor 3 ("readable but lacks structural decomposition") matches.

### Decision-making documented — 4/5

> Evidence: `README.md:25-40` — explains the choice of exponential-vs-fixed backoff with a reference to the failure mode it mitigates. Anchor 4 matches.

### Error handling — 2/5

> Evidence: `src/router.rs:120` — catches and re-raises the network error without distinguishing between transient and permanent failures. Anchor 2 ("error paths exist but do not differentiate") matches.

### Test coverage — 4/5

> Evidence: `tests/router_test.rs` — covers the happy path, three retry scenarios, and the timeout. Missing: the network-partition test the rubric anchor 5 names.

## Aggregate

17/25.

This is the per-dimension sum. The skill does NOT translate this into a hire/no-hire recommendation. The panel debrief is where the decision happens.

## AI-use signal notes

Disclosed policy: **syntax-help-only**.

- ⚠️ `signal: uniform_style` — `src/cache.rs` and `src/router.rs` use the same comment style and naming idioms despite the different complexity. May warrant a follow-up question in the panel debrief.
- ✓ No `verbatim_public_match` signals.
- ✓ No `generic_comments` signals beyond the documented threshold.

The panel discusses these against the disclosed policy. The skill does not decide.
```

## Watch-outs

- **Auto-pass / auto-fail drift.** *Guard:* the report ends after the aggregate. No recommendation field. The aggregate is a sum, not a verdict.
- **Generic-feedback hallucination.** *Guard:* every dimension score requires verbatim citation (file path + line range + content).
- **AI-use false positive.** *Guard:* signals are notes, not violations. The panel decides against the disclosed policy.
- **Unsandboxed candidate code.** *Guard:* skill warns before running in non-sandboxed environments.
- **Bias inheritance.** *Guard:* the rubric is upstream of the skill. Audit the rubric separately if the dimensions correlate with disparate impact.

# Take-home rubric template

The take-home evaluator scores a submission against this rubric shape. Copy the JSON below to your role's rubric file (one per take-home format) and fill in every field. The skill reads the rubric; without it, scoring has nothing to anchor against.

A complete rubric takes 30-90 minutes to author per take-home format. Reuse across roles in the same family is high — a senior-backend take-home rubric is largely the same across companies once you've written it once.

## JSON shape

```json
{
"take_home_id": "senior-backend-router-rewrite-v3",
"version": "2026-04-15",
"expected_deliverables": [
"README.md",
"src/**/*.rs",
"tests/**/*.rs",
"Cargo.toml"
],
"build_commands": {
"build": "cargo build --release",
"test": "cargo test --all",
"lint": "cargo clippy -- -D warnings"
},
"ai_use_policy_match": "syntax-help-only",
"dimensions": [
{
"id": "correctness",
"label": "Correctness",
"anchors": {
"1": "Compiles but does not pass the candidate's own tests, or does not handle the named happy path.",
"2": "Handles the happy path; ignores the named edge cases (retries, partial failure).",
"3": "Handles the happy path and the obvious edge cases; misses the subtle ones (clock skew, partition recovery).",
"4": "Handles the named edge cases with explicit code paths; minor gaps acceptable.",
"5": "Handles all named edge cases AND demonstrates a configurable retry budget / timeout structure that the rubric explicitly calls for."
}
},
{
"id": "code_quality",
"label": "Code quality and structural decomposition",
"anchors": {
"1": "Single file, no decomposition; difficult to read.",
"2": "Decomposed but the decomposition does not follow domain boundaries.",
"3": "Readable but lacks structural decomposition that would scale past prototype.",
"4": "Clear module boundaries that follow the domain; idiomatic for the language.",
"5": "All of 4, plus the structural choices are documented in the README with the alternatives considered."
}
},
{
"id": "decision_documentation",
"label": "Decision-making documented",
"anchors": {
"1": "No README, or the README only repeats the take-home brief.",
"2": "README describes what was built without naming the engineering choices.",
"3": "README names some choices without naming the alternatives.",
"4": "README names the choices AND explains why each was picked over the named alternatives.",
"5": "All of 4, plus the README cites the failure modes each choice mitigates."
}
},
{
"id": "error_handling",
"label": "Error handling",
"anchors": {
"1": "Errors are caught and silently swallowed.",
"2": "Error paths exist but do not differentiate between transient and permanent failures.",
"3": "Differentiates transient vs. permanent; lacks structured error types.",
"4": "Structured error types; retry policy is explicit per error class.",
"5": "All of 4, plus error paths have explicit observability (logging / metrics / traces) named in the code."
}
},
{
"id": "test_coverage",
"label": "Test coverage",
"anchors": {
"1": "No tests, or tests do not run.",
"2": "Tests cover the happy path only.",
"3": "Tests cover the happy path and one or two edge cases.",
"4": "Tests cover the happy path and multiple edge cases (timeout, retry, partial failure).",
"5": "All of 4, plus the network-partition test the rubric explicitly calls for."
}
}
],
"rubric_fairness_check": {
"no_bootcamp_vs_cs_proxies": "Anchors must score on observable behavior in the submission, not on idioms that proxy for educational background. 'Uses obscure language idioms' is forbidden as a positive signal.",
"no_native-english-only_proxies": "Anchors must NOT score on README writing fluency beyond the level required to communicate the engineering decisions.",
"documented_in_brief": "The take-home brief shared with the candidate must describe the rubric dimensions and approximate weighting. Surprise dimensions are unfair."
}
}
```

## Per-field notes

- `take_home_id` — stable identifier for the take-home format. Reused across candidates for the same role family.
- `version` — semver or date. Bumped when the rubric is edited; the skill captures the version in the report so re-scoring against an edited rubric is visible.
- `expected_deliverables` — globs the skill walks against the submission. Missing deliverables surface in the report.
- `build_commands` — the skill runs these in step 2 (deterministic checks). Sandboxed execution required.
- `ai_use_policy_match` — should match the disclosure language in the take-home brief. Mismatch means the candidate's policy understanding doesn't match what the skill calibrates against.
- `dimensions` — array. Each dimension has an `id`, a `label`, and 5 anchor strings. Anchors should be observable behavior, not adjectives.
- `rubric_fairness_check` — three named fairness checks the skill confirms before scoring. If the rubric anchors violate any of these, the skill emits a warning and asks the rubric author to revise. (The skill does not refuse to score on a fairness-check violation, because the rubric is upstream and revising it is the right intervention. But it surfaces the issue.)

## Authoring a new dimension

To add a dimension to an existing rubric:

1. Pick observable behavior, not adjectives. "Has good error handling" is not a dimension; "error paths differentiate transient vs. permanent failure" is.
2. Write the 5 anchors as five distinct observable behaviors, each strictly more demanding than the last.
3. Test the dimension on a known submission. Can you score it from the anchors alone, without the original code in your head? If not, the anchors are too vague.
4. Bump the rubric version.

## Authoring a new rubric (for a net-new take-home)

1. Start from the take-home brief. What does the brief tell the candidate to deliver? Those are the `expected_deliverables`.
2. What is the brief asking the candidate to demonstrate? Those become the `dimensions`. Aim for 4-6 dimensions; more than 6 and the panelist can't hold them.
3. Write the 1-anchor first (the floor: what does an unsubmitted-effort look like?), then the 5-anchor (the ceiling: what does the strongest submission look like?), then fill 2-4 between.
4. Write the brief and the rubric in parallel. Anchors that don't show up in the brief are surprise dimensions; anchors in the brief that don't show up in the rubric are unscoreable promises.
5. Run the rubric on a known submission (a prior hire's submission, anonymized). Does it score them where you'd expect?

# AI-use policy mapping

The take-home evaluator runs pattern checks calibrated against the disclosed AI-use policy. The same submission produces different signal interpretation under different policies; this file documents the mapping.

The intent: surface signals to the panel debrief, not to make a determination. AI-use detection is well-known to be unreliable as a forensic tool; the right framing is "discuss with the candidate against the disclosed policy."

## Policy values

### `none-allowed`

The take-home brief told the candidate: "Do not use AI tools (Claude, ChatGPT, Copilot, Cursor, etc.) at any point during this assessment."

Pattern checks:

- **Verbatim public matches** — surface as `signal: verbatim_public_match` if any chunk ≥3 lines matches a known public AI-generated boilerplate exactly.
- **Style consistency vs. complexity** — surface as `signal: uniform_style` if comment style and naming idioms are suspiciously consistent across files of varying complexity.
- **Generic comments without engagement** — surface as `signal: generic_comments` if comments explain what the code does without engaging with the problem-specific decisions, beyond a per-file threshold (default: >40% of comments are generic).

Panel debrief framing: "Signals suggest possible AI use. Discuss with the candidate. The skill does not determine; the panel does, against the disclosed policy."

### `syntax-help-only`

The take-home brief told the candidate: "AI tools are allowed for syntax help (looking up the right method name, checking a regex, formatting). They are NOT allowed for solution generation (asking 'how would I implement this?', 'write me the function for X')."

Pattern checks: same as `none-allowed`, but with different framing. AI-generated boilerplate at the function level is a signal worth discussing; AI-completed identifier names are not.

Panel debrief framing: "AI use was permitted within bounds. Signals indicate where the candidate may have crossed the bounds. Discuss with the candidate."

### `ai-tools-allowed`

The take-home brief told the candidate: "AI tools are allowed throughout. Tell us what tools you used and how, in the README."

Pattern checks: still run, but signals are surfaced as informational only.

Panel debrief framing: "AI use was permitted. The signals are informational. The panel evaluates the SUBMISSION against the rubric; the question is what the candidate built and why, not which tools they used."

In the `ai-tools-allowed` policy, the panel should be ready to evaluate "the candidate's prompting and tool-use judgment" as a positive dimension if the rubric calls it out — many roles in 2026 explicitly want to see how candidates work with AI tools.

## What the skill does NOT do

- **Run a third-party AI-detection model** (GPTZero, Originality, etc.). Those tools have well-documented false-positive rates that climb to 30-50% on technical writing. Their findings do not survive a panel debrief.
- **Output a confidence score for "this was AI-generated."** No such confidence can be honestly assigned; the patterns are signals, not proof.
- **Block the report on signals.** Signals appear in the report. The panel decides. If `none-allowed` and the signals are strong, the panel typically schedules a follow-up conversation rather than auto-rejecting.

## Calibration

The skill's pattern thresholds (e.g. "40% generic comments") are tunable per take-home format. If your team's take-home format produces a lot of boilerplate naturally (e.g. it asks for a CRUD API), the threshold should be raised; if your format produces little boilerplate (e.g. it asks for a custom algorithm), the threshold should be lowered.

The defaults are calibrated against general-engineering take-homes. Tune in `config.json` per take-home format. Document the tuning in the rubric file's notes section.

## Why surfaced as notes, not as a verdict

1. **False-positive cost is asymmetric.** Auto-rejecting a candidate based on an AI-use signal that turns out to be wrong damages the firm's brand and risks a discrimination claim if the signal correlates with disparate impact. Surfacing for discussion costs nothing.
2. **The disclosed policy is the contract.** What matters is whether the candidate followed the policy they were told about. The signal helps the panel ask; it does not answer.
3. **AI-use detection is unreliable.** Even the best-known detectors have unacceptable error rates. The skill does not pretend otherwise.
4. **Hire decisions involve more than this one signal.** A candidate with a strong submission and a possible AI-use signal under `syntax-help-only` is a candidate to talk to, not a candidate to drop.