claude-skill

Evaluador de take-home con Claude

Dificultad

intermedio

Tiempo de setup

40min

Para

recruiter · hiring-manager · technical-screener

Reclutamiento y TA

Stack

Un Claude Skill que califica la entrega de take-home de un candidato contra una rúbrica escrita por el hiring team, con citas línea por línea desde el código o los documentos entregados, y produce un reporte de evaluación estructurado — nunca aprueba ni rechaza automáticamente. El panel de hiring usa el reporte para anclar el debrief en vivo; la decisión real de hire/no-hire ocurre en la discusión del panel, no en el reporte. Reemplaza los 60-90 minutos por panelista de “leí esto el sábado en la mañana y creo que estaba bien?” desorganizado con una revisión estructurada de 15 minutos por panelista más un debrief calibrado de 30 minutos.

Cuándo usarlo

El rol usa un take-home como parte del loop (prerrequisito: structured interviewing — sin una rúbrica escrita el skill no tiene contra qué calificar).
Quieres scoring consistente entre panelistas. Las revisiones de take-home son notoriamente inconsistentes porque cada panelista lee en distinto momento con distinto nivel de atención; el reporte anclado en rúbrica es el artefacto nivelador.
El take-home es un ejercicio de código, un write-up de system design, un ejercicio escrito (borrador de PRD, mock-write-up de una sales call), o una build de integración que produce artifacts inspeccionables.

Cuándo NO usarlo

Auto-pass / auto-fail en el loop. El skill produce un reporte calificado. La decisión de hire ocurre en el debrief del panel. Conectar el score agregado del reporte a una transición de stage dispara la misma exposición de NYC LL 144 / EU AI Act que el auto-rechazo en screening.
Entrevistas de live coding. Workflow distinto (observación en vivo del proceso, no evaluación del artifact). El workflow de interview-debrief cubre ese caso.
Take-homes más largos que 4 horas de trabajo del candidato. Los take-homes largos son por sí mismos un anti-patrón de candidate experience; el skill no arregla eso.
Entregas donde el candidato no firmó el disclosure de uso de AI. El scoring de la rúbrica está calibrado contra una política específica de uso de AI (e.g. “herramientas de AI permitidas para ayuda de sintaxis, no para generación de soluciones”); sin el disclosure, el skill no puede calibrar la detección de “señal de solo AI”.
Detección de plagio como uso primario. El skill marca patrones sospechosos (matches verbatim contra repos públicos, boilerplate genérico generado por AI) pero no es una herramienta forense de plagio. Usa una herramienta dedicada para eso si necesitas hallazgos de plagio defendibles.

Setup

Pon el bundle. Coloca apps/web/public/artifacts/take-home-evaluator-claude-skill/SKILL.md en tu directorio de skills de Claude Code.
Redacta la rúbrica. Por cada take-home, escribe una rúbrica JSON con las dimensiones sobre las que realmente calificas (corrección, calidad de código, toma de decisiones documentada en comentarios / README, manejo de errores, cobertura de tests). Anclas por dimensión de 1 a 5. El template vive en references/1-take-home-rubric-template.md.
Configura la política de uso de AI. El prompt del skill le dice explícitamente a Claude qué uso de AI estaba permitido (“solo ayuda de sintaxis”, “herramientas de AI permitidas en todo el ejercicio”, “sin herramientas de AI”, etc.). El setting mapea al lenguaje del disclosure en el brief del take-home — deben coincidir.
Define el modo de distribución por panelista. O modo de panelista único (un reporte por entrega) o modo per-panelista (cada panelista recibe la misma entrega, genera su propia evaluación, y el skill agrega los deltas cross-panelista). El modo per-panelista capta drift de scoring pero duplica el costo de modelo.
Haz dry-run sobre un take-home cerrado. Califica un take-home de un candidato contratado (o no) el trimestre pasado. Compara los scores por dimensión del skill contra los scores reales del panel. Ajusta las anclas de la rúbrica si el skill pesa las dimensiones de manera distinta.

Qué hace el skill realmente

Seis pasos. El orden importa: los checks determinísticos (compilar, correr, estructura de archivos) suceden antes de que el LLM califique cualquier cosa, porque dejar que el modelo califique una entrega que no corre produce un reporte confiado sobre un artifact roto.

Valida la forma de la entrega. Chequea que todos los deliverables nombrados en el brief del take-home existan (e.g. README.md, archivos fuente, archivos de tests). Deliverables faltantes → marca en el reporte; NO califiques esas dimensiones.
Corre checks determinísticos. Compila el código. Corre la test suite que el candidato escribió. Captura el output. Estos son los resultados auditables y reproducibles — el LLM no los re-litiga.
Califica por dimensión de la rúbrica. Por cada dimensión en la rúbrica, califica de 1 a 5 con citas verbatim de la entrega del candidato (ruta del archivo + rango de líneas + el código o texto). Las citas son requeridas; sin una cita, el score cae al ancla 1 de la rúbrica. El requisito de cita mantiene al modelo aterrizado en la entrega real en lugar de en feedback genérico.
Detecta señal de uso de AI contra la política. Corre pattern checks contra la política de uso de AI declarada. Matches verbatim con boilerplate público generado por AI, estilo sospechosamente consistente entre archivos de complejidad variable, o comentarios genéricos sin engagement con las decisiones específicas del problema, todo aparece como notas de ai-use-signal — no como una violación, solo como una señal para que el panel discuta contra la política declarada.
Calcula el agregado SIN una recomendación de hire/no-hire. Suma los scores por dimensión. Surfacea el agregado como número. NO traduzcas el agregado a una recomendación. El skill explícitamente devuelve “reporte; no es una decisión” en lugar de “pass / fail”.
Emite reporte per-panelista o agregado. En modo de panelista único, el reporte va al panelista que llamó. En modo per-panelista, el skill agrega entre panelistas, surfacea deltas cross-panelista por dimensión (y qué panelista vio qué de manera distinta), y emite un reporte listo para el debrief.

Realidad de costos

Por entrega de take-home, sobre Claude Sonnet 4.6:

Tokens de LLM — típicamente 15-30k de input (rúbrica + código/texto de la entrega + instrucciones del skill) y 3-5k de output (reporte calificado por dimensión). Aproximadamente $0.15-0.25 por entrega en modo de panelista único. El modo per-panelista (3-4 panelistas) multiplica linealmente.
Costo de CI / sandbox — correr la test suite del candidato cuesta lo que cueste tu CI normalmente; usualmente despreciable. La ejecución en sandbox (recomendada — nunca corras código de candidato en el laptop del panel) cuesta lo que cobre tu proveedor de sandboxed runner.
Tiempo de panelista — la ganancia. La revisión de primera pasada de un panelista sobre un take-home toma 60-90 minutos cuando se hace bien, menos cuando se hace mal. Revisar el reporte del skill y anotar de acuerdo/no de acuerdo por dimensión toma 15-25 minutos. Tiempo agregado del panel ahorrado por take-home: 2-3 horas de panelista.
Tiempo de setup — 40 minutos una vez para la rúbrica y el mapping de política de uso de AI por formato de take-home. La reutilización entre roles de la misma familia es alta.

Métrica de éxito

Trackea tres cosas por ciclo de take-home:

Varianza de score cross-panelista — varianza entre los scores por dimensión de los panelistas. El skill debería comprimir la varianza (panelistas anclados en la misma rúbrica y las mismas citas) sin forzar acuerdo artificial. Varianza por debajo de ~0.5 (en escala de 5 puntos) sugiere que los panelistas están firmando en blanco el reporte del skill; arriba de ~1.5 sugiere que las anclas de la rúbrica son demasiado vagas para que el take-home discrimine.
Correlación de hire-vs-no-hire con el agregado del skill — a lo largo de un trimestre, ¿la decisión de hire del panel correlaciona con el agregado del skill? Debería ser positiva pero NO 1.0; si es 1.0, el panel está deferiendo automáticamente (que es el modo de falla contra el que el skill está diseñado), y si es 0, la rúbrica o el skill están desalineados con lo que el panel realmente valora.
Duración del debrief del take-home — tiempo de reloj desde “todos los panelistas entregaron reviews” hasta “decisión registrada”. Debería bajar de 1-2 días a menos de 4 horas, porque el reporte es un ancla compartida.

vs alternativas

vs CodeSignal Coding Reports / HackerRank automated grading. Esos productos corren el código del candidato contra los test cases de la plataforma y emiten un score. Elígelos si tu take-home es estructurado de input-bien-definido-a-output-bien-definido (estilo LeetCode). Elige el skill si el take-home es una build (escribe un sistema chico, diseña una API, escribe un PRD), donde la rúbrica es el score y el score es la rúbrica. Las dos son complementarias; CodeSignal puede ser el input al paso de run-tests del skill.
vs take-homes calificados a mano. El grading a mano es lo correcto para los hires de mayor stakes (founding engineer, principal IC) donde el juicio narrativo del panel es el deliverable. El skill paga su costo de setup en el 80% de los take-homes donde lo que falta es aplicación consistente de la rúbrica.
vs “revisa este código” estilo ChatGPT. El chat genérico devuelve feedback genérico. El skill es estructuralmente distinto: exige citas verbatim, corre checks determinísticos primero, y se niega a producir una recomendación de hire/no-hire.
vs no usar take-home (loops solo en vivo). Una elección razonable para roles senior donde las referencias y las rondas en vivo cargan el peso. El skill es irrelevante si el loop no tiene take-home.

Cosas para cuidar

Drift de auto-pass / auto-fail. Guardrail: el output del skill termina con los scores por dimensión y el agregado. No hay string “pass” ni “fail”. El schema omite explícitamente un campo de recomendación.
Alucinación de feedback genérico. Guardrail: cada score de dimensión exige una cita verbatim (ruta de archivo + rango de líneas + contenido). Los scores sin citas caen a 1.
Sesgo heredado de la rúbrica. Guardrail: la rúbrica es upstream de este skill. Pasa la rúbrica por el framing del diversity slate auditor — ¿la rúbrica califica sobre dimensiones con impacto dispar conocido (e.g. “usa idioms oscuros”, que frecuentemente correlaciona con background de bootcamp vs. carrera de CS)?
Falso positivo de detección de uso de AI. Guardrail: las señales de uso de AI se surfacean como notas, no como violaciones. El panel revisa contra la política declarada. Marcar automáticamente como violación sería la lectura equivocada; el uso legítimo de herramientas de AI (dentro de la política) es cada vez más la norma.
Falla de sandboxing sobre código del candidato. Guardrail: el skill explícitamente recomienda ejecución en sandbox y advierte si el entorno que llama corre la test suite directamente sobre la máquina del panel. Nunca corras código de candidato no revisado en una máquina con acceso a secretos de la firma.
Blowup de tamaño de entrega. Guardrail: si la entrega excede ~50K LOC, el skill advierte que el scoring va a ser parcial y le pide al panelista que identifique las partes en las que enfocarse. Los take-homes que producen 50K LOC son por sí mismos una señal de que el brief estuvo mal.

Stack

El bundle del skill vive en apps/web/public/artifacts/take-home-evaluator-claude-skill/ y contiene:

SKILL.md — la definición del skill
references/1-take-home-rubric-template.md — template de rúbrica para completar
references/2-ai-use-policy-mapping.md — cómo la política declarada mapea a los pattern checks del skill

Herramientas que el workflow asume que usas: Claude (el modelo). Opcionales: CodeSignal o HackerRank para la pata de checks determinísticos; Ashby para el registro del candidato. La ejecución en sandbox es elección del recruiter / hiring manager (contenedores Docker, GitHub Actions, etc.).

Conceptos relacionados: structured interviewing, behavioral interviewing, candidate experience, quality of hire.

Editar esta página en GitHub

Archivos de este artefacto

Descargar todo (.zip)

---
name: take-home-evaluator
description: Score a take-home submission against a rubric, with verbatim citations from the candidate's code or text, plus deterministic checks (compile, run tests). Output is a structured report with per-dimension scores and an aggregate — never a hire/no-hire recommendation. Detects AI-use signals against the disclosed policy as notes for the panel debrief.
---

# Take-home assessment evaluator

## When to invoke

Use this skill when a panelist has a candidate's take-home submission and wants a rubric-anchored evaluation report to bring to the panel debrief. Take the submission directory plus the role rubric as input and return a structured Markdown report.

Do NOT invoke this skill for:

- **Auto-pass / auto-fail in the loop.** This skill produces a scored report. The hire decision happens in the panel debrief, not in the report.
- **Live coding interviews.** Different workflow.
- **Submissions where the candidate did not sign the AI-use disclosure.** The skill calibrates against a specific use-of-AI policy; without the disclosure, there is nothing to calibrate against.
- **Plagiarism forensics.** The skill flags suspicious patterns but is not a defensible plagiarism tool.

## Inputs

- Required: `submission_dir` — path to the candidate's submission directory.
- Required: `rubric` — path to the take-home rubric file. See `references/1-take-home-rubric-template.md` for the shape.
- Required: `ai_use_policy` — string identifying the disclosed policy. One of `none-allowed`, `syntax-help-only`, `ai-tools-allowed`. The skill's pattern checks are calibrated against the policy, not against an absolute "AI-generated" detector.
- Optional: `panelist_id` — if running per-panelist mode, identify the panelist for cross-panelist aggregation.
- Optional: `sandboxed` — boolean. If `false`, the skill warns about running candidate code on the local machine and asks for confirmation before proceeding to step 2.

## Reference files

- `references/1-take-home-rubric-template.md` — the rubric shape the skill expects.
- `references/2-ai-use-policy-mapping.md` — how each `ai_use_policy` value maps to the pattern checks in step 4.

## Method

Six steps.

### 1. Validate the submission shape

Walk `submission_dir`. Compare against the deliverables named in the take-home brief (the rubric file's `expected_deliverables` field). For each missing deliverable, add to a `missing_deliverables` array on the report. Do NOT score dimensions that depend on missing files; their score is `not assessed` rather than `1`.

### 2. Run deterministic checks

If the submission has a build/test command in `package.json`, `Makefile`, `pyproject.toml`, or `Cargo.toml`:

- Compile / install dependencies in a sandboxed environment. If the calling environment is not sandboxed (per `sandboxed: false`), warn and stop until the panelist confirms.
- Run the candidate's test suite. Capture pass/fail counts.
- Run linters / formatters in check mode (do NOT modify the candidate's code). Capture findings.

Record the deterministic results in a `deterministic_checks` block on the report. These are auditable — the LLM does not re-litigate them in step 3.

### 3. Score per rubric dimension

For each dimension in the rubric:

- Read the rubric anchors (1-5).
- Find evidence in the submission. Evidence is a verbatim string from the code or text, with file path + line range.
- If you cannot find verbatim evidence for a score above 1, the score is 1 — no inference, no generic feedback.
- Tag each cited line with which dimension it supports.

### 4. Detect AI-use signal against the policy

Run pattern checks per `references/2-ai-use-policy-mapping.md`:

- **Verbatim public matches** — does any chunk of the submission match a public AI-generated boilerplate exactly? Surface as `signal: verbatim_public_match`.
- **Style consistency vs. complexity** — is the style suspiciously consistent across files of varying complexity? Real candidates' style varies with the difficulty of the section. Surface as `signal: uniform_style`.
- **Generic comments without engagement** — comments that explain what the code does without engaging with the problem-specific decisions are a common AI-generated tell. Surface as `signal: generic_comments`.

Surface signals as NOTES, never as VIOLATIONS. The panel reviews against the disclosed policy. If `ai_use_policy: ai-tools-allowed`, the signals are informational only. If `ai_use_policy: none-allowed`, the panel discusses whether the signals warrant follow-up — the skill does not decide.

### 5. Compute aggregate

Sum the per-dimension scores. Surface the aggregate as a number AND the per-dimension breakdown.

Do NOT translate the aggregate into a recommendation. The skill's schema explicitly omits a `recommendation` field. The report ends after the aggregate and the AI-use signals section.

### 6. Emit report

Write the report to `report.md` in the submission directory or to stdout (depending on the calling environment). In per-panelist mode, write to `report-<panelist_id>.md`.

## Output format

```markdown
# Take-home evaluation — {Candidate name} — {Role}

Submission: `{submission_dir}` · Rubric: `{rubric_path}` (SHA `{short}`)
Generated: {ISO timestamp} · Skill v1.0 · Model: claude-sonnet-4-6
Panelist: {panelist_id or "single-panelist mode"}
AI-use policy: {ai_use_policy}

## Deterministic checks

- **Build:** {passed | failed | not applicable}
- **Tests:** {N/M passed} ({test command})
- **Linter:** {N findings}
- **Missing deliverables:** {list or "none"}

## Per-dimension scores

### Correctness — 4/5

> Evidence: `src/router.rs:42-57` — handles the request-routing edge case for retries with exponential backoff and jitter, including the cap at max 60s. Anchor 4 ("handles the named edge cases with explicit code paths") matches.

Counter-evidence (would have been 5): retry budget is hardcoded at 5 attempts; the rubric's anchor 5 names "configurable retry budget."

### Code quality — 3/5

> Evidence: `src/router.rs` — file is 800 lines with no module split. Anchor 3 ("readable but lacks structural decomposition") matches.

### Decision-making documented — 4/5

> Evidence: `README.md:25-40` — explains the choice of exponential-vs-fixed backoff with a reference to the failure mode it mitigates. Anchor 4 matches.

### Error handling — 2/5

> Evidence: `src/router.rs:120` — catches and re-raises the network error without distinguishing between transient and permanent failures. Anchor 2 ("error paths exist but do not differentiate") matches.

### Test coverage — 4/5

> Evidence: `tests/router_test.rs` — covers the happy path, three retry scenarios, and the timeout. Missing: the network-partition test the rubric anchor 5 names.

## Aggregate

17/25.

This is the per-dimension sum. The skill does NOT translate this into a hire/no-hire recommendation. The panel debrief is where the decision happens.

## AI-use signal notes

Disclosed policy: **syntax-help-only**.

- ⚠️ `signal: uniform_style` — `src/cache.rs` and `src/router.rs` use the same comment style and naming idioms despite the different complexity. May warrant a follow-up question in the panel debrief.
- ✓ No `verbatim_public_match` signals.
- ✓ No `generic_comments` signals beyond the documented threshold.

The panel discusses these against the disclosed policy. The skill does not decide.
```

## Watch-outs

- **Auto-pass / auto-fail drift.** *Guard:* the report ends after the aggregate. No recommendation field. The aggregate is a sum, not a verdict.
- **Generic-feedback hallucination.** *Guard:* every dimension score requires verbatim citation (file path + line range + content).
- **AI-use false positive.** *Guard:* signals are notes, not violations. The panel decides against the disclosed policy.
- **Unsandboxed candidate code.** *Guard:* skill warns before running in non-sandboxed environments.
- **Bias inheritance.** *Guard:* the rubric is upstream of the skill. Audit the rubric separately if the dimensions correlate with disparate impact.

# Take-home rubric template

The take-home evaluator scores a submission against this rubric shape. Copy the JSON below to your role's rubric file (one per take-home format) and fill in every field. The skill reads the rubric; without it, scoring has nothing to anchor against.

A complete rubric takes 30-90 minutes to author per take-home format. Reuse across roles in the same family is high — a senior-backend take-home rubric is largely the same across companies once you've written it once.

## JSON shape

```json
{
"take_home_id": "senior-backend-router-rewrite-v3",
"version": "2026-04-15",
"expected_deliverables": [
"README.md",
"src/**/*.rs",
"tests/**/*.rs",
"Cargo.toml"
],
"build_commands": {
"build": "cargo build --release",
"test": "cargo test --all",
"lint": "cargo clippy -- -D warnings"
},
"ai_use_policy_match": "syntax-help-only",
"dimensions": [
{
"id": "correctness",
"label": "Correctness",
"anchors": {
"1": "Compiles but does not pass the candidate's own tests, or does not handle the named happy path.",
"2": "Handles the happy path; ignores the named edge cases (retries, partial failure).",
"3": "Handles the happy path and the obvious edge cases; misses the subtle ones (clock skew, partition recovery).",
"4": "Handles the named edge cases with explicit code paths; minor gaps acceptable.",
"5": "Handles all named edge cases AND demonstrates a configurable retry budget / timeout structure that the rubric explicitly calls for."
}
},
{
"id": "code_quality",
"label": "Code quality and structural decomposition",
"anchors": {
"1": "Single file, no decomposition; difficult to read.",
"2": "Decomposed but the decomposition does not follow domain boundaries.",
"3": "Readable but lacks structural decomposition that would scale past prototype.",
"4": "Clear module boundaries that follow the domain; idiomatic for the language.",
"5": "All of 4, plus the structural choices are documented in the README with the alternatives considered."
}
},
{
"id": "decision_documentation",
"label": "Decision-making documented",
"anchors": {
"1": "No README, or the README only repeats the take-home brief.",
"2": "README describes what was built without naming the engineering choices.",
"3": "README names some choices without naming the alternatives.",
"4": "README names the choices AND explains why each was picked over the named alternatives.",
"5": "All of 4, plus the README cites the failure modes each choice mitigates."
}
},
{
"id": "error_handling",
"label": "Error handling",
"anchors": {
"1": "Errors are caught and silently swallowed.",
"2": "Error paths exist but do not differentiate between transient and permanent failures.",
"3": "Differentiates transient vs. permanent; lacks structured error types.",
"4": "Structured error types; retry policy is explicit per error class.",
"5": "All of 4, plus error paths have explicit observability (logging / metrics / traces) named in the code."
}
},
{
"id": "test_coverage",
"label": "Test coverage",
"anchors": {
"1": "No tests, or tests do not run.",
"2": "Tests cover the happy path only.",
"3": "Tests cover the happy path and one or two edge cases.",
"4": "Tests cover the happy path and multiple edge cases (timeout, retry, partial failure).",
"5": "All of 4, plus the network-partition test the rubric explicitly calls for."
}
}
],
"rubric_fairness_check": {
"no_bootcamp_vs_cs_proxies": "Anchors must score on observable behavior in the submission, not on idioms that proxy for educational background. 'Uses obscure language idioms' is forbidden as a positive signal.",
"no_native-english-only_proxies": "Anchors must NOT score on README writing fluency beyond the level required to communicate the engineering decisions.",
"documented_in_brief": "The take-home brief shared with the candidate must describe the rubric dimensions and approximate weighting. Surprise dimensions are unfair."
}
}
```

## Per-field notes

- `take_home_id` — stable identifier for the take-home format. Reused across candidates for the same role family.
- `version` — semver or date. Bumped when the rubric is edited; the skill captures the version in the report so re-scoring against an edited rubric is visible.
- `expected_deliverables` — globs the skill walks against the submission. Missing deliverables surface in the report.
- `build_commands` — the skill runs these in step 2 (deterministic checks). Sandboxed execution required.
- `ai_use_policy_match` — should match the disclosure language in the take-home brief. Mismatch means the candidate's policy understanding doesn't match what the skill calibrates against.
- `dimensions` — array. Each dimension has an `id`, a `label`, and 5 anchor strings. Anchors should be observable behavior, not adjectives.
- `rubric_fairness_check` — three named fairness checks the skill confirms before scoring. If the rubric anchors violate any of these, the skill emits a warning and asks the rubric author to revise. (The skill does not refuse to score on a fairness-check violation, because the rubric is upstream and revising it is the right intervention. But it surfaces the issue.)

## Authoring a new dimension

To add a dimension to an existing rubric:

1. Pick observable behavior, not adjectives. "Has good error handling" is not a dimension; "error paths differentiate transient vs. permanent failure" is.
2. Write the 5 anchors as five distinct observable behaviors, each strictly more demanding than the last.
3. Test the dimension on a known submission. Can you score it from the anchors alone, without the original code in your head? If not, the anchors are too vague.
4. Bump the rubric version.

## Authoring a new rubric (for a net-new take-home)

1. Start from the take-home brief. What does the brief tell the candidate to deliver? Those are the `expected_deliverables`.
2. What is the brief asking the candidate to demonstrate? Those become the `dimensions`. Aim for 4-6 dimensions; more than 6 and the panelist can't hold them.
3. Write the 1-anchor first (the floor: what does an unsubmitted-effort look like?), then the 5-anchor (the ceiling: what does the strongest submission look like?), then fill 2-4 between.
4. Write the brief and the rubric in parallel. Anchors that don't show up in the brief are surprise dimensions; anchors in the brief that don't show up in the rubric are unscoreable promises.
5. Run the rubric on a known submission (a prior hire's submission, anonymized). Does it score them where you'd expect?

# AI-use policy mapping

The take-home evaluator runs pattern checks calibrated against the disclosed AI-use policy. The same submission produces different signal interpretation under different policies; this file documents the mapping.

The intent: surface signals to the panel debrief, not to make a determination. AI-use detection is well-known to be unreliable as a forensic tool; the right framing is "discuss with the candidate against the disclosed policy."

## Policy values

### `none-allowed`

The take-home brief told the candidate: "Do not use AI tools (Claude, ChatGPT, Copilot, Cursor, etc.) at any point during this assessment."

Pattern checks:

- **Verbatim public matches** — surface as `signal: verbatim_public_match` if any chunk ≥3 lines matches a known public AI-generated boilerplate exactly.
- **Style consistency vs. complexity** — surface as `signal: uniform_style` if comment style and naming idioms are suspiciously consistent across files of varying complexity.
- **Generic comments without engagement** — surface as `signal: generic_comments` if comments explain what the code does without engaging with the problem-specific decisions, beyond a per-file threshold (default: >40% of comments are generic).

Panel debrief framing: "Signals suggest possible AI use. Discuss with the candidate. The skill does not determine; the panel does, against the disclosed policy."

### `syntax-help-only`

The take-home brief told the candidate: "AI tools are allowed for syntax help (looking up the right method name, checking a regex, formatting). They are NOT allowed for solution generation (asking 'how would I implement this?', 'write me the function for X')."

Pattern checks: same as `none-allowed`, but with different framing. AI-generated boilerplate at the function level is a signal worth discussing; AI-completed identifier names are not.

Panel debrief framing: "AI use was permitted within bounds. Signals indicate where the candidate may have crossed the bounds. Discuss with the candidate."

### `ai-tools-allowed`

The take-home brief told the candidate: "AI tools are allowed throughout. Tell us what tools you used and how, in the README."

Pattern checks: still run, but signals are surfaced as informational only.

Panel debrief framing: "AI use was permitted. The signals are informational. The panel evaluates the SUBMISSION against the rubric; the question is what the candidate built and why, not which tools they used."

In the `ai-tools-allowed` policy, the panel should be ready to evaluate "the candidate's prompting and tool-use judgment" as a positive dimension if the rubric calls it out — many roles in 2026 explicitly want to see how candidates work with AI tools.

## What the skill does NOT do

- **Run a third-party AI-detection model** (GPTZero, Originality, etc.). Those tools have well-documented false-positive rates that climb to 30-50% on technical writing. Their findings do not survive a panel debrief.
- **Output a confidence score for "this was AI-generated."** No such confidence can be honestly assigned; the patterns are signals, not proof.
- **Block the report on signals.** Signals appear in the report. The panel decides. If `none-allowed` and the signals are strong, the panel typically schedules a follow-up conversation rather than auto-rejecting.

## Calibration

The skill's pattern thresholds (e.g. "40% generic comments") are tunable per take-home format. If your team's take-home format produces a lot of boilerplate naturally (e.g. it asks for a CRUD API), the threshold should be raised; if your format produces little boilerplate (e.g. it asks for a custom algorithm), the threshold should be lowered.

The defaults are calibrated against general-engineering take-homes. Tune in `config.json` per take-home format. Document the tuning in the rubric file's notes section.

## Why surfaced as notes, not as a verdict

1. **False-positive cost is asymmetric.** Auto-rejecting a candidate based on an AI-use signal that turns out to be wrong damages the firm's brand and risks a discrimination claim if the signal correlates with disparate impact. Surfacing for discussion costs nothing.
2. **The disclosed policy is the contract.** What matters is whether the candidate followed the policy they were told about. The signal helps the panel ask; it does not answer.
3. **AI-use detection is unreliable.** Even the best-known detectors have unacceptable error rates. The skill does not pretend otherwise.
4. **Hire decisions involve more than this one signal.** A candidate with a strong submission and a possible AI-use signal under `syntax-help-only` is a candidate to talk to, not a candidate to drop.