claude-skill

Avaliador de take-home com o Claude

Dificuldade

intermediário

Tempo de setup

40min

Para

recruiter · hiring-manager · technical-screener

Recrutamento e TA

Stack

Um Claude Skill que pontua a submissão take-home de um candidato contra uma rubrica que o time de contratação escreveu, com citações linha por linha do código ou documentos submetidos, e produz um relatório de avaliação estruturado — nunca passa ou reprova automaticamente. O painel de hiring usa o relatório para ancorar o debrief ao vivo; a decisão real de contratar/não contratar acontece na discussão do painel, não no relatório. Substitui os 60-90 minutos por painelista de “li isso no sábado de manhã e acho que estava ok?” desorganizados por uma revisão estruturada de 15 minutos por painelista mais um debrief calibrado de 30 minutos.

Quando usar

O cargo usa um take-home como etapa no loop (entrevistas estruturadas são pré-requisito — sem uma rubrica escrita o skill não tem nada para pontuar).
Você quer pontuação consistente entre painelistas. Revisões de take-home são notoriamente inconsistentes porque cada painelista lê em horários diferentes com níveis de atenção diferentes; o relatório ancorado na rubrica é o artefato de nivelamento.
O take-home é um exercício de código, um write-up de design de sistema, um exercício escrito (rascunho de PRD, mock de chamada de vendas escrito) ou uma integração construída que produz artefatos inspecionáveis.

Quando NÃO usar

Passa / reprova automático no loop. O skill produz um relatório pontuado. A decisão de contratar acontece no debrief do painel. Conectar o score agregado do relatório a uma transição de stage dispara a mesma exposição ao NYC LL 144 / EU AI Act que a rejeição automática no screening.
Entrevistas de código ao vivo. Workflow diferente (observação ao vivo do processo, não avaliação de artefato). O workflow de debrief de entrevista lida com esse caso.
Take-homes de mais de 4 horas de trabalho do candidato. Take-homes longos são em si um anti-padrão de experiência do candidato; o skill não vai corrigir isso.
Submissões onde o candidato não assinou a divulgação de uso de IA. A pontuação da rubrica é calibrada contra uma política específica de uso de IA (ex.: “ferramentas de IA permitidas para ajuda com sintaxe, não para geração de solução”); sem a divulgação, o skill não consegue calibrar a detecção de “sinal somente de IA”.
Detecção de plágio como uso primário. O skill sinaliza padrões suspeitos (correspondências verbatim de repositórios públicos, boilerplate genérico gerado por IA), mas não é uma ferramenta forense de detecção de plágio. Use uma ferramenta dedicada para isso se precisar de achados defensáveis.

Setup

Faça o drop do bundle. Coloque apps/web/public/artifacts/take-home-evaluator-claude-skill/SKILL.md no seu diretório de skills do Claude Code.
Crie a rubrica. Por take-home, escreva uma rubrica JSON com as dimensões em que você realmente pontua (correção, qualidade de código, tomada de decisão documentada em comentários / README, tratamento de erros, cobertura de testes). Âncoras por dimensão de 1-5. O template fica em references/1-take-home-rubric-template.md.
Configure a política de uso de IA. O prompt do skill diz explicitamente ao Claude qual uso de IA foi permitido (“apenas ajuda com sintaxe”, “ferramentas de IA permitidas em todo momento”, “sem ferramentas de IA”, etc.). A configuração mapeia para a linguagem de divulgação no brief do take-home — devem corresponder.
Configure o modo de distribuição de painelistas. Ou modo de painelista único (um relatório por submissão) ou modo por painelista (cada painelista recebe a mesma submissão, gera sua própria avaliação e o skill agrega os deltas entre painelistas). O modo por painelista captura drift de pontuação mas dobra o custo de modelo.
Execute em seco num take-home fechado. Pontue um take-home de um candidato contratado (ou não) no trimestre passado. Compare os scores por dimensão do skill com os scores reais do painel. Ajuste as âncoras da rubrica se o skill ponder dimensões de forma diferente.

O que o skill realmente faz

Seis passos. A ordem importa: verificações determinísticas (compilar, rodar, estrutura de arquivo) acontecem antes de o LLM pontuar qualquer coisa, porque deixar o modelo pontuar uma submissão que não roda produz um relatório confiante sobre um artefato quebrado.

Valide a forma da submissão. Verifique que todos os entregáveis nomeados no brief do take-home existem (ex.: README.md, arquivos de código, arquivos de teste). Entregáveis ausentes → sinalize no relatório; NÃO pontue essas dimensões.
Execute verificações determinísticas. Compile o código. Execute a suite de testes que o candidato escreveu. Capture o output. Esses são os resultados auditáveis e reproduzíveis — o LLM não os re-litiga.
Pontue por dimensão da rubrica. Para cada dimensão na rubrica, pontue 1-5 com citações verbatim da submissão do candidato (caminho do arquivo + intervalo de linha + o código ou texto). Citações são obrigatórias; sem citação, o score fica no padrão da âncora 1 da rubrica. O requisito de citação mantém o modelo ancorado na submissão real em vez de feedback genérico.
Detecte sinal de uso de IA contra a política. Execute verificações de padrão contra a política de uso de IA divulgada. Correspondências verbatim com boilerplate de IA gerado publicamente, estilo suspeitosamente consistente entre arquivos de complexidade variável, ou comentários genéricos sem engajamento com as decisões específicas do problema todos surgem como notas ai-use-signal — não como violação, apenas como sinal para o painel discutir contra a política divulgada.
Calcule o agregado SEM uma recomendação de contratar/não contratar. Some os scores por dimensão. Surfaceie o agregado como um número. NÃO traduza o agregado em uma recomendação. O skill retorna explicitamente “relatório; não uma decisão” em vez de “passa / reprova”.
Emita relatório por painelista ou agregado. No modo de painelista único, o relatório vai para o painelista que chamou. No modo por painelista, o skill agrega entre painelistas, surfaceia deltas por dimensão entre painelistas (e qual painelista viu o quê diferente) e emite um relatório pronto para debrief.

Realidade de custos

Por submissão take-home, no Claude Sonnet 4.6:

Tokens de LLM — tipicamente 15-30k de input (rubrica + código/texto da submissão + instruções do skill) e 3-5k de output (relatório pontuado por dimensão). Aproximadamente $0,15-0,25 por submissão no modo de painelista único. O modo por painelista (3-4 painelistas) multiplica linearmente.
Custo de CI / sandbox — rodar a suite de testes do candidato custa o que seu CI normalmente custa; geralmente negligível. A execução em sandbox (recomendada — nunca rode código do candidato no laptop do painel) custa o que seu provedor de runner em sandbox cobra.
Tempo do painelista — o ganho. A revisão de primeira passagem de um painelista num take-home é 60-90 minutos quando bem feito, menos quando mal feito. Revisar o relatório do skill e anotar concordar/discordar por dimensão é 15-25 minutos. Tempo total do painel economizado por take-home: 2-3 horas de painelista.
Tempo de setup — 40 minutos uma vez para a rubrica e o mapeamento de política de uso de IA por formato de take-home. A reutilização entre cargos na mesma família é alta.

Métrica de sucesso

Rastreie três coisas por ciclo de take-home:

Variância de score entre painelistas — variância entre os scores por dimensão dos painelistas. O skill deve comprimir a variância (painelistas ancorados na mesma rubrica e nas mesmas citações) sem forçar concordância artificial. Variância abaixo de ~0,5 (numa escala de 5 pontos) sugere que os painelistas estão aprovando o relatório do skill sem critério; acima de ~1,5 sugere que as âncoras da rubrica são vagas demais para o take-home discriminar.
Correlação contratação-vs-não-contratação com o agregado do skill — ao longo de um trimestre, a decisão de contratação do painel correlaciona com o agregado do skill? Deve ser positiva mas NÃO 1,0; se for 1,0, o painel está deferindo automaticamente (que é o modo de falha contra o qual o skill é projetado), e se for 0, a rubrica ou o skill está desalinhado com o que o painel realmente valoriza.
Duração do debrief de take-home — tempo real desde “todos os painelistas enviaram revisões” até “decisão registrada”. Deve cair de 1-2 dias para menos de 4 horas, porque o relatório é uma âncora compartilhada.

Versus as alternativas

Versus CodeSignal Coding Reports / HackerRank automated grading. Esses produtos rodam o código do candidato contra os casos de teste da plataforma e emitem um score. Escolha-os se seu take-home é bem-definido-input-para-output-bem-definido estruturado (estilo LeetCode). Escolha o skill se o take-home é uma construção (escreva um pequeno sistema, projete uma API, escreva um PRD), onde a rubrica é o score e o score é a rubrica. Os dois são complementares; o CodeSignal pode ser o input para a etapa de run-tests do skill.
Versus take-homes corrigidos manualmente. A correção manual é certa para as contratações de maior risco (engenheiro fundador, IC principal) onde o julgamento narrativo do painel é o entregável. O skill ganha seu custo de setup nos 80% dos take-homes onde a aplicação consistente de rubrica é o que está faltando.
Versus ChatGPT no estilo “revise este código”. O chat genérico retorna feedback genérico. O skill é estruturalmente diferente: exige citações verbatim, executa verificações determinísticas primeiro e recusa autorizar uma recomendação de contratar/não contratar.
Versus nenhum take-home (loops somente ao vivo). Uma escolha razoável para cargos sênior onde referências e rodadas ao vivo carregam o peso. O skill é irrelevante se o loop não tem take-home.

Pontos de atenção

Drift de passa / reprova automático. Guarda: o output do skill termina com os scores por dimensão e o agregado. Não há string “passa” ou “reprova”. O esquema omite explicitamente um campo de recomendação.
Alucinação de feedback genérico. Guarda: todo score de dimensão exige uma citação verbatim (caminho do arquivo + intervalo de linha + conteúdo). Scores sem citações ficam no padrão 1.
Herança de viés da rubrica. Guarda: a rubrica é upstream deste skill. Execute a rubrica pelo enquadramento do auditor de slate diversificado — a rubrica pontua em dimensões que têm impacto disparate conhecido (ex.: “usa idiomas obscuros”, que frequentemente correlaciona com background de bootcamp vs CS)?
Falso positivo de detecção de uso de IA. Guarda: sinais de uso de IA são surfaceados como notas, não violações. O painel revisa contra a política divulgada. Auto-sinalizar como violação seria a leitura errada; o uso legítimo de ferramentas de IA (dentro da política) é cada vez mais a norma.
Falha de sandboxing no código do candidato. Guarda: o skill recomenda explicitamente execução em sandbox e avisa se o ambiente que chamou executa a suite de testes diretamente na máquina do painel. Nunca execute código de candidato não revisado numa máquina com acesso a segredos do escritório.
Explosão do tamanho da submissão. Guarda: se a submissão exceder ~50K LOC, o skill avisa que a pontuação será parcial e solicita ao painelista que identifique as partes em que focar. Take-homes que produzem 50K LOC são em si um sinal de que o brief estava errado.

Stack

O bundle do skill fica em apps/web/public/artifacts/take-home-evaluator-claude-skill/ e contém:

SKILL.md — a definição do skill
references/1-take-home-rubric-template.md — template de rubrica preenchível
references/2-ai-use-policy-mapping.md — como a política divulgada mapeia para as verificações de padrão do skill

Ferramentas que o workflow assume que você usa: Claude (o modelo). Opcional: CodeSignal ou HackerRank para a parte de verificação determinística; Ashby para o registro do candidato. Execução em sandbox é escolha do recruiter / hiring manager (containers Docker, GitHub Actions, etc.).

Conceitos relacionados: entrevistas estruturadas, entrevistas comportamentais, experiência do candidato, qualidade da contratação.

Editar esta página no GitHub

Arquivos deste artefato

Baixar tudo (.zip)

---
name: take-home-evaluator
description: Score a take-home submission against a rubric, with verbatim citations from the candidate's code or text, plus deterministic checks (compile, run tests). Output is a structured report with per-dimension scores and an aggregate — never a hire/no-hire recommendation. Detects AI-use signals against the disclosed policy as notes for the panel debrief.
---

# Take-home assessment evaluator

## When to invoke

Use this skill when a panelist has a candidate's take-home submission and wants a rubric-anchored evaluation report to bring to the panel debrief. Take the submission directory plus the role rubric as input and return a structured Markdown report.

Do NOT invoke this skill for:

- **Auto-pass / auto-fail in the loop.** This skill produces a scored report. The hire decision happens in the panel debrief, not in the report.
- **Live coding interviews.** Different workflow.
- **Submissions where the candidate did not sign the AI-use disclosure.** The skill calibrates against a specific use-of-AI policy; without the disclosure, there is nothing to calibrate against.
- **Plagiarism forensics.** The skill flags suspicious patterns but is not a defensible plagiarism tool.

## Inputs

- Required: `submission_dir` — path to the candidate's submission directory.
- Required: `rubric` — path to the take-home rubric file. See `references/1-take-home-rubric-template.md` for the shape.
- Required: `ai_use_policy` — string identifying the disclosed policy. One of `none-allowed`, `syntax-help-only`, `ai-tools-allowed`. The skill's pattern checks are calibrated against the policy, not against an absolute "AI-generated" detector.
- Optional: `panelist_id` — if running per-panelist mode, identify the panelist for cross-panelist aggregation.
- Optional: `sandboxed` — boolean. If `false`, the skill warns about running candidate code on the local machine and asks for confirmation before proceeding to step 2.

## Reference files

- `references/1-take-home-rubric-template.md` — the rubric shape the skill expects.
- `references/2-ai-use-policy-mapping.md` — how each `ai_use_policy` value maps to the pattern checks in step 4.

## Method

Six steps.

### 1. Validate the submission shape

Walk `submission_dir`. Compare against the deliverables named in the take-home brief (the rubric file's `expected_deliverables` field). For each missing deliverable, add to a `missing_deliverables` array on the report. Do NOT score dimensions that depend on missing files; their score is `not assessed` rather than `1`.

### 2. Run deterministic checks

If the submission has a build/test command in `package.json`, `Makefile`, `pyproject.toml`, or `Cargo.toml`:

- Compile / install dependencies in a sandboxed environment. If the calling environment is not sandboxed (per `sandboxed: false`), warn and stop until the panelist confirms.
- Run the candidate's test suite. Capture pass/fail counts.
- Run linters / formatters in check mode (do NOT modify the candidate's code). Capture findings.

Record the deterministic results in a `deterministic_checks` block on the report. These are auditable — the LLM does not re-litigate them in step 3.

### 3. Score per rubric dimension

For each dimension in the rubric:

- Read the rubric anchors (1-5).
- Find evidence in the submission. Evidence is a verbatim string from the code or text, with file path + line range.
- If you cannot find verbatim evidence for a score above 1, the score is 1 — no inference, no generic feedback.
- Tag each cited line with which dimension it supports.

### 4. Detect AI-use signal against the policy

Run pattern checks per `references/2-ai-use-policy-mapping.md`:

- **Verbatim public matches** — does any chunk of the submission match a public AI-generated boilerplate exactly? Surface as `signal: verbatim_public_match`.
- **Style consistency vs. complexity** — is the style suspiciously consistent across files of varying complexity? Real candidates' style varies with the difficulty of the section. Surface as `signal: uniform_style`.
- **Generic comments without engagement** — comments that explain what the code does without engaging with the problem-specific decisions are a common AI-generated tell. Surface as `signal: generic_comments`.

Surface signals as NOTES, never as VIOLATIONS. The panel reviews against the disclosed policy. If `ai_use_policy: ai-tools-allowed`, the signals are informational only. If `ai_use_policy: none-allowed`, the panel discusses whether the signals warrant follow-up — the skill does not decide.

### 5. Compute aggregate

Sum the per-dimension scores. Surface the aggregate as a number AND the per-dimension breakdown.

Do NOT translate the aggregate into a recommendation. The skill's schema explicitly omits a `recommendation` field. The report ends after the aggregate and the AI-use signals section.

### 6. Emit report

Write the report to `report.md` in the submission directory or to stdout (depending on the calling environment). In per-panelist mode, write to `report-<panelist_id>.md`.

## Output format

```markdown
# Take-home evaluation — {Candidate name} — {Role}

Submission: `{submission_dir}` · Rubric: `{rubric_path}` (SHA `{short}`)
Generated: {ISO timestamp} · Skill v1.0 · Model: claude-sonnet-4-6
Panelist: {panelist_id or "single-panelist mode"}
AI-use policy: {ai_use_policy}

## Deterministic checks

- **Build:** {passed | failed | not applicable}
- **Tests:** {N/M passed} ({test command})
- **Linter:** {N findings}
- **Missing deliverables:** {list or "none"}

## Per-dimension scores

### Correctness — 4/5

> Evidence: `src/router.rs:42-57` — handles the request-routing edge case for retries with exponential backoff and jitter, including the cap at max 60s. Anchor 4 ("handles the named edge cases with explicit code paths") matches.

Counter-evidence (would have been 5): retry budget is hardcoded at 5 attempts; the rubric's anchor 5 names "configurable retry budget."

### Code quality — 3/5

> Evidence: `src/router.rs` — file is 800 lines with no module split. Anchor 3 ("readable but lacks structural decomposition") matches.

### Decision-making documented — 4/5

> Evidence: `README.md:25-40` — explains the choice of exponential-vs-fixed backoff with a reference to the failure mode it mitigates. Anchor 4 matches.

### Error handling — 2/5

> Evidence: `src/router.rs:120` — catches and re-raises the network error without distinguishing between transient and permanent failures. Anchor 2 ("error paths exist but do not differentiate") matches.

### Test coverage — 4/5

> Evidence: `tests/router_test.rs` — covers the happy path, three retry scenarios, and the timeout. Missing: the network-partition test the rubric anchor 5 names.

## Aggregate

17/25.

This is the per-dimension sum. The skill does NOT translate this into a hire/no-hire recommendation. The panel debrief is where the decision happens.

## AI-use signal notes

Disclosed policy: **syntax-help-only**.

- ⚠️ `signal: uniform_style` — `src/cache.rs` and `src/router.rs` use the same comment style and naming idioms despite the different complexity. May warrant a follow-up question in the panel debrief.
- ✓ No `verbatim_public_match` signals.
- ✓ No `generic_comments` signals beyond the documented threshold.

The panel discusses these against the disclosed policy. The skill does not decide.
```

## Watch-outs

- **Auto-pass / auto-fail drift.** *Guard:* the report ends after the aggregate. No recommendation field. The aggregate is a sum, not a verdict.
- **Generic-feedback hallucination.** *Guard:* every dimension score requires verbatim citation (file path + line range + content).
- **AI-use false positive.** *Guard:* signals are notes, not violations. The panel decides against the disclosed policy.
- **Unsandboxed candidate code.** *Guard:* skill warns before running in non-sandboxed environments.
- **Bias inheritance.** *Guard:* the rubric is upstream of the skill. Audit the rubric separately if the dimensions correlate with disparate impact.

# Take-home rubric template

The take-home evaluator scores a submission against this rubric shape. Copy the JSON below to your role's rubric file (one per take-home format) and fill in every field. The skill reads the rubric; without it, scoring has nothing to anchor against.

A complete rubric takes 30-90 minutes to author per take-home format. Reuse across roles in the same family is high — a senior-backend take-home rubric is largely the same across companies once you've written it once.

## JSON shape

```json
{
"take_home_id": "senior-backend-router-rewrite-v3",
"version": "2026-04-15",
"expected_deliverables": [
"README.md",
"src/**/*.rs",
"tests/**/*.rs",
"Cargo.toml"
],
"build_commands": {
"build": "cargo build --release",
"test": "cargo test --all",
"lint": "cargo clippy -- -D warnings"
},
"ai_use_policy_match": "syntax-help-only",
"dimensions": [
{
"id": "correctness",
"label": "Correctness",
"anchors": {
"1": "Compiles but does not pass the candidate's own tests, or does not handle the named happy path.",
"2": "Handles the happy path; ignores the named edge cases (retries, partial failure).",
"3": "Handles the happy path and the obvious edge cases; misses the subtle ones (clock skew, partition recovery).",
"4": "Handles the named edge cases with explicit code paths; minor gaps acceptable.",
"5": "Handles all named edge cases AND demonstrates a configurable retry budget / timeout structure that the rubric explicitly calls for."
}
},
{
"id": "code_quality",
"label": "Code quality and structural decomposition",
"anchors": {
"1": "Single file, no decomposition; difficult to read.",
"2": "Decomposed but the decomposition does not follow domain boundaries.",
"3": "Readable but lacks structural decomposition that would scale past prototype.",
"4": "Clear module boundaries that follow the domain; idiomatic for the language.",
"5": "All of 4, plus the structural choices are documented in the README with the alternatives considered."
}
},
{
"id": "decision_documentation",
"label": "Decision-making documented",
"anchors": {
"1": "No README, or the README only repeats the take-home brief.",
"2": "README describes what was built without naming the engineering choices.",
"3": "README names some choices without naming the alternatives.",
"4": "README names the choices AND explains why each was picked over the named alternatives.",
"5": "All of 4, plus the README cites the failure modes each choice mitigates."
}
},
{
"id": "error_handling",
"label": "Error handling",
"anchors": {
"1": "Errors are caught and silently swallowed.",
"2": "Error paths exist but do not differentiate between transient and permanent failures.",
"3": "Differentiates transient vs. permanent; lacks structured error types.",
"4": "Structured error types; retry policy is explicit per error class.",
"5": "All of 4, plus error paths have explicit observability (logging / metrics / traces) named in the code."
}
},
{
"id": "test_coverage",
"label": "Test coverage",
"anchors": {
"1": "No tests, or tests do not run.",
"2": "Tests cover the happy path only.",
"3": "Tests cover the happy path and one or two edge cases.",
"4": "Tests cover the happy path and multiple edge cases (timeout, retry, partial failure).",
"5": "All of 4, plus the network-partition test the rubric explicitly calls for."
}
}
],
"rubric_fairness_check": {
"no_bootcamp_vs_cs_proxies": "Anchors must score on observable behavior in the submission, not on idioms that proxy for educational background. 'Uses obscure language idioms' is forbidden as a positive signal.",
"no_native-english-only_proxies": "Anchors must NOT score on README writing fluency beyond the level required to communicate the engineering decisions.",
"documented_in_brief": "The take-home brief shared with the candidate must describe the rubric dimensions and approximate weighting. Surprise dimensions are unfair."
}
}
```

## Per-field notes

- `take_home_id` — stable identifier for the take-home format. Reused across candidates for the same role family.
- `version` — semver or date. Bumped when the rubric is edited; the skill captures the version in the report so re-scoring against an edited rubric is visible.
- `expected_deliverables` — globs the skill walks against the submission. Missing deliverables surface in the report.
- `build_commands` — the skill runs these in step 2 (deterministic checks). Sandboxed execution required.
- `ai_use_policy_match` — should match the disclosure language in the take-home brief. Mismatch means the candidate's policy understanding doesn't match what the skill calibrates against.
- `dimensions` — array. Each dimension has an `id`, a `label`, and 5 anchor strings. Anchors should be observable behavior, not adjectives.
- `rubric_fairness_check` — three named fairness checks the skill confirms before scoring. If the rubric anchors violate any of these, the skill emits a warning and asks the rubric author to revise. (The skill does not refuse to score on a fairness-check violation, because the rubric is upstream and revising it is the right intervention. But it surfaces the issue.)

## Authoring a new dimension

To add a dimension to an existing rubric:

1. Pick observable behavior, not adjectives. "Has good error handling" is not a dimension; "error paths differentiate transient vs. permanent failure" is.
2. Write the 5 anchors as five distinct observable behaviors, each strictly more demanding than the last.
3. Test the dimension on a known submission. Can you score it from the anchors alone, without the original code in your head? If not, the anchors are too vague.
4. Bump the rubric version.

## Authoring a new rubric (for a net-new take-home)

1. Start from the take-home brief. What does the brief tell the candidate to deliver? Those are the `expected_deliverables`.
2. What is the brief asking the candidate to demonstrate? Those become the `dimensions`. Aim for 4-6 dimensions; more than 6 and the panelist can't hold them.
3. Write the 1-anchor first (the floor: what does an unsubmitted-effort look like?), then the 5-anchor (the ceiling: what does the strongest submission look like?), then fill 2-4 between.
4. Write the brief and the rubric in parallel. Anchors that don't show up in the brief are surprise dimensions; anchors in the brief that don't show up in the rubric are unscoreable promises.
5. Run the rubric on a known submission (a prior hire's submission, anonymized). Does it score them where you'd expect?

# AI-use policy mapping

The take-home evaluator runs pattern checks calibrated against the disclosed AI-use policy. The same submission produces different signal interpretation under different policies; this file documents the mapping.

The intent: surface signals to the panel debrief, not to make a determination. AI-use detection is well-known to be unreliable as a forensic tool; the right framing is "discuss with the candidate against the disclosed policy."

## Policy values

### `none-allowed`

The take-home brief told the candidate: "Do not use AI tools (Claude, ChatGPT, Copilot, Cursor, etc.) at any point during this assessment."

Pattern checks:

- **Verbatim public matches** — surface as `signal: verbatim_public_match` if any chunk ≥3 lines matches a known public AI-generated boilerplate exactly.
- **Style consistency vs. complexity** — surface as `signal: uniform_style` if comment style and naming idioms are suspiciously consistent across files of varying complexity.
- **Generic comments without engagement** — surface as `signal: generic_comments` if comments explain what the code does without engaging with the problem-specific decisions, beyond a per-file threshold (default: >40% of comments are generic).

Panel debrief framing: "Signals suggest possible AI use. Discuss with the candidate. The skill does not determine; the panel does, against the disclosed policy."

### `syntax-help-only`

The take-home brief told the candidate: "AI tools are allowed for syntax help (looking up the right method name, checking a regex, formatting). They are NOT allowed for solution generation (asking 'how would I implement this?', 'write me the function for X')."

Pattern checks: same as `none-allowed`, but with different framing. AI-generated boilerplate at the function level is a signal worth discussing; AI-completed identifier names are not.

Panel debrief framing: "AI use was permitted within bounds. Signals indicate where the candidate may have crossed the bounds. Discuss with the candidate."

### `ai-tools-allowed`

The take-home brief told the candidate: "AI tools are allowed throughout. Tell us what tools you used and how, in the README."

Pattern checks: still run, but signals are surfaced as informational only.

Panel debrief framing: "AI use was permitted. The signals are informational. The panel evaluates the SUBMISSION against the rubric; the question is what the candidate built and why, not which tools they used."

In the `ai-tools-allowed` policy, the panel should be ready to evaluate "the candidate's prompting and tool-use judgment" as a positive dimension if the rubric calls it out — many roles in 2026 explicitly want to see how candidates work with AI tools.

## What the skill does NOT do

- **Run a third-party AI-detection model** (GPTZero, Originality, etc.). Those tools have well-documented false-positive rates that climb to 30-50% on technical writing. Their findings do not survive a panel debrief.
- **Output a confidence score for "this was AI-generated."** No such confidence can be honestly assigned; the patterns are signals, not proof.
- **Block the report on signals.** Signals appear in the report. The panel decides. If `none-allowed` and the signals are strong, the panel typically schedules a follow-up conversation rather than auto-rejecting.

## Calibration

The skill's pattern thresholds (e.g. "40% generic comments") are tunable per take-home format. If your team's take-home format produces a lot of boilerplate naturally (e.g. it asks for a CRUD API), the threshold should be raised; if your format produces little boilerplate (e.g. it asks for a custom algorithm), the threshold should be lowered.

The defaults are calibrated against general-engineering take-homes. Tune in `config.json` per take-home format. Document the tuning in the rubric file's notes section.

## Why surfaced as notes, not as a verdict

1. **False-positive cost is asymmetric.** Auto-rejecting a candidate based on an AI-use signal that turns out to be wrong damages the firm's brand and risks a discrimination claim if the signal correlates with disparate impact. Surfacing for discussion costs nothing.
2. **The disclosed policy is the contract.** What matters is whether the candidate followed the policy they were told about. The signal helps the panel ask; it does not answer.
3. **AI-use detection is unreliable.** Even the best-known detectors have unacceptable error rates. The skill does not pretend otherwise.
4. **Hire decisions involve more than this one signal.** A candidate with a strong submission and a possible AI-use signal under `syntax-help-only` is a candidate to talk to, not a candidate to drop.