claude-skill

Claudeによる面接デブリーフサマリー

Difficulty

初級

Setup time

30min

For

recruiter · hiring-manager · talent-acquisition

Recruiting & TA

Stack

候補者のパネル全体（すべての面接官の構造化スコアカード、オプションのBrightHireまたはMetaviewトランスクリプト、ロールルーブリック）を受け取り、同期デブリーフミーティングの前にパネルが読む、証拠に基づいたデブリーフブリーフを生成するClaude スキルです。ブリーフはルーブリック次元ごとの集約シグナル、合意と不一致の領域、パネルが解決すべき具体的な意思決定ポイント、シグナルが薄い場合のフォローアップ質問を提示します。採用・不採用の推薦は意図的に出力しません。それはパネルの仕事であり、そうでなければこのワークフローはEU AI法附属書IIIの高リスク体制および米国のほとんどの採用AI法規の対象に入ってしまいます。

下流効果：デブリーフが、誰が何をスコアしたかを90分かけてレビューする場ではなく、実際の不一致を30分で議論する場になります。

使う場面

次のすべてが当てはまる場合にスキルを実行します。

ロールルーブリックをカバーする少なくとも3名の面接官で構成されるフル面接ループが候補者について完了している。
すべての面接官がルーブリックに対する構造化スコアカードを提出済みである（自由記述のみのスコアカードはスキルのステップ1の入力チェックで失敗します。apps/web/public/artifacts/interview-debrief-summary-skill/SKILL.mdを参照してください）。
同期デブリーフミーティングまで少なくとも2時間ある。ブリーフはミーティング中にスキャンするものではなく、事前に読むために作られています。
ロールにapps/web/public/artifacts/interview-debrief-summary-skill/references/1-interview-rubric-template.mdの形式に合致した構造化ルーブリックがある。すべての次元に1〜5のアンカーテーブルがあり、すべてのアンカーに行動の説明がある。

使わない場面

このスキルはいくつかの隣接した用途には適していません。

採用・不採用の自動決定。 ブリーフは最終決定を出力しません。パネルのための意思決定ポイントを出力します。自動決定はEU AI法附属書III義務、ニューヨーク市地方法144のバイアス監査要件、イリノイ州AIVIの同意要件、メリーランド州HB 1202の通知ルールを発動させます。このスキルはその体制の外側に留まるよう設計されています。自動決定ロジックに組み込めばその設計を無効にしてしまいます。
採用担当者によるレビューなしに候補者へのフィードバックを送ること。 ブリーフは内部専用です。合成された論拠テキストは内部パネルの言い回しを使用しており、候補者にそのまま提示した場合、差別訴訟の証拠となりえます。
パネルデブリーフの会話を代替すること。 ブリーフはディスカッションへの入力であり、代替ではありません。「ブリーフがコンセンサスを示しているのでデブリーフはスキップしよう」という使い方は、references/3-disagreement-escalation.mdのルールが意図して表面化させる失敗モードです。摩擦のないコンセンサスそのものがキャリブレーションの懸念事項です。
面接官が1名のループ。 3名未満ではパネル合成は意味をなしません。1名の面接官向けフィードバックワークフローを使用してください。
同意のないトランスクリプト。 二者同意が必要な法域（CA、FL、IL、MD、MA、MT、NH、PA、WA）では、これは必須の停止条件です。面接開始時に候補者が録音に同意していない限り、BrightHireまたはMetaviewのトランスクリプトを渡さないでください。
ルーブリック自体の設問に関するキャリブレーションセッション。 パネルがルーブリック（候補者ではなく）について議論しているとき、ブリーフの次元別合成はノイズです。キャリブレーションセッションを別途実施し、ルーブリックが安定したらブリーフを再実行してください。

セットアップ

アーティファクトバンドルはapps/web/public/artifacts/interview-debrief-summary-skill/にあります。内容：

SKILL.md — フロントマター、起動ルール、6ステップの手順、リテラル出力フォーマット、注意点とガードのペアを含むClaude スキル定義。
references/1-interview-rubric-template.md — スキルが入力を検証する構造化ルーブリックの形式。
references/2-debrief-brief-format.md — ブリーフが書かれるリテラルMarkdownフォーマット。
references/3-disagreement-escalation.md — 決定論的な意思決定ポイントルール（レンジ、バーレイザー拒否権、HM対パネル乖離、yes全体の中の一つのno、カバレッジギャップ、根拠不十分クラスター）。

ワークフローを立ち上げるには：

バンドルをClaudeのスキルディレクトリに配置する。 interview-debrief-summary-skill/をプロジェクトの.claude/skills/（またはチームの共有スキル場所）以下に配置します。
ルーブリックテンプレートをロール固有のルーブリックに置き換える。 ロールごとにreferences/1-interview-rubric-template.mdを編集します。すべての次元に行動の説明を含む1〜5のアンカーテーブルが必要です。次元数は4〜7の間に保ってください。4未満ではパネルが三角測量できず、7超ではスコアカードが義務的に記入されて証拠の質が低下します。
スコアカードエクスポートを接続する。 スキルが構造化スコアカードを読めるようATSエクスポートを設定します。Ashby、Greenhouse、LeverはいずれもAPIでスコアカードJSONを公開しています。スキルはSKILL.mdのInputsブロックに従った{interviewer_id, interviewer_role, dimension_scores, evidence_notes}の配列を期待します。
既知の候補者でテストする。 パネルがすでにデブリーフを行い決定を下した候補者で実行します。ブリーフの意思決定ポイントと実際のデブリーフの議論トピックを比較します。ブリーフがパネルで議論されなかったトピックを提示する（またはパネルで議論されたトピックを見逃す）場合、プロンプトではなくまずルーブリックを調整してください。
監査ログディレクトリを設定する。 スキルは実行ごとにルーブリックSHA、面接官数、意思決定ポイント数、タイムスタンプを含む1行をaudit/<YYYY-MM>.jsonlに追記します。監査行に候補者のPIIは含まれません。このログは、ニューヨーク市地方法144/EU AI法の照会に対してワークフローを防衛可能にするものです。

スキルの実際の動作

6ステップの手順が次の順序で実行されます。この順序は重要です。決定論的な検証とマッピングがLLMによるルーブリックアンカーと質問の生成の前に行われ、最後の候補者体験パスが各ステージを個別に割り当てる際に見えない過負荷を検出するために組み立てられたループ全体を再読します。

ルーブリックと入力を検証する。 自由記述のみのルーブリック、3名未満の面接官、2名未満の面接官しかカバーしない次元、20文字未満のevidence_notes文字列で停止します。警告ではなく停止するのは意図的です。不完全な入力で生成されたブリーフはパネルの精神的なアンカーになってしまうためです。
次元ごとに集約する（決定論的）。 平均、レンジ、標準偏差、面接官ロール別の内訳を計算します。LLMはこの時点でまだスコアカードを参照しません。
意思決定ポイントを特定する（決定論的）。 references/3-disagreement-escalation.mdの6つのルールを適用します。意思決定ポイントはLLMが不一致と読み取ったものではなく、構造化シグナルに基づきます。
次元ごとに合成する。 LLMは次元ごとに2〜3文の合成を生成し、evidence_notesの文字列を引用符付きで逐語的に引用します。言い換えはバイアスが入り込む場所であり、スキルはそれを禁止しています。トランスクリプトが利用可能な場合、合成はタイムスタンプの範囲を引用します。「シグナル不十分 — フォローアップを推奨」は「推薦なし」とは区別された第一級の出力です。次元に関する証拠の欠如はパネルが把握すべき情報です。
キャリブレーションチェック。 候補者のスコア分布を同じロールの直近5件のデブリーフの移動平均と比較します。結果はブリーフの末尾の「キャリブレーションノート」ブロックに表示され、次元ごとのインラインには表示されません。意図：会話をフレーム化するものであり、スコアを調整するものではありません。
ブリーフを書いて停止する。 briefs/<candidate_id>-<YYYYMMDD>.mdに書き込みます。監査ログに1行追記します。「候補者に送信する」「Slackに投稿する」「ATSのステージを更新する」エンドポイントは呼び出しません。採用担当者と採用マネージャーが次のアクションを決定するまで、ブリーフは内部専用です。

出力フォーマットは固定です（apps/web/public/artifacts/interview-debrief-summary-skill/references/2-debrief-brief-format.mdを参照）。意図的に「推薦」セクションはなく、「集約シグナル」「次元別合成」「パネルへの意思決定ポイント」「フォローアップ質問」「キャリブレーションノート」「付録 — 面接官別証拠」のみです。採用決定を読み取ろうとする読者は、構造によってディスカッションに押し戻されます。

コスト

トランスクリプトなしの5名の面接官・5つのルーブリック次元での典型的なブリーフは、約18〜25kの入力トークン（ルーブリック＋スコアカード＋証拠ノート＋3つの参照ファイル）と4〜6kの出力トークンになります。現在のAPI料金でClaude Sonnetを使用した場合、デブリーフあたり約$0.10〜$0.15です。トランスクリプトを付加した場合（典型的な30分の面接トランスクリプト：各7〜10kトークン）、5名のループは1デブリーフあたり$0.40〜$0.70に達します。

節約時間の計算が本質的な数値です。典型的な5名のデブリーフミーティングは60〜90分かかり、そのうち30〜50分は実際の意思決定ディスカッションが始まる前の「それぞれが何を見たか」のラウンドロビンです。ブリーフはそのラウンドロビンを置き換えます。あるリファレンス組織で採用担当者がこのスキルを運用した結果、ブリーフを少なくとも4時間前に配布したループでは、デブリーフミーティングの平均が28分（75分から短縮）になったと報告されています。

これはデブリーフあたり約45分の節約で、通常5名の面接官にわたって、合計約3.75人・時間のミーティング時間が削減され、コストは1ドル未満です。

成功指標

注目すべき指標：ブリーフが少なくとも4時間前に配布されたループにおける、カレンダー分でのデブリーフミーティング時間の中央値です。カレンダーツール（またはAshbyの面接スケジュール履歴）から取得し、「ブリーフあり」と「ブリーフなし」のコホートに分けてください。目標軌道：ブリーフなしコホートでの60〜90分の中央値が、最初の4〜6週間でブリーフありコホートでの25〜40分の中央値に低下すること。

並行して監視するカウンター指標：ブリーフありコホートとブリーフなしコホートでの6ヶ月時点の採用後後悔率。デブリーフが速くなったにもかかわらず後悔率が上昇した場合、ブリーフは不一致を表面化させるのではなく平均化させています。references/3-disagreement-escalation.mdの不一致エスカレーションルールを厳格化してください（通常：レンジ閾値を2から1.5に下げるか、関連する次元に「3未満のスコア」トリガーを追加する）。

代替手段との比較

Ashbyの組み込みデブリーフ機能。 Ashbyはダッシュボードビューでスコアカードを集約し、パネル平均を計算します。書面による合成は生成せず、ルールによる意思決定ポイントの提示も、「4.0のコンセンサス」と「4.0の根拠不十分クラスター」の区別もしません。スキルが読み込むデータソースとしてAshbyのビューを使用し、ブリーフの代替としては使用しないでください。
Greenhouseのスコアカード集約。 Greenhouseは面接官ごとの採用・不採用の集計とパネルの推薦集計にロールアップします。この集計こそがスキルが設計で防ごうとする失敗モードです。スコア算術を決定として扱うようパネルを誘導し、パネルの「賛成」全体で平均化されてしまうバーレイザーの拒否権を見えなくします。
採用担当者による手動ノート。 採用担当者がすべてのスコアカードを読んでデブリーフの「テーマ」を1段落のメールにまとめるのが現状の多くのチームです。ループに対する採用担当者の読みが反映されており価値がありますが、採用担当者の時間に対して線形にスケールし、長い目で見ると「HMが望んでいることへのパターンマッチング」に偏る傾向があります。スキルは採用担当者を超えて一貫しており、採用担当者自身がブリーフを書く場合には通常フラグを立てない構造的な不一致（R3 — HM対パネル乖離）を提示します。
何もしない。 デフォルト — 全員が自分のノートを持ってデブリーフに参加し、ディスカッションはラウンドロビンで実施されます。四半期に10件未満の採用規模の小規模チームには問題ありません。それ以上の量では、ラウンドロビンがボトルネックになり、疲労が蓄積するとともにデブリーフの質が低下します。

注意点

1名の強い意見によるバイアス（最初に読んだスコアカードへのアンカリング）。 対策： ステップ2はLLMがいずれのスコアカードも参照する前に、すべての面接官にわたって決定論的に集約します。ステップ3のR3ルール（HM対パネル乖離）は単一の強い意見の乖離を明示的に意思決定ポイントとして提示します。合成は次元別ブロックで面接官名ではなく面接官ロール（HM、ピア、クロスファンクショナル、バーレイザー）で証拠を帰属させるため、ブリーフが上位面接官の方向に切り上げることを防ぎます。
根拠不十分な次元での見かけのコンセンサス。 対策： ステップ1でのevidence_notes最小長チェック（20文字未満で失敗）。ステップ3のR6（根拠不十分クラスター）は、3件以上のスコアが1ポイント以内にクラスターしているが証拠ノートの平均が30文字未満の次元を推薦なしではなくフォローアップ推奨として提示します。これは自由形式のデブリーフで最も一般的なサイレントな失敗です。
スコア算術による決定（平均3.5超を「採用」として扱う）。 対策： ブリーフは採用・不採用の推薦を出力しません。出力フォーマットには意図的に「推薦」ブロックがなく、意思決定ポイントとフォローアップのみです。決定を読み取ろうとする読者は、構造によってディスカッションに押し戻されます。
バーレイザーの拒否権がサイレントにオーバーライドされる。 対策： ステップ3のR2は、バーレイザーのスコアがパネル平均より2以上低い場合を自動的に意思決定ポイントとして提示します。バーレイザーの反対意見が平均化されてしまう状態でブリーフを生成することはできません。パネルの残りが全員一致であっても同様です。
合成に人口統計的なパターンが漏れ込む。 対策： 合成はevidence_notesの文字列を言い換えではなく逐語的に引用するため、LLMがある観察を保護クラス推論を示唆する言語に書き換えることを防ぎます。渡されたevidence_note自体に保護クラスのプロキシ（名前の出自、年齢推定、育児状況の推測、行動アンカーのない「カルチャーフィット」）が含まれている場合、スキルはステップ1で停止し、続行前にその問題のあるノートを書き直すよう提示します。
キャリブレーションノートが判決として過剰解釈される。 対策： キャリブレーションブロックはブリーフの末尾に追記され、次元ごとのインラインには表示されません。ブロックはアクションを提示するのではなく「許容範囲内」または「許容範囲外 — 議論してください」という言語を使用し、同じロールの過去のデブリーフが5件未満しかロードされていない場合はキャリブレーションチェック全体をスキップします。

スタック

AIプロバイダー： Claude（合成ステップにはSonnet、ルーブリックが曖昧な場合の初回実行検証にはOpus）。
ATS： Ashby、Greenhouse、またはLever — スコアカードデータソース。
オプションのトランスクリプト： BrightHireまたはMetaview。面接開始時に記録された二者同意の文書が必要です。
位置付け： このスキルが前提とするルーブリック設計の規律については構造化面接を参照してください。スキルは非構造化面接プロセスを救うことはできません。構造化プロセスが生み出すシグナルを合成するだけです。
ポリシーの枠組み： 候補者データの入力（特にトランスクリプトはGDPRおよびほとんどの米国州プライバシー法制度において機微な個人データ）に必要なTier-Aのエンタープライズ向けAI取り扱いについては法務チーム向けAIポリシーを参照してください。

GitHubでこのページを編集

Files in this artifact

Download all (.zip)

---
name: interview-debrief-summary
description: Synthesize a panel's per-interviewer scorecards (and optional transcripts) into an evidence-grounded debrief brief. Surfaces aggregate signal per rubric dimension, areas of agreement and disagreement, a recommended decision-point for the panel, and follow-up questions when signal is thin. Always stops short of issuing a hire/no-hire decision — the panel decides.
---

# Interview debrief summary

## When to invoke

Invoke this skill once a candidate's full interview loop has concluded and all interviewers have submitted their scorecards. The output is a brief the panel reads *before* the synchronous debrief meeting, so the meeting discusses the actual disagreements rather than being a 90-minute round of note-comparison.

Trigger conditions:

- All scheduled interviews completed in the ATS ([Ashby](/en/tools/ashby/), [Greenhouse](/en/tools/greenhouse/), [Lever](/en/tools/lever/)).
- Every interviewer has submitted a structured scorecard against the role rubric (free-text-only scorecards fail the input check in step 1).
- The debrief meeting is at least 2 hours away (so the brief can be read in advance, not skimmed during the call).

Do NOT invoke for:

- **Auto-deciding hire/no-hire.** This skill never emits a final decision. It emits an aggregate signal and a recommended decision-point for the panel to resolve. Auto-deciding would put the workflow inside EU AI Act Annex III high-risk obligations and most US state hiring-AI statutes (NYC LL 144, IL AIVI, MD HB 1202).
- **Sending feedback to the candidate without recruiter review.** The brief is internal-only. Synthesized rationale text can include phrasing that is fine for an internal panel but actionable as evidence in a discrimination claim if surfaced to the candidate verbatim.
- **Replacing the panel-debrief conversation.** The brief is the input to the discussion, not a substitute. Skipping the debrief because "the brief already shows consensus" is a failure mode this skill is designed to surface against (see `references/3-disagreement-escalation.md`).
- **Single-interviewer loops.** If only one interviewer was scheduled, do not invoke — there is nothing to aggregate. Run a different workflow (single-interviewer feedback) instead.
- **Transcripts without consent.** Do not pass [BrightHire](/en/tools/brighthire/) or [Metaview](/en/tools/metaview/) transcripts unless the candidate consented to recording at interview start. Two-party-consent jurisdictions (CA, FL, IL, MD, MA, MT, NH, PA, WA) make this a hard halt, not a guideline.

## Inputs

- Required: `candidate_id` — the ATS-internal candidate ID.
- Required: `role_rubric` — path to a Markdown file under `references/` with the structured rubric (dimensions, 1-5 anchor scale, anchor descriptions per level). Without this the skill refuses to run; an unstructured rubric is the most common cause of vague synthesis.
- Required: `scorecards` — an array of per-interviewer scorecard objects. Each object: `interviewer_id`, `interviewer_role` (one of `hiring_manager`, `peer`, `cross_functional`, `bar_raiser`), `dimension_scores` (map of dimension name to integer 1-5), `evidence_notes` (map of dimension name to free-text observation, minimum 20 characters per dimension).
- Required: `candidate_metadata` — `role_title`, `level_band`, `loop_type` (one of `onsite`, `virtual_onsite`, `phone_screen_panel`).
- Optional: `transcripts` — array of paths to BrightHire / Metaview transcript exports. When present, the skill cites supporting moments per evidence claim. When absent, the brief notes "transcript-unsupported" on each dimension synthesis.
- Optional: `prior_debriefs` — paths to previous debrief briefs for the same role, used by the calibration check in step 5.

## Reference files

Always read the following from `references/` before generating the brief. They contain the rubric scaffolding, the literal output format, and the disagreement-escalation rules. Without them the brief is generic and the guards that keep the synthesis defensible do not run.

- `references/1-interview-rubric-template.md` — the structured rubric template the role rubric must conform to. Replace the template content with your role-specific rubric before running. The skill validates the passed-in `role_rubric` against this shape.
- `references/2-debrief-brief-format.md` — the literal output format, including the per-dimension synthesis layout and the decision-point-for-the-panel block. The skill writes against this format verbatim — do not freestyle.
- `references/3-disagreement-escalation.md` — rules for when a disagreement gets surfaced as a decision-point versus left as a note. Includes the bar-raiser-veto and the hiring-manager-vs-peer-divergence rules.

## Method

Run these six steps in order. Steps 1-3 are deterministic input validation and aggregation; only step 4 uses the LLM for synthesis. Running the LLM over an unvalidated, free-text-only rubric or over a single interviewer's scorecard produces output that is fast, confident, and unusable.

### 1. Validate the rubric and inputs

Open `role_rubric` and verify it conforms to the shape in `references/1-interview-rubric-template.md`: every dimension has a 1-5 anchor table, every anchor has a behavioral description, no dimension allows free-text scoring only. Halt if any check fails — surface the offending lines.

Then validate `scorecards`:

- At least 3 distinct interviewers (below 3, panel synthesis is not meaningful — surface a one-interviewer single-feedback note instead).
- Every dimension in the rubric is scored by at least 2 interviewers (gaps mean the loop did not cover the dimension; surface as a follow-up question rather than synthesizing absent signal).
- `evidence_notes` strings ≥ 20 characters on every score (free-text-only interviewers get bumped back to re-fill before the brief runs).

The choice to halt rather than warn is intentional: a brief generated on partial inputs becomes the panel's mental anchor, even when the generator notes the partial inputs. Halting forces the missing inputs to be filled before the discussion frame is set.

### 2. Aggregate per dimension (deterministic)

For each rubric dimension, compute:

- Mean score across interviewers.
- Min and max (the range — the disagreement signal).
- Standard deviation (used in step 4 to weight whether to surface as a decision-point).
- Per-interviewer-role breakdown (hiring_manager, peer, cross_functional, bar_raiser scores listed separately so structural disagreements surface).

Why structured rubric instead of free-form synthesis: a free-form synthesis loses the per-dimension comparability that lets the panel discuss specific evidence rather than overall impressions. Without per-dimension comparability, the debrief reverts to "everyone shares their gut feeling, loudest voice wins" — which is the failure mode this entire skill exists to prevent.

### 3. Identify decision-points (deterministic)

Apply the rules from `references/3-disagreement-escalation.md`:

- **Range ≥ 2 across interviewers on any single dimension** → surface as a decision-point.
- **Bar-raiser score ≥ 2 below the panel mean on any dimension** → surface as a decision-point regardless of range (bar-raiser veto semantics).
- **Hiring-manager score ≥ 2 above any other interviewer's score** → surface as a decision-point (single-strong-opinion guard).
- **No-hire from any one interviewer when the rest are hire** → surface as a decision-point with the dissenting evidence verbatim.

These rules run before the LLM synthesizes, so the decision-points are based on the structured signal, not on what the LLM thinks reads as disagreement. The synthesis in step 4 then explains the underlying disagreement; it does not pick which disagreements matter.

### 4. Synthesize per dimension

For each rubric dimension, the LLM produces:

- A two-to-three-sentence synthesis of what the panel saw, grounded in `evidence_notes` strings cited verbatim (no paraphrasing — paraphrasing is where bias enters).
- The evidence supporting the higher scores, attributed to interviewer role (not name — names go in the appendix).
- The evidence supporting the lower scores, attributed similarly.
- When transcripts are available, the timestamp range in the transcript where the supporting evidence appeared. Format: `BrightHire 14:22-15:08`. When transcripts are absent, write `transcript-unsupported` and do not infer.

Why "insufficient signal" is a first-class output, not a fallback: the absence of evidence for a dimension is itself information the panel needs. A dimension with two scores both based on 20-character evidence notes is not "consensus at 4.0"; it is "two interviewers guessed at 4.0". The brief writes "insufficient signal — recommend follow-up" rather than "consensus" in that case. This is different from "no recommendation", which would withhold all output and leave the panel without a structured starting point.

### 5. Calibration check

If `prior_debriefs` is provided, compare the score distribution against the previous 5 debriefs for the same role. Flag if:

- This candidate's mean is more than 1 standard deviation above the rolling mean (possible halo / overscoring).
- This candidate's mean is more than 1 standard deviation below the rolling mean for a dimension where the role has historically scored high (possible single-strong-negative-opinion drag).

Calibration findings appear as a "Calibration note" block at the end of the brief, never inline in the per-dimension synthesis. The intent is to give the panel a frame for the discussion, not to override the specific signal on this candidate.

### 6. Write the brief and stop

Write to `briefs/<candidate_id>-<YYYYMMDD>.md` per the format in `references/2-debrief-brief-format.md`. Append a single line to `audit/<YYYY-MM>.jsonl`: `run_id`, `candidate_id`, `role`, `rubric_sha256`, `interviewer_count`, `dimensions_count`, `decision_points_count`, `transcripts_attached` (boolean), `model_id`, `timestamp`. No candidate PII in the audit line.

Do not call any "send to candidate", "post to Slack channel", or "update ATS stage" endpoint. The brief is internal to the panel until the recruiter and hiring manager decide what to do with the synthesis.

## Output format

```markdown
# Interview debrief brief — {candidate_id} · {role_title} · {level_band}

Generated: {ISO timestamp} · Loop: {loop_type} · Interviewers: {n} ·
Rubric SHA: {short} · Transcripts: {yes|no}

## Aggregate signal

| Dimension | Mean | Range | HM | Peer | XFN | Bar-raiser |
|---|---|---|---|---|---|---|
| Technical depth | 4.2 | 4-5 | 5 | 4 | 4 | 4 |
| Systems design | 3.0 | 2-4 | 4 | 3 | 3 | 2 |
| Communication | 4.0 | 3-5 | 4 | 4 | 5 | 3 |
| Execution under pressure | 3.2 | 3-4 | 3 | 3 | 4 | 3 |
| Ownership | 4.5 | 4-5 | 5 | 4 | 5 | 4 |

## Per-dimension synthesis

### Technical depth — mean 4.2, range 4-5

The panel saw consistent depth on backend systems work. HM cites
"led the migration from a Postgres monolith to a sharded
Citus cluster, owned the cutover playbook end-to-end" (BrightHire
14:22-15:08). Peer and XFN cite the same migration with corroborating
detail. Bar-raiser scored 4 (not 5) on the basis that the candidate's
description of the rollback plan was "more reactive than I'd want at
this level" (transcript-unsupported, scorecard only). No decision-point
surfaced — disagreement is within tolerance.

### Systems design — mean 3.0, range 2-4 — DECISION-POINT

Range exceeds the threshold. HM scored 4 citing "drew the right
boundary between sync and async paths". Bar-raiser scored 2 citing
"could not articulate the trade-off between leader-follower and
multi-leader replication when prompted" (BrightHire 32:10-34:45). The
panel needs to resolve whether the bar-raiser's specific concern about
replication-topology fluency is load-bearing for the level, or whether
it is one weak moment in an otherwise strong design conversation.

### Communication — insufficient signal — RECOMMEND FOLLOW-UP

Two interviewers (HM, Peer) scored 4 with evidence notes under 30
characters. XFN scored 5 with no evidence note. Bar-raiser scored 3
with the note "felt scripted on the situational question, but I may
be reading too much in." This is not consensus at 4.0; it is
under-evidenced. Recommend follow-up question in the next round if
the candidate advances, or a 30-minute follow-up call with the
bar-raiser to walk through the specific moments.

[continues for each remaining dimension]

## Decision-points for the panel

1. **Systems design — replication-topology fluency.** Bar-raiser scored
2, HM scored 4. Resolve: is fluency on multi-leader vs
leader-follower trade-offs required at this level, or is the broader
design judgment sufficient?
2. **Communication — under-evidenced consensus.** Three scores cluster
at 4-5 but evidence notes are thin. Resolve: do we trust the cluster,
or do we ask for a follow-up signal?
3. **Bar-raiser dissent on technical depth.** Bar-raiser at 4 vs panel
mean of 4.2 — within tolerance, but the rollback-plan concern is
worth airing as a development area if hire.

## Follow-up questions if signal is thin

- For Communication: a 30-minute follow-up with the bar-raiser walking
through the situational-question moments.
- For Systems design: a take-home or whiteboard follow-up specifically
on replication topology trade-offs.

## Calibration note

This candidate's mean score across dimensions (3.78) is 0.4 standard
deviations above the rolling mean for the last 5 senior-backend
debriefs (3.51). Within tolerance — no calibration concern flagged.

## Appendix — per-interviewer evidence

[Per-interviewer scorecards, with names, in full. The synthesis above
attributes by role only; names live here so the panel can ask
specific interviewers to elaborate without ambiguity.]
```

## Watch-outs

- **Bias from one strong opinion (anchoring on the first scorecard).** *Guard:* step 2 aggregates deterministically across all interviewers before the LLM sees any single scorecard, and step 3's hiring-manager-vs-peer-divergence rule explicitly surfaces single-strong-opinion divergence as a decision-point. The LLM does not "round up" toward the senior interviewer's score.
- **False consensus on under-evidenced dimensions.** *Guard:* `evidence_notes` minimum-length check in step 1 (≥ 20 chars), and step 4's "insufficient signal" first-class output. A dimension where three interviewers scored 4 with one-word evidence notes is written as "insufficient signal — recommend follow-up", not as "consensus at 4.0". This is the most common silent failure of free-form debriefs.
- **Score-arithmetic-as-decision (treating mean ≥ 3.5 as "hire").** *Guard:* the brief never emits a hire/no-hire recommendation. It emits decision-points for the panel. The output format intentionally has no "Recommendation" block — only "Decision-points for the panel" and "Follow-up questions". A reader who tries to read off a decision finds the structure pushes them back to discussion.
- **Bar-raiser veto silently overridden.** *Guard:* step 3's rule surfaces any bar-raiser score ≥ 2 below the panel mean as a decision-point automatically. The brief cannot be generated in a state where a bar-raiser dissent is averaged away.
- **Demographic patterns leaking into synthesis.** *Guard:* the synthesis cites `evidence_notes` strings verbatim rather than paraphrasing, which prevents the LLM from rewriting an observation into language that telegraphs a protected-class inference. If a passed-in `evidence_note` itself contains protected-class proxies, the skill halts in step 1 and surfaces the offending note for re-write before continuing.
- **Calibration note overinterpreted as a verdict.** *Guard:* the calibration block is appended at the end of the brief, never inline per dimension. The intent is to frame the conversation, not adjust individual scores. The brief explicitly says "within tolerance" or "outside tolerance — discuss" rather than suggesting an action.

# Interview rubric — TEMPLATE

> Replace this template's contents with your role-specific rubric.
> The interview-debrief-summary skill validates the passed-in rubric
> against the shape below in step 1 and halts if any dimension is
> missing the required structure. Do not loosen the structure to make
> a vague rubric pass — fix the rubric instead.

## Role metadata

- **Role title**: {e.g. Senior Backend Engineer}
- **Level band**: {e.g. L5 / Senior IC}
- **EEOC job category**: {e.g. Professionals — required for audit log}
- **Last edited**: {YYYY-MM-DD}
- **Owner**: {hiring manager name + recruiter name}

## Score scale

All dimensions use the same 1-5 scale. Anchors below are the *minimum* behavior required at each level; a candidate scoring above the level exceeds the anchor in addition to meeting it.

| Score | Label | Meaning |
|---|---|---|
| 1 | Strong no | Misses the bar by a wide margin; would block the team |
| 2 | No | Below the bar with no clear path to growing into it in 6mo |
| 3 | Mixed | At the bar with a meaningful gap; viable with development plan |
| 4 | Yes | At or above the bar; ready to contribute on day one |
| 5 | Strong yes | Above the bar with capacity to lift the team |

## Dimensions

Each dimension below MUST have a 1-5 anchor table with behavioral descriptions. Free-text-only anchors fail the rubric validation in step 1.

### Dimension 1 — {e.g. Technical depth}

What this dimension tests: {one sentence}

| Score | Behavioral anchor |
|---|---|
| 1 | {behavior — e.g. "Cannot describe systems they have built without prompting"} |
| 2 | {behavior} |
| 3 | {behavior} |
| 4 | {behavior} |
| 5 | {behavior — e.g. "Walks through trade-offs at multiple levels of the stack unprompted, with concrete examples"} |

Common evidence sources: take-home review, system-design conversation, deep-dive on past projects.

### Dimension 2 — {e.g. Systems design}

What this dimension tests: {one sentence}

| Score | Behavioral anchor |
|---|---|
| 1 | {behavior} |
| 2 | {behavior} |
| 3 | {behavior} |
| 4 | {behavior} |
| 5 | {behavior} |

Common evidence sources: system-design interview, architecture review.

### Dimension 3 — {e.g. Communication}

What this dimension tests: {one sentence}

| Score | Behavioral anchor |
|---|---|
| 1 | {behavior} |
| 2 | {behavior} |
| 3 | {behavior} |
| 4 | {behavior} |
| 5 | {behavior} |

Common evidence sources: every interview; hiring-manager screen explicitly tests structured explanation.

### Dimension 4 — {e.g. Execution under pressure}

What this dimension tests: {one sentence}

| Score | Behavioral anchor |
|---|---|
| 1 | {behavior} |
| 2 | {behavior} |
| 3 | {behavior} |
| 4 | {behavior} |
| 5 | {behavior} |

### Dimension 5 — {e.g. Ownership}

What this dimension tests: {one sentence}

| Score | Behavioral anchor |
|---|---|
| 1 | {behavior} |
| 2 | {behavior} |
| 3 | {behavior} |
| 4 | {behavior} |
| 5 | {behavior} |

> Add or remove dimensions to match the role. Keep the count between
> 4 and 7. Below 4, the panel cannot triangulate; above 7, scorecards
> get filled out as a chore and evidence quality degrades.

## Interviewer-role assignment

Per loop, every dimension should be covered by at least 2 interviewers of different `interviewer_role`s (the skill validates this in step 1). Suggested coverage matrix:

| Dimension | Hiring manager | Peer | Cross-functional | Bar-raiser |
|---|---|---|---|---|
| Technical depth | yes | yes | — | yes |
| Systems design | — | yes | — | yes |
| Communication | yes | yes | yes | yes |
| Execution under pressure | yes | — | yes | — |
| Ownership | yes | yes | — | yes |

## Disqualifiers

Single signals that result in a no-hire regardless of other dimensions. Keep this list short and mechanical. The skill flags these prominently in the brief if any interviewer notes them.

- {e.g. "Misrepresented past role title or scope" — backed by reference check}
- {e.g. "Hostile or dismissive toward an interviewer or coordinator" — noted by 2+ interviewers}

## Things this rubric does NOT score

These get explicitly excluded so they cannot creep back as "intuition":

- "Culture fit" without behavioral anchors — replace with the specific behaviors you mean.
- School prestige as a standalone signal — appears in pattern-match dimensions only.
- Tenure pattern interpretation that penalizes parental leave or health gaps.
- Any inference from photo, name origin, or pronoun usage.

# Debrief brief — output format

> The interview-debrief-summary skill writes against this format
> verbatim in step 6. Do not freestyle the structure — the panel
> reads many of these and consistency is what makes them scannable.
> Replace `{placeholders}` with real values; keep the section
> headings and ordering exactly as below.

## Required structure

Every brief MUST contain these sections in this order:

1. Title line
2. Header (generated, loop type, interviewers, rubric SHA, transcripts)
3. Aggregate signal table
4. Per-dimension synthesis (one block per rubric dimension)
5. Decision-points for the panel
6. Follow-up questions if signal is thin
7. Calibration note (always present; says "no prior data" if first run)
8. Appendix — per-interviewer evidence

There is intentionally NO "Recommendation" section. The brief never emits a hire/no-hire. The panel resolves the decision-points and makes the call in the synchronous debrief.

## Template

```markdown
# Interview debrief brief — {candidate_id} · {role_title} · {level_band}

Generated: {ISO timestamp} · Loop: {loop_type} · Interviewers: {n} ·
Rubric SHA: {short} · Transcripts: {yes|no}

## Aggregate signal

| Dimension | Mean | Range | HM | Peer | XFN | Bar-raiser |
|---|---|---|---|---|---|---|
| {dimension 1} | {mean} | {min}-{max} | {score} | {score} | {score} | {score} |
| {dimension 2} | ... | ... | ... | ... | ... | ... |

Cells with no score: dash (`—`), not zero. The skill never imputes
a missing score.

## Per-dimension synthesis

For each dimension, one block in this exact shape:

### {Dimension name} — mean {x.x}, range {min}-{max}{ — DECISION-POINT|RECOMMEND FOLLOW-UP}{empty if neither}

{Two-to-three-sentence synthesis. Cite `evidence_notes` strings
verbatim in quotation marks. Attribute to interviewer ROLE, not
name (HM, Peer, XFN, Bar-raiser). When transcripts are available,
include the timestamp range as `(BrightHire 14:22-15:08)` after the
quoted evidence. When transcripts are absent, write
`transcript-unsupported` after the quoted evidence and do not infer.}

{Optional second paragraph: the disagreement, if a decision-point.
Names the specific resolved-vs-unresolved tension for the panel to
discuss. Two sentences max.}

The "DECISION-POINT" suffix is added when step 3's escalation rules
fire. The "RECOMMEND FOLLOW-UP" suffix is added when the synthesis
in step 4 marks the dimension as insufficient signal. Neither suffix
when the dimension is consensus-with-evidence.

## Decision-points for the panel

Numbered list. Each item names the dimension, the divergence, and the
specific question to resolve. Three to five items in a typical brief.
If there are zero decision-points, write the literal sentence:

> No decision-points surfaced. The panel may want to confirm that the
> consensus reflects shared evidence rather than shared assumptions
> before treating the loop as resolved.

(That fallback is intentional — frictionless consensus is itself a
calibration concern.)

## Follow-up questions if signal is thin

Bulleted list. Each bullet is a specific follow-up the panel could
run before deciding: a 30-minute follow-up call, a take-home, a
reference check on a specific dimension, a re-interview by a
specific role. Empty list is acceptable; write "None — signal is
sufficient on every dimension" if so.

## Calibration note

One paragraph. Compares this candidate's per-dimension score
distribution against the rolling mean from `prior_debriefs` (last 5
debriefs for the same role). Format:

> This candidate's mean score across dimensions ({x.xx}) is {n.n}
> standard deviations {above|below} the rolling mean for the last
> {k} {role_title} debriefs ({y.yy}). {Within tolerance — no
> calibration concern flagged. | Outside tolerance — discuss whether
> the rubric is being applied consistently this loop.}

If `prior_debriefs` was not provided: write "No prior debriefs
loaded. Calibration check skipped — recommend running with
`prior_debriefs` populated once 5+ same-role debriefs exist."

## Appendix — per-interviewer evidence

Per-interviewer scorecards in full, with names. The synthesis above
attributes by role only; names live here so the panel can ask
specific interviewers to elaborate without ambiguity.

For each interviewer:

### {Interviewer name} — {interviewer_role} — overall {summary score if scorecard provides one, else dash}

| Dimension | Score | Evidence |
|---|---|---|
| {dimension 1} | {score} | {evidence_notes string verbatim} |
| {dimension 2} | ... | ... |
```

## Formatting rules

- Soft-wrap prose paragraphs in the synthesis blocks. Tables, headings, and block quotes are preserved verbatim.
- Use `—` (em-dash) for missing values in the aggregate-signal table.
- Quote evidence notes verbatim in quotation marks. Do not paraphrase. Paraphrasing is where bias and false certainty enter.
- Interviewer-role labels in the synthesis: `HM`, `Peer`, `XFN`, `Bar-raiser`. Always exactly these strings — the brief is sometimes parsed downstream for analytics.
- Timestamp citations: `(Tool TimecodeStart-TimecodeEnd)`. The tool name is `BrightHire` or `Metaview`. Timecodes are `mm:ss-mm:ss`.
- File location: `briefs/<candidate_id>-<YYYYMMDD>.md`.

## What this format intentionally does NOT include

- A "Recommendation" or "Decision" section.
- A confidence score.
- A summary "lean" toward hire or no-hire.
- An overall pass/fail at the top of the brief.

These omissions are load-bearing. Every one of them, in earlier iterations, became the one thing the panel read — turning the brief into the decision and the meeting into a rubber stamp.

# Disagreement escalation rules

> The interview-debrief-summary skill applies these rules in step 3
> (deterministic decision-point identification) before the LLM runs
> the per-dimension synthesis. The rules are deliberately strict —
> the cost of surfacing a non-disagreement as a decision-point is
> 2 minutes of panel discussion; the cost of averaging a real
> disagreement away is a regretted hire or a missed strong candidate.

## Rules

Apply each rule independently. A dimension that triggers any rule is flagged with the `DECISION-POINT` suffix in the synthesis output and appears in the "Decision-points for the panel" section.

### R1. Range-on-dimension

**Trigger:** any single dimension where `max(scores) - min(scores) >= 2`.

**Rationale:** a 2-point spread on a 1-5 scale crosses a meaningful behavioral anchor (e.g. "below the bar" to "at the bar"). Two interviewers seeing the same candidate that differently is a calibration issue or an evidence-asymmetry issue — both worth discussing.

**Example:** Systems design — HM 4, Peer 3, Bar-raiser 2. Range = 2. Surface as decision-point.

### R2. Bar-raiser-veto

**Trigger:** `bar_raiser_score <= panel_mean - 2` on any dimension, where `panel_mean` excludes the bar-raiser.

**Rationale:** the bar-raiser role exists to apply a level-consistent standard across many loops. A bar-raiser scoring 2+ below the rest of the panel on a dimension means the panel is calibrated to a different standard than the one the bar-raiser is holding. That gap is load-bearing — not a tie-breaker, but a calibration discussion.

**Example:** Technical depth — HM 5, Peer 4, XFN 4 (panel mean 4.33), Bar-raiser 2. Surface as decision-point.

**Edge case:** if there is no bar-raiser in the loop, this rule does not fire. The brief notes "no bar-raiser in loop" in the calibration block.

### R3. Hiring-manager-vs-panel divergence

**Trigger:** `hiring_manager_score >= max(other_scores) + 2` on any dimension.

**Rationale:** the hiring manager is the most consequential single voice in most hiring decisions and the most prone to single-strong- opinion bias. A hiring manager scoring 2+ above every other interviewer is the pattern that produces "we hired them because the HM loved them and nobody pushed back."

**Example:** Communication — HM 5, Peer 3, XFN 3, Bar-raiser 3. HM is 2 above max of others. Surface as decision-point.

**Note:** this rule fires *upward* (HM higher than panel), not downward. A hiring manager scoring well below the panel typically self-resolves in the meeting; the upward case is the one that needs structural escalation.

### R4. Single-no-among-yes

**Trigger:** any single interviewer's overall scorecard recommendation is `no_hire` or `strong_no` while every other interviewer recommends `hire` or `strong_hire`.

**Rationale:** a single dissenting no-hire is the highest-information signal in a debrief — either the dissenter saw something the panel missed (in which case the hire is at risk) or the dissenter has a miscalibration on this candidate (in which case it is a coaching opportunity for the dissenter). Both outcomes require explicit discussion. Averaging the dissent away is the failure mode.

**Example:** HM hire, Peer strong_hire, XFN hire, Bar-raiser no_hire. Surface as decision-point with the bar-raiser's evidence verbatim.

### R5. Coverage-gap

**Trigger:** any rubric dimension with fewer than 2 interviewer scores.

**Rationale:** a dimension scored by only one interviewer is not a panel signal; it is one person's read. The brief surfaces the gap as a follow-up question rather than as a decision-point — the recommended action is to gather more signal, not to debate the existing one.

**Output location:** appears in "Follow-up questions if signal is thin", not in "Decision-points for the panel".

### R6. Under-evidenced cluster

**Trigger:** a dimension where 3+ interviewers' scores cluster within 1 point AND the mean evidence-note length across those interviewers is below 30 characters.

**Rationale:** a tight cluster of scores backed by one-sentence evidence is "consensus" only in the same sense that "everyone agreed the food was fine" is a restaurant review. The synthesis writes it as `RECOMMEND FOLLOW-UP` rather than as agreement.

**Output location:** appears as `RECOMMEND FOLLOW-UP` suffix on the per-dimension synthesis AND in "Follow-up questions if signal is thin".

## Rules NOT applied

These were considered and explicitly rejected:

- **"Average score below 3.0 → no-hire decision-point."** Rejected because it conflates an aggregation rule (the brief never aggregates to a hire/no-hire) with an escalation rule (the brief surfaces disagreements). The panel decides whether mean 2.8 means no-hire; the brief just shows that mean 2.8 is the score.
- **"More than X minutes of recorded silence in transcript → flag rapport issue."** Rejected because rapport interpretation from silence is exactly the kind of inference that surfaces protected-class proxies. Transcripts are used for evidence citation only, never for inferred-state analysis.
- **"Panel-tenure-weighted mean."** Rejected because weighting a senior interviewer's score above a junior one builds the seniority bias the bar-raiser role is supposed to neutralize. All scores are equal-weight in the aggregation; structural disagreements (R2, R3) are surfaced separately.

## When the rules conflict

If a single dimension triggers multiple rules (e.g. R1 AND R2 both fire on Systems design), the synthesis surfaces it as one decision-point with both triggers cited. The "Decision-points for the panel" entry names both ("range across panel of 2 points, including bar-raiser scoring 2 below panel mean").

If the brief has more than 5 decision-points, the brief surfaces all of them but adds a paragraph at the top of the section noting that the loop produced unusually high disagreement and the calibration of the rubric (or the loop composition) may itself be the discussion to have first.