claude-skill

Claude を使った任意の契約書からの条項抽出

Difficulty

初級

Setup time

20min

For

legal-ops · in-house-counsel · paralegal · contract-manager

Legal Ops

Stack

単一の締結済み契約書（.docx またはテキストレイヤー付き .pdf）を受け取り、CLM が実際にキーとする条項を含む引用根拠付き JSON レコードを出力する Claude スキルです。準拠法、責任上限、補償、期間、自動更新、解除トリガー、支払い条件、IP 所有権、秘密保持期間に加え、設定したカスタムフィールド（データ所在地、最恵国待遇、支配権の変更、権利譲渡）を抽出します。抽出されたすべての値に逐語的な抜粋、{page, char_span} 引用、信頼スコアが付くため、下流のレビュアーは契約書を読み直すことなく数秒で確認できます。

このページでは実行するタイミング、明示的に実行しないタイミング、コスト、本番リポジトリに向ける前にサイズを測るべき名前付きの失敗モードをカバーします。

使用するタイミング

特権クリアランスを既に通過した契約書に対して構造化された出力が必要な場合にこのスキルを使用します：

CLM データのバックフィル。 フラットファイルのリポジトリ（Box、SharePoint、ネットワークドライブ）を継承し、パラリーガルの四半期分を消費せずに Ironclad または Agiloft のメタデータフィールドを作成する必要がある場合。
条項ライブラリの構築。 ポートフォリオ全体のすべての「責任上限」条項が欲しく、条項ライブラリがプレイブックの定めた立場ではなく実際に合意したことを反映するようにしたい場合。
デューデリジェンス。 ディールがクローズする前に 48 時間で対象会社の契約セット全体の支配権の変更、権利譲渡、最恵国顧客条項を表面化する必要がある場合。
更新のトリアージ。 通知期間日数フィールドが入力された状態で、今後 90 日以内に自動更新されるすべての契約書にフラグを立てる必要がある場合。

アーティファクトバンドルは apps/web/public/artifacts/clause-extraction-claude-skill/ にあります：

SKILL.md — 方法、出力フォーマット、注意事項を含むスキル定義
references/1-clause-taxonomy.md — 契約タイプごとに抽出する条項（見出しと同義語を含む）
references/2-output-schema.json — すべてのレコードが検証される JSON スキーマ（バージョンに固定してください）
references/3-citation-format.md — 引用文法と「存在しない」/「抽出できなかった」フォールバックのルール

使用しないタイミング

このスキルは意図的に範囲が狭いです。以下のいずれかに該当する場合は呼び出しを拒否してください。

アクティブな交渉中の特権ドラフト。 ほとんどの法務チームの AI ポリシー（および当社が推奨する AI ポリシーテンプレート）は、進行中の交渉ドラフト — 特に社外弁護士の修正と弁護士作業成果物 — に明確な線引きをしています。このスキルは特権の問題をすでに解決した締結済みまたは最終近くの契約書向けです。文書が特権クリアランスを通過しているか不確かな場合、答えはノーです。
非 Tier-A AI ベンダー経由のもの。 社内承認済みの Tier-A エンドポイント（Anthropic API 直接、または企業向け Claude テナント）に対してのみ実行してください。コンシューマー向けチャットボットは絶対に不可です。ブラウザプラグインも不可です。「裏で Claude」と謳う未検証の SaaS ラッパーも不可です。契約書を Tier-B ベンダー経由で送ることは特権漏洩のリスクがあります — AI ポリシーを回避するのではなく、呼び出しを拒否してください。スキル自体にはエンドポイント許可リストがハードコードされています。Claude Code または企業テナントの Claude.ai で実行している場合は問題ありません。
下書きまたはリドライン。 このスキルは読み取りのみです。リドラインには、別の契約書リドラインスキルを使用してください。
法的解釈。 出力はテキスト + 引用です。12 ヶ月の責任上限がディールのコンテキストで「十分か」は判断の問題であり、弁護士が担います。

セットアップ

バンドルを ~/.claude/skills/（Claude Code）に配置するか、references/ ディレクトリと SKILL.md を Claude.ai プロジェクトにアップロードします。
references/1-clause-taxonomy.md の内容を自社の実際の分類法に置き換えます。デフォルト分類法には一般的な MSA 条項があります。ほとんどの会社は 5〜10 のカスタムフィールドを追加します（管轄ごとのデータ所在地、支配権変更の除外、非勧誘期間、最恵国待遇の範囲）。
references/2-output-schema.json をバージョンに固定します。分類法の変更のたびにスキーマとスキルの extractor_version をバンプして、下流のコンシューマーがドリフトを検出できるようにします。
既知の契約書で実行します — CLM に値がすでにある契約書を選んでください。抽出された JSON を CLM レコードと比較します。出力が一致するまで分類法の同義語を繰り返し改善します。
スケールで実行します。スキルは契約書ごとです。n8n、シェルループ、または CLM のインテークフックでバッチを調整します。

スキルの実際の動作

4 ステップ、順番通り。

レイアウト保存付きテキスト抽出。 .docx は docx XML 経由で解析されます。.pdf は pdfplumber 経由でページ番号とバウンディングボックスの文字スパンが残ります。PDF にテキストレイヤーがない場合（スキャン画像）、スキルは空のテキストを出力するのではなく error: "ocr_required" で中断します。スキャン PDF を OCR にルーティングすることは別の上流の問題です。このスキルは OCR を行いません。スキャンから「クリーンな」空の抽出を静かに生成することは、大声で失敗するよりも悪いからです。
引用根拠付き抽出、条項ごとに 1 パス。 分類法の各条項について：見出し + 同義語の一致で候補段落を見つけ、候補段落のみ（契約書全体ではなく）を条項定義と共に Claude に渡し、値、≤280 文字の逐語的抜粋、{page, char_span} 引用、high | medium | low の信頼スコアを返します。ソース段落の部分文字列と完全に一致しない抜粋はすべて拒否されます — これは幻覚ガードであり、交渉の余地はありません。条項ごとのプロンプト（1 つのメガプロンプトではなく）により、失敗のみを再試行し、各コールの入力トークンを上限で制限し、幻覚をレコード全体ではなく単一フィールドに分離できます。
固定された output-schema.json に対するスキーマ検証。 検証エラーは出力の errors 配列に格納されます。スキルは型を静かに強制変換しません。
「存在しない」フォールバック。 条項が見つからない場合、value: null, status: "not_present", note: "Searched headings: [...]" を出力します。推測しません。CLM バックフィルパイプラインは null + status:not_present を確認済み不在（そのフィールドなしで契約書を登録）として扱い、null + status:error を再実行が必要（登録しない）として扱います。2 つを混在させると時間とともに CLM データが破損します。

コストの実態

2026 年の Claude 価格 — スキル内で使用するコスト効率の高いモデルで入力 ~3 USD/M トークン、出力 ~15 USD/M トークン — では、コストは入力トークンに支配され、入力トークンは候補段落の長さに支配されます（スキルは条項ごとにマッチした段落のみを送り、契約書全体を送らないため）。

契約書あたりの概算数値：

短い契約書（5 ページ、全条項コールで ~3K 入力トークン、~500 出力トークン）： 契約書あたり ~0.02 USD。
標準 MSA（20 ページ、~12K 入力トークン、~1K 出力トークン）： 契約書あたり ~0.05 USD。
スケジュール付き長いエンタープライズ MSA（60 ページ、~35K 入力トークン、~2K 出力トークン）： 契約書あたり ~0.13 USD。

パイプライン経由で毎月 ~200 件の新規および継承契約書を実行する典型的なミッドマーケット社内チームでは、トークン支出は月 10〜30 USD です。コストはパラリーガルの 1 時間分に比べると端数です。端数でなくなるのは 50,000 件のデューデリジェンスプロジェクトです — 件あたり 0.05 USD で 2,500 USD になります。これは依然として安価ですが、クレジットカード明細で発見するよりも事前に予算化する価値があります。

トークン以外のコスト：confidence: medium | low のすべての抽出（および high の 10% サンプル）には人間のレビューが必要です。medium で ~30 秒、low で ~2 分を見込んでください。スキルはパラリーガルより速いですが、無料ではありません。

成功の指標

初日から計装する価値のある 2 つの指標。

ラベル付きセットでの抽出精度。 手動抽出で 50 件の金本位セットを構築します。条項ごとに精度と再現率を測定します。目標：必須条項（governing_law、liability_cap、term_length_months、auto_renewal）で ≥95% の精度。それ以下では偽陽性が CLM を汚染し、レビュアーはフィールドを無視することを学びます。再現率の重要度は低いです — not_present は重要な答えであり、見逃した条項は人間のレビューにルーティングされます。
契約書あたりのエンドツーエンドの時間。 フラグ付きレコードへの人間レビューパスを含みます。20 ページ MSA の目標：4 分以内（完全な手動抽出の 20〜30 分に対して）。5 倍が見えない場合、人間レビューキューが積極的すぎます — 信頼閾値を厳しくしてください。

代替案との比較

Ironclad のネイティブ AI 条項抽出と比較して。 Ironclad の組み込み抽出は、すべての対象契約書が Ironclad にある場合に優れています。Ironclad の外部からバックフィルする場合（インポートパスが複雑）や、Ironclad のテンプレートセットを超えるカスタム条項が必要な場合に苦労します。このスキルはディスク上の任意のファイルに対して実行し、自社の分類法を使用します。完全に Ironclad に生活している場合はネイティブ抽出を使用してください。複数の宛先にフィードする場合や非 Ironclad リポジトリでデューデリジェンスを行う場合は、このスキルの方が適しています。
Kira Systemsと比較して。 Kira はエンタープライズグレードの老舗です — 高精度、深いテンプレートライブラリ、高価（6 桁）、長い販売サイクル、カスタム条項ごとのトレーニングデータが必要。M&A デューデリジェンスをスケールで行う BigLaw ファームなら Kira はその価値があります。数千件の継承 MSA をバックフィルする 50 人の法務 Ops チームなら Kira は過剰であり、このスキルは必要な精度で 2 桁安価です。
手動パラリーガルレビューと比較して。 正直な比較です。20 ページ MSA から 10 の条項を抽出するパラリーガルは 20〜30 分かかり、簡単な条項（準拠法、期間）で ≥99% の精度、難しい条項（責任上限の構造、補償の除外）で ~90% の精度を達成します。このスキルは 1 分未満で ~0.05 USD で行い、簡単なもので ~95%、難しいもので ~85% を達成し、残りを信頼フラグ経由で人間にルーティングします。ほとんどのチームにとって正しいアプローチはハイブリッドです：すべての契約書にスキルを使用し、フラグ付きレコードにパラリーガルを使用します。

注意事項

Tier-B ベンダー経由の特権漏洩。 承認されていない AI エンドポイントを通じて特権文書をルーティングすることで特権が放棄される可能性があります。ガード： スキルは起動時にハードコードされたエンドポイント許可リスト（api.anthropic.com に加えて企業テナント）をチェックし、設定されたエンドポイントがリストにない場合は実行を拒否します。AI ポリシーに許可リストの所有者をドキュメント化してください。
スキャン PDF での OCR による欠落テキスト。 OCR レイヤーのないスキャン画像 PDF は空のページとして抽出されます。ガードなしでは、スキルはほとんどの条項を not_present と報告し、クリーンな実行のように見えます。ガード： ステップ 1 は抽出文字数が < 50 のページを検出し、誤解を招くレコードを出力するのではなく ocr_required で中断します。契約書を上流で OCR にルーティングして再実行してください。
幻覚された条項。 モデルは求めると存在しない「利便のための解除」条項を親切に発明します。ガード： ステップ 2 のバイトと同一の抜粋部分文字列チェック — ソース段落に文字通り存在しない抜粋はすべて拒否され、条項は status: "error", error: "excerpt_not_grounded" を記録します。設計上、高信頼度の幻覚パスは存在しません。
契約バージョン間のスキーマドリフト。 liability_cap を文字列から {type, amount, period} オブジェクトに変更する分類法の更新は、すべての下流コンシューマーを静かに壊します。ガード： references/2-output-schema.json に extractor_version を固定し、分類法またはスキーマの変更のたびにバンプします。下流コンシューマーは安定性の前提ではなくバージョンをキーにします。
定義された用語の解決。 「スケジュール A に記載されている通り」は値ではなく参照を返します。ガード： スキルは as set forth in / as defined in を検出し、confidence: medium と note: "cross-reference, manual resolution required" を出力します。ナイーブな自動解決は正直なフラグよりも悪いです。
法的アドバイスではありません。 抽出は機械的です。このディールで 12 ヶ月の上限が許容できるかどうかは弁護士が担う判断です。

スタック

Claude — テキスト抽出のオーケストレーション、引用根拠付き条項抽出、スキーマ検証
Ironclad（オプション）— 抽出されたレコードの主要 CLM 宛先。CLM を選定中の場合は alternatives-to-ironclad や best CLM platforms も参照してください。
CLM 背景知識 — CLM の概要と抽出の位置付け。

GitHubでこのページを編集

Files in this artifact

Download all (.zip)

---
name: clause-extraction
description: Extract a fixed set of contract clauses from a single .pdf or .docx and emit citation-grounded JSON with page/span references. Use after intake to backfill CLM metadata, build a clause library, or surface change-of-control / liability terms during diligence.
---

# Clause extraction

## When to invoke

Invoke this skill per contract, after the document has been ingested and you need a structured clause record (governing law, liability cap, term, auto-renewal, indemnification, payment terms, IP ownership, confidentiality term, termination triggers, plus any custom clauses you configure).

Typical callers:

- CLM backfill — populating Ironclad / Agiloft / DealHub metadata for a legacy contract repository
- Diligence — surfacing change-of-control, assignment, MFN clauses on a target company's contract set before deal close
- Clause library — building a corpus of "what we actually agreed to" across a portfolio so the playbook reflects reality

Do NOT invoke this skill for:

- **Privileged drafts in active negotiation** — per AI policy in most legal teams, in-flight negotiation drafts (especially with outside counsel redlines) do not get sent to AI tooling. This skill is for executed or near-final contracts that have already cleared privilege.
- **Anything via non-Tier-A AI vendors.** Run only against the firm-approved Tier-A model endpoint (Anthropic API or your enterprise Claude tenant). A general-purpose chatbot, browser plugin, or unvetted SaaS wrapper is a privilege-leak vector — refuse the invocation rather than route around the AI policy.
- Drafting or redlining clauses (this skill reads only)
- Interpreting legal effect (the output is text + citation; legal judgment stays with counsel)

## Inputs

- Required: `contract_path` — absolute path to a `.pdf` or `.docx`. PDFs must be text-based or pre-OCR'd; scanned-image PDFs without an OCR layer are rejected at step 1.
- Required: `taxonomy` — path to `references/clause-taxonomy.md` (or a custom taxonomy keyed by contract type). Defines the clauses to look for and the expected value type (string, number, boolean, enum).
- Required: `output_schema` — path to `references/output-schema.json`. The JSON Schema the output must validate against. Schema drift across contract versions is the #1 source of downstream pipeline breakage; pinning the schema per run guards against it.
- Optional: `contract_type` — `msa | sow | nda | dpa | order_form`. Selects the clause subset from the taxonomy. Defaults to `msa`.
- Optional: `custom_clauses` — array of additional clause names to look for beyond the taxonomy defaults (e.g. `data_residency_clause`, `most_favored_customer_clause`).

## Reference files

Read these from `references/` before processing. They are templates — replace the placeholder content with your firm's real taxonomy and schema before running on production contracts.

- `references/clause-taxonomy.md` — clause definitions per contract type, with the value type, required/optional flag, and synonym phrases the extraction step matches against
- `references/output-schema.json` — the JSON Schema every emitted record must validate against
- `references/citation-format.md` — citation grammar (page + span anchor) and the rules for "not present" / "could not extract" fallbacks

## Method

Run these steps in order. Do not parallelize — later steps depend on the artifacts produced by earlier ones.

### 1. Text extraction with layout preservation

For `.docx`: parse via the docx XML and emit a flat text stream with paragraph indices and section headings preserved.

For `.pdf`: use a text-layer extractor (pdfplumber or pdfminer.six) that preserves page numbers and bounding-box character spans. If the PDF has no text layer (scanned image), abort with `error: "ocr_required"` rather than silently emitting empty text. Routing a scanned PDF to OCR is a separate upstream concern; this skill does not OCR.

The output of step 1 is a list of `{page, paragraph_index, char_span, text}` records. Every later citation references these coordinates.

### 2. Citation-grounded extraction (one pass per clause)

For each clause in the taxonomy:

1. Locate candidate paragraphs by heading match (e.g. "Governing Law", "Term") and synonym phrase match (e.g. "shall be governed by", "initial term of this Agreement").
2. Pass the candidate paragraphs (not the full contract) to Claude with the clause definition and ask for: the value, the verbatim source excerpt (≤ 280 chars), the `{page, char_span}` citation, and a `confidence` score (`high | medium | low`).
3. **Reject any extracted excerpt that is not byte-identical to a substring of the source paragraphs.** This is the hallucination guard — if the model returns text not actually in the contract, drop the extraction and record `value: null, error: "excerpt_not_grounded"`.

Why one pass per clause and not a single mega-prompt: per-clause prompts let you retry only the failures, cap each call's input tokens (cheaper, faster), and isolate hallucination failures to a single field instead of the whole record.

### 3. Schema validation

Validate the assembled record against `output-schema.json`. If validation fails, emit the validation error in the output's `errors` array. Do not silently coerce types.

### 4. "Not present" fallback

If a clause is not located in step 2 (no candidate paragraphs above confidence threshold), emit `value: null, status: "not_present", note: "Searched headings: [...]; no matching paragraphs found."` Do not guess. "Not present" is a load-bearing answer; CLM backfill pipelines treat `null + status:not_present` differently from `null + error:*`.

## Output format

Always emit a single JSON object per contract. Soft constraints below are enforced by `references/output-schema.json`.

```json
{
"contract_file": "vendor_msa_2026.pdf",
"contract_type": "msa",
"extracted_at": "2026-05-03T14:22:00Z",
"extractor_version": "clause-extraction@2026.05",
"clauses": {
"governing_law": {
"value": "Delaware",
"excerpt": "This Agreement shall be governed by and construed in accordance with the laws of the State of Delaware, without regard to its conflict of laws principles.",
"citation": { "page": 14, "char_span": [1820, 1980] },
"confidence": "high",
"status": "extracted"
},
"liability_cap": {
"value": "12 months fees",
"excerpt": "In no event shall either party's aggregate liability exceed the fees paid by Customer in the twelve (12) months preceding the event giving rise to the claim.",
"citation": { "page": 18, "char_span": [220, 410] },
"confidence": "high",
"status": "extracted"
},
"auto_renewal": {
"value": true,
"renewal_term_months": 12,
"notice_period_days": 90,
"excerpt": "This Agreement shall automatically renew for successive 12-month terms unless either party provides 90 days' written notice of non-renewal.",
"citation": { "page": 3, "char_span": [50, 230] },
"confidence": "high",
"status": "extracted"
},
"most_favored_customer_clause": {
"value": null,
"status": "not_present",
"note": "Searched headings: ['Most Favored', 'MFN', 'Pricing']; no matching paragraphs found."
}
},
"errors": []
}
```

## Watch-outs

- **Privilege leak via Tier-B vendor.** Routing a privileged or attorney-work-product document through a non-approved AI endpoint can waive privilege. Guard: hard-coded allowlist of model endpoints (`ALLOWED_ENDPOINTS = ["api.anthropic.com", "<your-enterprise-tenant>"]`) checked at skill startup. Refuse to run if the configured endpoint is not on the list. Document the allowlist owner in your AI policy.
- **OCR-induced text gaps on scanned PDFs.** If step 1 silently emits empty pages from a scanned image PDF, the skill will report many clauses as `not_present` and look like a clean extraction. Guard: step 1 detects pages with < 50 extracted characters and aborts with `ocr_required` rather than producing a misleading "clean" record.
- **Hallucinated clauses.** Models will helpfully invent a "termination for convenience" clause that doesn't exist if asked. Guard: byte-identical excerpt-substring check in step 2 — any excerpt not literally present in the source paragraphs is rejected. Pair with `confidence: low` flagging for human review on the rest.
- **Schema drift across contract versions.** A taxonomy update that changes `liability_cap` from a string to a structured `{type, amount, period}` silently breaks every downstream consumer. Guard: pin `extractor_version` in the output and bump it on every taxonomy or schema change. Downstream consumers key on version, not on the assumption that the schema is stable.
- **Defined-term resolution.** When a clause says "as set forth in Schedule A" the excerpt is the reference, not the value. Guard: detect the substring "as set forth in" / "as defined in" and emit `confidence: medium, note: "cross-reference, manual resolution required"` rather than treating the reference as the answer.
- **Heading-light contracts.** Contracts without clear section headings (older or short-form) extract less reliably. Guard: when fewer than 60% of expected headings match in step 2, mark the whole record `confidence: medium` and note `"heading_density: low"` so downstream QA routes it to human review.

# Clause taxonomy — TEMPLATE

> Replace this template's contents with your firm's actual clause taxonomy
> per contract type. The clause-extraction skill reads this file on every
> run; without your real taxonomy, extractions will use the generic defaults
> below and miss the clauses your CLM cares about.

The skill keys on `clause_id`. Every clause record in the output JSON uses the `clause_id` as the property name. Adding a clause means: add an entry here AND add the matching property to `output-schema.json`.

## Convention

For each clause:

- `clause_id` — snake_case identifier used as the JSON key
- `value_type` — `string | number | boolean | enum | structured`
- `required` — `true | false` (drives `not_present` vs hard error in validation)
- `headings` — list of section heading strings the locator matches against
- `synonyms` — list of phrase substrings the locator falls back to when no heading matches
- `value_hint` — what the extractor should pull (e.g. "the named jurisdiction state or country", "12-month-fees / 24-month-fees / unlimited / other")

## MSA defaults

### governing_law

- value_type: `string`
- required: true
- headings: `["Governing Law", "Choice of Law", "Applicable Law"]`
- synonyms: `["shall be governed by", "construed in accordance with the laws of"]`
- value_hint: the named jurisdiction (state, province, or country)

### liability_cap

- value_type: `structured` — `{ type: "fees_period" | "fixed_amount" | "unlimited" | "other", amount?: number, period_months?: number, currency?: string }`
- required: true
- headings: `["Limitation of Liability", "Liability Cap", "Cap on Liability"]`
- synonyms: `["aggregate liability shall not exceed", "in no event shall either party's liability exceed"]`
- value_hint: extract the cap amount or formula. Distinguish indirect-damages exclusions (do NOT extract those here) from the cap itself.

### indemnification

- value_type: `structured` — `{ ip_indemnity: boolean, mutual: boolean, carveouts: string[] }`
- required: true
- headings: `["Indemnification", "Indemnity"]`
- synonyms: `["shall defend, indemnify and hold harmless"]`
- value_hint: pull the IP indemnity boolean and the carveouts list (e.g. combinations, modifications, open source).

### term_length_months

- value_type: `number`
- required: true
- headings: `["Term", "Term and Termination"]`
- synonyms: `["initial term of this Agreement", "shall commence on the Effective Date"]`
- value_hint: convert years to months (3-year term → 36).

### auto_renewal

- value_type: `structured` — `{ enabled: boolean, renewal_term_months?: number, notice_period_days?: number }`
- required: true
- headings: `["Renewal", "Term and Termination"]`
- synonyms: `["shall automatically renew", "evergreen", "successive renewal terms"]`

### termination_triggers

- value_type: `structured` — `{ for_convenience: { allowed: boolean, notice_days?: number }, for_cause: { material_breach_cure_days?: number }, for_insolvency: boolean }`
- required: true
- headings: `["Termination"]`
- synonyms: `["may terminate this Agreement", "for material breach"]`

### payment_terms

- value_type: `structured` — `{ net_days: number, currency: string, late_fee_apr?: number }`
- required: true
- headings: `["Payment", "Fees and Payment", "Invoicing"]`
- synonyms: `["payable within", "net thirty (30) days"]`

### ip_ownership

- value_type: `enum` — `vendor | customer | joint | work_for_hire | other`
- required: true
- headings: `["Intellectual Property", "Ownership", "IP Rights"]`
- synonyms: `["all right, title and interest"]`

### confidentiality_term_months

- value_type: `number`
- required: true
- headings: `["Confidentiality", "Non-Disclosure"]`
- synonyms: `["confidentiality obligations shall survive", "for a period of"]`
- value_hint: convert years to months. If trade-secret carveout is "in perpetuity", emit `-1` and set `confidence: medium`.

## NDA defaults

(Replace with your NDA-specific taxonomy. Typical: `term_months`, `survival_period_months`, `permitted_purposes`, `residual_rights`, `return_or_destroy`.)

## DPA defaults

(Replace with your DPA-specific taxonomy. Typical: `data_residency`, `subprocessor_consent`, `audit_rights`, `breach_notification_hours`, `sccs_module_used`.)

## Custom clauses (firm-specific)

Add your firm-specific clauses here. Examples to consider:

- `change_of_control_clause` — boolean + carveouts
- `most_favored_customer_clause` — boolean + scope
- `data_residency_clause` — enum of jurisdictions
- `assignment_restriction` — enum: `no_restriction | consent_required | prohibited`
- `non_solicit_term_months` — number

## Last edited

{YYYY-MM-DD} — bump on every taxonomy change. The extractor records this date in `extractor_version` so downstream consumers can detect schema drift.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://ooligo.io/schemas/clause-extraction/v1.json",
  "title": "Clause extraction record",
  "description": "Output of the clause-extraction skill, one record per contract. Replace this template with your firm's pinned schema. Bump $id on every breaking change so downstream consumers can key on version.",
  "type": "object",
  "required": ["contract_file", "contract_type", "extracted_at", "extractor_version", "clauses", "errors"],
  "additionalProperties": false,
  "properties": {
    "contract_file": { "type": "string", "minLength": 1 },
    "contract_type": { "type": "string", "enum": ["msa", "sow", "nda", "dpa", "order_form"] },
    "extracted_at": { "type": "string", "format": "date-time" },
    "extractor_version": { "type": "string", "pattern": "^clause-extraction@[0-9]{4}\\.[0-9]{2}$" },
    "source_metadata": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "page_count": { "type": "integer", "minimum": 1 },
        "heading_density": { "type": "string", "enum": ["high", "medium", "low"] },
        "ocr_layer_present": { "type": "boolean" }
      }
    },
    "clauses": {
      "type": "object",
      "description": "One property per clause_id from the taxonomy. Use $defs/clauseRecord for each.",
      "additionalProperties": { "$ref": "#/$defs/clauseRecord" }
    },
    "errors": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["code", "message"],
        "properties": {
          "code": { "type": "string", "enum": ["ocr_required", "schema_validation_failed", "endpoint_not_allowed", "excerpt_not_grounded", "taxonomy_mismatch"] },
          "message": { "type": "string" },
          "clause_id": { "type": "string" }
        }
      }
    }
  },
  "$defs": {
    "clauseRecord": {
      "type": "object",
      "required": ["status"],
      "additionalProperties": true,
      "properties": {
        "status": { "type": "string", "enum": ["extracted", "not_present", "error"] },
        "value": {
          "description": "Type depends on the clause's value_type in the taxonomy. Null when status != extracted.",
          "anyOf": [
            { "type": "string" },
            { "type": "number" },
            { "type": "boolean" },
            { "type": "object" },
            { "type": "null" }
          ]
        },
        "excerpt": {
          "type": "string",
          "maxLength": 280,
          "description": "Verbatim substring of the source. MUST be byte-identical to text in the source paragraphs — the hallucination guard rejects any excerpt not literally present."
        },
        "citation": {
          "type": "object",
          "required": ["page", "char_span"],
          "additionalProperties": false,
          "properties": {
            "page": { "type": "integer", "minimum": 1 },
            "char_span": {
              "type": "array",
              "minItems": 2,
              "maxItems": 2,
              "items": { "type": "integer", "minimum": 0 },
              "description": "[start_char_offset, end_char_offset] within the page's extracted text."
            }
          }
        },
        "confidence": { "type": "string", "enum": ["high", "medium", "low"] },
        "note": { "type": "string" },
        "error": { "type": "string" }
      },
      "allOf": [
        {
          "if": { "properties": { "status": { "const": "extracted" } } },
          "then": { "required": ["value", "excerpt", "citation", "confidence"] }
        },
        {
          "if": { "properties": { "status": { "const": "not_present" } } },
          "then": { "required": ["note"] }
        },
        {
          "if": { "properties": { "status": { "const": "error" } } },
          "then": { "required": ["error"] }
        }
      ]
    }
  }
}

# Citation format — TEMPLATE

> The clause-extraction skill emits a citation on every extracted clause so
> the downstream reviewer can verify the extraction in seconds rather than
> re-reading the contract. Without a usable citation grammar, extractions
> are unfalsifiable — and unfalsifiable extractions become silent CLM data
> rot. This file pins the format and the fallback rules.

## Citation grammar

A citation is a structured object, not a string. The skill emits:

```json
{
  "page": 14,
  "char_span": [1820, 1980]
}
```

- `page` — 1-indexed page number in the source PDF, or paragraph cluster index for `.docx` (since `.docx` has no fixed pagination).
- `char_span` — `[start, end]` character offsets within the page's extracted text, where `start` is inclusive and `end` is exclusive.

The `excerpt` field on the clause record is the verbatim substring at that span. The skill enforces that `page_text[char_span[0]:char_span[1]] == excerpt`. If the assertion fails, the extraction is rejected and the clause is recorded with `status: "error", error: "excerpt_not_grounded"`.

## Why structured, not "p. 14, ¶ 3"

Free-text citations like "p. 14, paragraph 3" cannot be machine-verified. A reviewer cannot click them. A pipeline cannot diff them across re-runs. A regression test cannot assert "the citation moved by exactly N characters when we re-ran extraction after taxonomy v2." Structured citations make every extraction reproducible and reviewable.

## Reviewer UX expectations

Downstream tooling (CLM, review queue, audit log) is expected to render the citation as a deep link into the source PDF page with the excerpt highlighted. Without that affordance, reviewers fall back to ctrl-F on the excerpt — which works, but doubles review time.

Recommended renderer behavior:

- Display the excerpt with the citation page badge inline
- On click, open the source PDF at the cited page with the excerpt highlighted (PDF.js supports `#highlight=<text>`)
- Show `confidence` as a colored chip: high = green, medium = amber, low = red

## "Not present" — the load-bearing answer

When a clause is not located, the citation is omitted and `status: "not_present"` is set with a `note` field documenting the search:

```json
{
  "value": null,
  "status": "not_present",
  "note": "Searched headings: ['Most Favored', 'MFN', 'Pricing']; searched synonyms: ['most favored', 'no less favorable']; no matching paragraphs found."
}
```

This is intentionally explicit. CLM backfill pipelines treat a `null` with `status: "not_present"` as confirmed-absent (file the contract without that field) and a `null` with `status: "error"` as needs-rerun (do not file). Conflating the two corrupts CLM data over time.

## Cross-reference handling

When the matched paragraph is a pointer ("as set forth in Schedule A"), the skill emits:

```json
{
  "value": "see Schedule A",
  "excerpt": "Liability shall be limited as set forth in Schedule A.",
  "citation": { "page": 18, "char_span": [220, 274] },
  "confidence": "medium",
  "note": "cross-reference; manual resolution required"
}
```

Resolving cross-references is out of scope. The skill could chase the reference into Schedule A, but the failure modes (mis-numbered schedules, amendments overriding the schedule, partially-resolved chains) make naive resolution worse than an honest "needs human" flag.

## Confidence calibration

| Confidence | Meaning | Reviewer action |
|---|---|---|
| `high` | Heading match + synonym match + clean excerpt grounded in source | Spot-check 10% sample; trust the rest |
| `medium` | Synonym match without heading, OR cross-reference, OR low heading density on the contract overall | Review every record |
| `low` | Multiple candidate paragraphs and the model picked one with weak signal, OR excerpt is ≥ 200 chars | Review every record before filing |

The skill MUST NOT emit `high` for a record that did not pass the byte-identical excerpt check. There is no "high-confidence hallucination" case — by construction.

## Last edited

{YYYY-MM-DD}