claude-skill

Klausel-Extraktion aus beliebigen Verträgen mit Claude

Difficulty

Anfänger

Setup time

20min

For

legal-ops · in-house-counsel · paralegal · contract-manager

Legal Ops

Stack

Eine Claude Skill, die einen einzelnen unterzeichneten Vertrag – .docx oder .pdf mit Textebene – entgegennimmt und einen zitierten JSON-Datensatz mit den Klauseln ausgibt, auf die Ihr CLM tatsächlich eingeht: Gerichtsstand, Haftungsdeckelung, Freistellung, Laufzeit, automatische Verlängerung, Kündigungsauslöser, Zahlungsbedingungen, IP-Eigentümerschaft, Vertraulichkeitsdauer, plus beliebige benutzerdefinierte Felder, die Sie konfigurieren (Datenspeicherort, MFN, Change-of-Control, Abtretung). Jeder extrahierte Wert enthält einen verbatim-Auszug, eine {Seite, char_span}-Zitation und einen Konfidenz-Score, sodass der nachgelagerte Reviewer in Sekunden verifizieren kann, ohne den Vertrag erneut zu lesen.

Diese Seite beschreibt, wann man sie ausführt, wann explizit nicht, was sie kostet und die benannten Fehlermodi, die Sie einschätzen sollten, bevor Sie sie auf ein Produktions-Repository richten.

Wann einsetzen

Greifen Sie auf die Skill zurück, wenn Sie einen strukturierten Output-Bedarf gegen Verträge haben, die bereits das Privilege-Clearing bestanden haben:

CLM-Daten-Backfill. Sie haben ein Flat-File-Repository (Box, SharePoint, Netzlaufwerk) geerbt und müssen Ironclad- oder Agiloft-Metadatenfelder befüllen, ohne ein Paralegal-Quartal zu verbrennen.
Klauselbibliotheks-Aufbau. Sie wollen jede „Haftungsdeckelung”-Klausel im Portfolio, damit die Klauselbibliothek widerspiegelt, was Sie tatsächlich vereinbart haben, nicht die angestrebte Position des Playbooks.
Due Diligence. Sie haben 48 Stunden, um Change-of-Control-, Abtretungs- und Most-Favoured-Customer-Klauseln im Vertragswerk eines Targets vor einem Deal-Abschluss aufzuzeigen.
Renewal-Triage. Sie müssen jeden Vertrag markieren, der in den nächsten 90 Tagen automatisch verlängert wird, mit befülltem Kündigungsfristen-Feld.

Das Artifact-Bundle liegt unter apps/web/public/artifacts/clause-extraction-claude-skill/ und liefert:

SKILL.md — die Skill-Definition mit Methode, Ausgabeformat und Hinweisen
references/1-clause-taxonomy.md — die zu extrahierenden Klauseln pro Vertragstyp, mit Überschriften und Synonymen
references/2-output-schema.json — das JSON Schema, gegen das jeder Datensatz validiert wird (an eine Version pinnen)
references/3-citation-format.md — Zitiergrammatik und die Regeln für „nicht vorhanden” / „konnte nicht extrahiert werden”-Fallbacks

Wann NICHT einsetzen

Die Skill ist bewusst eng. Lehnen Sie den Aufruf in einem dieser Fälle ab.

Privilegierte Entwürfe in aktiver Verhandlung. Die KI-Richtlinien der meisten Rechtsabteilungen (und die KI-Richtlinien-Vorlage, die wir empfehlen) ziehen eine klare Linie bei laufenden Verhandlungs-Entwürfen – insbesondere externe Anwalts-Redlines und anwaltliches Arbeitsergebnis. Diese Skill ist für unterzeichnete oder quasi-endgültige Verträge, die die Privilege-Frage bereits geklärt haben. Wenn Sie nicht sicher sind, ob ein Dokument die Prüfung bestanden hat, ist die Antwort nein.
Alles über Nicht-Tier-A-KI-Anbieter. Nur gegen den von Ihrer Firma genehmigten Tier-A-Endpunkt ausführen (Anthropic API direkt oder Ihr Enterprise-Claude-Tenant). Niemals den Consumer-Chatbot. Niemals ein Browser-Plugin. Niemals einen ungeprüften SaaS-Wrapper, der „Claude unter der Haube” verspricht. Das Senden eines Vertrags durch einen Tier-B-Anbieter ist ein Privilege-Leak-Vektor – lehnen Sie den Aufruf ab statt die KI-Richtlinie zu umgehen. Die Skill selbst kodiert eine Endpunkt-Allowlist fest; wenn Sie sie innerhalb von Claude Code oder Claude.ai mit Ihrem Enterprise-Tenant ausführen, sind Sie in Ordnung.
Entwurf oder Redlining. Diese Skill liest nur. Für Redlining verwenden Sie die separate Vertrags-Redline-Skill.
Rechtsauslegung. Der Output ist Text + Zitation. Ob eine 12-monatige Haftungsdeckelung „gut genug” ist, angesichts des Deal-Kontexts, ist ein Urteilsaufruf, der bei der Rechtsberatung bleibt.

Setup

Bundle in ~/.claude/skills/ (Claude Code) ablegen oder das Verzeichnis references/ und SKILL.md in ein Claude.ai-Projekt hochladen.
Den Inhalt von references/1-clause-taxonomy.md durch die tatsächliche Taxonomie Ihrer Firma ersetzen. Die Standard-Taxonomie enthält die üblichen MSA-Klauseln; die meisten Firmen fügen 5–10 benutzerdefinierte Felder hinzu (Datenspeicherort nach Jurisdiktion, Change-of-Control-Ausnahmen, Non-Solicit-Laufzeit, MFN-Umfang).
references/2-output-schema.json an eine Version pinnen. extractor_version im Schema und in der Skill bei jeder Taxonomieänderung erhöhen, damit nachgelagerte Konsumenten Drift erkennen können.
Auf einem bekannten Vertrag ausführen – wählen Sie einen, dessen Klauselwerte Sie bereits im CLM haben. Vergleichen Sie den extrahierten JSON mit dem CLM-Datensatz. Taxonomie-Synonyme iterieren, bis Sie übereinstimmen.
Im großen Maßstab ausführen. Die Skill ist per Vertrag; Batch in n8n, einer Shell-Schleife oder dem Intake-Hook Ihres CLM orchestrieren.

Was die Skill tatsächlich tut

Vier Schritte der Reihe nach.

Textextraktion mit Layout-Erhaltung. .docx wird über die docx-XML geparst; .pdf über pdfplumber, sodass Seitenzahlen und Bounding-Box-Zeichen-Spans erhalten bleiben. Wenn das PDF keine Textebene hat (gescanntes Bild), bricht die Skill mit error: "ocr_required" ab statt leeren Text auszugeben. Gescannte PDFs an OCR zu routen ist eine separate vorgelagerte Angelegenheit; diese Skill führt keine OCR durch, weil das stille Produzieren einer „sauberen” leeren Extraktion aus einem Scan schlimmer ist als laut zu scheitern.
Zitations-verankerte Extraktion, ein Pass pro Klausel. Für jede Klausel in der Taxonomie: Kandidaten-Absätze durch Überschriften- + Synonym-Übereinstimmung finden, nur diese Kandidaten (nicht den gesamten Vertrag) mit der Klauseldefinition an Claude übergeben und den Wert, einen verbatim ≤ 280-Zeichen-Auszug, die {Seite, char_span}-Zitation und einen high | medium | low-Konfidenz-Score zurückfordern. Jeder Auszug, der nicht byte-identisch mit einem Teilstring der Quellabsätze ist, wird abgelehnt – das ist der Halluzinations-Guard, und er ist nicht verhandelbar. Per-Klausel-Prompts (statt eines Mega-Prompts) ermöglichen das Retry nur der Fehlschläge, begrenzen die Input-Tokens jedes Aufrufs und isolieren Halluzinationen auf ein einzelnes Feld statt den gesamten Datensatz.
Schema-Validierung gegen das gepinnte output-schema.json. Validierungsfehler landen im errors-Array des Outputs. Die Skill erzwingt keine stillen Typ-Konvertierungen.
„Nicht vorhanden”-Fallback. Wenn eine Klausel nicht gefunden wird, value: null, status: "not_present", note: "Gesuchte Überschriften: [...]" ausgeben. Nicht raten. CLM-Backfill-Pipelines behandeln null + status:not_present als bestätigt-abwesend (Vertrag ohne dieses Feld ablegen) und null + status:error als needs-rerun (nicht ablegen). Die beiden zu vermischen korrumpiert CLM-Daten über Zeit.

Kostenrealität

Zu 2026-Claude-Preisen – nennen wir es ~$3/M Input-Tokens und ~$15/M Output-Tokens für das kosteneffektive Modell in der Skill – werden die Kosten von Input-Tokens dominiert, und Input-Tokens werden von der Länge der Kandidaten-Absätze dominiert (weil die Skill nie den vollständigen Vertrag sendet, nur die übereinstimmenden Absätze pro Klausel).

Grobe Zahlen pro Vertrag:

Kurzer Vertrag (5 Seiten, ~3K Input-Tokens über alle per-Klausel-Aufrufe, ~500 Output-Tokens): ~$0,02 pro Vertrag.
Standard-MSA (20 Seiten, ~12K Input-Tokens, ~1K Output-Tokens): ~$0,05 pro Vertrag.
Langes Enterprise-MSA mit Anlagen (60 Seiten, ~35K Input-Tokens, ~2K Output-Tokens): ~$0,13 pro Vertrag.

Für ein typisches Mid-Market-Inhouse-Team, das ~200 neue und geerbte Verträge pro Monat durch die Pipeline führt, sind das $10–$30/Monat an Token-Ausgaben. Die Kosten sind Rundungsfehler gegenüber einer Paralegal-Stunde. Wo es aufhört, Rundungsfehler zu sein, ist das 50.000-Vertrags-Due-Diligence-Projekt – bei $0,05 pro Stück sind das $2.500, was immer noch günstig ist, aber es lohnt sich, es im Voraus zu budgetieren statt es auf der Kreditkartenrechnung zu entdecken.

Die Nicht-Token-Kosten: Jede Extraktion mit confidence: medium | low (und eine 10-%-Stichprobe von high) benötigt menschliche Überprüfung. Planen Sie ~30 Sekunden pro Datensatz bei medium und ~2 Minuten bei low. Die Skill ist schneller als ein Paralegal, nicht kostenlos.

Erfolgsmetrik

Zwei Metriken lohnen es, von Tag eins an zu instrumentieren.

Extraktionsgenauigkeit auf einem beschrifteten Set. Erstellen Sie ein 50-Vertrags-Gold-Set mit manuellen Extraktionen. Messen Sie Präzision und Recall pro Klausel. Ziel: ≥ 95 % Präzision auf den erforderlichen Klauseln (governing_law, liability_cap, term_length_months, auto_renewal). Darunter vergiften die False Positives das CLM, und Reviewer lernen, das Feld zu ignorieren. Recall ist weniger wichtig – not_present ist eine tragende Antwort, und eine verfehlte Klausel wird zur menschlichen Überprüfung geleitet.
Zeit pro Vertrag, von Ende zu Ende. Einschließlich des menschlichen Überprüfungs-Passes bei markierten Datensätzen. Ziel für ein 20-seitiges MSA: unter 4 Minuten Wanduhr, vs. 20–30 Minuten für vollständige manuelle Extraktion. Wenn Sie nicht 5× sehen, ist die Warteschlange für menschliche Überprüfung zu aggressiv – die Konfidenz-Schwellenwerte enger stellen.

Vergleich mit Alternativen

vs. Ironclad native KI-Klausel-Extraktion. Ironclads eingebaute Extraktion ist ausgezeichnet, wenn jeder Vertrag, der Ihnen wichtig ist, in Ironclad lebt. Sie kämpft, wenn Sie von außerhalb von Ironclad befüllen (der Import-Pfad ist unhandlich) und wenn Sie benutzerdefinierte Klauseln jenseits von Ironclads Vorlagensatz wollen. Diese Skill läuft gegen jede Datei auf der Festplatte und verwendet Ihre Taxonomie. Wenn Sie vollständig in Ironclad leben, verwenden Sie deren native Extraktion; wenn Sie mehrere Ziele befüttern oder Due Diligence an einem Nicht-Ironclad-Repository durchführen, ist diese Skill die bessere Wahl.
vs. Kira Systems. Kira ist der Enterprise-Grade-Incumbent – hohe Genauigkeit, tiefe Vorlagenbibliothek, teuer (sechsstellig), langer Verkaufszyklus, erfordert Trainingsdaten pro benutzerdefinierter Klausel. Wenn Sie eine BigLaw-Firma sind, die M&A-Due-Diligence im großen Maßstab durchführt, verdient Kira seinen Preis. Wenn Sie ein 50-köpfiges Legal-Ops-Team sind, das einige tausend geerbte MSAs rückwirkend befüllt, ist Kira Overkill und diese Skill ist zwei Größenordnungen günstiger für die Genauigkeit, die Sie brauchen.
vs. manueller Paralegal-Überprüfung. Der ehrliche Vergleich. Ein Paralegal, der 10 Klauseln aus einem 20-seitigen MSA extrahiert, braucht 20–30 Minuten und erreicht ≥ 99 % Genauigkeit bei einfachen Klauseln (Gerichtsstand, Laufzeit) und ~90 % bei den schwierigen (Haftungsdeckelungsstruktur, Freistellungsausnahmen). Diese Skill erledigt es in unter einer Minute für ~$0,05, trifft ~95 % bei einfachen und ~85 % bei schwierigen und leitet den Rest über das Konfidenz-Flag an einen Menschen. Der richtige Ansatz für die meisten Teams ist hybrid: Skill auf jedem Vertrag, Paralegal bei markierten Datensätzen.

Wichtige Hinweise

Privilege-Leak über Tier-B-Anbieter. Das Routing eines privilegierten Dokuments durch einen nicht genehmigten KI-Endpunkt kann das Privileg aufheben. Guard: Die Skill prüft beim Start eine fest kodierte Endpunkt-Allowlist (api.anthropic.com plus Ihr Enterprise-Tenant) und weigert sich zu laufen, wenn der konfigurierte Endpunkt nicht darauf steht. Dokumentieren Sie den Allowlist-Eigentümer in Ihrer KI-Richtlinie.
OCR-bedingte Textlücken bei gescannten PDFs. Ein gescanntes Bild-PDF ohne OCR-Ebene extrahiert als leere Seiten; ohne einen Guard würde die Skill die meisten Klauseln als not_present melden und wie ein sauberer Durchlauf aussehen. Guard: Schritt 1 erkennt Seiten mit < 50 extrahierten Zeichen und bricht mit ocr_required ab statt einen irreführenden Datensatz auszugeben. Routen Sie den Vertrag vorgelagert durch OCR und führen Sie erneut aus.
Halluzinierte Klauseln. Modelle werden hilfreicherweise eine „Kündigung aus Bequemlichkeit”-Klausel erfinden, die nicht existiert, wenn sie gefragt werden. Guard: Die byte-identische Auszug-Teilstring-Prüfung in Schritt 2 – jeder Auszug, der nicht wörtlich in den Quellabsätzen vorhanden ist, wird abgelehnt, und die Klausel erfasst status: "error", error: "excerpt_not_grounded". Es gibt konstruktionsbedingt keinen High-Confidence-Halluzinations-Pfad.
Schema-Drift über Vertragsversionen. Eine Taxonomie-Aktualisierung, die liability_cap von einem String zu einem {type, amount, period}-Objekt ändert, bricht stillschweigend jeden nachgelagerten Konsumenten. Guard: extractor_version in references/2-output-schema.json pinnen und bei jeder Taxonomie- oder Schema-Änderung erhöhen. Nachgelagerte Konsumenten schlüsseln auf Version, nicht auf eine Stabilitätsannahme.
Defined-Term-Auflösung. „Wie in Anlage A dargelegt” gibt die Referenz zurück, nicht den Wert. Guard: Die Skill erkennt wie in ... dargelegt / wie in ... definiert und gibt confidence: medium mit note: "Querverweise, manuelle Auflösung erforderlich" aus. Naive Auto-Auflösung ist schlechter als das ehrliche Flag.
Keine Rechtsberatung. Extraktion ist mechanisch. Ob eine 12-monatige Deckelung für diesen Deal akzeptabel ist, ist ein Urteil, das bei der Rechtsberatung bleibt.

Stack

Claude — Textextraktion-Orchestrierung, zitations-verankerte Klausel-Extraktion, Schema-Validierung
Ironclad (optional) — primäres CLM-Ziel für die extrahierten Datensätze. Siehe auch alternatives-to-ironclad und den best CLM platforms-Vergleich, wenn Sie noch auswählen.
CLM-Hintergrund — was CLM ist und wo Extraktion hineinpasst.

Diese Seite auf GitHub bearbeiten

Files in this artifact

Download all (.zip)

---
name: clause-extraction
description: Extract a fixed set of contract clauses from a single .pdf or .docx and emit citation-grounded JSON with page/span references. Use after intake to backfill CLM metadata, build a clause library, or surface change-of-control / liability terms during diligence.
---

# Clause extraction

## When to invoke

Invoke this skill per contract, after the document has been ingested and you need a structured clause record (governing law, liability cap, term, auto-renewal, indemnification, payment terms, IP ownership, confidentiality term, termination triggers, plus any custom clauses you configure).

Typical callers:

- CLM backfill — populating Ironclad / Agiloft / DealHub metadata for a legacy contract repository
- Diligence — surfacing change-of-control, assignment, MFN clauses on a target company's contract set before deal close
- Clause library — building a corpus of "what we actually agreed to" across a portfolio so the playbook reflects reality

Do NOT invoke this skill for:

- **Privileged drafts in active negotiation** — per AI policy in most legal teams, in-flight negotiation drafts (especially with outside counsel redlines) do not get sent to AI tooling. This skill is for executed or near-final contracts that have already cleared privilege.
- **Anything via non-Tier-A AI vendors.** Run only against the firm-approved Tier-A model endpoint (Anthropic API or your enterprise Claude tenant). A general-purpose chatbot, browser plugin, or unvetted SaaS wrapper is a privilege-leak vector — refuse the invocation rather than route around the AI policy.
- Drafting or redlining clauses (this skill reads only)
- Interpreting legal effect (the output is text + citation; legal judgment stays with counsel)

## Inputs

- Required: `contract_path` — absolute path to a `.pdf` or `.docx`. PDFs must be text-based or pre-OCR'd; scanned-image PDFs without an OCR layer are rejected at step 1.
- Required: `taxonomy` — path to `references/clause-taxonomy.md` (or a custom taxonomy keyed by contract type). Defines the clauses to look for and the expected value type (string, number, boolean, enum).
- Required: `output_schema` — path to `references/output-schema.json`. The JSON Schema the output must validate against. Schema drift across contract versions is the #1 source of downstream pipeline breakage; pinning the schema per run guards against it.
- Optional: `contract_type` — `msa | sow | nda | dpa | order_form`. Selects the clause subset from the taxonomy. Defaults to `msa`.
- Optional: `custom_clauses` — array of additional clause names to look for beyond the taxonomy defaults (e.g. `data_residency_clause`, `most_favored_customer_clause`).

## Reference files

Read these from `references/` before processing. They are templates — replace the placeholder content with your firm's real taxonomy and schema before running on production contracts.

- `references/clause-taxonomy.md` — clause definitions per contract type, with the value type, required/optional flag, and synonym phrases the extraction step matches against
- `references/output-schema.json` — the JSON Schema every emitted record must validate against
- `references/citation-format.md` — citation grammar (page + span anchor) and the rules for "not present" / "could not extract" fallbacks

## Method

Run these steps in order. Do not parallelize — later steps depend on the artifacts produced by earlier ones.

### 1. Text extraction with layout preservation

For `.docx`: parse via the docx XML and emit a flat text stream with paragraph indices and section headings preserved.

For `.pdf`: use a text-layer extractor (pdfplumber or pdfminer.six) that preserves page numbers and bounding-box character spans. If the PDF has no text layer (scanned image), abort with `error: "ocr_required"` rather than silently emitting empty text. Routing a scanned PDF to OCR is a separate upstream concern; this skill does not OCR.

The output of step 1 is a list of `{page, paragraph_index, char_span, text}` records. Every later citation references these coordinates.

### 2. Citation-grounded extraction (one pass per clause)

For each clause in the taxonomy:

1. Locate candidate paragraphs by heading match (e.g. "Governing Law", "Term") and synonym phrase match (e.g. "shall be governed by", "initial term of this Agreement").
2. Pass the candidate paragraphs (not the full contract) to Claude with the clause definition and ask for: the value, the verbatim source excerpt (≤ 280 chars), the `{page, char_span}` citation, and a `confidence` score (`high | medium | low`).
3. **Reject any extracted excerpt that is not byte-identical to a substring of the source paragraphs.** This is the hallucination guard — if the model returns text not actually in the contract, drop the extraction and record `value: null, error: "excerpt_not_grounded"`.

Why one pass per clause and not a single mega-prompt: per-clause prompts let you retry only the failures, cap each call's input tokens (cheaper, faster), and isolate hallucination failures to a single field instead of the whole record.

### 3. Schema validation

Validate the assembled record against `output-schema.json`. If validation fails, emit the validation error in the output's `errors` array. Do not silently coerce types.

### 4. "Not present" fallback

If a clause is not located in step 2 (no candidate paragraphs above confidence threshold), emit `value: null, status: "not_present", note: "Searched headings: [...]; no matching paragraphs found."` Do not guess. "Not present" is a load-bearing answer; CLM backfill pipelines treat `null + status:not_present` differently from `null + error:*`.

## Output format

Always emit a single JSON object per contract. Soft constraints below are enforced by `references/output-schema.json`.

```json
{
"contract_file": "vendor_msa_2026.pdf",
"contract_type": "msa",
"extracted_at": "2026-05-03T14:22:00Z",
"extractor_version": "clause-extraction@2026.05",
"clauses": {
"governing_law": {
"value": "Delaware",
"excerpt": "This Agreement shall be governed by and construed in accordance with the laws of the State of Delaware, without regard to its conflict of laws principles.",
"citation": { "page": 14, "char_span": [1820, 1980] },
"confidence": "high",
"status": "extracted"
},
"liability_cap": {
"value": "12 months fees",
"excerpt": "In no event shall either party's aggregate liability exceed the fees paid by Customer in the twelve (12) months preceding the event giving rise to the claim.",
"citation": { "page": 18, "char_span": [220, 410] },
"confidence": "high",
"status": "extracted"
},
"auto_renewal": {
"value": true,
"renewal_term_months": 12,
"notice_period_days": 90,
"excerpt": "This Agreement shall automatically renew for successive 12-month terms unless either party provides 90 days' written notice of non-renewal.",
"citation": { "page": 3, "char_span": [50, 230] },
"confidence": "high",
"status": "extracted"
},
"most_favored_customer_clause": {
"value": null,
"status": "not_present",
"note": "Searched headings: ['Most Favored', 'MFN', 'Pricing']; no matching paragraphs found."
}
},
"errors": []
}
```

## Watch-outs

- **Privilege leak via Tier-B vendor.** Routing a privileged or attorney-work-product document through a non-approved AI endpoint can waive privilege. Guard: hard-coded allowlist of model endpoints (`ALLOWED_ENDPOINTS = ["api.anthropic.com", "<your-enterprise-tenant>"]`) checked at skill startup. Refuse to run if the configured endpoint is not on the list. Document the allowlist owner in your AI policy.
- **OCR-induced text gaps on scanned PDFs.** If step 1 silently emits empty pages from a scanned image PDF, the skill will report many clauses as `not_present` and look like a clean extraction. Guard: step 1 detects pages with < 50 extracted characters and aborts with `ocr_required` rather than producing a misleading "clean" record.
- **Hallucinated clauses.** Models will helpfully invent a "termination for convenience" clause that doesn't exist if asked. Guard: byte-identical excerpt-substring check in step 2 — any excerpt not literally present in the source paragraphs is rejected. Pair with `confidence: low` flagging for human review on the rest.
- **Schema drift across contract versions.** A taxonomy update that changes `liability_cap` from a string to a structured `{type, amount, period}` silently breaks every downstream consumer. Guard: pin `extractor_version` in the output and bump it on every taxonomy or schema change. Downstream consumers key on version, not on the assumption that the schema is stable.
- **Defined-term resolution.** When a clause says "as set forth in Schedule A" the excerpt is the reference, not the value. Guard: detect the substring "as set forth in" / "as defined in" and emit `confidence: medium, note: "cross-reference, manual resolution required"` rather than treating the reference as the answer.
- **Heading-light contracts.** Contracts without clear section headings (older or short-form) extract less reliably. Guard: when fewer than 60% of expected headings match in step 2, mark the whole record `confidence: medium` and note `"heading_density: low"` so downstream QA routes it to human review.

# Clause taxonomy — TEMPLATE

> Replace this template's contents with your firm's actual clause taxonomy
> per contract type. The clause-extraction skill reads this file on every
> run; without your real taxonomy, extractions will use the generic defaults
> below and miss the clauses your CLM cares about.

The skill keys on `clause_id`. Every clause record in the output JSON uses the `clause_id` as the property name. Adding a clause means: add an entry here AND add the matching property to `output-schema.json`.

## Convention

For each clause:

- `clause_id` — snake_case identifier used as the JSON key
- `value_type` — `string | number | boolean | enum | structured`
- `required` — `true | false` (drives `not_present` vs hard error in validation)
- `headings` — list of section heading strings the locator matches against
- `synonyms` — list of phrase substrings the locator falls back to when no heading matches
- `value_hint` — what the extractor should pull (e.g. "the named jurisdiction state or country", "12-month-fees / 24-month-fees / unlimited / other")

## MSA defaults

### governing_law

- value_type: `string`
- required: true
- headings: `["Governing Law", "Choice of Law", "Applicable Law"]`
- synonyms: `["shall be governed by", "construed in accordance with the laws of"]`
- value_hint: the named jurisdiction (state, province, or country)

### liability_cap

- value_type: `structured` — `{ type: "fees_period" | "fixed_amount" | "unlimited" | "other", amount?: number, period_months?: number, currency?: string }`
- required: true
- headings: `["Limitation of Liability", "Liability Cap", "Cap on Liability"]`
- synonyms: `["aggregate liability shall not exceed", "in no event shall either party's liability exceed"]`
- value_hint: extract the cap amount or formula. Distinguish indirect-damages exclusions (do NOT extract those here) from the cap itself.

### indemnification

- value_type: `structured` — `{ ip_indemnity: boolean, mutual: boolean, carveouts: string[] }`
- required: true
- headings: `["Indemnification", "Indemnity"]`
- synonyms: `["shall defend, indemnify and hold harmless"]`
- value_hint: pull the IP indemnity boolean and the carveouts list (e.g. combinations, modifications, open source).

### term_length_months

- value_type: `number`
- required: true
- headings: `["Term", "Term and Termination"]`
- synonyms: `["initial term of this Agreement", "shall commence on the Effective Date"]`
- value_hint: convert years to months (3-year term → 36).

### auto_renewal

- value_type: `structured` — `{ enabled: boolean, renewal_term_months?: number, notice_period_days?: number }`
- required: true
- headings: `["Renewal", "Term and Termination"]`
- synonyms: `["shall automatically renew", "evergreen", "successive renewal terms"]`

### termination_triggers

- value_type: `structured` — `{ for_convenience: { allowed: boolean, notice_days?: number }, for_cause: { material_breach_cure_days?: number }, for_insolvency: boolean }`
- required: true
- headings: `["Termination"]`
- synonyms: `["may terminate this Agreement", "for material breach"]`

### payment_terms

- value_type: `structured` — `{ net_days: number, currency: string, late_fee_apr?: number }`
- required: true
- headings: `["Payment", "Fees and Payment", "Invoicing"]`
- synonyms: `["payable within", "net thirty (30) days"]`

### ip_ownership

- value_type: `enum` — `vendor | customer | joint | work_for_hire | other`
- required: true
- headings: `["Intellectual Property", "Ownership", "IP Rights"]`
- synonyms: `["all right, title and interest"]`

### confidentiality_term_months

- value_type: `number`
- required: true
- headings: `["Confidentiality", "Non-Disclosure"]`
- synonyms: `["confidentiality obligations shall survive", "for a period of"]`
- value_hint: convert years to months. If trade-secret carveout is "in perpetuity", emit `-1` and set `confidence: medium`.

## NDA defaults

(Replace with your NDA-specific taxonomy. Typical: `term_months`, `survival_period_months`, `permitted_purposes`, `residual_rights`, `return_or_destroy`.)

## DPA defaults

(Replace with your DPA-specific taxonomy. Typical: `data_residency`, `subprocessor_consent`, `audit_rights`, `breach_notification_hours`, `sccs_module_used`.)

## Custom clauses (firm-specific)

Add your firm-specific clauses here. Examples to consider:

- `change_of_control_clause` — boolean + carveouts
- `most_favored_customer_clause` — boolean + scope
- `data_residency_clause` — enum of jurisdictions
- `assignment_restriction` — enum: `no_restriction | consent_required | prohibited`
- `non_solicit_term_months` — number

## Last edited

{YYYY-MM-DD} — bump on every taxonomy change. The extractor records this date in `extractor_version` so downstream consumers can detect schema drift.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://ooligo.io/schemas/clause-extraction/v1.json",
  "title": "Clause extraction record",
  "description": "Output of the clause-extraction skill, one record per contract. Replace this template with your firm's pinned schema. Bump $id on every breaking change so downstream consumers can key on version.",
  "type": "object",
  "required": ["contract_file", "contract_type", "extracted_at", "extractor_version", "clauses", "errors"],
  "additionalProperties": false,
  "properties": {
    "contract_file": { "type": "string", "minLength": 1 },
    "contract_type": { "type": "string", "enum": ["msa", "sow", "nda", "dpa", "order_form"] },
    "extracted_at": { "type": "string", "format": "date-time" },
    "extractor_version": { "type": "string", "pattern": "^clause-extraction@[0-9]{4}\\.[0-9]{2}$" },
    "source_metadata": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "page_count": { "type": "integer", "minimum": 1 },
        "heading_density": { "type": "string", "enum": ["high", "medium", "low"] },
        "ocr_layer_present": { "type": "boolean" }
      }
    },
    "clauses": {
      "type": "object",
      "description": "One property per clause_id from the taxonomy. Use $defs/clauseRecord for each.",
      "additionalProperties": { "$ref": "#/$defs/clauseRecord" }
    },
    "errors": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["code", "message"],
        "properties": {
          "code": { "type": "string", "enum": ["ocr_required", "schema_validation_failed", "endpoint_not_allowed", "excerpt_not_grounded", "taxonomy_mismatch"] },
          "message": { "type": "string" },
          "clause_id": { "type": "string" }
        }
      }
    }
  },
  "$defs": {
    "clauseRecord": {
      "type": "object",
      "required": ["status"],
      "additionalProperties": true,
      "properties": {
        "status": { "type": "string", "enum": ["extracted", "not_present", "error"] },
        "value": {
          "description": "Type depends on the clause's value_type in the taxonomy. Null when status != extracted.",
          "anyOf": [
            { "type": "string" },
            { "type": "number" },
            { "type": "boolean" },
            { "type": "object" },
            { "type": "null" }
          ]
        },
        "excerpt": {
          "type": "string",
          "maxLength": 280,
          "description": "Verbatim substring of the source. MUST be byte-identical to text in the source paragraphs — the hallucination guard rejects any excerpt not literally present."
        },
        "citation": {
          "type": "object",
          "required": ["page", "char_span"],
          "additionalProperties": false,
          "properties": {
            "page": { "type": "integer", "minimum": 1 },
            "char_span": {
              "type": "array",
              "minItems": 2,
              "maxItems": 2,
              "items": { "type": "integer", "minimum": 0 },
              "description": "[start_char_offset, end_char_offset] within the page's extracted text."
            }
          }
        },
        "confidence": { "type": "string", "enum": ["high", "medium", "low"] },
        "note": { "type": "string" },
        "error": { "type": "string" }
      },
      "allOf": [
        {
          "if": { "properties": { "status": { "const": "extracted" } } },
          "then": { "required": ["value", "excerpt", "citation", "confidence"] }
        },
        {
          "if": { "properties": { "status": { "const": "not_present" } } },
          "then": { "required": ["note"] }
        },
        {
          "if": { "properties": { "status": { "const": "error" } } },
          "then": { "required": ["error"] }
        }
      ]
    }
  }
}

# Citation format — TEMPLATE

> The clause-extraction skill emits a citation on every extracted clause so
> the downstream reviewer can verify the extraction in seconds rather than
> re-reading the contract. Without a usable citation grammar, extractions
> are unfalsifiable — and unfalsifiable extractions become silent CLM data
> rot. This file pins the format and the fallback rules.

## Citation grammar

A citation is a structured object, not a string. The skill emits:

```json
{
  "page": 14,
  "char_span": [1820, 1980]
}
```

- `page` — 1-indexed page number in the source PDF, or paragraph cluster index for `.docx` (since `.docx` has no fixed pagination).
- `char_span` — `[start, end]` character offsets within the page's extracted text, where `start` is inclusive and `end` is exclusive.

The `excerpt` field on the clause record is the verbatim substring at that span. The skill enforces that `page_text[char_span[0]:char_span[1]] == excerpt`. If the assertion fails, the extraction is rejected and the clause is recorded with `status: "error", error: "excerpt_not_grounded"`.

## Why structured, not "p. 14, ¶ 3"

Free-text citations like "p. 14, paragraph 3" cannot be machine-verified. A reviewer cannot click them. A pipeline cannot diff them across re-runs. A regression test cannot assert "the citation moved by exactly N characters when we re-ran extraction after taxonomy v2." Structured citations make every extraction reproducible and reviewable.

## Reviewer UX expectations

Downstream tooling (CLM, review queue, audit log) is expected to render the citation as a deep link into the source PDF page with the excerpt highlighted. Without that affordance, reviewers fall back to ctrl-F on the excerpt — which works, but doubles review time.

Recommended renderer behavior:

- Display the excerpt with the citation page badge inline
- On click, open the source PDF at the cited page with the excerpt highlighted (PDF.js supports `#highlight=<text>`)
- Show `confidence` as a colored chip: high = green, medium = amber, low = red

## "Not present" — the load-bearing answer

When a clause is not located, the citation is omitted and `status: "not_present"` is set with a `note` field documenting the search:

```json
{
  "value": null,
  "status": "not_present",
  "note": "Searched headings: ['Most Favored', 'MFN', 'Pricing']; searched synonyms: ['most favored', 'no less favorable']; no matching paragraphs found."
}
```

This is intentionally explicit. CLM backfill pipelines treat a `null` with `status: "not_present"` as confirmed-absent (file the contract without that field) and a `null` with `status: "error"` as needs-rerun (do not file). Conflating the two corrupts CLM data over time.

## Cross-reference handling

When the matched paragraph is a pointer ("as set forth in Schedule A"), the skill emits:

```json
{
  "value": "see Schedule A",
  "excerpt": "Liability shall be limited as set forth in Schedule A.",
  "citation": { "page": 18, "char_span": [220, 274] },
  "confidence": "medium",
  "note": "cross-reference; manual resolution required"
}
```

Resolving cross-references is out of scope. The skill could chase the reference into Schedule A, but the failure modes (mis-numbered schedules, amendments overriding the schedule, partially-resolved chains) make naive resolution worse than an honest "needs human" flag.

## Confidence calibration

| Confidence | Meaning | Reviewer action |
|---|---|---|
| `high` | Heading match + synonym match + clean excerpt grounded in source | Spot-check 10% sample; trust the rest |
| `medium` | Synonym match without heading, OR cross-reference, OR low heading density on the contract overall | Review every record |
| `low` | Multiple candidate paragraphs and the model picked one with weak signal, OR excerpt is ≥ 200 chars | Review every record before filing |

The skill MUST NOT emit `high` for a record that did not pass the byte-identical excerpt check. There is no "high-confidence hallucination" case — by construction.

## Last edited

{YYYY-MM-DD}