claude-skill

Salesforce-Datenbereinigung mit einem Claude Skill

Difficulty

Profi

Setup time

90min

For

revops

RevOps

Stack

Ein Claude Skill, der Salesforce nach dem Datenmüll durchsucht, der Ihre Berichte still verzerrt — doppelte Accounts, verwaiste Contacts, Junk-Leads, fehlerhafte Telefonnummern, Account-Contact-Fehlanpassungen und Stage-Werte, die die Funnel-Definition verletzen — und dann Korrekturen als CSV vorschlägt, die der Operator genehmigt, bevor ein Write landet. Der Skill schreibt nie ohne expliziten Trockendurchlauf plus menschliche Genehmigung, und jede angewendete Änderung wird in einem benutzerdefinierten Audit-Objekt protokolliert, damit sie rückgängig gemacht werden kann.

Das vollständige Bundle befindet sich unter apps/web/public/artifacts/salesforce-data-cleanup-skill/. Die SKILL.md enthält die Eingaben, Methode und das Ausgabeformat, dem der Skill folgt. Drei Referenzdateien dienen als ausfüllbares Gerüst des Operators: dedup-rules.md für Match-Keys und Ähnlichkeitsschwellen, stage-definitions.md für den per-Stage-Pflichtfeld-Satz und survivor-ranking.md für die Gewichte, die verwendet werden, um zu entscheiden, welcher Datensatz bei einem Merge gewinnt.

Wann verwenden

Greifen Sie auf diesen Skill zurück, wenn das Reporting aufgehört hat, vertrauenswürdig zu sein, weil die zugrunde liegenden Objektdaten schneller verfallen sind, als das Team sie bereinigen kann. Spezifische Auslöser: Eine Board-ARR-Zahl stimmt mit der Pipeline-Ansicht des CRO um mehr als ein paar Prozent nicht überein; das SDR-Team beschwert sich, denselben Prospect unter drei verschiedenen Account-Datensätzen zu treffen; ein Marketing-Attribution-Dashboard doppelzählt, weil Contacts in den falschen Accounts existieren; eine jährliche ICP-Re-Segmentierung ist blockiert, weil firmografische Felder in einem Viertel der Accounts fehlen. In all diesen Fällen ist der Engpass Datenhygiene, nicht Strategie.

Der Skill ist auch die richtige Wahl, wenn ein vorhandenes Dedup-Tool einen Shelfware-Effekt erzeugt hat — RevOps hat eine Lizenz, aber niemand vertraut den Vorschlägen genug, um zu handeln. Das Unterscheidungsmerkmal des Skills ist, dass jede vorgeschlagene Zusammenführung mit einer per-Pair-Begründungszeile geliefert wird, die den deterministischen Key zitiert, der ausgelöst hat, und die Survivor-Auswahlsignale, die die Entscheidung getroffen haben. Diese Überprüfbarkeit ist es, die die menschliche Genehmigung entsperrt, die die Bereinigung benötigt.

Wann NICHT verwenden

Verwenden Sie diesen Skill nicht, wenn einer der folgenden Punkte zutrifft.

Sie benötigen ein Echtzeit-Dedup-on-Create-Gate am Point of Lead-Capture. Der Skill ist ein Batch-Tool, das in Chunks scannt, kein synchrones Validierungsregel. Für Point-of-Create-Dedup konfigurieren Sie Salesforces native Duplicate Rules.

Sie benötigen, dass der Skill Writes automatisch anwendet. Es gibt kein Auto-Modus by design. Jede Korrektur durchläuft eine Trockendurchlauf-CSV, die der Operator mit Approve=Y markieren muss, bevor apply_fix eine Zeile berührt. Wenn das Betriebsmodell unüberwachte Writes erfordert, ist der Skill die falsche Form und die richtige Antwort ist ein deterministischer ETL-Job mit expliziter Inhaber-Freigabe im Change-Management-Prozess.

Sie reagieren auf eine DSGVO- oder CCPA-Right-to-Erasure-Anfrage. Verwenden Sie den dokumentierten PII-Löschungsfluss der Plattform, der durch Legal geleitet wird und den richtigen Papierpfad erzeugt. Improvisieren Sie nicht mit einem Bereinigungstool.

Sie möchten Hard-Deletes, die den Papierkorb umgehen. Der Skill hat keinen Hard-Delete-Codepfad. Papierkorb-Disziplin ist nicht verhandelbar; permanente Bereinigungen sind eine bewusste manuelle Plattformaktion.

Der erste Lauf erfolgt gegen Production mit einem schreibberechtigten Token. Zwei schreibgeschützte Scan-Zyklen sind erforderlich, bevor der Skill Write-Credentials akzeptiert, und selbst dann sollte der erste Write-Lauf ein Sandbox-Rehearsal eines Account-Merges sein.

Einrichtung

Legen Sie das Bundle unter apps/web/public/artifacts/salesforce-data-cleanup-skill/ in ~/.claude/skills/salesforce-data-cleanup/ ab. Der Skill-Loader nimmt SKILL.md und das references/-Verzeichnis automatisch auf.
Setzen Sie SFDC_TOKEN auf ein schreibgeschütztes Connected-App-Token. Setzen Sie SFDC_INSTANCE_URL auf den Sandbox-Endpunkt, nicht Production. Der Skill hat sandbox=true als Standard und verweigert das Umschalten ohne ein explizites Override-Flag.
Ersetzen Sie den Inhalt von references/dedup-rules.md, references/stage-definitions.md und references/survivor-ranking.md durch die echten Regeln des Teams. Die Vorlagen sind Gerüste; gegen eine Live-Org zu laufen, wird by design eine hohe False-Positive-Rate produzieren.
Stellen Sie das benutzerdefinierte SObject Cleanup_Audit__c in den Sandbox- und Production-Orgs bereit, indem Sie die in SKILL.md unter „Method, Schritt 5” dokumentierte Feldform verwenden. Das Prüfprotokoll macht Läufe reversibel — ohne es sollte apply_fix nicht ausgeführt werden.
Führen Sie den ersten Discovery-Scan aus. scan_data_health(scope="Account,Contact,Lead,Opportunity"). Erwarten Sie, dass der Scan beim ersten Durchgang Schwachstellen im Dedup-Regelwerk aufzeigt — das ist der Sinn der schreibgeschützten Zyklen.

Was der Skill tatsächlich tut

Der Skill führt fünf Schritte in Reihenfolge aus, vollständig dokumentiert in SKILL.md. Der Discovery-Scan zieht jedes in-scope SObject über die Bulk API in Chunks, da eine einzelne REST-Abfrage gegen eine 100k-Account-Org an Governor-Limits stoßen würde und chunked Bulk die Timeout-Obergrenze bei großen Pulls vermeidet.

Dedup verwendet einen zweistufigen Hybrid-Ansatz. Durchgang eins ist deterministisch — kleingeschriebene Domain, E.164-normalisierte Telefonnummer, NFKD-normalisierter Name mit entfernten Unternehmens-Suffixen. Exakte Übereinstimmungen bei einem einzigen starken Key gehen als high-Konfidenz zur Vorschlags-CSV. Durchgang zwei ist ein Claude-Semantic-Similarity-Vergleich, aber nur für Kandidatenpaare, die bereits ein schwaches deterministisches Signal teilen (gleiche ersten sechs Telefonstellen, gleicher erster Namens-Token, gleiche parent-domain-TLD). Der Narrow-then-rank-Ansatz hält die pro-Scan-Token-Kosten unter fünf Dollar bei einer 100k-Account-Org; reines paarweises Semantik über N^2-Datensätze ist sowohl teuer als auch bei häufigen Namen noisy.

Survivor-Auswahl für Merges verwendet einen zusammengesetzten Score: 0,4 Gewicht auf Aktivitäts-Aktualität der letzten 90 Tage von Tasks und Events, 0,3 Gewicht auf angehängte Contact-Anzahl, 0,2 Gewicht auf Opportunity-Geschichte (Anzahl plus Log des Betrags), 0,1 Gewicht darauf, ob LastModifiedById der Integrationsbenutzer ist. Kein einzelnes Signal ist für sich allein zuverlässig — Most-recent-modification zeigt oft auf einen Backfill, Contact-Anzahl begünstigt alte Datensätze, und Opportunity-Betrag allein verwirft die aktive Beziehung. Der zusammengesetzte Score verfolgt, wo das Team heute tatsächlich arbeitet.

Trockendurchlauf gibt eine CSV mit Operation, Field, Old_Value, New_Value, Confidence, Survivor_Id, Rationale und einer Approve-Spalte aus, die der Operator setzen muss. Apply liest die genehmigte CSV und schreibt über Bulk API, protokolliert jede Änderung in Cleanup_Audit__c mit sowohl dem vorherigen als auch dem neuen JSON-Wert, sodass ein revert(run_id)-Begleiter die Originale erneut anwenden kann.

Kostenrealität

Ein Discovery-Scan gegen eine 100k-Account-, 500k-Contact-Org läuft in ca. zwanzig Minuten Wanduhrzeit und verbraucht ca. 3-5 Dollar Claude API-Token für den Semantic-Similarity-Durchgang. Bulk API-Aufrufkontingent-Nutzung liegt im niedrigen Hunderte-Bereich pro Scan; weit unter der täglichen Obergrenze jeder Standard-Org. Der angewendete Write-Lauf ist selbst die kleinere Kosten — Bulk API-Writes verbrauchen keine Claude-Token, sie verbrennen nur ein paar zusätzliche API-Aufrufe pro genehmigtem Zeilen-Chunk.

Die Headcount-Rechnung ist die eigentliche Geschichte. Ein typischer RevOps-Bereinigungssprint bei den oben genannten Größen läuft zwei oder drei Wochen mit der Zeit eines Analysten pro Quartal plus ein paar Tage von einem Salesforce-Admin. Der Skill reduziert das auf ca. zwei Tage pro Quartal — ein Discovery-Scan, einen halben Tag zur Überprüfung der Trockendurchlauf-CSVs, ein Sandbox-Rehearsal etwaiger Account-Merges und ein Apply-Lauf. Bei einem voll ausgelasteten RevOps-Gehalt ist das eine bedeutende Einsparung über ein Jahr.

Die Kosten, die der Skill nicht eliminiert, sind der Rep-Kommunikations-Overhead. Ein Merge-Lauf ohne Komms verbrennt Vertrauen schneller als die schlechten Daten, und das change_brief.md, das der Skill neben jedem angewendeten Lauf ausgibt, ist eine Vorlage, die der Operator noch senden muss.

Erfolgsmetrik

Achten Sie auf eine Zahl pro Scan: den Anteil der high-Konfidenz-Vorschläge, die der Operator bei der ersten Überprüfung genehmigt. Beim ersten Lauf liegt diese Zahl typischerweise unter fünfzig Prozent — das ist die Feinjustierung des Dedup-Regelwerks, nicht der Skill, der underperformt. Nach drei oder vier Scan-Zyklen, mit angepassten Regeln, sollte diese Zahl über achtzig Prozent liegen. Unter diesem Boden bei Zyklus vier stimmen die Dedup-Regeln in references/dedup-rules.md immer noch nicht mit den Daten überein und benötigen einen weiteren Durchgang vor weiteren Write-Läufen.

Eine sekundäre Metrik: Stage-Verletzungsanzahl über die Zeit. Eine gesunde Org mit einer echten Funnel-Definition sollte sehen, dass diese Zahl Monat für Monat sinkt, wenn RevOps die vorgelagerten Ursachen behebt — Pflichtfeld-Validierungsregeln, Stage-Übergangs-Automatisierungen, Lead-Routing-Logik. Wenn die Stage-Verletzungsanzahl über Bereinigungszyklen hinweg konstant bleibt, ist das Dirty-Data-Problem Prozess und nicht Daten.

vs. Alternativen

DemandTools ist der Incumbent in diesem Bereich. Es ist ein ausgereiftes, deterministisches, GUI-gesteuertes Tool, das RevOps-Teams seit einem Jahrzehnt verwenden. Es ist hervorragend für hochvolumiges deterministisches Dedup; es ist schwächer beim Survivor-Begründungs-Prüfpfad, den dieser Skill ausgibt, und es kann den Semantic-Similarity-Durchgang für unscharfe Firmennamen nicht ohne eine separate Scripting-Ebene durchführen. Wenn das Team bereits für DemandTools zahlt und das Dedup-Regelwerk ausgereift ist, bleiben Sie dabei und ziehen Sie diesen Skill nur für die semantischen Dedup-Randfälle und die Prüfprotokoll-Disziplin in Betracht.

Cloudingo ist der engere Vergleichspunkt — er hat Fuzzy-Matching und einen Review-then-apply-Workflow, der dem ähnelt, was der Skill produziert. Cloudingo ist benutzerfreundlicher für einen nicht-technischen RevOps-Lead. Der Vorteil des Skills ist die per-Pair-Begründungszeile und das Referenzdatei-Modell, das es dem Team ermöglicht, seine Dedup-Regeln in Git neben dem Rest der RevOps-Konfiguration zu versionieren. Wenn RevOps allergisch gegen Git ist, gewinnt Cloudingo.

Ein manueller RevOps-geführter Bereinigungssprint ist die Status-quo-Alternative für Teams ohne Dedup-Tool. Er funktioniert, verbraucht aber die oben dokumentierte Analysten-Zeit und produziert kein wiederverwendbares Artefakt — der nächste Sprint beginnt von vorne. Der Discovery-Scan des Skills ist jedes Mal dasselbe Artefakt, was die Arbeit zusammensetzbar macht.

Fallstricke

Schreibzugriff beim ersten Lauf zu gewähren ist der häufigste Fehler. Der erste Scan zeigt Schwachstellen im Dedup-Regelwerk genauso wie in den Daten; wenn der Skill sie anwendet, werden die False Positives zu echten, auditierten Writes. Der Guard: Der Skill verweigert apply_fix, wenn das konfigurierte Token Schreibbereich hat und das Prüfprotokoll null vorherige Trockendurchlauf-Zeilen für den Scope des Laufs zeigt. Mindestens zwei schreibgeschützte Zyklen, unabhängig davon, wie vertrauenswürdig die Regeln aussehen.

Account-Merges, die auf die falschen Datensätze kaskadieren, ist der teuerste Fehler. Ein falscher Survivor nimmt die falschen Opportunities, Tasks, Events und Contact Roles mit. Der Guard: apply_fix für jede dedup_account-Zeile verweigert die Ausführung, es sei denn, ein Sandbox-Rehearsal mit demselben Run_Id-Präfix ist in den letzten vierzehn Tagen erfolgt, und der Operator hat --rehearsed=true gesetzt. Das Sandbox-Rehearsal ist keine optionale Zeremonie — es ist der Ort, wo die kaskadierenden Nebeneffekte eines gegebenen Merges tatsächlich beobachtet werden.

Reps, die zu zusammengeführten Accounts aufwachen, von denen sie nie gehört haben, ist das kulturelle Versagen, das zukünftige Bereinigungsläufe zunichte macht. Der Guard: Der Skill gibt ein change_brief.md neben jedem angewendeten Lauf aus, das die Merge-Karte, Inhaber-E-Mails und die Anzahl der verschobenen Opportunities auflistet, bereit zum Einfügen in einen Slack-Kanal, bevor Reps sich einloggen. Senden Sie es. Das Überspringen des Komms-Schritts verbrennt Vertrauen schneller als die schlechten Daten es je getan haben.

Hard-Deletes, die den Papierkorb umgehen, ist eine Anfrage, die auftaucht, aber abgelehnt werden sollte. Der Guard: Der Skill hat keinen Hard-Delete-Codepfad. soft_delete ist die einzige Delete-Operation; wer eine permanente Bereinigung will, macht dies durch den manuellen Workflow der Plattform mit der entsprechenden Freigabe.

Stack

Salesforce — Source of Truth und Ziel der Writes; benutzerdefiniertes Objekt Cleanup_Audit__c enthält das reversible Prüfprotokoll
Claude — führt den Semantic-Similarity-Durchgang aus und gibt die per-Pair-Begründungszeilen aus, die Merges überprüfbar machen
Bulk API — für sowohl Reads (chunked Discovery) als auch Writes (chunked Apply) verwendet; nie die synchrone REST-Abfrage-API für vollständige Scans

Diese Seite auf GitHub bearbeiten

Files in this artifact

Download all (.zip)

---
name: salesforce-data-cleanup
description: Find and propose fixes for Salesforce data-quality issues — duplicate accounts, orphan contacts, junk leads, malformed phones, mismatched account-contact relationships, and stage values that violate funnel definitions. Always proposes; never writes without explicit dry-run review and approval.
---

# Salesforce data cleanup

## When to invoke

Whenever you need to surface and propose fixes for Salesforce data-quality issues that are quietly distorting reporting: duplicate accounts and contacts, orphan records (contacts with no account, opportunities with no contact roles), junk leads (test records, role-address contacts, free-mail in B2B), malformed phones and websites, account-contact mismatches (contact email domain does not match account website domain), and stage values that violate the funnel definition (e.g. "Closed Won" with no Close Date, "Negotiation" with zero opportunity contact roles).

The Skill takes an object scope, a dedup-rules file, a stage-definition file, and an ICP rubric. It produces a discovery scan, then per-issue dry-runs as CSVs the operator approves before any write.

Do NOT invoke this skill for:

- Auto-applying writes without a dry-run + human approval. Every fix goes through the dry-run CSV gate. There is no "auto" mode.
- First runs against production with write access. The first two scan cycles must use a read-only API token. Write credentials only after the operator has reviewed the false-positive rate.
- Hard-deletes. The Skill never bypasses the recycle bin. Deletes are soft by default; permanent purges are out of scope and a deliberate manual decision.
- Bulk merges in production without a sandbox rehearsal. Account merges cascade to opportunities, activities, and contact roles — sandbox first.
- Data residency / PII purge requests under GDPR or CCPA. Use the platform's documented right-to-erasure flow, not this Skill.

## Inputs

- Required: `scope` — comma-separated SObject names to scan (e.g. `Account,Contact,Lead,Opportunity`). Default: all four.
- Required: `dedup_rules_path` — path to a `references/dedup-rules.md` style file naming the deterministic match keys (domain, normalized phone, normalized name) and the fuzzy-match thresholds.
- Required: `stage_definitions_path` — path to a `references/stage-definitions.md` style file naming the required field set per Opportunity stage.
- Required: `icp_rubric_path` — path to a `references/icp-rubric.md` style file used to evaluate whether orphan or low-signal records are worth keeping.
- Optional: `survivor_ranking_path` — path to a `references/survivor-ranking.md` style file with weights for activity recency, contact count, opportunity history, and last-modified-by user trust. Default weights apply if absent.
- Optional: `sandbox` — boolean, default `true`. Forces the Skill to talk to the configured sandbox endpoint, not production.
- Optional: `chunk_size` — integer, default `10000`. Bulk API query chunk size.

## Reference files

Always read the following from `references/` before scanning. Without them, the discovery output is generic and the dry-run proposals will not match the operator's actual definitions.

- `references/dedup-rules.md` — deterministic and fuzzy match rules (replace the template with your real rules)
- `references/stage-definitions.md` — required fields per Opportunity stage (replace with your actual funnel definition)
- `references/survivor-ranking.md` — weights for choosing the survivor in a merge proposal (replace with your actual priorities)

## Method

Run these five sub-tasks in order. Do not parallelize: later steps depend on the survivor-selection context produced by earlier steps.

### 1. Discovery scan

Pull each in-scope SObject via the Bulk API in chunks (default 10k rows). Bulk API, not REST query, because a full Account scan on a real org is 100k+ rows and REST chops on governor limits. Chunked because a single 1M-row pull blows past memory and the job times out at the org's configured Bulk API batch ceiling.

For each chunk, classify rows into issue buckets:

- `dedup_account` / `dedup_contact` / `dedup_lead`
- `orphan_contact` (no AccountId), `orphan_opp` (no contact roles)
- `junk_lead` (free-mail in B2B, role addresses, test patterns)
- `format_violation_phone` / `format_violation_website` (regex-fail)
- `mismatch_account_contact` (contact email domain != account website)
- `stage_violation` (required field per stage missing)

Output: a one-page scan summary with per-bucket counts and confidence band.

### 2. Dedup pass — deterministic + semantic hybrid

For dedup buckets the Skill runs a two-pass pipeline. Pass one is pure deterministic regex/normalization: lowercased email domain, E.164-normalized phone, NFKD-normalized + lowercased + suffix-stripped name. Exact matches on either domain or normalized phone go straight to the proposal CSV.

Pass two is Claude semantic similarity, applied only to candidate pairs that share at least one weak deterministic signal (same first 6 digits of phone, same first token of name, same parent-domain TLD). Hybrid because pure regex misses "Acme, Inc." vs "Acme Incorporated — APAC", and pure semantic similarity is too expensive to run pairwise across N^2 records and produces too many false positives on common names. The narrow-then-rank approach holds the per-scan token spend to under $5 for a 100k-Account org.

Confidence bands: `high` (deterministic match on 2+ keys), `medium` (deterministic on 1 key + semantic similarity ≥ 0.85), `low` (semantic only). The dry-run CSV ships only `high` and `medium` by default.

### 3. Survivor selection for merges

For each duplicate pair, propose a survivor using a weighted score:

- Activity recency (last 90 days of Tasks + Events): weight 0.4
- Contact count attached: weight 0.3
- Opportunity history (count + total Amount): weight 0.2
- LastModifiedBy not equal to integration user: weight 0.1

The Skill ranks both records, proposes the higher score as survivor, and emits a per-pair rationale line in the dry-run CSV. RevOps reviews and overrides via a `survivor_override` column before approving.

This composite is used because no single signal is reliable: most-recent modification often points at an integration backfill, contact count favors old crusty records, and Opportunity Amount alone discards the active relationship. The composite tracks "where the team is actually working."

### 4. Dry-run CSV per issue

For the issue the operator picks, emit a CSV with one row per proposed change. Columns: `Id`, `Object`, `Operation` (`merge` / `update` / `soft_delete`), `Field`, `Old_Value`, `New_Value`, `Confidence`, `Survivor_Id` (merges only), `Rationale`, `Approve` (operator-set, default blank).

The Skill writes nothing until a CSV with `Approve=Y` rows is passed back into `apply_fix`. Rows without `Approve=Y` are skipped, including blanks. This is the gate.

### 5. Apply with audit

`apply_fix(issue_id, csv_path)` reads the approved CSV and posts the writes in chunks via the Bulk API. Every write is logged to a custom SObject `Cleanup_Audit__c` with: `Operation`, `Target_Id`, `Field`, `Old_Value_JSON`, `New_Value_JSON`, `Survivor_Id`, `Run_Id`, `Approver_User`, `Timestamp`. The audit record makes the change reversible: a `revert(run_id)` companion script reads the audit log and re-applies `Old_Value_JSON`.

## Output format

The discovery scan returns a Markdown report and writes one dry-run CSV per issue picked. Literal example:

```markdown
# Salesforce data-cleanup scan — 2026-05-03

Scope: Account, Contact, Lead, Opportunity
Sandbox: true
Run ID: run_2026-05-03_a8c1

## Summary by issue type

| Issue | Count | High conf | Medium conf | Low conf |
|----------------------------|------:|----------:|------------:|---------:|
| dedup_account | 1,284 | 412 | 736 | 136 |
| dedup_contact | 8,902 | 3,118 | 4,901 | 883 |
| orphan_contact | 1,071 | 1,071 | - | - |
| junk_lead | 942 | 942 | - | - |
| format_violation_phone | 4,820 | 4,820 | - | - |
| mismatch_account_contact | 612 | 612 | - | - |
| stage_violation | 217 | 217 | - | - |

Pick one to dry-run: `dry_run_fix(issue_id="dedup_account")`
```

Per-issue dry-run CSV (first three lines):

```csv
Id,Object,Operation,Field,Old_Value,New_Value,Confidence,Survivor_Id,Rationale,Approve
0011x00000ABCD1,Account,merge,,,,high,0011x00000WXYZ9,"E.164 phone match + same primary domain; survivor has 14 activities in last 90d vs 0",
0031x00000EFGH2,Contact,update,Email,john@oldmail.example,john@newmail.example,high,,Domain matches account website,
00Q1x00000IJKL3,Lead,soft_delete,,,,high,,Free-mail (gmail) in B2B segment + no activity 18mo,
```

Audit log entry shape (one row in `Cleanup_Audit__c` per write):

```json
{
"Operation": "merge",
"Target_Id": "0011x00000ABCD1",
"Field": null,
"Old_Value_JSON": "{\"Name\":\"Acme, Inc.\",\"Phone\":\"+1-415-555-0100\"}",
"New_Value_JSON": "{\"merged_into\":\"0011x00000WXYZ9\"}",
"Survivor_Id": "0011x00000WXYZ9",
"Run_Id": "run_2026-05-03_a8c1",
"Approver_User": "revops.lead@example.com",
"Timestamp": "2026-05-03T17:42:11Z"
}
```

## Watch-outs

- **Granting write access on the first run.** The first scan will surface false positives because it is also surfacing flaws in the dedup ruleset. Guard: the Skill refuses to run `apply_fix` if the configured token has write scope and `Cleanup_Audit__c` shows zero prior `Operation=dry_run` rows for the run's scope. Two read-only cycles minimum.
- **Account merge cascading to wrong records.** Account merges propagate to all child Opportunities, Contacts, Tasks, and Events. A wrong survivor destroys the wrong history. Guard: `apply_fix` for `dedup_account` rows refuses to run unless `sandbox=true` has been set on at least one prior run with the same `Run_Id` prefix in the last 14 days, and the operator has set `--rehearsed=true` explicitly.
- **Hard-deletes bypassing recycle bin.** Operators sometimes ask the Skill to "really delete" — to skip the recycle bin and free storage. Guard: the Skill has no hard-delete code path. `soft_delete` is the only delete operation; recycle-bin emptying is a manual platform action.
- **Reps waking up to merged accounts.** A clean dedup run with no comms burns trust faster than the bad data did. Guard: the Skill emits a `change_brief.md` alongside every applied run, listing the merge map (loser-Id → survivor-Id, owner email, count of moved opps) ready to paste into a Slack channel before reps log in.

# Dedup rules — TEMPLATE

> Replace this template's contents with your team's actual matching rules.
> The Skill reads this file on every run; without your real rules, the
> dedup proposals will be generic and produce a high false-positive rate
> on the first scan.

## Object scope

Which SObjects this ruleset applies to. The Skill runs the rules per object; rules can differ across Account / Contact / Lead.

- [ ] Account
- [ ] Contact
- [ ] Lead

## Deterministic match keys

Pass one of the dedup pipeline. Two records that match on any single key listed here are surfaced as `high` confidence dedup candidates.

### Account

| Key                       | Normalization                                  | Match strength |
|---------------------------|------------------------------------------------|----------------|
| Primary domain            | lowercase, strip `www.`, strip TLD aliases     | strong         |
| Billing phone             | E.164 normalize, drop extension                | strong         |
| Stripped name + country   | NFKD, lowercase, drop suffixes (Inc/LLC/GmbH)  | medium         |
| D-U-N-S number            | exact                                          | strong         |

### Contact

| Key                       | Normalization                                  | Match strength |
|---------------------------|------------------------------------------------|----------------|
| Email                     | lowercase, strip `+tag`, strip dots in gmail   | strong         |
| Mobile phone              | E.164 normalize                                | strong         |
| Full name + AccountId     | NFKD, lowercase, drop punctuation              | medium         |

### Lead

| Key                       | Normalization                                  | Match strength |
|---------------------------|------------------------------------------------|----------------|
| Email                     | lowercase, strip `+tag`                        | strong         |
| Phone                     | E.164 normalize                                | strong         |

## Fuzzy / semantic similarity threshold

Pass two of the pipeline. Only candidates that share at least one weak signal (same first 6 digits of phone, same first token of name, same parent-domain TLD) are submitted for semantic comparison.

| SObject | Cosine similarity threshold | Notes                                    |
|---------|-----------------------------|------------------------------------------|
| Account | 0.85                        | Lower threshold misses APAC/EU branches  |
| Contact | 0.90                        | Higher because false positives expensive |
| Lead    | 0.80                        | Lower because cleanup tolerates churn    |

## Disqualifiers — never propose merge

Even if rules match, do not propose merge if any of the following holds:

- Records belong to different `RecordType` (e.g. Customer vs Partner)
- Records are owned by different BU `Sales_Region__c` values
- One record has an active Opportunity in `Negotiation` or later — defer to rep review, do not auto-propose
- Records carry a `Do_Not_Merge__c = TRUE` flag set by an admin

## Last edited

{YYYY-MM-DD}

# Stage definitions — TEMPLATE

> Replace this template's contents with your team's actual Opportunity
> stage definition. The Skill uses this to flag rows in the
> `stage_violation` bucket — opportunities sitting in a stage without
> the fields that stage requires.

## Funnel stages

List every active value of the Opportunity `StageName` picklist, in funnel order. Mark which stages are open vs closed-won vs closed-lost.

| Order | StageName             | Type        | Probability % |
|------:|-----------------------|-------------|---------------:|
| 1     | Prospecting           | open        |              5 |
| 2     | Qualification         | open        |             15 |
| 3     | Discovery             | open        |             30 |
| 4     | Solutioning           | open        |             50 |
| 5     | Negotiation           | open        |             70 |
| 6     | Verbal Commit         | open        |             90 |
| 7     | Closed Won            | closed-won  |            100 |
| 8     | Closed Lost           | closed-lost |              0 |

## Required fields per stage

A row sitting in stage N must have all fields listed for N (and all earlier stages) populated. The Skill flags rows that violate this as `stage_violation` candidates.

### Qualification

- `AccountId` not null
- `Primary_Contact__c` not null
- `Lead_Source` not null

### Discovery

- All of Qualification, plus:
- At least 1 OpportunityContactRole with `IsPrimary = TRUE`
- `Pain_Point__c` not null
- `Compelling_Event__c` not null

### Solutioning

- All of Discovery, plus:
- `Decision_Criteria__c` not null
- `Decision_Process__c` not null
- At least 2 OpportunityContactRoles

### Negotiation

- All of Solutioning, plus:
- `Economic_Buyer_Identified__c = TRUE`
- `Mutual_Action_Plan_URL__c` not null

### Verbal Commit

- All of Negotiation, plus:
- `Procurement_Status__c` not null
- `Close_Date` within the next 45 days

### Closed Won

- All of Verbal Commit, plus:
- `Close_Date` not in the future
- `Amount` not null and greater than zero
- `Contract_URL__c` not null
- At least 1 `Quote` record with `Status = "Accepted"`

### Closed Lost

- `Close_Date` not in the future
- `Loss_Reason__c` not null
- `Competitor__c` not null when `Loss_Reason__c = "Competitor"`

## Skip-stage rules

Stages that may be legitimately skipped (the Skill does not flag a violation if the row passes through them):

- `Solutioning` may be skipped for inbound deals under $10k ACV
- `Verbal Commit` may be skipped for self-serve conversions

## Last edited

{YYYY-MM-DD}

# Survivor ranking — TEMPLATE

> Replace this template's contents with your team's actual weighting.
> The Skill uses these weights when proposing which record survives a
> merge. Defaults below reflect the bias "where the team is actually
> working today" rather than "who has the most history."

## Composite score

For each record in a duplicate pair, the Skill computes:

```
score = 0.40 * activity_recency_signal
      + 0.30 * contact_count_signal
      + 0.20 * opportunity_history_signal
      + 0.10 * not_integration_user_signal
```

Higher score wins survivor status. Ties default to the record with the lower (older) Salesforce Id, which is conventionally the canonical record.

## Signal definitions

### activity_recency_signal

- 1.0 if Tasks + Events in the last 30 days exist
- 0.7 if in the last 90 days
- 0.3 if in the last 365 days
- 0.0 otherwise

### contact_count_signal

- Normalized count of active Contacts attached to the Account. `min(count, 20) / 20`. Capped to avoid one record's bloat dominating.

### opportunity_history_signal

- 0.5 weight on count of Opportunities with `IsClosed = FALSE`
- 0.5 weight on log10(1 + sum of `Amount` across all Opportunities)
- Both halves min-max normalized across the candidate pair only

### not_integration_user_signal

- 1.0 if `LastModifiedById` is not in the integration-user allowlist
- 0.0 otherwise

The integration-user allowlist (set by RevOps admin):

- `dataloader@example.com.invalid`
- `marketo-sync@example.com.invalid`
- `outreach-sync@example.com.invalid`
- {add yours}

## Override columns

Even after the Skill proposes a survivor, RevOps can override per-row in the dry-run CSV. The columns the Skill respects on read-back:

- `survivor_override` — set to the Id you want as survivor
- `do_not_merge` — set to `Y` to drop this row from `apply_fix`
- `defer_to_owner_review` — set to `Y` to route to the rep instead

## Disqualifiers — never auto-propose a survivor

Skip survivor proposal entirely if any of:

- Both records have an open Opportunity in `Negotiation` or later — too much risk of breaking deal motion mid-flight
- Records have different Account Owners and one is in their first 30 days on the team (avoid trampling new-hire pipelines)
- One record carries a `Strategic_Account__c = TRUE` flag — kick to a RevOps human

## Last edited

{YYYY-MM-DD}