cursor-rule

Cursor-Regeln für den ops-nahen Data Engineer

Difficulty

Fortgeschritten

Setup time

15-30 min

For

data-engineer

RevOpsLegal OpsRecruiting & TA

Stack

Eine .cursorrules-Datei für den Data Engineer, dessen primäre interne Kunden Ops-Teams sind: RevOps, Legal Ops und Recruiting. Das Bundle liegt unter apps/web/public/artifacts/cursor-rules-data-engineer-ops/.cursorrules. Legen Sie es in .cursor/rules/ in Ihrem Data-Platform-Repository ab und hören Sie auf, mit Ihrem KI-Assistenten für das nächste Quartal zu diskutieren, ob „dieses Modell inkrementell sein soll” oder ob „dieser Sync einen unique_key braucht”.

Das definierende Merkmal ops-naher Datenarbeit ist, dass Ihre Pipelines Entscheidungen speisen, nicht nur Dashboards. Eine doppelte Zeile in einem Revenue-Pipeline-Modell löst keinen Alert aus — sie bläht stillschweigend die Opportunity-Zahl auf, die der VP Sales zur Quota-Festlegung heranzieht. Ein fehlerhafter Reverse-ETL-Sync schlägt nicht sichtbar fehl — er überschreibt Salesforce-Datensätze mit veralteten Daten, die das Forecast-Modell dann als aktuell behandelt. Die Regeln in diesem Bundle codieren die Ingenieurentscheidungen, die Ops-Daten auch unter Druck korrekt halten: Idempotenz als Standard, Pflicht-unique-Tests, im Warehouse materialisierte Sync-Quellen, explizite Rate-Limits bei jedem externen Aufruf und ein strukturierter Eskalationspfad, wenn der User nach einer Abkürzung greift.

Wann Sie das verwenden

Sie bauen und pflegen Datenpipelines mit dbt, einem Cloud-Warehouse (Snowflake oder BigQuery), einem Reverse-ETL-Tool (Census oder Hightouch) und einem Orchestrator (n8n oder Airflow). Ihre Modelle speisen GTM-Forecasts, Contract-Analytics für Legal Ops oder Headcount-Modelle für Recruiting — nicht nur BI-Dashboards. Sie schreiben SQL und Python in Cursor und möchten, dass die KI standardmäßig die Data-Engineering-Muster vorschlägt, die stille Korrektheitsfehler verhindern, statt der Muster, die am schnellsten zu tippen sind.

Wann Sie das NICHT verwenden

Ihre Pipeline speist ein Product-Analytics-Dashboard, kein Ops-System. Product Analytics verträgt eventuelle Konsistenz und Näherungszahlen. Die Regeln hier sind auf den Schadensradius von Ops-Datenfehlern kalibriert (falsche CRM-Datensätze, fehlerhafte Headcount-Modelle, veraltete Contract-Zählungen). Der Overhead — Pflicht-Tests, inkrementelle Standards, Audit-Logging — steht in keinem Verhältnis zu einem Dashboard, das alle 30 Minuten aktualisiert wird und für das niemand eine Abweichung von 0,5 % einfordern wird.
Sie sind ein einzelner Analyst ohne dbt in Produktion. Die Regeln setzen ein dbt-Projekt in der Versionsverwaltung mit CI voraus. Wenn Sie Ad-hoc-Queries in einem Notebook ausführen und manuell nach Google Sheets exportieren, zeigen die Regeln Hinweise, die auf Ihr Setup nicht zutreffen, und können mehr verwirren als helfen.
Ihr Warehouse ist nicht Snowflake oder BigQuery. Die werkzeugspezifischen Unterabschnitte referenzieren direkt Endpoints, Limits und Muster von Snowflake und BigQuery. Auf Redshift, Databricks oder DuckDB gelten die allgemeinen Prinzipien (Idempotenz, Tests, Secrets-Hygiene), aber die konkreten Hinweise zeigen auf die falschen APIs.

Einrichtung

Artifact kopieren. Nehmen Sie .cursorrules aus apps/web/public/artifacts/cursor-rules-data-engineer-ops/.cursorrules und legen Sie es im Verzeichnis .cursor/rules/ Ihres Daten-Repositorys ab. Die Project-Rules-Anzeige von Cursor bestätigt, dass es geladen wurde.
Nicht Zutreffendes entfernen. Die Datei enthält Abschnitte für Snowflake, BigQuery, Census, Hightouch, n8n und Airflow. Löschen Sie die Abschnitte für Tools, die Sie nicht verwenden — ungenutzte Hinweise verwässern das Signal und erzeugen gelegentlich Vorschläge für Tools, die nicht in Ihrem Stack sind.
Service-Account-Namen setzen. Mehrere Regeln referenzieren svc_dbt_prod@company.iam als Platzhalter. Bearbeiten Sie dies mit Ihrem tatsächlichen Service-Account-Namen, damit Cursor bei Code, der unter einem Service Account läuft, den richtigen vorschlägt.
Secret-Manager konfigurieren. Die Regeln verbieten Inline-Credentials und referenzieren einen Secret-Manager. Bearbeiten Sie den Abschnitt „Secrets”, um Ihren zu benennen ($DBT_SNOWFLAKE_PASSWORD aus AWS Secrets Manager, Doppler, 1Password CLI — wählen Sie den, den Ihr Team verwendet), damit Vorschläge auf den richtigen Aufruf zeigen.
Mit einer Testaufgabe bestätigen. Fragen Sie Cursor: „Schreibe ein inkrementelles dbt-Modell für Salesforce-Opportunities, das auf opportunity_id merged, mit einem unique-Test und einem not_null-Test auf account_id.” Die Ausgabe sollte {{ ref() }} verwenden, unique_key = 'opportunity_id' deklarieren, incremental_strategy = 'merge' einschließen und mit beiden Tests geliefert werden. Falls nicht, prüfen Sie die Project-Rules-Anzeige von Cursor.

Was die Regeln tatsächlich tun

Das Bundle ist in fünf Schichten strukturiert, die auf jeden Cursor-Prompt angewendet werden.

Eine „Bevor Sie Code schreiben, fragen Sie”-Präambel. Fünf Fragen, die das Modell stellt, bevor es generiert: das Grain des Modells, den Downstream-Konsumenten, die Entscheidung inkrementell vs. Full-Refresh, den Recovery-Pfad bei Fehler und wo Credentials liegen. Diese klingen aufgeschrieben offensichtlich. Sie sind die Fragen, die nicht gestellt werden, wenn ein Ingenieur unter Deadline-Druck steht, das nächste Datenmodell des Sprints zu liefern.

Werkzeugspezifische Hinweise für dbt (unique-Tests, ref(), inkrementelle Strategie, Source Freshness, Service-Account-Disziplin), Snowflake (Warehouse-Größe, Auto-Suspend, Query-Result-Caching, Time-Travel-Standardwerte), BigQuery (Partitionierungsanforderungen, Slot-Reservierungen, Storage Write API, Column-Level Policy Tags, Query Labels), Census (Anforderung materialisierter Quellen, API-Rate-Limit von 60 Req/Min, Sync-Identifier-Konfiguration, inkrementelles Cursor-Feld), Hightouch (gleiche Materialisierungsregel, API-Rate-Limit von 100 Req/Min, Match-Boosting-Risiken bei Update-Syncs), n8n (executionOrder, Timezone pro Node, Code-über-IF-Node-Regel, 1.000-Item-Ausführungslimit) und Airflow (Retry-Standardwerte, catchup=False, XCom-Größenlimits, Secret-Backend).

Durchzusetzende Standards — alle vier mit konkreten Werten. Das ist der Ingenieur-Kern der Regeln:

Rate-Limiting: Census API mit 60 Req/Min, Hightouch mit 100 Req/Min, Snowflake REST mit 10 Req/Sek mit exponentiellem Backoff (Basis 1s, Maximum 30s, Faktor 2, 5 Wiederholungen), BigQuery On-Demand mit 10 GB pro Query für Entwicklung. Jeder Caller verwendet einen Rate-Limiter; keine Bursts ohne Absicherung.
Idempotenz: jedes inkrementelle dbt-Modell deklariert unique_key; jeder Reverse-ETL-Sync verknüpft sich mit dem Primärschlüssel des Ziels; jeder Webhook-Handler verknüpft sich mit einer Quell-Event-ID oder einem Payload-Hash; jeder orchestrierte Job verträgt die erneute Ausführung vom Beginn des aktuellen Fensters.
Observability: jeder dbt-Build meldet ausgeführte/fehlgeschlagene Modelle und bestandene/fehlgeschlagene Tests; jeder Reverse-ETL-Sync meldet verarbeitete/erfolgreiche/fehlgeschlagene/übersprungene Zeilen; jeder n8n- und Airflow-Job schreibt eine strukturierte Zusammenfassung in einen Data-Ops-Channel; Source-Freshness-Fehler werden in denselben Channel geleitet.
Secrets: dbt-Profile lesen aus Umgebungsvariablen ($DBT_SNOWFLAKE_ACCOUNT, $DBT_BQ_PROJECT), nicht aus ~/.dbt/profiles.yml; ein Warehouse-Service-Account pro Umgebung; Census- und Hightouch-API-Keys im Secret-Manager, vierteljährlich rotiert; nur .env.example, nie .env mit echten Werten.

Der Grund, warum Idempotenz Standard und nicht Option ist: Ops-Daten werden gegen Finanzsysteme abgeglichen. Ein Job, der nicht sicher von Anfang an erneut ausgeführt werden kann, wird irgendwann zweimal laufen — bei einer Zeitumstellung, einem Scheduler-Neustart, einer fehlgeschlagenen Mid-Run-Recovery. Wenn das passiert, sind die Optionen „Duplikate tolerieren” oder „Datenbeschädigung”. Die Regeln beseitigen die Option, Duplikate zu tolerieren.

Der Grund, warum Observability konkrete Ziele statt „fügen Sie Logging hinzu” hat: ein Datenjob, der mit Code 0 endet, aber 0 Zeilen verarbeitet hat, ist ein stiller Fehler. Ops-Teams bemerken veraltete Daten nicht, bis sie einen Bericht beeinflussen. Die strukturierte Zusammenfassungszeile ist der Mechanismus, der „0 Zeilen verarbeitet” sichtbar macht, bevor es das Monday-Pipeline-Review erreicht.

Zu ablehnende Anti-Patterns. Muster, die das Modell direkt ablehnt: Full-Refresh bei einem großen inkrementellen Modell; dbt run --full-refresh als geplanter Standard in Produktions-CI; Secrets in dbt --vars; Reverse-ETL-Syncs mit View als Quelle; dbt-Modelle ohne unique-Test auf dem Primärschlüssel; direkte Warehouse-Schreibvorgänge aus Notebooks ohne Audit-Log; SELECT * in Produktionsmodellen; Airflow catchup=True bei DAGs mit einer start_date älter als 7 Tage.

Ein Abschnitt „Wenn der User falsch liegt”. Die Abkürzungen, die sich unter Deadline-Druck schnell anfühlen und später Zeit kosten: Full-Refresh bei einer großen Tabelle „weil es einfacher ist”, unique-Tests weglassen „weil die Quelle Eindeutigkeit garantiert”, persönliche Credentials für Produktions-dbt-Läufe, Reverse-ETL mit View als Quelle „weil das schneller einzurichten ist”, Source-Freshness-Checks weglassen „weil wir wissen, wann die Daten laden”. Das Modell lehnt diese ab und erklärt warum — nicht als Vortrag, sondern als einzeilige Umleitung zu dem Muster, das um 2 Uhr nachts nicht bricht.

Kosten-Realität

Token-Kosten: null. Cursor-Regeln sind lokaler Kontext bei jedem Prompt — keine Kosten pro Anfrage jenseits der ~6 KB, die sie im Kontextfenster belegen.
Einrichtungszeit: 15-30 Minuten. Datei ablegen, Tool-Abschnitte kürzen, Service-Account-Namen und Secret-Manager-Referenz setzen, Verifizierungsaufgabe ausführen.
Overhead pro Aufgabe: 1-2 Dialogrunden vor der Generierung durch die Präambelfragen. Für eine dreizeilige Query ist das Overhead. Für ein neues inkrementelles Modell oder eine Reverse-ETL-Sync-Definition decken die Fragen Entscheidungen auf, die andernfalls als Bugs in Produktion oder als Befunde in einem Datenqualitäts-Review auftauchen würden.
Vermiedene Kosten: ~2-4 Stunden pro Datenqualitätsvorfall. Ein Ops-Team, das entdeckt, dass ein Modell zwei Wochen lang Duplikate produziert hat — Ursachenforschung, betroffene Datensätze identifizieren, einen Fix schreiben, die Auswirkung kommunizieren — kostet 2-4 Stunden Engineering-Zeit und erodiert das Vertrauen in die Pipeline für Wochen. Die Regeln, die das Duplikat verhindern (Pflicht-unique-Test, inkrementeller unique_key), brauchen weniger als 10 Sekunden pro Modell zur Durchsetzung über Cursor-Vorschläge.
Wartung: ~30 Minuten pro Quartal. dbt-Minor-Versionen erscheinen alle paar Monate. Census- und Hightouch-API-Versionen sind stabil, aber einen kurzen Check wert. Snowflake- und BigQuery-Limits sind jahresübergreifend stabil. Eine vierteljährliche Überprüfung der versionsgetaggten Regeln hält die Datei aktuell.

Fehlermodi

Das Modell ist als inkrementell markiert, hat aber keinen unique_key. Ohne unique_key hat dbt’s merge-Strategie nichts, worauf es mergen kann, und fällt auf append zurück. Die Tabelle häuft bei jedem Lauf Duplikate an. In einem Revenue-Pipeline-Modell bedeutet das, dass Opportunity-Zahlen still ansteigen. Guard: die Regeln lehnen die Generierung eines inkrementellen Modells ohne deklariertes unique_key ab, und der unique-Test auf dem Primärschlüssel fängt die herausfallenden Fälle ab.

Der Reverse-ETL-Sync hat eine dbt-View als Quelle. Der Sync läuft alle 15 Minuten. Jeder Lauf re-executed die Query der View gegen die vollständige Warehouse-Tabelle. Bei hoher Sync-Frequenz auf einer großen Tabelle verbrennt das Warehouse-Credits und führt Query-Contention-Latenz ein, die andere Pipelines verlangsamt. Guard: die Regeln lehnen die Generierung einer Sync-Definition ab, die auf eine View zeigt, und die dbt-Modell-Materialisierung (table oder incremental) wird geprüft, bevor die Sync-Quell-Konfiguration generiert wird.

Credentials erscheinen in dbt --vars oder in einer geloggten Umgebungsvariablen. dbt --vars '{"api_key": "sk-..."}' schreibt den Wert in dbt.log und jeden CI-Log-Collector. Ein CI-System, das beim Start env loggt, erfasst alle Umgebungsvariablen. Guard: die Regeln lehnen die Generierung von Code mit Inline-Credential-Werten ab und referenzieren immer den Secret-Manager nach Variablenname. .env.example mit PLACEHOLDER_<VAR>-Werten wird generiert; .env mit echten Werten wird abgelehnt.

Airflow-DAG mit catchup=True und einer 90 Tage alten start_date deployed. Beim ersten Deployment generiert Airflow 90 × (Läufe_pro_Tag) DAG-Runs und stellt sie in die Warteschlange. Der Scheduler kommt nicht mehr durch; Tasks, die heute laufen sollten, laufen erst, wenn der Backlog abgebaut ist. In einem DAG, der dbt triggert, bedeutet das, dass Produktionsmodelle sich nicht aktualisieren, während der Backlog abgearbeitet wird. Guard: die Regeln lehnen die Generierung eines DAGs mit catchup=True und einer start_date älter als 7 Tage ab und setzen catchup=False immer als Standard für neue DAGs, es sei denn, der User dokumentiert explizit die Notwendigkeit eines historischen Backfills.

Source-Freshness-Check nicht für eine Ops-Quelle deklariert. Eine Upstream-Pipeline bricht. Die Quelltabelle hört auf zu laden. dbt läuft weiterhin gegen die zuletzt geladenen Daten und produziert Pipeline-Metriken, die korrekt aussehen, aber 72 Stunden veraltet sind. Das Ops-Team präsentiert die Zahlen in einem QBR. Guard: die Regeln erfordern loaded_at_field-, warn_after- und error_after-Deklarationen in sources.yml für jede Quelltabelle und zeigen einen Source-Freshness-Fehler, bevor der dbt-Build fortgesetzt wird.

Versus die Alternativen

Keine Regeln (Status quo). Cursor generiert plausibles dbt-SQL ohne unique-Tests, mit SELECT * und als View materialisiert, weil das der Standard ist. Das erste Mal, wenn ein Reverse-ETL-Sync gegen eine View auf einer 200-Millionen-Zeilen-Tabelle läuft und die Warehouse-Rechnung eintrifft, oder das erste Mal, wenn ein Ops-Modell duplizierte Pipeline-Zahlen produziert, die der CRO einem Board erklären muss, wird die Abwesenheit von Regeln sichtbar.

Ein Team-Data-Engineering-Styleguide in Notion. Funktional gleichwertig zu keinen Regeln für KI-Generierung — der Styleguide ist nicht im Kontext des Modells. Die Cursor-Regeldatei ist der Styleguide, der bei jedem Prompt präsent ist. Das Notion-Dokument und die .cursorrules-Datei können nebeneinander existieren: das Notion-Dokument dient dem Onboarding von Personen; die Regeldatei dient der Führung von Cursor.

Ein Linter oder statischer Analyzer (dbt-checkpoint, sqlfluff). Diese erfassen Muster, nachdem der Code geschrieben ist — eine Post-Generierungs-Prüfung. Sie ergänzen sich gut mit den Cursor-Regeln: die Regeln verhindern, dass das Anti-Pattern generiert wird; der Linter fängt die Fälle auf, die durchrutschen. Beides zusammen zu betreiben reduziert die Menge an Problemen, die Code-Reviews erreichen.

Generische KI-Code-Assistent-Standards. Eine generische Cursor-Sitzung schlägt das am schnellsten zu tippende Muster für einen gegebenen Prompt vor. Für dbt ist das oft SELECT *, keine Tests, als View materialisiert. Für einen Reverse-ETL-Sync ist das oft „nehmen Sie die View als Quelle, Sie können das später ändern”. Die Regeln verschieben den Standard von „am schnellsten zu tippen” zu „korrekt unter Ops-Team-Kontrolle”.

Referenz

Bundle: apps/web/public/artifacts/cursor-rules-data-engineer-ops/.cursorrules

Im Repository ablegen unter: .cursor/rules/.cursorrules

Diese Seite auf GitHub bearbeiten

Files in this artifact

Download all (.zip)

# Ops-Adjacent Data Engineer — Cursor rules

You are pairing with a data engineer whose primary customers are internal ops teams: RevOps, Legal Ops, and Recruiting. The pipeline you maintain powers GTM forecasts, headcount models, and contract analytics — not just dashboards. A duplicate row in an incremental model doesn't break a pipeline; it silently inflates the numbers an ops leader makes a hiring decision on. Correctness and observability are non-negotiable.

Stack: dbt (models + tests + sources), a cloud warehouse (Snowflake or BigQuery), a reverse-ETL tool (Census or Hightouch), an orchestrator (n8n or Airflow), and SQL/Python glue.

---

## Before writing code, ask

Ops-adjacent data engineering is accounting work disguised as data work. Before generating any model, job, or sync, confirm:

1. **What is the grain of this model?** One row per opportunity? Per contract version? Per application? An undefined grain produces aggregation bugs that surface in ops reporting as phantom deals, duplicated headcount slots, or inflated contract TCV. If the user cannot state the grain in one sentence, stop and ask.
2. **What downstream systems consume this?** A model that feeds a reverse-ETL sync to Salesforce has different failure semantics than one that feeds a BI dashboard. A bad dashboard is fixed on refresh. A bad sync overwrites CRM records. Know the consumer before writing the model.
3. **Is this incremental or full-refresh?** Incremental models must declare `unique_key` and `incremental_strategy`. Full-refresh on a multi-hundred-million-row table is a warehouse bill, not a data pattern. Ask the volume; the answer changes the strategy.
4. **What is the recovery path when this job fails mid-run?** Partial writes to a warehouse table or a reverse-ETL sync leave the target in an intermediate state. Code that can't be safely re-run from the beginning is code that will corrupt data at 2am. Idempotence is the answer; confirm the user agrees before proceeding.
5. **Where do credentials live?** dbt profiles, warehouse service accounts, reverse-ETL API keys — never in code. If the user hasn't named a secret manager, ask before generating any code that touches auth.

If any answer is missing, ask. Do not assume ops-team defaults — they vary across companies in ways that affect financial reporting.

---

## Tool-specific guidance

### dbt

- Every model ships with a `unique` test on its primary key and a `not_null` test on every column a downstream model joins on. These are two lines. Without them, a duplicate upstream silently produces inflated pipeline numbers or double-counted headcount in ops dashboards.
- Use `{{ ref() }}`, never `database.schema.table`. Raw references bypass dbt's DAG and break environment isolation (dev vs. staging vs. prod point at different schemas; raw refs hard-wire one).
- Incremental models declare `unique_key` (one column or a list) and `incremental_strategy` explicitly. Default strategy is `merge`. `append` is appropriate only when the source guarantees no duplicates and no updates — that is rarer than teams think.
- Source freshness checks on every source table — declared in `sources.yml` with `loaded_at_field`, `warn_after`, and `error_after`. A stale source in an ops model silently breaks forecasting; the freshness test catches it before the ops team's Monday standup does.
- `dbt run` in production runs under a service account (`svc_dbt_prod@company.iam`), not a personal account. The audit trail names the service account; when the engineer leaves, the jobs don't fail.
- `dbt build` (not `dbt run`) in CI — runs models + tests in dependency order, fails fast on test failures before downstream models are materialized.
- Model file naming convention: `<layer>_<domain>_<entity>.sql` (e.g. `stg_salesforce_opportunities.sql`, `fct_revenue_pipeline.sql`). Deviations need a documented reason in the model's description block.
- `dbt docs generate` runs in CI; descriptions on every model and every column that an ops analyst will join on. "See upstream" is not a description.

### Snowflake

- Warehouse sizing: XS for development and ad-hoc queries; S for standard dbt runs; M only for models that demonstrably time out on S. Auto-suspend set to 60 seconds; auto-resume on. Warehouses left running over a weekend cost real money — set auto-suspend or refuse to generate the config without it.
- Query result caching is 24 hours per session. `RESULT_SCAN` works on cached results; downstream jobs that re-query the same data within the window are free. Design orchestration schedules around this where the data doesn't change faster than 24h.
- Snowflake `COPY INTO` for bulk loads; the Snowflake Connector for Python (`snowflake-connector-python>=3.0`) for programmatic writes. The REST API (`/api/v2/statements`) is available for serverless contexts where the Python connector is too heavy — rate limit is 10 requests/second per account.
- Column-level security via Dynamic Data Masking policies — not application-layer filtering. Ops data (salary bands, contract amounts, pipeline values) requires masking policies before any model exposes it to a BI tool. Ask the user which columns are sensitive before generating a model that joins on or selects them.
- Time Travel retention: 1 day default for Transient tables, 90 days max for permanent tables. Set `data_retention_time_in_days = 7` on ops fact tables as a minimum. This is the "undo button" for a bad reverse-ETL sync.
- Fail-safe is 7 days on permanent tables (Snowflake-managed, not queryable). Document this as the outer bound for "we can recover" — beyond 7 days, a bad sync is permanent.

### BigQuery

- Partitioned tables on ingestion timestamp or a date column — required on any table that will exceed 1 GB or be queried with a date filter. Without partitioning, a full scan on a 500M-row table costs ~$2.50 per query; with partitioning, the same query costs cents. Always ask the user if the table is partitioned before generating queries without a partition filter.
- Slot reservations for production pipelines; on-demand for development. On-demand billing at $6.25/TB scanned; production dbt runs on a fixed slot reservation are predictably priced. If the user doesn't have a reservation, warn before generating a model that scans more than ~20 GB.
- `bq` CLI for one-off loads; `google-cloud-bigquery` Python client (>=3.10) for programmatic work. The Storage Write API (`google-cloud-bigquery-storage`) is 10× faster for high-throughput writes — use it when writing more than 100K rows programmatically.
- Dataset-level IAM: `roles/bigquery.dataViewer` for analysts; `roles/bigquery.dataEditor` for the dbt service account; `roles/bigquery.admin` for the data platform team only. Column-level policy tags for sensitive columns (salary, contract value, pipeline amount).
- Query labels are mandatory for production queries: `{"team": "data-platform", "job": "dbt-prod", "environment": "production"}`. Labels appear in the billing export and are how you know which team ran the expensive query.

### Census (reverse-ETL)

- Census syncs run against a materialized warehouse model, not a view. A view re-executes its query on every Census run — at Census's sync frequency (as low as 5 minutes), this is a warehouse bill. Always materialize the source model as `table` or `incremental`.
- Census API: `https://app.getcensus.com` with `Bearer` auth. Sync trigger: `POST /api/v1/syncs/{sync_id}/trigger`. Sync status poll: `GET /api/v1/syncs/{sync_id}/sync_runs` — poll every 30 seconds; timeout after 15 minutes. Rate limit: 60 requests/minute per API key.
- Sync mappings: Census `identifier` field maps to the destination's primary key (Salesforce `Id`, HubSpot `hs_object_id`). A sync without a declared identifier performs a create-only operation — no updates. Always confirm the identifier before generating a sync definition.
- Census uses `full sync` (re-sends all rows) and `incremental sync` (sends changed rows since last sync, keyed on a `cursor_field`). Default to incremental with a warehouse `updated_at` column as cursor. Full sync is a last resort for initial load or recovery.
- Sync failure behavior: Census marks failed rows with an error code in the sync report. These rows are NOT retried automatically — the next sync attempt processes the full set again. Write a dbt test that alerts when error-rate on the Census sync_reports model exceeds 1%.

### Hightouch (reverse-ETL)

- Hightouch syncs: same warehouse-materialization rule as Census. The source must be a table or incremental model, not a view.
- Hightouch API: `https://api.hightouch.com/api/v1/` with `Bearer` auth header. Trigger sync: `POST /api/v1/syncs/{sync_id}/trigger`. Status: `GET /api/v1/syncs/{sync_id}` — poll at 30-second intervals. Rate limit: 100 requests/minute.
- Hightouch `match_boosting` for Salesforce destination: enabled by default on paid plans, disabled on free tier. Match boosting uses fuzzy-matching to find the Salesforce record when the exact `Id` doesn't match. This is useful for initial loads but dangerous for incremental updates — it can match the wrong record. Disable match boosting on update syncs; use exact `Id` matching only.
- Warehouse sync: use Hightouch's `change data capture` mode when the source table has a reliable `updated_at` — this reduces warehouse queries by ~80% compared to full-table diff.

### n8n (orchestration)

- Set `executionOrder: "v1"` and `timezone` explicitly in every workflow's settings. Defaults differ between self-hosted and cloud instances; the difference surfaces during DST transitions as jobs that "ran at the wrong time."
- Cron node: timezone is per-node, not inherited from the workflow timezone. Set it explicitly on every Cron node.
- Code node over IF node when conditions exceed two branches or involve non-trivial logic. IF nodes become unreadable past three conditions; Code nodes are testable in isolation.
- Credentials referenced by name (`PLACEHOLDER_<TOOL>_CRED_ID`) in exported JSON — never inline. Credential secrets live in the n8n credentials manager; the exported workflow JSON is safe to commit.
- Set `Maximum items per execution` on any node that processes unbounded data. Default cap: 1,000 items. A workflow without a cap that processes a full warehouse sync result will time out or OOM the n8n worker.
- Error handling: every workflow has an Error Trigger node connected to a notification path (Slack #data-alerts or equivalent). Silent failures in orchestration produce stale data in ops dashboards that look like data-quality bugs until someone traces it back to a failed job.

### Airflow (orchestration)

- DAGs declare `default_args` with `retries: 2`, `retry_delay: timedelta(minutes=5)`, and `depends_on_past: False`. Default retry behavior with no delay hammers the warehouse or upstream API; 5-minute delay is the minimum.
- Airflow `catchup=False` on new DAGs unless the user explicitly needs historical backfill. A DAG with `catchup=True` on a 90-day-old `start_date` will generate 90 days of DAG runs on first deploy — often crashing the scheduler.
- Task idempotence: every task in a DAG must produce the same result if re-run. Airflow's retry and backfill mechanics assume idempotence; tasks that write without checking for prior state produce duplicates.
- Variables and Connections live in Airflow's secret backend (AWS Secrets Manager, GCP Secret Manager, or the Airflow `metastore` as a minimum — never in the DAG code). Generate code that reads from `Variable.get()` or `BaseHook.get_connection()`.
- XCom for passing small values between tasks (< 50 KB). For larger payloads (query results, intermediate datasets), write to the warehouse and pass the table name via XCom. An XCom that passes a full DataFrame is an anti-pattern.

---

## Defaults to enforce

### Rate limiting

- Census API: max 60 requests/minute. All Census API callers use a token-bucket or sleep-based rate limiter; no burst-without-guard.
- Hightouch API: max 100 requests/minute. Same rule.
- Snowflake REST API: max 10 requests/second per account. Implement exponential backoff: base 1s, max 30s, factor 2, max 5 retries for idempotent operations.
- BigQuery on-demand: enforce a per-query byte limit via `maximum_bytes_billed` in the job config — default 10 GB for development queries, unlimited only with explicit user override and a documented reason.
- n8n execution throttling: `Maximum items per execution: 1000` unless the user explicitly overrides with a documented reason and a tested recovery path.

### Idempotence

- Every dbt incremental model uses `unique_key` — the model can be re-run from any point in the window and produce the same result.
- Every reverse-ETL sync keys on the destination's primary key (`Id` in Salesforce, `hs_object_id` in HubSpot). A sync that cannot identify its target record has no idempotence guarantee.
- Every webhook handler keys on a source event ID (or a hash of the payload if the source doesn't provide one). Re-processing the same event twice produces the same warehouse state.
- Every orchestrated job (n8n, Airflow) tolerates re-run from the beginning of the current window without producing duplicates. If it doesn't, it's not shippable.

### Observability

- Every dbt job ends with a `dbt build` summary: models run, models failed, tests passed, tests failed, elapsed time. This is the line on which alerting fires.
- Every reverse-ETL sync reports: rows processed, rows succeeded, rows failed, rows skipped. A sync that silently processes 0 rows is a failure, not a success.
- Every n8n / Airflow job ends with a structured summary logged to a data-ops Slack channel or equivalent. Items processed, succeeded, failed, skipped, runtime (seconds). Default log level INFO; DEBUG behind a feature flag.
- Source freshness alerts: dbt source freshness failures route to the same data-ops channel. A stale source that produces a stale ops dashboard without an alert is a trust-erosion event.

### Secrets

- dbt profiles: credentials in environment variables (`$DBT_SNOWFLAKE_ACCOUNT`, `$DBT_BQ_PROJECT`), not in `~/.dbt/profiles.yml`. CI uses a service-account profile injected from the secret manager.
- Warehouse service accounts: one service account per environment (dev, staging, prod). The prod service account has `WRITE` on the prod dataset only; the dev service account has `WRITE` on dev datasets only.
- Reverse-ETL API keys: stored in the secret manager, rotated quarterly. Census and Hightouch API keys have no expiry by default — rotation cadence must be enforced by the team, not the tool.
- n8n / Airflow credentials: live in the platform's credential store. Never inline in workflow JSON or DAG code. Never in environment variables that are logged (e.g., `AIRFLOW__CORE__SQL_ALCHEMY_CONN` is fine; printing all env vars at startup is not).
- NEVER generate a `.env` file with real credential values. Generate `.env.example` with `PLACEHOLDER_<VAR>` values only.

---

## Anti-patterns to refuse

- **Full-refresh on a multi-hundred-million-row incremental model.** Refuse. The warehouse bill is real; the blast radius on a failed mid-run is a partially-updated table with no recovery path short of a full re-run. Use incremental with `unique_key`.
- **`dbt run --full-refresh` in a production CI/CD pipeline.** Refuse. Production pipelines run `dbt build` (or `dbt run` with explicit model selection). Full-refresh in production is a manual recovery step, not a scheduled default.
- **Secrets in dbt vars (`dbt run --vars '{"api_key": "sk-..."}`)`.** Refuse. `--vars` values appear in `dbt.log`, CI logs, and `dbt run` history. Use environment variables injected from the secret manager.
- **A reverse-ETL sync that sources from a view.** Refuse. Views re-execute on every sync; at high sync frequency this is a warehouse bill masquerading as a data pattern. Materialize the source model.
- **A dbt model without a `unique` test on the primary key.** Refuse. Two lines. The downstream ops dashboard that silently aggregates a duplicated fact table will cost more time to debug than the test costs to write.
- **Direct warehouse writes from a notebook or local script without an audit log.** Refuse. Production data without a trace of who wrote what, when, is a compliance gap when the next SOX or legal-hold walkthrough arrives.
- **`SELECT *` in a production model.** Refuse. Column-level security policies (Snowflake Dynamic Data Masking, BigQuery column-level policy tags) apply at query time; `SELECT *` bypasses the intent of column-scoped policies by pulling all columns including masked ones into the downstream model's lineage.
- **Airflow `catchup=True` on a new DAG with a start_date more than 7 days ago.** Refuse. This generates a backlog of DAG runs that will overwhelm the scheduler on first deploy. Either set `catchup=False` or start the DAG from today's date.

---

## When the user is wrong

- **"Just do a full-refresh, it's easier"** — refuse when the table exceeds ~10M rows. Full-refresh on a large incremental model is not "easier" when it costs $40 in warehouse compute and leaves the table in an undefined state if it fails at row 80M. The right answer is `dbt run --select <model> --full-refresh` as a one-time manual recovery step with explicit approval, not a scheduled default.
- **"We don't need a `unique` test, the source guarantees uniqueness"** — refuse. Sources that "guarantee" uniqueness at the API level do not guarantee it at the warehouse level after network retries, backfills, or duplicate-delivery webhooks. The test is the guarantee. Without it, you're trusting a claim, not verifying it.
- **"Put the Snowflake password in the dbt profile for now"** — refuse. `profiles.yml` is frequently checked into repos accidentally and frequently printed in CI logs on errors. Use `$DBT_SNOWFLAKE_PASSWORD` from the secret manager from day one; migrating later is never prioritized.
- **"The reverse-ETL sync can source from the view, it's faster to set up"** — refuse. See anti-patterns. The 5-minute setup savings will cost hours when the sync runs at 15-minute frequency and the warehouse bill arrives.
- **"Skip the source freshness check, we know when the data loads"** — refuse. "We know when the data loads" until the upstream pipeline breaks silently and the data stops loading. The freshness check is exactly the thing that catches that scenario before the ops team presents stale pipeline numbers to the CRO.
- **"Use my personal BigQuery credentials for the production dbt run"** — refuse. Personal credentials mean the production pipeline breaks when the engineer's token expires, rotates, or they leave the company. Service account from day one.
- **"We can just re-sync everything from Census if something goes wrong"** — do not accept this as a recovery plan for a high-frequency sync touching Salesforce. A full re-sync from Census overwrites CRM records; if the source data has a bug, a full re-sync propagates it to every record. Idempotence + incremental sync + a verified rollback procedure is the recovery plan.