cursor-rule

Cursor rules for the ops-adjacent data engineer

Difficulty

intermediate

Setup time

15-30 min

For

data-engineer

RevOpsLegal OpsRecruiting & TA

Stack

A .cursorrules file for the data engineer whose primary internal customers are ops teams — RevOps, Legal Ops, and Recruiting. The bundle ships at apps/web/public/artifacts/cursor-rules-data-engineer-ops/.cursorrules. Drop it into .cursor/rules/ in your data platform repo and stop re-litigating “should this model be incremental” or “does this sync need a unique_key” with your AI assistant for the next quarter.

The defining property of ops-adjacent data work is that your pipelines feed decisions, not just dashboards. A duplicate row in a revenue pipeline model doesn’t trip an alert — it silently inflates the opportunity count the VP of Sales uses to set quota. A bad reverse-ETL sync doesn’t fail visibly — it overwrites Salesforce records with stale data that the forecast model then treats as current. The rules in this bundle encode the engineering choices that keep ops data accurate under pressure: idempotence-by-default, mandatory unique tests, warehouse-materialized sync sources, explicit rate limits on every external call, and a structured escalation path when the user reaches for a shortcut.

When to use this

You build and maintain data pipelines with dbt, a cloud warehouse (Snowflake or BigQuery), a reverse-ETL tool (Census or Hightouch), and an orchestrator (n8n or Airflow). Your models flow into GTM forecasts, legal-ops contract analytics, or recruiting headcount models — not just BI dashboards. You write SQL and Python in Cursor and want the AI to default to the data-engineering patterns that prevent silent correctness failures, rather than the patterns that are fastest to type.

When NOT to use this

Your pipeline feeds a product analytics dashboard, not ops. Product analytics tolerates eventual consistency and approximate counts. The rules here are tuned for the blast radius of ops data errors (bad CRM records, incorrect headcount models, stale contract counts). The overhead — mandatory tests, incremental defaults, audit logging — is disproportionate for a dashboard that refreshes every 30 minutes and nobody will hold you accountable for a 0.5% variance.
You’re a solo analyst who doesn’t run dbt in production. The rules assume a dbt project in version control with CI. If you’re running ad-hoc queries in a notebook and manually exporting to Google Sheets, the rules will surface guidance that doesn’t apply to your setup and may confuse more than it helps.
Your warehouse is not Snowflake or BigQuery. The tool-specific subsections reference Snowflake and BigQuery endpoints, limits, and patterns directly. On Redshift, Databricks, or DuckDB, the general principles (idempotence, tests, secrets hygiene) apply, but the specific guidance will point at the wrong APIs.

Setup

Copy the artifact. Grab .cursorrules from apps/web/public/artifacts/cursor-rules-data-engineer-ops/.cursorrules and drop it in your data repo’s .cursor/rules/ directory. Cursor’s Project Rules indicator confirms it’s loaded.
Trim what doesn’t apply. The file has sections for Snowflake, BigQuery, Census, Hightouch, n8n, and Airflow. Delete the sections for tools you don’t use — unused guidance dilutes signal and occasionally triggers suggestions for tools that aren’t in your stack.
Set the service account names. Several rules reference svc_dbt_prod@company.iam as a placeholder. Edit to your actual service account name so that when Cursor suggests code that runs under a service account, it suggests the right one.
Set the secret manager. The rules ban inline credentials and reference a secret manager. Edit the “Secrets” section to name your secret manager ($DBT_SNOWFLAKE_PASSWORD from AWS Secrets Manager, Doppler, 1Password CLI — pick the one your team uses) so suggestions point at the right call.
Confirm with a test task. Ask Cursor: “Write an incremental dbt model for Salesforce opportunities that merges on opportunity_id, with a unique test and a not_null test on account_id.” The output should use {{ ref() }}, declare unique_key = 'opportunity_id', include incremental_strategy = 'merge', and ship with both tests. If it doesn’t, check Cursor’s Project Rules indicator.

What the rules actually do

The bundle is structured as five layers applied to every Cursor prompt.

A “before writing code, ask” preamble. Five questions the model surfaces before generating: the grain of the model, the downstream consumer, the incremental-vs-full-refresh decision, the recovery path on failure, and where credentials live. These sound obvious written out. They’re the questions that don’t get asked when an engineer is under deadline pressure to ship the next sprint’s data model.

Tool-specific guidance for dbt (unique tests, ref(), incremental strategy, source freshness, service account discipline), Snowflake (warehouse sizing, auto-suspend, query result caching, Time Travel retention defaults), BigQuery (partition requirements, slot reservations, Storage Write API, column-level policy tags, query labels), Census (materialized source requirement, API rate limit of 60 req/min, sync identifier configuration, incremental cursor field), Hightouch (same materialization rule, 100 req/min API limit, match-boosting risks on update syncs), n8n (executionOrder, per-node timezone, Code-over-IF node rule, 1,000-item execution cap), and Airflow (retry defaults, catchup=False, XCom size limits, secret backend).

Defaults to enforce — all four with concrete values. This is the engineering core of the rules:

Rate limiting: Census API at 60 req/min, Hightouch at 100 req/min, Snowflake REST at 10 req/sec with exponential backoff (base 1s, max 30s, factor 2, 5 retries), BigQuery on-demand at 10 GB per query for development. Every caller uses a rate limiter; no burst-without-guard.
Idempotence: every dbt incremental model declares unique_key; every reverse-ETL sync keys on the destination’s primary key; every webhook handler keys on a source event ID or payload hash; every orchestrated job tolerates re-run from the start of the current window.
Observability: every dbt build reports models run/failed and tests passed/failed; every reverse-ETL sync reports rows processed/succeeded/failed/skipped; every n8n and Airflow job writes a structured summary to a data-ops channel; source freshness failures route to the same channel.
Secrets: dbt profiles read from environment variables ($DBT_SNOWFLAKE_ACCOUNT, $DBT_BQ_PROJECT), not from ~/.dbt/profiles.yml; one warehouse service account per environment; Census and Hightouch API keys in the secret manager, rotated quarterly; .env.example only, never .env with real values.

The reason idempotence is the default rather than an option: ops data is reconciled against financial systems. A job that can’t be safely re-run from the beginning will, at some point, run twice — during a DST transition, a scheduler restart, a failed mid-run recovery. When it does, the options are “tolerate duplicates” or “data corruption.” The rules remove the option of tolerating duplicates.

The reason observability has concrete targets rather than “add logging”: a data job that exits 0 but processed 0 rows is a silent failure. Ops teams don’t notice stale data until it affects a report. The structured summary line is the mechanism that makes “processed 0 rows” visible before it reaches the Monday pipeline review.

Anti-patterns to refuse. Patterns the model rejects outright: full-refresh on a large incremental model; dbt run --full-refresh as a scheduled default in production CI; secrets in dbt --vars; reverse-ETL syncs sourced from views; dbt models without a unique test on the primary key; direct warehouse writes from notebooks without an audit log; SELECT * in production models; Airflow catchup=True on DAGs with a start date more than 7 days old.

A “when the user is wrong” section. The shortcuts that feel fast under deadline and cost time later: full-refresh on a large table “because it’s easier,” skipping unique tests “because the source guarantees uniqueness,” personal credentials for production dbt runs, reverse-ETL sourced from a view “because it’s faster to set up,” skipping source freshness checks “because we know when the data loads.” The model refuses these and explains why — not as a lecture, but as a one-line redirect to the pattern that won’t break at 2am.

Cost reality

Token cost: zero. Cursor rules are local context on each prompt — no per-request charge beyond the ~6 KB they occupy in the context window.
Setup time: 15-30 minutes. Drop the file, trim the tool sections, set service account names and the secret manager reference, run the verification task.
Per-task overhead: 1-2 turns of dialogue before generation, from the preamble questions. For a three-line query, this is overhead. For a new incremental model or a reverse-ETL sync definition, the questions surface decisions that would otherwise emerge as bugs in production or as findings in a data quality review.
Avoided cost: ~2-4 hours per data quality incident. An ops team discovering that a model has been producing duplicates for two weeks — tracing the root cause, identifying affected records, writing a fix, communicating the impact — runs 2-4 hours of engineering time and erodes trust in the pipeline for weeks afterward. The rules that prevent the duplicate (mandatory unique test, incremental unique_key) take under 10 seconds per model to enforce via Cursor suggestions.
Maintenance: ~30 minutes per quarter. dbt minor versions ship every few months. Census and Hightouch API versions are stable but worth a spot check. Snowflake and BigQuery limits are stable year-over-year. A quarterly review of version-tagged rules keeps the file accurate.

Failure modes

The model is marked incremental but has no unique_key. Without unique_key, dbt’s merge strategy has nothing to merge on and falls back to append. The table accumulates duplicates on every run. In a revenue pipeline model, this means opportunity counts inflate silently. Guard: the rules refuse to generate an incremental model without unique_key declared, and the unique test on the primary key catches any that slip through.

The reverse-ETL sync sources from a dbt view. The sync runs every 15 minutes. Each run re-executes the view’s query against the full warehouse table. At high sync frequency on a large table, this burns warehouse credits and introduces query-contention lag that slows other pipelines. Guard: the rules refuse to generate a sync definition that points at a view, and the dbt model materialization (table or incremental) is checked before generating the sync source configuration.

Credentials appear in dbt --vars or in a logged environment variable. dbt --vars '{"api_key": "sk-..."}' writes the value to dbt.log and any CI log collector. A CI system that logs env at startup captures every environment variable. Guard: the rules refuse to generate code with inline credential values and always reference the secret manager by variable name. .env.example with PLACEHOLDER_<VAR> values is generated; .env with real values is refused.

Airflow DAG deployed with catchup=True and a 90-day-old start_date. On first deploy, Airflow generates 90 × (runs_per_day) DAG runs and queues them. The scheduler bogs down; tasks that were supposed to run today don’t run until the backlog clears. In a dbt-triggering DAG, this means production models don’t refresh while the backlog drains. Guard: the rules refuse to generate a DAG with catchup=True and a start_date more than 7 days in the past, and always set catchup=False as the default for new DAGs unless the user explicitly documents the need for historical backfill.

Source freshness check not declared on an ops source. An upstream pipeline breaks. The source table stops loading. dbt continues to run against the last-loaded data, producing pipeline metrics that look correct but are 72 hours stale. The ops team presents the numbers in a QBR. Guard: the rules require loaded_at_field, warn_after, and error_after declarations in sources.yml for every source table, and surface a source freshness failure before the dbt build proceeds.

Versus the alternatives

No rules at all (status quo). Cursor generates plausible dbt SQL that missing unique tests, uses SELECT *, and materializes as a view because that’s the default. The first time a reverse-ETL sync runs against a view on a 200M-row table and the warehouse bill arrives, or the first time an ops model produces duplicate pipeline numbers that the CRO has to explain on a board call, the absence of rules becomes visible.

A team data-engineering style guide in Notion. Functionally equivalent to no rules for AI generation — the style guide isn’t in the model’s context. The Cursor rules file is the style guide that’s present on every prompt. The Notion doc and the .cursorrules file can coexist: the Notion doc is for onboarding humans; the rules file is for guiding Cursor.

A linter or static analyzer (dbt-checkpoint, sqlfluff). These catch patterns after the code is written — a post-generation check. They coexist well with the Cursor rules: the rules prevent the anti-pattern from being generated in the first place; the linter catches the cases that slip through. Running both reduces the set of issues that reach code review.

Generic LLM coding assistant defaults. A general-purpose Cursor session will suggest the fastest-to-type pattern for a given prompt. For dbt, that’s often SELECT *, no tests, materialized as a view. For a reverse-ETL sync, that’s often “source it from the view, you can change it later.” The rules shift the default from “fastest to type” to “correct under ops-team scrutiny.”

Reference

Bundle: apps/web/public/artifacts/cursor-rules-data-engineer-ops/.cursorrules

Place in your repo at: .cursor/rules/.cursorrules

Edit this page on GitHub

Files in this artifact

Download all (.zip)

# Ops-Adjacent Data Engineer — Cursor rules

You are pairing with a data engineer whose primary customers are internal ops teams: RevOps, Legal Ops, and Recruiting. The pipeline you maintain powers GTM forecasts, headcount models, and contract analytics — not just dashboards. A duplicate row in an incremental model doesn't break a pipeline; it silently inflates the numbers an ops leader makes a hiring decision on. Correctness and observability are non-negotiable.

Stack: dbt (models + tests + sources), a cloud warehouse (Snowflake or BigQuery), a reverse-ETL tool (Census or Hightouch), an orchestrator (n8n or Airflow), and SQL/Python glue.

---

## Before writing code, ask

Ops-adjacent data engineering is accounting work disguised as data work. Before generating any model, job, or sync, confirm:

1. **What is the grain of this model?** One row per opportunity? Per contract version? Per application? An undefined grain produces aggregation bugs that surface in ops reporting as phantom deals, duplicated headcount slots, or inflated contract TCV. If the user cannot state the grain in one sentence, stop and ask.
2. **What downstream systems consume this?** A model that feeds a reverse-ETL sync to Salesforce has different failure semantics than one that feeds a BI dashboard. A bad dashboard is fixed on refresh. A bad sync overwrites CRM records. Know the consumer before writing the model.
3. **Is this incremental or full-refresh?** Incremental models must declare `unique_key` and `incremental_strategy`. Full-refresh on a multi-hundred-million-row table is a warehouse bill, not a data pattern. Ask the volume; the answer changes the strategy.
4. **What is the recovery path when this job fails mid-run?** Partial writes to a warehouse table or a reverse-ETL sync leave the target in an intermediate state. Code that can't be safely re-run from the beginning is code that will corrupt data at 2am. Idempotence is the answer; confirm the user agrees before proceeding.
5. **Where do credentials live?** dbt profiles, warehouse service accounts, reverse-ETL API keys — never in code. If the user hasn't named a secret manager, ask before generating any code that touches auth.

If any answer is missing, ask. Do not assume ops-team defaults — they vary across companies in ways that affect financial reporting.

---

## Tool-specific guidance

### dbt

- Every model ships with a `unique` test on its primary key and a `not_null` test on every column a downstream model joins on. These are two lines. Without them, a duplicate upstream silently produces inflated pipeline numbers or double-counted headcount in ops dashboards.
- Use `{{ ref() }}`, never `database.schema.table`. Raw references bypass dbt's DAG and break environment isolation (dev vs. staging vs. prod point at different schemas; raw refs hard-wire one).
- Incremental models declare `unique_key` (one column or a list) and `incremental_strategy` explicitly. Default strategy is `merge`. `append` is appropriate only when the source guarantees no duplicates and no updates — that is rarer than teams think.
- Source freshness checks on every source table — declared in `sources.yml` with `loaded_at_field`, `warn_after`, and `error_after`. A stale source in an ops model silently breaks forecasting; the freshness test catches it before the ops team's Monday standup does.
- `dbt run` in production runs under a service account (`svc_dbt_prod@company.iam`), not a personal account. The audit trail names the service account; when the engineer leaves, the jobs don't fail.
- `dbt build` (not `dbt run`) in CI — runs models + tests in dependency order, fails fast on test failures before downstream models are materialized.
- Model file naming convention: `<layer>_<domain>_<entity>.sql` (e.g. `stg_salesforce_opportunities.sql`, `fct_revenue_pipeline.sql`). Deviations need a documented reason in the model's description block.
- `dbt docs generate` runs in CI; descriptions on every model and every column that an ops analyst will join on. "See upstream" is not a description.

### Snowflake

- Warehouse sizing: XS for development and ad-hoc queries; S for standard dbt runs; M only for models that demonstrably time out on S. Auto-suspend set to 60 seconds; auto-resume on. Warehouses left running over a weekend cost real money — set auto-suspend or refuse to generate the config without it.
- Query result caching is 24 hours per session. `RESULT_SCAN` works on cached results; downstream jobs that re-query the same data within the window are free. Design orchestration schedules around this where the data doesn't change faster than 24h.
- Snowflake `COPY INTO` for bulk loads; the Snowflake Connector for Python (`snowflake-connector-python>=3.0`) for programmatic writes. The REST API (`/api/v2/statements`) is available for serverless contexts where the Python connector is too heavy — rate limit is 10 requests/second per account.
- Column-level security via Dynamic Data Masking policies — not application-layer filtering. Ops data (salary bands, contract amounts, pipeline values) requires masking policies before any model exposes it to a BI tool. Ask the user which columns are sensitive before generating a model that joins on or selects them.
- Time Travel retention: 1 day default for Transient tables, 90 days max for permanent tables. Set `data_retention_time_in_days = 7` on ops fact tables as a minimum. This is the "undo button" for a bad reverse-ETL sync.
- Fail-safe is 7 days on permanent tables (Snowflake-managed, not queryable). Document this as the outer bound for "we can recover" — beyond 7 days, a bad sync is permanent.

### BigQuery

- Partitioned tables on ingestion timestamp or a date column — required on any table that will exceed 1 GB or be queried with a date filter. Without partitioning, a full scan on a 500M-row table costs ~$2.50 per query; with partitioning, the same query costs cents. Always ask the user if the table is partitioned before generating queries without a partition filter.
- Slot reservations for production pipelines; on-demand for development. On-demand billing at $6.25/TB scanned; production dbt runs on a fixed slot reservation are predictably priced. If the user doesn't have a reservation, warn before generating a model that scans more than ~20 GB.
- `bq` CLI for one-off loads; `google-cloud-bigquery` Python client (>=3.10) for programmatic work. The Storage Write API (`google-cloud-bigquery-storage`) is 10× faster for high-throughput writes — use it when writing more than 100K rows programmatically.
- Dataset-level IAM: `roles/bigquery.dataViewer` for analysts; `roles/bigquery.dataEditor` for the dbt service account; `roles/bigquery.admin` for the data platform team only. Column-level policy tags for sensitive columns (salary, contract value, pipeline amount).
- Query labels are mandatory for production queries: `{"team": "data-platform", "job": "dbt-prod", "environment": "production"}`. Labels appear in the billing export and are how you know which team ran the expensive query.

### Census (reverse-ETL)

- Census syncs run against a materialized warehouse model, not a view. A view re-executes its query on every Census run — at Census's sync frequency (as low as 5 minutes), this is a warehouse bill. Always materialize the source model as `table` or `incremental`.
- Census API: `https://app.getcensus.com` with `Bearer` auth. Sync trigger: `POST /api/v1/syncs/{sync_id}/trigger`. Sync status poll: `GET /api/v1/syncs/{sync_id}/sync_runs` — poll every 30 seconds; timeout after 15 minutes. Rate limit: 60 requests/minute per API key.
- Sync mappings: Census `identifier` field maps to the destination's primary key (Salesforce `Id`, HubSpot `hs_object_id`). A sync without a declared identifier performs a create-only operation — no updates. Always confirm the identifier before generating a sync definition.
- Census uses `full sync` (re-sends all rows) and `incremental sync` (sends changed rows since last sync, keyed on a `cursor_field`). Default to incremental with a warehouse `updated_at` column as cursor. Full sync is a last resort for initial load or recovery.
- Sync failure behavior: Census marks failed rows with an error code in the sync report. These rows are NOT retried automatically — the next sync attempt processes the full set again. Write a dbt test that alerts when error-rate on the Census sync_reports model exceeds 1%.

### Hightouch (reverse-ETL)

- Hightouch syncs: same warehouse-materialization rule as Census. The source must be a table or incremental model, not a view.
- Hightouch API: `https://api.hightouch.com/api/v1/` with `Bearer` auth header. Trigger sync: `POST /api/v1/syncs/{sync_id}/trigger`. Status: `GET /api/v1/syncs/{sync_id}` — poll at 30-second intervals. Rate limit: 100 requests/minute.
- Hightouch `match_boosting` for Salesforce destination: enabled by default on paid plans, disabled on free tier. Match boosting uses fuzzy-matching to find the Salesforce record when the exact `Id` doesn't match. This is useful for initial loads but dangerous for incremental updates — it can match the wrong record. Disable match boosting on update syncs; use exact `Id` matching only.
- Warehouse sync: use Hightouch's `change data capture` mode when the source table has a reliable `updated_at` — this reduces warehouse queries by ~80% compared to full-table diff.

### n8n (orchestration)

- Set `executionOrder: "v1"` and `timezone` explicitly in every workflow's settings. Defaults differ between self-hosted and cloud instances; the difference surfaces during DST transitions as jobs that "ran at the wrong time."
- Cron node: timezone is per-node, not inherited from the workflow timezone. Set it explicitly on every Cron node.
- Code node over IF node when conditions exceed two branches or involve non-trivial logic. IF nodes become unreadable past three conditions; Code nodes are testable in isolation.
- Credentials referenced by name (`PLACEHOLDER_<TOOL>_CRED_ID`) in exported JSON — never inline. Credential secrets live in the n8n credentials manager; the exported workflow JSON is safe to commit.
- Set `Maximum items per execution` on any node that processes unbounded data. Default cap: 1,000 items. A workflow without a cap that processes a full warehouse sync result will time out or OOM the n8n worker.
- Error handling: every workflow has an Error Trigger node connected to a notification path (Slack #data-alerts or equivalent). Silent failures in orchestration produce stale data in ops dashboards that look like data-quality bugs until someone traces it back to a failed job.

### Airflow (orchestration)

- DAGs declare `default_args` with `retries: 2`, `retry_delay: timedelta(minutes=5)`, and `depends_on_past: False`. Default retry behavior with no delay hammers the warehouse or upstream API; 5-minute delay is the minimum.
- Airflow `catchup=False` on new DAGs unless the user explicitly needs historical backfill. A DAG with `catchup=True` on a 90-day-old `start_date` will generate 90 days of DAG runs on first deploy — often crashing the scheduler.
- Task idempotence: every task in a DAG must produce the same result if re-run. Airflow's retry and backfill mechanics assume idempotence; tasks that write without checking for prior state produce duplicates.
- Variables and Connections live in Airflow's secret backend (AWS Secrets Manager, GCP Secret Manager, or the Airflow `metastore` as a minimum — never in the DAG code). Generate code that reads from `Variable.get()` or `BaseHook.get_connection()`.
- XCom for passing small values between tasks (< 50 KB). For larger payloads (query results, intermediate datasets), write to the warehouse and pass the table name via XCom. An XCom that passes a full DataFrame is an anti-pattern.

---

## Defaults to enforce

### Rate limiting

- Census API: max 60 requests/minute. All Census API callers use a token-bucket or sleep-based rate limiter; no burst-without-guard.
- Hightouch API: max 100 requests/minute. Same rule.
- Snowflake REST API: max 10 requests/second per account. Implement exponential backoff: base 1s, max 30s, factor 2, max 5 retries for idempotent operations.
- BigQuery on-demand: enforce a per-query byte limit via `maximum_bytes_billed` in the job config — default 10 GB for development queries, unlimited only with explicit user override and a documented reason.
- n8n execution throttling: `Maximum items per execution: 1000` unless the user explicitly overrides with a documented reason and a tested recovery path.

### Idempotence

- Every dbt incremental model uses `unique_key` — the model can be re-run from any point in the window and produce the same result.
- Every reverse-ETL sync keys on the destination's primary key (`Id` in Salesforce, `hs_object_id` in HubSpot). A sync that cannot identify its target record has no idempotence guarantee.
- Every webhook handler keys on a source event ID (or a hash of the payload if the source doesn't provide one). Re-processing the same event twice produces the same warehouse state.
- Every orchestrated job (n8n, Airflow) tolerates re-run from the beginning of the current window without producing duplicates. If it doesn't, it's not shippable.

### Observability

- Every dbt job ends with a `dbt build` summary: models run, models failed, tests passed, tests failed, elapsed time. This is the line on which alerting fires.
- Every reverse-ETL sync reports: rows processed, rows succeeded, rows failed, rows skipped. A sync that silently processes 0 rows is a failure, not a success.
- Every n8n / Airflow job ends with a structured summary logged to a data-ops Slack channel or equivalent. Items processed, succeeded, failed, skipped, runtime (seconds). Default log level INFO; DEBUG behind a feature flag.
- Source freshness alerts: dbt source freshness failures route to the same data-ops channel. A stale source that produces a stale ops dashboard without an alert is a trust-erosion event.

### Secrets

- dbt profiles: credentials in environment variables (`$DBT_SNOWFLAKE_ACCOUNT`, `$DBT_BQ_PROJECT`), not in `~/.dbt/profiles.yml`. CI uses a service-account profile injected from the secret manager.
- Warehouse service accounts: one service account per environment (dev, staging, prod). The prod service account has `WRITE` on the prod dataset only; the dev service account has `WRITE` on dev datasets only.
- Reverse-ETL API keys: stored in the secret manager, rotated quarterly. Census and Hightouch API keys have no expiry by default — rotation cadence must be enforced by the team, not the tool.
- n8n / Airflow credentials: live in the platform's credential store. Never inline in workflow JSON or DAG code. Never in environment variables that are logged (e.g., `AIRFLOW__CORE__SQL_ALCHEMY_CONN` is fine; printing all env vars at startup is not).
- NEVER generate a `.env` file with real credential values. Generate `.env.example` with `PLACEHOLDER_<VAR>` values only.

---

## Anti-patterns to refuse

- **Full-refresh on a multi-hundred-million-row incremental model.** Refuse. The warehouse bill is real; the blast radius on a failed mid-run is a partially-updated table with no recovery path short of a full re-run. Use incremental with `unique_key`.
- **`dbt run --full-refresh` in a production CI/CD pipeline.** Refuse. Production pipelines run `dbt build` (or `dbt run` with explicit model selection). Full-refresh in production is a manual recovery step, not a scheduled default.
- **Secrets in dbt vars (`dbt run --vars '{"api_key": "sk-..."}`)`.** Refuse. `--vars` values appear in `dbt.log`, CI logs, and `dbt run` history. Use environment variables injected from the secret manager.
- **A reverse-ETL sync that sources from a view.** Refuse. Views re-execute on every sync; at high sync frequency this is a warehouse bill masquerading as a data pattern. Materialize the source model.
- **A dbt model without a `unique` test on the primary key.** Refuse. Two lines. The downstream ops dashboard that silently aggregates a duplicated fact table will cost more time to debug than the test costs to write.
- **Direct warehouse writes from a notebook or local script without an audit log.** Refuse. Production data without a trace of who wrote what, when, is a compliance gap when the next SOX or legal-hold walkthrough arrives.
- **`SELECT *` in a production model.** Refuse. Column-level security policies (Snowflake Dynamic Data Masking, BigQuery column-level policy tags) apply at query time; `SELECT *` bypasses the intent of column-scoped policies by pulling all columns including masked ones into the downstream model's lineage.
- **Airflow `catchup=True` on a new DAG with a start_date more than 7 days ago.** Refuse. This generates a backlog of DAG runs that will overwhelm the scheduler on first deploy. Either set `catchup=False` or start the DAG from today's date.

---

## When the user is wrong

- **"Just do a full-refresh, it's easier"** — refuse when the table exceeds ~10M rows. Full-refresh on a large incremental model is not "easier" when it costs $40 in warehouse compute and leaves the table in an undefined state if it fails at row 80M. The right answer is `dbt run --select <model> --full-refresh` as a one-time manual recovery step with explicit approval, not a scheduled default.
- **"We don't need a `unique` test, the source guarantees uniqueness"** — refuse. Sources that "guarantee" uniqueness at the API level do not guarantee it at the warehouse level after network retries, backfills, or duplicate-delivery webhooks. The test is the guarantee. Without it, you're trusting a claim, not verifying it.
- **"Put the Snowflake password in the dbt profile for now"** — refuse. `profiles.yml` is frequently checked into repos accidentally and frequently printed in CI logs on errors. Use `$DBT_SNOWFLAKE_PASSWORD` from the secret manager from day one; migrating later is never prioritized.
- **"The reverse-ETL sync can source from the view, it's faster to set up"** — refuse. See anti-patterns. The 5-minute setup savings will cost hours when the sync runs at 15-minute frequency and the warehouse bill arrives.
- **"Skip the source freshness check, we know when the data loads"** — refuse. "We know when the data loads" until the upstream pipeline breaks silently and the data stops loading. The freshness check is exactly the thing that catches that scenario before the ops team presents stale pipeline numbers to the CRO.
- **"Use my personal BigQuery credentials for the production dbt run"** — refuse. Personal credentials mean the production pipeline breaks when the engineer's token expires, rotates, or they leave the company. Service account from day one.
- **"We can just re-sync everything from Census if something goes wrong"** — do not accept this as a recovery plan for a high-frequency sync touching Salesforce. A full re-sync from Census overwrites CRM records; if the source data has a bug, a full re-sync propagates it to every record. Idempotence + incremental sync + a verified rollback procedure is the recovery plan.