← Back to blog
engineering·May 13, 2026·6 min read

The CDP question — why your customer data needs a home

You don't need a $50K/year CDP. You need a warehouse, 6 dbt models, and a reverse-ETL tool. Total cost: under $500/month.

engineering

The CDP question — why your customer data needs a home

You don't need a $50K/year Customer Data Platform. You need a warehouse, 6 dbt models, and a reverse-ETL tool. Total cost: under $500/month.

A CDP isn't a product you buy. It's an architecture you build. Every company has customer data — it's just scattered across Salesforce, HubSpot, your website, your product, your email tool, your ad platforms. A CDP collects it, unifies identities, and makes it available for activation. The question isn't whether you need this capability. It's whether you need to pay a vendor $50K/year for it.

What a CDP actually does

Four functions:

  1. Collection — ingest events and attributes from every customer touchpoint (website, app, email, CRM, support)
  2. Identity resolution — figure out that user_123 on your website, john@acme.com in your CRM, and the person who clicked your Google Ad are the same human
  3. Enrichment — add third-party data (company size, industry, funding, tech stack) to known identities
  4. Activation — push unified customer profiles back out to tools that need them (ad platforms, email tools, CRM, support)

Segment, mParticle, and Twilio Engage do all four. They charge accordingly. For most SMBs and mid-market companies, the warehouse-native approach does the same thing for a fraction of the cost.

The warehouse-native CDP

The architecture we build:

Website (PostHog/GA4) ──→ BigQuery
Product (Mixpanel/Amplitude) ──→ BigQuery
CRM (Salesforce/HubSpot) ──→ BigQuery (via Fivetran)
Email (SendGrid/Resend) ──→ BigQuery
Ads (Google/Meta/LinkedIn) ──→ BigQuery
Enrichment (Clay/Clearbit) ──→ BigQuery
Consent (CookieYes) ──→ BigQuery
         │
         ▼
      dbt models (6 core models)
         │
         ▼
   Hightouch / Census (reverse ETL)
         │
         ▼
   Salesforce + Google Ads + Meta Ads + Email

The 6 core dbt models

1. dim_person — the person spine

-- models/core/dim_person.sql
WITH source_contacts AS (
    SELECT email, first_name, last_name, 'salesforce' AS source, sf_contact_id AS source_id
    FROM {{ ref('stg_salesforce__contacts') }}
    UNION ALL
    SELECT email, first_name, last_name, 'hubspot', hs_contact_id
    FROM {{ ref('stg_hubspot__contacts') }}
    UNION ALL
    SELECT email, first_name, last_name, 'sendgrid', sg_contact_id
    FROM {{ ref('stg_sendgrid__contacts') }}
),
 
deduplicated AS (
    SELECT
        LOWER(TRIM(email)) AS email,
        FIRST_VALUE(first_name) OVER (PARTITION BY LOWER(TRIM(email)) ORDER BY source) AS first_name,
        FIRST_VALUE(last_name) OVER (PARTITION BY LOWER(TRIM(email)) ORDER BY source) AS last_name,
        ARRAY_AGG(DISTINCT source) AS source_systems,
        ARRAY_AGG(DISTINCT source_id) AS source_ids,
        ROW_NUMBER() OVER (PARTITION BY LOWER(TRIM(email)) ORDER BY source) AS rn
    FROM source_contacts
    WHERE email IS NOT NULL
    GROUP BY email, first_name, last_name, source, source_id
)
 
SELECT
    {{ dbt_utils.generate_surrogate_key(['email']) }} AS person_id,
    email,
    first_name,
    last_name,
    source_systems,
    source_ids
FROM deduplicated
WHERE rn = 1

2. dim_company — the account spine

-- models/core/dim_company.sql
SELECT
    {{ dbt_utils.generate_surrogate_key(['domain']) }} AS company_id,
    domain,
    FIRST_VALUE(company_name) OVER (PARTITION BY domain ORDER BY source_priority) AS company_name,
    e.employee_count,
    e.industry,
    e.annual_revenue_range,
    e.funding_stage
FROM {{ ref('int_companies__merged') }} c
LEFT JOIN {{ ref('stg_clay__enrichment') }} e
    ON c.domain = e.domain

3. fct_events — the unified event stream

-- models/core/fct_events.sql
SELECT event_id, person_id, event_name, event_at, properties, 'website' AS source
FROM {{ ref('int_posthog__person_events') }}
UNION ALL
SELECT event_id, person_id, event_name, event_at, properties, 'product'
FROM {{ ref('int_mixpanel__person_events') }}
UNION ALL
SELECT event_id, person_id, event_name, event_at, properties, 'email'
FROM {{ ref('int_sendgrid__person_events') }}

4. int_person_company — the person-to-company mapping

Uses email domain to map people to companies. Handles edge cases (gmail.com, outlook.com) by falling back to CRM account associations.

5. mart_customer_360 — the unified profile

Joins person + company + events + enrichment into a single row per person with:

  • All known identifiers
  • Company context
  • Last 90 days of activity
  • Engagement score
  • Lifecycle stage

6. mart_activation_audiences — segments for reverse ETL

-- models/activation/mart_activation_audiences.sql
SELECT
    p.person_id,
    p.email,
    c.company_name,
    c.employee_count,
    CASE
        WHEN p.lifecycle_stage = 'customer' AND p.engagement_score < 30
        THEN 'at_risk_customer'
        WHEN p.lifecycle_stage = 'lead' AND p.engagement_score > 70
        THEN 'hot_lead'
        WHEN c.employee_count > 200 AND p.last_pricing_page_visit IS NOT NULL
        THEN 'enterprise_prospect'
        ELSE 'nurture'
    END AS audience_segment
FROM {{ ref('mart_customer_360') }} p
JOIN {{ ref('dim_company') }} c ON p.company_id = c.company_id

Hightouch picks up these segments and pushes them to:

  • Google Ads — matched customer lists for targeting/exclusion
  • Meta Ads — custom audiences
  • Salesforce — updated lead scores and lifecycle stages
  • SendGrid — email list segmentation

Identity resolution: the hard part

The core challenge of any CDP is figuring out that multiple records across multiple systems represent the same person.

Our approach:

  • Email is the primary key. If two records share an email, they're the same person. Simple, reliable, covers 90% of cases.
  • Company domain is the account key. Extract the domain from the email. john@acme.comacme.com. Map to dim_company.
  • Anonymous → known stitching. When a PostHog user_pseudo_id later identifies via form submission, backfill the person_id onto historical events.
-- Stitch anonymous sessions to known persons
UPDATE {{ ref('fct_events') }} e
SET person_id = i.person_id
FROM {{ ref('int_identity_graph') }} i
WHERE e.anonymous_id = i.anonymous_id
  AND e.person_id IS NULL

This isn't as sophisticated as what Segment does with probabilistic matching across devices. It doesn't need to be. For B2B, email-based identity resolution covers the vast majority of use cases. You don't need to track the same person across their phone, laptop, and tablet — you need to know that the lead in your CRM visited your website and opened your email.

One thing vendor CDPs handle well: consent enforcement. In a warehouse-native approach, you need to build this yourself.

We ingest CookieYes consent data (via Google Tag Manager events) into the warehouse and join it to the person spine:

-- Only activate persons with valid marketing consent
SELECT *
FROM {{ ref('mart_activation_audiences') }}
WHERE person_id IN (
    SELECT person_id
    FROM {{ ref('int_consent__current') }}
    WHERE marketing_consent = TRUE
      AND consent_date IS NOT NULL
)

This ensures that reverse-ETL audiences only include consented users. Non-negotiable for GDPR compliance.

The cost comparison

ComponentVendor CDP (Segment)Warehouse-native
Data collectionIncludedPostHog free tier + Fivetran ($0-200/mo)
WarehouseNot included (you still need one)BigQuery ($50-200/mo)
TransformationBasicdbt Cloud free tier
Identity resolutionIncluded6 dbt models (built once)
Reverse ETLBasic integrationsHightouch ($0-300/mo)
Total$12K-120K/year$100-700/month

The warehouse-native approach is cheaper, more transparent, and gives you full control over the data and logic. The trade-off: you build and maintain the models. For a team with even basic dbt skills, that trade-off is overwhelmingly worth it.


We've built warehouse-native CDPs for B2B companies — from identity resolution to reverse ETL to consent management. If you're evaluating Segment or thinking about unifying your customer data, book a discovery call and we'll show you the warehouse-native alternative.

Got a similar problem?

30 minutes. We'll tell you honestlywhat's broken.