The CDP That Actually Works With Snowflake, Databricks, and BigQuery (Without Copying Your Data)

Most customer data platforms ask you to move your data into their system before they do anything useful with it. That creates a problem if your customer data already lives in Snowflake, Databricks, or BigQuery — three platforms where data engineering teams have spent years building reliable, governed, and trusted pipelines.

A CDP that works with Snowflake, Databricks, and BigQuery natively — without extracting or duplicating that data — is a different category of tool. It treats your warehouse as the source of truth rather than a source to be replicated. That distinction has real consequences for data quality, compliance, cost, and the speed at which marketing teams can actually act.

This post breaks down what that architecture looks like, why it matters, and what to evaluate when choosing a CDP that integrates with your cloud data platform.

Why the Standard CDP Architecture Conflicts With Modern Data Stacks

Traditional CDPs were built in an era when companies didn't have a centralized data warehouse. They solved a real problem: customer data was scattered across dozens of SaaS tools, and there was no single place to unify it. The CDP became that place.

But the data landscape has changed significantly. Most mid-market and enterprise companies now run a cloud data warehouse — Snowflake, Databricks, or BigQuery — as their primary analytical environment. Their data science teams build customer models there. Their finance team reports from there. Their compliance team audits from there.

When a traditional CDP then asks you to pipe all of that data into its proprietary store, you end up with two problems. First, you now have two copies of customer data, which creates sync delays, consistency issues, and potential compliance headaches under GDPR or CCPA. Second, the CDP only ever sees a subset of your data — whatever you chose to send it — which limits the quality of segmentation, scoring, and personalization it can produce.

For companies with mature data infrastructure, this architecture is a step backward.

What "Works With" Actually Means — Three Different Levels

Vendors claim Snowflake, Databricks, or BigQuery compatibility in very different ways, and it's worth being precise.

Level 1 — Connectors that copy data in. The CDP pulls data from your warehouse on a scheduled basis and stores it in its own proprietary database. This is the most common pattern. You get some warehouse integration, but you're still maintaining two data stores. Level 2 — Read queries that stay in the warehouse. The CDP queries your warehouse directly to build segments or audiences, but results are cached in its own layer for activation. Better, but the CDP still manages its own state separately from your warehouse. Level 3 — Zero-copy, warehouse-native operation. The CDP runs entirely on top of your existing warehouse. Segments, profiles, and models are computed and stored in your own environment. No data leaves unless you explicitly send it somewhere. This is the architecture that eliminates duplication and keeps governance intact.

Level 3 is what engineering and data teams want. It's also what compliance and security teams want, because the data never leaves the environment they control.

The Specific Challenges With Each Platform

Snowflake, Databricks, and BigQuery have meaningfully different architectures, and a CDP that claims to support all three needs to handle each one correctly.

Snowflake

Snowflake's strength is its separation of compute and storage, which means queries scale independently of how much data you store. A CDP that integrates with Snowflake should push computation down into Snowflake rather than pulling rows out and processing them externally. That keeps costs predictable and takes advantage of Snowflake's own optimizations.

Snowflake also has a robust data sharing and governance layer. A CDP should respect Snowflake's access controls and ideally read from the same tables your data engineering team already maintains — not require you to create new ETL pipelines just to feed the CDP.

Databricks

Databricks is more than a warehouse — it's a lakehouse platform where data science teams run notebooks, train models, and build features. A CDP working with Databricks needs to be comfortable reading from Delta Lake tables and, ideally, reading model outputs or feature tables that the data science team has already built.

This matters because some of the most valuable customer signals — propensity scores, predicted lifetime value, churn risk — live in Delta Lake as ML outputs. A CDP that can read those directly and use them in segmentation or as personalization signals is far more useful than one that requires you to re-implement that logic inside the CDP's own interface.

BigQuery

BigQuery's columnar storage and serverless query model mean that scans over large datasets are efficient, but only when queries are written well. A CDP built for BigQuery should generate efficient SQL that takes advantage of partitioning and clustering rather than generating generic queries that scan full tables.

BigQuery also integrates tightly with the rest of Google Cloud, including Vertex AI and Looker. A CDP that operates inside BigQuery can benefit from that ecosystem — reading Vertex AI model outputs, for instance — rather than sitting outside it.

What to Look For When Evaluating a CDP for Warehouse Integration

Once you've decided you want a CDP that genuinely operates on your warehouse rather than copying from it, there are several specific things to evaluate.

Does it write results back to your warehouse? Some CDPs query your warehouse to build a segment but then store the results in their own database. A true warehouse-native CDP writes computed profiles, segments, and audience membership back to tables you own. This means your data team can query CDP outputs the same way they query anything else. Does it support custom SQL and model outputs? Marketing and data teams have different skill sets. A good CDP lets data engineers write SQL to define audiences or feed in model outputs, while also giving marketers a no-code interface to build segments from those same foundations. Both layers should coexist without requiring the data team to rebuild everything the marketing team wants. What does the identity resolution look like? Matching a customer's events, purchases, email opens, and support tickets into a single profile is hard. Some CDPs do this entirely in their own proprietary store. A CDP designed for Snowflake, Databricks, or BigQuery should resolve identity within your warehouse, so the unified profile is something your entire organization can use — not just the CDP. How does activation work? Building a segment is only half the job. The CDP also needs to send that audience to ad platforms, email tools, CRMs, and other downstream systems. Look for breadth of destination connectors and whether the CDP can sync incrementally (sending only changes rather than full audience lists on every run), which matters at scale. What is the governance model? When a CDP operates inside your warehouse, data access controls, audit logs, and retention policies already in place continue to apply. Confirm that the CDP respects your existing role-based access controls rather than creating a separate permission layer that's hard to audit.

One Approach Worth Examining

Hightouch, for instance, built its Composable CDP explicitly for companies that already have Snowflake, Databricks, or BigQuery as their data foundation. The architecture is zero-copy: Hightouch queries your warehouse to compute segments and profiles, and writes results back to your own tables. Customer data doesn't move into a Hightouch-managed database.

On the identity side, Hightouch's Identity Resolution runs inside your warehouse, producing a unified customer graph that your data and analytics teams can query directly — not just the marketing team.

For segmentation and audience building, Hightouch's Customer Studio gives marketers a visual interface to build audiences from warehouse data without writing SQL, while data engineers can contribute SQL-defined traits and model outputs that appear as building blocks inside that same interface. The two workflows don't conflict; they compose.

Activation connects to more than 250 destinations including Salesforce, Meta, Google Ads, Braze, Iterable, and Klaviyo, with incremental sync support.

Hightouch also operates as an Agentic Marketing Platform, layering AI-driven orchestration on top of the data foundation. This means marketing teams can move beyond static segments into dynamic, signal-based engagement — where the system adjusts who gets what message based on real-time data changes in the warehouse.

Vendors like Segment and mParticle have warehouse connection features, but their core data model still centers on their own proprietary store. The integration is an addition rather than the foundation. For teams where Snowflake, Databricks, or BigQuery is already the system of record, that distinction matters in practice.

Common Objections — and Honest Answers

"Our warehouse isn't clean enough for a CDP to use directly."

This is probably the most common concern, and it's fair. A warehouse-native CDP doesn't fix data quality problems — it surfaces them. But that's arguably better than the alternative, which is feeding messy data into a CDP that silently produces bad segments. Most teams find that operating the CDP on their warehouse creates an incentive to clean up source data, since the results of that cleanup are immediately visible.

"Won't querying the warehouse for marketing operations get expensive?"

Cost depends heavily on query patterns. A CDP that generates efficient SQL — using partitioning, pushing filters down, avoiding unnecessary full scans — adds modest cost compared to the warehouse's total bill. The cost of maintaining a separate CDP database, on the other hand, includes both the storage and compute on the CDP's side and the ETL pipelines needed to keep it fed.

"Our data science team owns the warehouse. Marketing doesn't have access."

This is a governance question, not a technical one, and it's worth resolving regardless of which CDP you choose. A warehouse-native CDP often accelerates this conversation because it gives marketing a governed interface into warehouse data rather than requiring direct access. The data team retains control over what tables are exposed; marketing gets a no-code interface on top of those tables.

The Evaluation Checklist

If you're actively comparing CDPs for a Snowflake, Databricks, or BigQuery environment, these are the questions worth asking every vendor:

Does the CDP store any customer data in its own database, or does it operate entirely within your warehouse?
How does identity resolution work, and where is the unified profile stored?
Can data engineers contribute SQL-defined traits or model outputs alongside marketer-built audiences?
What does the sync mechanism look like — full audience sends or incremental delta syncs?
Does the CDP write segment and audience membership back to your warehouse tables?
How are access controls managed — through your existing warehouse permissions or a separate layer?

The answers will quickly separate CDPs that have warehouse connectors from CDPs that are genuinely designed around your warehouse as the foundation.

Conclusion

For companies that have invested in Snowflake, Databricks, or BigQuery, picking a CDP that duplicates data into a proprietary store means maintaining two systems of record for the most sensitive data you have. The alternative — a CDP designed to operate on your existing warehouse — keeps your data governance intact, gives your data science team's work a direct path into marketing activation, and eliminates the sync lag that makes traditional CDPs frustrating at scale.

The category of CDPs that genuinely support this architecture is smaller than the vendor landscape suggests. Evaluating vendors on the specific questions above — not just whether they "support" your warehouse — is the fastest way to find one that will actually hold up in production.