Most Talend-to-dbt guides — including our own — stay at the altitude of strategy: audit the estate, bucket the jobs, run both systems in parallel. All true. But the question we actually get on the first call is more concrete:
"We have a Talend job that hits a REST API, mangles the response, and writes it to a table. What does that look like on Google Cloud and dbt?"
So this post is one job, end to end. We'll take a representative Talend pipeline — a tREST → tMap → tBigQueryOutput flow pulling orders from a SaaS API — and rebuild it on Google Cloud and dbt. No abstractions. The same pattern covers most of the API-shaped jobs in a typical estate.
The Talend job we're replacing
The job under the microscope is the one every estate has a few of:
tRESTcalls a paginated API endpoint (/v2/orders?page=N), looping until the response is empty.tMapflattens the JSON, renames fields, casts types, drops cancelled orders.tBigQueryOutputtruncates and reloads areporting.orderstable.- A Talend scheduler entry runs it nightly on a VM that also runs forty other jobs.
It works. It's also a black box: the pagination logic lives in a component nobody opens, the field mapping is a screenshot in a Confluence page, and the whole thing dies if the VM reboots mid-run.
Here's where each piece lands in the Google Cloud + dbt world:
| Talend component | What it did | Replacement |
|---|---|---|
tREST + loop | Pull paginated API data | Cloud Run function (Python) |
tBigQueryOutput (raw) | Land the response | BigQuery raw table (raw.orders) |
tMap | Flatten, rename, cast, filter | dbt staging model |
| Business logic / joins | Shape for reporting | dbt mart model |
| Talend scheduler | Trigger nightly | Cloud Scheduler → Workflows |
| (none — manual checks) | Data quality | dbt tests in CI |
The principle that makes this clean: extraction and transformation stop being the same job. Talend smeared them together inside one .item file. We split them — a function that only lands raw data, and dbt that only transforms what's already landed. That separation is the entire reason the dbt side stays simple.
Step 1 — Extraction: replace tREST with a Cloud Run function
The API pull becomes a small Python service on Cloud Run. It does exactly what tREST did — page through the endpoint — but the logic is now readable code in a Git repo instead of a loop component.
# main.py — deployed to Cloud Run
import requests
from google.cloud import bigquery
API_BASE = "https://api.example-saas.com/v2"
RAW_TABLE = "my-project.raw.orders"
def fetch_orders(session, since):
"""Page through the orders endpoint until it runs dry."""
page, rows = 1, []
while True:
resp = session.get(
f"{API_BASE}/orders",
params={"updated_since": since, "page": page, "per_page": 200},
timeout=30,
)
resp.raise_for_status()
batch = resp.json()["data"]
if not batch:
break
rows.extend(batch)
page += 1
return rows
def load_to_bigquery(rows):
"""Land the raw API response untouched — one JSON string per row."""
client = bigquery.Client()
job = client.load_table_from_json(
[{"payload": r, "_ingested_at": "AUTO"} for r in rows],
RAW_TABLE,
job_config=bigquery.LoadJobConfig(
write_disposition="WRITE_APPEND",
schema=[
bigquery.SchemaField("payload", "JSON"),
bigquery.SchemaField("_ingested_at", "TIMESTAMP",
default_value_expression="CURRENT_TIMESTAMP()"),
],
),
)
job.result()Two decisions matter here and they're the opposite of how Talend worked:
- Land the response raw. No flattening, no renaming, no filtering in the extractor. Store the API's JSON exactly as it arrived, in a BigQuery
JSONcolumn, with an ingestion timestamp. If the API adds a field next quarter, you already have it. If a transform has a bug, you re-run dbt — you don't re-pull six months of history. - Append, don't truncate.
tBigQueryOutputtruncated-and-reloaded. We append and let dbt deduplicate. That turns your raw table into an audit log and makes incremental loads possible.
The Talend instinct is to clean data on the way in. The modern instinct is to land it dirty and clean it in SQL, where the logic is version-controlled and testable.
Step 2 — Incremental loads with a watermark
The tREST loop pulled everything, every night. On BigQuery you pay for what you scan, so pull only what changed. Store a watermark — the last ingestion timestamp — and pass it as the updated_since param:
def get_watermark(client):
sql = "SELECT MAX(_ingested_at) AS hwm FROM `my-project.raw.orders`"
row = next(iter(client.query(sql).result()), None)
return row.hwm.isoformat() if row and row.hwm else "2020-01-01T00:00:00Z"Now each run lands only new and updated orders. The raw table grows as an append-only history; dbt collapses it to the current state in the next step.
Step 3 — Transformation: tMap becomes dbt models
This is where the tMap logic moves — but instead of a visual mapping, it's SQL in layers.
Staging parses the raw JSON and does exactly what tMap did: rename, cast, filter. One model, one job:
-- models/staging/stg_orders.sql
WITH raw AS (
SELECT
payload,
_ingested_at,
ROW_NUMBER() OVER (
PARTITION BY JSON_VALUE(payload, '$.id')
ORDER BY _ingested_at DESC
) AS rn
FROM {{ source('raw', 'orders') }}
)
SELECT
JSON_VALUE(payload, '$.id') AS order_id,
JSON_VALUE(payload, '$.customer.email') AS customer_email,
CAST(JSON_VALUE(payload, '$.total') AS NUMERIC) AS order_total,
CAST(JSON_VALUE(payload, '$.created_at') AS TIMESTAMP) AS created_at,
JSON_VALUE(payload, '$.status') AS status
FROM raw
WHERE rn = 1 -- keep the latest version of each order
AND JSON_VALUE(payload, '$.status') != 'cancelled'That ROW_NUMBER() … WHERE rn = 1 is the deduplication the append-only raw table needed — the equivalent of Talend's truncate-reload, but expressed as data, not destruction.
Mart holds the joins and business logic the dashboards read:
-- models/marts/mart_revenue_by_month.sql
SELECT
DATE_TRUNC(created_at, MONTH) AS month,
COUNT(DISTINCT order_id) AS orders,
SUM(order_total) AS revenue
FROM {{ ref('stg_orders') }}
GROUP BY 1Register the raw table as a dbt source so lineage starts at the first hop:
# models/staging/_sources.yml
sources:
- name: raw
schema: raw
tables:
- name: orders
loaded_at_field: _ingested_at
freshness:
warn_after: { count: 26, period: hour }That freshness block gives you something Talend never did for free: dbt source freshness fails loudly if the Cloud Run function stopped landing data. The pipeline tells you when it's broken instead of going quietly stale.
Step 4 — Orchestration: Cloud Scheduler, not a Talend server
The nightly trigger moves off the shared VM entirely. The chain is fully serverless:
- Cloud Scheduler fires on a cron (
0 2 * * *). - It invokes a Cloud Workflow that runs the steps in order: call the Cloud Run extractor, wait for it, then trigger the dbt run.
- dbt builds staging → marts and runs tests.
# workflow.yaml — Google Cloud Workflows
main:
steps:
- extract:
call: http.post
args:
url: https://orders-extractor-xxxx.run.app
auth: { type: OIDC }
- transform:
call: http.post
args:
url: https://api.cloud.getdbt.com/v2/.../jobs/123/run/
headers: { Authorization: '${"Token " + sys.get_env("DBT_TOKEN")}' }No VM. No always-on Talend runtime. Extraction runs for the ninety seconds it takes to page the API, then scales to zero. We covered this orchestration pattern — Workflows + Cloud Run + dbt, no Airflow cluster — in depth here.
Step 5 — Testing and the parallel run
The job that had zero automated checks gets them now, in the layer where they belong:
# models/staging/_schema.yml
models:
- name: stg_orders
columns:
- name: order_id
tests: [unique, not_null]
- name: order_total
tests:
- dbt_utils.accepted_range: { min_value: 0 }dbt test runs on every build and blocks the merge on failure. Then — the step you do not skip — run both pipelines side by side for two to four weeks and compare row counts and daily revenue totals until they match for a full week. We wrote up why that discipline is non-negotiable, and the failure modes of skipping it, in the migration playbook. Only then do you delete the Talend job.
What changed, beyond "it's SQL now"
The line-for-line rewrite isn't the win. Three structural things are:
- The raw layer is permanent. Every API response is preserved in BigQuery. Transform bugs are a
dbt runaway from fixed — no re-pulling history. - Compute scales to zero. The Talend VM ran 24/7 to do ninety seconds of work a night. Cloud Run bills for the ninety seconds. For most estates, a fleet of these jobs costs less than the single VM they replace.
- Every piece is in Git. The pagination logic, the field mapping, the schedule, the tests — all reviewable in a PR. The "ask Dave" dependency is gone.
| Layer | Talend world | Google Cloud + dbt |
|---|---|---|
| API extraction | tREST loop component | Cloud Run function (Python) |
| Raw storage | (none — transformed on the way in) | BigQuery raw table, JSON column |
| Transformation | tMap on a Talend server | dbt staging + mart models |
| Orchestration | Talend scheduler on a VM | Cloud Scheduler → Workflows |
| Testing | Manual spot checks | dbt tests in CI |
| Compute | Always-on VM | Serverless, scales to zero |
Do this once and the pattern repeats: the next API job, the next file drop, the next database extract all follow the same three-part shape — land raw, transform in dbt, orchestrate serverless. That's how a Talend estate of forty jobs comes apart one clean pipeline at a time.
If you're scoping a Talend-to-dbt move on Google Cloud and want to pressure-test the approach against your actual estate, book a discovery call — we'll walk through what the first pipeline looks like for your stack.