Talend to dbt on Google Cloud: migrating an API ingestion pipeline

Most Talend-to-dbt guides — including our own — stay at the altitude of strategy: audit the estate, bucket the jobs, run both systems in parallel. All true. But the question we actually get on the first call is more concrete:

"We have a Talend job that hits a REST API, mangles the response, and writes it to a table. What does that look like on Google Cloud and dbt?"

So this post is one job, end to end. We'll take a representative Talend pipeline — a tREST → tMap → tBigQueryOutput flow pulling orders from a SaaS API — and rebuild it on Google Cloud and dbt. No abstractions. The same pattern covers most of the API-shaped jobs in a typical estate.

The Talend job we're replacing

The job under the microscope is the one every estate has a few of:

tREST calls a paginated API endpoint (/v2/orders?page=N), looping until the response is empty.
tMap flattens the JSON, renames fields, casts types, drops cancelled orders.
tBigQueryOutput truncates and reloads a reporting.orders table.
A Talend scheduler entry runs it nightly on a VM that also runs forty other jobs.

It works. It's also a black box: the pagination logic lives in a component nobody opens, the field mapping is a screenshot in a Confluence page, and the whole thing dies if the VM reboots mid-run.

Here's where each piece lands in the Google Cloud + dbt world:

Talend component	What it did	Replacement
`tREST` + loop	Pull paginated API data	Cloud Run function (Python)
`tBigQueryOutput` (raw)	Land the response	BigQuery raw table (`raw.orders`)
`tMap`	Flatten, rename, cast, filter	dbt staging model
Business logic / joins	Shape for reporting	dbt mart model
Talend scheduler	Trigger nightly	Cloud Scheduler → Workflows
(none — manual checks)	Data quality	dbt tests in CI

The principle that makes this clean: extraction and transformation stop being the same job. Talend smeared them together inside one .item file. We split them — a function that only lands raw data, and dbt that only transforms what's already landed. That separation is the entire reason the dbt side stays simple.

Step 1 — Extraction: replace tREST with a Cloud Run function

The API pull becomes a small Python service on Cloud Run. It does exactly what tREST did — page through the endpoint — but the logic is now readable code in a Git repo instead of a loop component.

# main.py — deployed to Cloud Run
import requests
from google.cloud import bigquery
 
API_BASE = "https://api.example-saas.com/v2"
RAW_TABLE = "my-project.raw.orders"
 
def fetch_orders(session, since):
    """Page through the orders endpoint until it runs dry."""
    page, rows = 1, []
    while True:
        resp = session.get(
            f"{API_BASE}/orders",
            params={"updated_since": since, "page": page, "per_page": 200},
            timeout=30,
        )
        resp.raise_for_status()
        batch = resp.json()["data"]
        if not batch:
            break
        rows.extend(batch)
        page += 1
    return rows
 
def load_to_bigquery(rows):
    """Land the raw API response untouched — one JSON string per row."""
    client = bigquery.Client()
    job = client.load_table_from_json(
        [{"payload": r, "_ingested_at": "AUTO"} for r in rows],
        RAW_TABLE,
        job_config=bigquery.LoadJobConfig(
            write_disposition="WRITE_APPEND",
            schema=[
                bigquery.SchemaField("payload", "JSON"),
                bigquery.SchemaField("_ingested_at", "TIMESTAMP",
                                     default_value_expression="CURRENT_TIMESTAMP()"),
            ],
        ),
    )
    job.result()

Two decisions matter here and they're the opposite of how Talend worked:

Land the response raw. No flattening, no renaming, no filtering in the extractor. Store the API's JSON exactly as it arrived, in a BigQuery JSON column, with an ingestion timestamp. If the API adds a field next quarter, you already have it. If a transform has a bug, you re-run dbt — you don't re-pull six months of history.
Append, don't truncate. tBigQueryOutput truncated-and-reloaded. We append and let dbt deduplicate. That turns your raw table into an audit log and makes incremental loads possible.

The Talend instinct is to clean data on the way in. The modern instinct is to land it dirty and clean it in SQL, where the logic is version-controlled and testable.

Step 2 — Incremental loads with a watermark

The tREST loop pulled everything, every night. On BigQuery you pay for what you scan, so pull only what changed. Store a watermark — the last ingestion timestamp — and pass it as the updated_since param:

def get_watermark(client):
    sql = "SELECT MAX(_ingested_at) AS hwm FROM `my-project.raw.orders`"
    row = next(iter(client.query(sql).result()), None)
    return row.hwm.isoformat() if row and row.hwm else "2020-01-01T00:00:00Z"

Now each run lands only new and updated orders. The raw table grows as an append-only history; dbt collapses it to the current state in the next step.

Step 3 — Transformation: tMap becomes dbt models

This is where the tMap logic moves — but instead of a visual mapping, it's SQL in layers.

Staging parses the raw JSON and does exactly what tMap did: rename, cast, filter. One model, one job:

-- models/staging/stg_orders.sql
WITH raw AS (
    SELECT
        payload,
        _ingested_at,
        ROW_NUMBER() OVER (
            PARTITION BY JSON_VALUE(payload, '$.id')
            ORDER BY _ingested_at DESC
        ) AS rn
    FROM {{ source('raw', 'orders') }}
)
SELECT
    JSON_VALUE(payload, '$.id')                      AS order_id,
    JSON_VALUE(payload, '$.customer.email')          AS customer_email,
    CAST(JSON_VALUE(payload, '$.total') AS NUMERIC)  AS order_total,
    CAST(JSON_VALUE(payload, '$.created_at') AS TIMESTAMP) AS created_at,
    JSON_VALUE(payload, '$.status')                  AS status
FROM raw
WHERE rn = 1                       -- keep the latest version of each order
  AND JSON_VALUE(payload, '$.status') != 'cancelled'

That ROW_NUMBER() … WHERE rn = 1 is the deduplication the append-only raw table needed — the equivalent of Talend's truncate-reload, but expressed as data, not destruction.

Mart holds the joins and business logic the dashboards read:

-- models/marts/mart_revenue_by_month.sql
SELECT
    DATE_TRUNC(created_at, MONTH)  AS month,
    COUNT(DISTINCT order_id)        AS orders,
    SUM(order_total)                AS revenue
FROM {{ ref('stg_orders') }}
GROUP BY 1

# models/staging/_sources.yml
sources:
  - name: raw
    schema: raw
    tables:
      - name: orders
        loaded_at_field: _ingested_at
        freshness:
          warn_after: { count: 26, period: hour }

That freshness block gives you something Talend never did for free: dbt source freshness fails loudly if the Cloud Run function stopped landing data. The pipeline tells you when it's broken instead of going quietly stale.

Step 4 — Orchestration: Cloud Scheduler, not a Talend server

The nightly trigger moves off the shared VM entirely. The chain is fully serverless:

Cloud Scheduler fires on a cron (0 2 * * *).
It invokes a Cloud Workflow that runs the steps in order: call the Cloud Run extractor, wait for it, then trigger the dbt run.
dbt builds staging → marts and runs tests.

# workflow.yaml — Google Cloud Workflows
main:
  steps:
    - extract:
        call: http.post
        args:
          url: https://orders-extractor-xxxx.run.app
          auth: { type: OIDC }
    - transform:
        call: http.post
        args:
          url: https://api.cloud.getdbt.com/v2/.../jobs/123/run/
          headers: { Authorization: '${"Token " + sys.get_env("DBT_TOKEN")}' }

No VM. No always-on Talend runtime. Extraction runs for the ninety seconds it takes to page the API, then scales to zero. We covered this orchestration pattern — Workflows + Cloud Run + dbt, no Airflow cluster — in depth here.

Step 5 — Testing and the parallel run

The job that had zero automated checks gets them now, in the layer where they belong:

# models/staging/_schema.yml
models:
  - name: stg_orders
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: order_total
        tests:
          - dbt_utils.accepted_range: { min_value: 0 }

dbt test runs on every build and blocks the merge on failure. Then — the step you do not skip — run both pipelines side by side for two to four weeks and compare row counts and daily revenue totals until they match for a full week. We wrote up why that discipline is non-negotiable, and the failure modes of skipping it, in the migration playbook. Only then do you delete the Talend job.

What changed, beyond "it's SQL now"

The line-for-line rewrite isn't the win. Three structural things are:

The raw layer is permanent. Every API response is preserved in BigQuery. Transform bugs are a dbt run away from fixed — no re-pulling history.
Compute scales to zero. The Talend VM ran 24/7 to do ninety seconds of work a night. Cloud Run bills for the ninety seconds. For most estates, a fleet of these jobs costs less than the single VM they replace.
Every piece is in Git. The pagination logic, the field mapping, the schedule, the tests — all reviewable in a PR. The "ask Dave" dependency is gone.

Layer	Talend world	Google Cloud + dbt
API extraction	`tREST` loop component	Cloud Run function (Python)
Raw storage	(none — transformed on the way in)	BigQuery raw table, JSON column
Transformation	`tMap` on a Talend server	dbt staging + mart models
Orchestration	Talend scheduler on a VM	Cloud Scheduler → Workflows
Testing	Manual spot checks	dbt tests in CI
Compute	Always-on VM	Serverless, scales to zero

Do this once and the pattern repeats: the next API job, the next file drop, the next database extract all follow the same three-part shape — land raw, transform in dbt, orchestrate serverless. That's how a Talend estate of forty jobs comes apart one clean pipeline at a time.

If you're scoping a Talend-to-dbt move on Google Cloud and want to pressure-test the approach against your actual estate, book a discovery call — we'll walk through what the first pipeline looks like for your stack.

Talend to dbt on Google Cloud: migrating an API ingestion pipeline

The Talend job we're replacing

Step 1 — Extraction: replace tREST with a Cloud Run function

Step 2 — Incremental loads with a watermark

Step 3 — Transformation: tMap becomes dbt models

Step 4 — Orchestration: Cloud Scheduler, not a Talend server

Step 5 — Testing and the parallel run

What changed, beyond "it's SQL now"

More on the same topics.

Talend vs dbt: an honest comparison

ABM signals — how to fetch, process, and actually use intent data

Building an AI financial analyst with Claude and dbt

30 minutes. We'll tell you honestlywhat's broken.