← Back to blog
migration·September 15, 2025·7 min read

Migrating from Talend to dbt for modern data engineering

A practical guide to replacing Talend's visual ETL with dbt's SQL-first approach — from audit to parallel validation to the day you turn Talend off.

migration

Migrating from Talend to dbt for modern data engineering

Visual ETL made sense when data teams were small and transformations were simple. Drag a connector, wire a mapping, schedule a job. But at some point the DAG viewer became unreadable, the version control story became "ask Dave," and the server running Talend became the single point of failure nobody wanted to touch.

That's when the conversation about dbt starts.

We've run this migration enough times — across Talend Open Studio, Talend Cloud, and a few Informatica instances for good measure — to know where it goes smoothly and where it doesn't. This is the playbook.

Why teams move

The reasons are remarkably consistent:

  1. SQL-native transformations. dbt runs inside your warehouse. No external compute. No data movement. The warehouse you're already paying for does the work.
  2. Git as the source of truth. Every transformation is a SQL file in a repo. PRs, code review, CI — the same workflow your software engineers already use.
  3. Testing that actually runs. dbt tests execute on every build. Talend quality components exist but nobody enforces them consistently.
  4. Lineage you can trace. dbt docs generate produces a full dependency graph. In Talend, lineage means opening every job and manually following the connections.

The 60–70% improvement in query times we typically see post-migration is a bonus, not the reason. The real win is that your data team can move at the speed of a PR instead of the speed of a change-request ticket.

The audit nobody wants to do (but everyone needs)

Before writing a single dbt model, catalogue every Talend job. For each:

  • Sources and destinations. Where does data come from, where does it land?
  • Transformations. Rename, cast, join, aggregate, filter — name each one.
  • Schedule and owner. Who runs it, how often, what breaks when it doesn't?
  • Consumer. Who actually reads the output?

That last column is where the savings hide. In our experience, 20–30% of Talend jobs are orphaned — they run on schedule, they consume compute, and nobody has looked at their output in months. Retire those. Don't migrate dead weight.

Split the rest:

BucketWhat's in itMigration path
Clean SQL mappings~70% of jobsDirect dbt model conversion
Iteration / file handling~10% of jobsOrchestrator + dbt vars
Obsolete~20% of jobsArchive and delete

Extraction is not dbt's job

This trips people up. dbt transforms data that's already in the warehouse. It doesn't extract.

Replace Talend's tDBInput and tFileInput components with purpose-built extraction:

  • Fivetran for managed connectors — Salesforce, Shopify, HubSpot, Google Ads, 300+ others.
  • Airbyte for self-hosted or custom sources.
  • Cloud Functions / Workflows for bespoke API pulls.

Raw data lands in your warehouse untouched. Register those tables as dbt sources so lineage starts clean from the first hop.

Converting tMap to SQL models

Each tMap component becomes a SQL file. Structure them in layers:

Staging models (stg_*.sql) — one per source table. Rename columns, cast types, filter junk. No joins, no aggregations. One source in, one clean table out.

-- models/staging/stg_orders.sql
SELECT
    order_id,
    CAST(order_date AS DATE)      AS order_date,
    LOWER(TRIM(customer_email))   AS customer_email,
    order_total_cents / 100.0     AS order_total
FROM {{ source('erp', 'raw_orders') }}
WHERE order_id IS NOT NULL

Mart models (mart_*.sql) — this is where joins and business logic live. These are what dashboards read.

-- models/marts/mart_revenue_by_month.sql
SELECT
    DATE_TRUNC(o.order_date, MONTH)  AS month,
    COUNT(DISTINCT o.order_id)        AS orders,
    SUM(o.order_total)                AS revenue
FROM {{ ref('stg_orders') }} o
GROUP BY 1

Don't replicate tMap logic 1:1. The visual abstractions in Talend often paper over bad join logic — rewriting in SQL exposes assumptions you didn't know existed.

The iteration trap

dbt is set-based. It doesn't loop.

If your Talend job iterates over a list of client IDs, date ranges, or file paths, that iteration belongs in your orchestrator:

  • Airflow generates the parameter list.
  • Airflow calls dbt with variables: dbt run --vars '{"client_id": "abc"}'.
  • The dbt model reads the variable: {{ var('client_id') }}.

Clean separation. The orchestrator decides what to run. dbt decides how to transform it.

Teams that try to force iteration into dbt — Jinja loops generating dynamic SQL, macros that call macros — end up with something harder to maintain than the Talend job they replaced.

Testing: the part Talend never enforced

dbt's testing framework is its quiet superpower. Start with the basics:

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_email
        tests:
          - not_null

Then layer on business-specific tests:

      - name: order_total
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000

Run dbt test in CI. Failures block merges. You'll catch more data quality bugs in week one than Talend caught in a year.

Parallel validation: the non-negotiable step

Two to four weeks of running both systems side by side. No exceptions.

Compare daily:

  • Row counts on critical tables.
  • Aggregate values — revenue, user counts, whatever your dashboards report.
  • Dashboards built on both outputs.

When the numbers match for a full week, retire the Talend job. Not before.

Teams that skip this step spend months discovering tiny discrepancies in production, usually after the Talend server has been decommissioned and the fix is no longer simple.

What the stack looks like after

LayerTalend worlddbt world
ExtractiontDBInput, tFileInput, tRESTFivetran / Airbyte
TransformationTalend jobs on a dedicated serverdbt models in your warehouse
OrchestrationTalend scheduler or cronAirflow / Dagster / Prefect
TestingManual spot checksdbt tests in CI, every build
Version control"Ask Dave"Git-native, PR-reviewed
DeploymentExport + import job archivesdbt run triggered by CI merge

Timeline

For a medium-complexity estate (30–80 Talend jobs, 2–3 source systems):

PhaseDurationWhat happens
Audit + bucketing1 weekCatalogue, retire dead jobs, scope the migration
Extraction setup1 weekFivetran / Airbyte connectors, raw tables landing
Core model conversion2–3 weeksStaging + mart models, tests, documentation
Parallel validation2 weeksBoth systems running, daily comparison
Cutover + cleanup1 weekRetire Talend, update schedules, close tickets

Total: 7–8 weeks for a team of two. Faster if the Talend estate is clean. Slower if there's iteration logic or undocumented tribal knowledge baked into the jobs.

The uncomfortable truth

The hardest part of this migration isn't technical. It's getting the team to stop thinking in visual mappings and start thinking in SQL layers. The engineers who built those Talend jobs often have years of muscle memory — they know which tMap to open, which connection to check, which schedule to restart.

That muscle memory is valuable. What changes is the medium. Instead of opening a job designer, you open a SQL file. Instead of checking a tMap, you read a ref(). Instead of restarting a schedule, you re-run a CI pipeline.

The knowledge transfers. The tooling gets out of the way.


We've run this migration for teams across Snowflake, BigQuery, and Databricks — from 20-job Talend estates to 200+. If you're weighing the move, book a discovery call and we'll walk through what it looks like for your stack.

Got a similar problem?

30 minutes. We'll tell you honestlywhat's broken.