From GUI to Git: Migrating Your ETL from Talend to dbt for BigQuery

In the ever-evolving landscape of data engineering, organizations are continuously seeking more efficient, scalable, and collaborative ways to manage their data transformation pipelines. For years, graphical user interface (GUI)-based ETL tools like Talend have been a popular choice for their visual workflow design. However, a modern, code-centric approach using tools like dbt (data build tool) is rapidly gaining traction, especially for cloud data warehouses like Google BigQuery.

This article will walk you through a simple data aggregation task, first demonstrating how to build it as a Talend job and then showing the equivalent, and arguably more streamlined, process in dbt. We will use a common business scenario: calculating the total sales amount per product category from a sales data table and loading the aggregated results into a BigQuery table.

The Use Case: Aggregating Sales Data

Imagine you have a table in BigQuery named raw_sales with the following columns: order_id, product_category, quantity, and price. Your goal is to create a new table, agg_sales_by_category, that summarizes the total sales amount for each product category.

Here's a small sample of our source data in the raw_sales table:

Our desired output in the agg_sales_by_category table would be:

The Talend Approach: A Visual ETL Job

Talend allows you to build data pipelines visually by connecting components on a canvas. For our use case, the job would look like this:

Talend Job Flow:

1. Reading Data from BigQuery (tBigQueryInput)

This component is configured to connect to your Google Cloud project and read data from the raw_sales table. You would specify your project ID, dataset name, and the table name. The schema of the component would be defined to match the columns of the raw_sales table.

2. Calculating Total Price (tMap)

The tMap component is a powerful tool for row-level transformations. In this step, we'll add a new column called sales_amount by multiplying the quantity and price for each row.

Input: order_id, product_category, quantity, price
Output: product_category, sales_amount (Expression: row1.quantity * row1.price)

3. Aggregating the Data (tAggregateRow)

This is where the core aggregation logic resides. We will group the data by product_category and calculate the sum of sales_amount.

Group by: product_category
Operations:
- Output Column: total_sales
- Function: sum
- Input Column: sales_amount

4. Writing to BigQuery (tBigQueryOutput)

Finally, this component writes the aggregated data to our target table, agg_sales_by_category, in BigQuery. You'll configure the connection details, dataset, and table name. You can also specify the write disposition, such as CREATE_IF_NEEDED and WRITE_TRUNCATE to overwrite the table on each run.

While this visual approach can be intuitive for simple pipelines, it can become cumbersome to manage, version control, and debug as the logic grows in complexity.

The dbt Approach: SQL-based and Git-friendly

dbt allows you to transform data in your warehouse using SQL. It brings software engineering best practices like version control, testing, and documentation to your analytics code.

Here's how you would achieve the same aggregation in a dbt project:

dbt Project Structure:

A typical dbt project has a well-defined folder structure. For our use case, we would create a new SQL model file.

my_dbt_project/
|-- models/
|   |-- aggregations/
|       |-- agg_sales_by_category.sql
|-- dbt_project.yml

The dbt Model (agg_sales_by_category.sql):

This SQL file contains the logic for our transformation. dbt will execute this SQL against your BigQuery database and materialize the result as a new table (or view).

{{
  config(
    materialized='table'
  )
}}

WITH source_data AS (
  SELECT
    product_category,
    quantity * price AS sales_amount
  FROM
    `your_gcp_project.your_dataset.raw_sales`
)

SELECT
  product_category,
  SUM(sales_amount) AS total_sales
FROM
  source_data
GROUP BY
  product_category

Running the dbt Job:

To run this transformation, you would simply execute the following command in your terminal:

 dbt run --select agg_sales_by_category

dbt connects to your BigQuery project (configured in a profiles.yml file), runs the SQL query, and creates the agg_sales_by_category table in the schema specified in your dbt_project.yml.

The Verdict: Why Modern Data Teams are Flocking to dbt

While Talend has been a stalwart in the ETL world, the shift towards cloud-native data warehouses like BigQuery has highlighted the advantages of a code-first, in-warehouse transformation tool like dbt.

If your organization is currently using a traditional ETL tool like Talend to feed your BigQuery warehouse, it's worth exploring how a migration to dbt can streamline your workflows, improve your data quality, and empower your team to move faster and more reliably. The transition from clicking to coding might seem daunting, but the long-term benefits in terms of efficiency, scalability, and collaboration are well worth the investment.

Reach out to us on sales@warehows.io if you have any requirement on Talend Migration

Write to us