Data Fusion

Cloud-native data integration service with visual interface for ETL pipelines. Serverless, fully managed, and based on CDAP open source.

What is Data Fusion?

Data Fusion is Google’s answer to the growing complexity of data integration in hybrid and multi-cloud environments. While traditional ETL tools are often proprietary and difficult to scale, Data Fusion offers a cloud-native, visual solution based on open source (CDAP).

The service enables creating complex data pipelines through drag-and-drop without writing code. Over 150 pre-built connectors cover the most common data sources. Under the hood, Data Fusion uses Apache Spark for batch processing, enabling processing of petabyte-scale datasets.

Data Fusion is more than an ETL tool. The service offers data lineage at the field level, integrated pipeline monitoring, and the ability to export pipelines as code. This makes it the natural choice for enterprises establishing DataOps practices while enabling business analysts to create pipelines independently.

Common Use Cases

ETL/ELT Pipelines for BigQuery

Data Fusion is the preferred solution for visually creating ETL pipelines that load data from various sources into BigQuery. You can visually define transformations like joins, aggregations, deduplication, and data quality checks. Scheduling enables automatic, recurring executions.

Cloud-to-Cloud Data Integration

Integrate data between different cloud platforms without managing your own infrastructure. Example: Replicate data from AWS S3 to Google Cloud Storage, transform, and load into BigQuery. Data Fusion automatically manages orchestration and scaling.

On-Premises to Cloud Migration

For hybrid cloud scenarios, Data Fusion offers secure connections to on-premises databases via VPC peering or Cloud VPN. Incremental replication enables smooth migrations without downtime. Change Data Capture (CDC) keeps source and target systems synchronized.

Real-time Streaming Pipelines

Process event streams from Pub/Sub, Kafka, or Cloud Storage events in real-time. Windowing, aggregations, and stream joins are visually configurable. Results can be written to BigQuery for analytics or Bigtable for low-latency access.

Data Lake Ingestion

Automate ingestion of raw data into Cloud Storage data lakes. Data Fusion can retrieve files from various sources, validate, transform, and store them partitioned in Cloud Storage. Integration with Data Catalog enables automatic metadata cataloging.

Integration with innFactory

As a Google Cloud partner, innFactory supports you in implementing Data Fusion for enterprise-wide data integration. We help with architecting hybrid cloud pipelines, developing custom plugins, and optimizing pipeline performance.

Our expertise includes migrating from legacy ETL tools to Data Fusion, implementing DataOps practices with CI/CD integration, and building data lineage strategies for governance and compliance.

Available Tiers & Options

Basic

Strengths

Cost-effective entry point
For simple ETL pipelines
No pipeline limits

Considerations

No high availability
Limited scaling
No SLA

Recommended

Enterprise

Strengths

High availability across zones
Advanced transformations
Pipeline monitoring and lineage
CMEK encryption

Considerations

Higher costs
More complex configuration

Frequently Asked Questions

What is Data Fusion?

Data Fusion is a fully managed, cloud-native data integration service with a visual interface. It's based on CDAP (Cask Data Application Platform) open source and enables creating ETL/ELT pipelines without code through drag-and-drop. Data Fusion is powered by Apache Spark as the execution engine.

When should I use Data Fusion instead of Dataflow?

Data Fusion is suitable for teams without deep programming skills who want to create ETL pipelines visually. Dataflow is better for complex, code-based streaming pipelines with Apache Beam. If you need GUI-based data integration with 150+ pre-built connectors, Data Fusion is the right choice.

Which data sources and destinations does Data Fusion support?

Data Fusion offers over 150 pre-built connectors for BigQuery, Cloud Storage, Cloud SQL, Spanner, Bigtable, Pub/Sub, as well as external sources like Oracle, SAP, Salesforce, S3, Azure Blob Storage, and on-premises databases. Custom plugins can be developed for proprietary systems.

What is the difference between Basic and Enterprise Edition?

Basic Edition is suitable for development and test environments with simple pipelines without high availability requirements. Enterprise Edition offers zonal redundancy, advanced transformations, pipeline lineage, monitoring dashboards, and CMEK encryption for production environments.

Can Data Fusion process streaming pipelines?

Yes, Data Fusion supports both batch and streaming pipelines. Streaming sources like Pub/Sub, Kafka, or Cloud Storage events can be processed and written in real-time to BigQuery, Bigtable, or other destinations. Windowing and aggregations are also possible.

How is Data Fusion priced?

Data Fusion charges based on instance runtime (per hour) and pipeline executions. Basic Edition costs less than Enterprise Edition. Additionally, costs accrue for compute resources used by pipeline jobs (Dataproc clusters). Inactive instances can be stopped to save costs.

Is Data Fusion GDPR-compliant?

Yes, Data Fusion is available in EU regions and meets all GDPR requirements. Enterprise Edition offers CMEK (Customer-Managed Encryption Keys) for additional control over encryption. VPC peering enables secure connections to on-premises systems without public internet.

Does Data Fusion support data lineage?

Yes, Enterprise Edition offers automatic data lineage at the field level. You can visually trace how data flows through transformations, which fields originate from which sources, and where they end up. This is essential for impact analysis and compliance.

Data Fusion

What is Data Fusion?

Common Use Cases

ETL/ELT Pipelines for BigQuery

Cloud-to-Cloud Data Integration

On-Premises to Cloud Migration

Real-time Streaming Pipelines

Data Lake Ingestion

Integration with innFactory

Available Tiers & Options

Basic

Enterprise

Typical Use Cases

Technical Specifications

Frequently Asked Questions

What is Data Fusion?

When should I use Data Fusion instead of Dataflow?

Which data sources and destinations does Data Fusion support?

What is the difference between Basic and Enterprise Edition?

Can Data Fusion process streaming pipelines?

How is Data Fusion priced?

Is Data Fusion GDPR-compliant?

Does Data Fusion support data lineage?

Quick Links

Google Cloud Partner

Comparable Products from Other Clouds

AWS Glue - Serverless ETL

Azure Data Factory - Cloud ETL and Data Integration

Ready to start with Data Fusion?