Cloud-native data integration service with visual interface for ETL pipelines. Serverless, fully managed, and based on CDAP open source.
What is Data Fusion?
Data Fusion is Google’s answer to the growing complexity of data integration in hybrid and multi-cloud environments. While traditional ETL tools are often proprietary and difficult to scale, Data Fusion offers a cloud-native, visual solution based on open source (CDAP).
The service enables creating complex data pipelines through drag-and-drop without writing code. Over 150 pre-built connectors cover the most common data sources. Under the hood, Data Fusion uses Apache Spark for batch processing, enabling processing of petabyte-scale datasets.
Data Fusion is more than an ETL tool. The service offers data lineage at the field level, integrated pipeline monitoring, and the ability to export pipelines as code. This makes it the natural choice for enterprises establishing DataOps practices while enabling business analysts to create pipelines independently.
Common Use Cases
ETL/ELT Pipelines for BigQuery
Data Fusion is the preferred solution for visually creating ETL pipelines that load data from various sources into BigQuery. You can visually define transformations like joins, aggregations, deduplication, and data quality checks. Scheduling enables automatic, recurring executions.
Cloud-to-Cloud Data Integration
Integrate data between different cloud platforms without managing your own infrastructure. Example: Replicate data from AWS S3 to Google Cloud Storage, transform, and load into BigQuery. Data Fusion automatically manages orchestration and scaling.
On-Premises to Cloud Migration
For hybrid cloud scenarios, Data Fusion offers secure connections to on-premises databases via VPC peering or Cloud VPN. Incremental replication enables smooth migrations without downtime. Change Data Capture (CDC) keeps source and target systems synchronized.
Real-time Streaming Pipelines
Process event streams from Pub/Sub, Kafka, or Cloud Storage events in real-time. Windowing, aggregations, and stream joins are visually configurable. Results can be written to BigQuery for analytics or Bigtable for low-latency access.
Data Lake Ingestion
Automate ingestion of raw data into Cloud Storage data lakes. Data Fusion can retrieve files from various sources, validate, transform, and store them partitioned in Cloud Storage. Integration with Data Catalog enables automatic metadata cataloging.
Integration with innFactory
As a Google Cloud partner, innFactory supports you in implementing Data Fusion for enterprise-wide data integration. We help with architecting hybrid cloud pipelines, developing custom plugins, and optimizing pipeline performance.
Our expertise includes migrating from legacy ETL tools to Data Fusion, implementing DataOps practices with CI/CD integration, and building data lineage strategies for governance and compliance.
Contact us for consultation on Data Fusion and data integration on Google Cloud.
Available Tiers & Options
Basic
- Cost-effective entry point
- For simple ETL pipelines
- No pipeline limits
- No high availability
- Limited scaling
- No SLA
Enterprise
- High availability across zones
- Advanced transformations
- Pipeline monitoring and lineage
- CMEK encryption
- Higher costs
- More complex configuration
Typical Use Cases
Technical Specifications
Frequently Asked Questions
What is Data Fusion?
Data Fusion is a fully managed, cloud-native data integration service with a visual interface. It's based on CDAP (Cask Data Application Platform) open source and enables creating ETL/ELT pipelines without code through drag-and-drop. Data Fusion is powered by Apache Spark as the execution engine.
When should I use Data Fusion instead of Dataflow?
Data Fusion is suitable for teams without deep programming skills who want to create ETL pipelines visually. Dataflow is better for complex, code-based streaming pipelines with Apache Beam. If you need GUI-based data integration with 150+ pre-built connectors, Data Fusion is the right choice.
Which data sources and destinations does Data Fusion support?
Data Fusion offers over 150 pre-built connectors for BigQuery, Cloud Storage, Cloud SQL, Spanner, Bigtable, Pub/Sub, as well as external sources like Oracle, SAP, Salesforce, S3, Azure Blob Storage, and on-premises databases. Custom plugins can be developed for proprietary systems.
What is the difference between Basic and Enterprise Edition?
Basic Edition is suitable for development and test environments with simple pipelines without high availability requirements. Enterprise Edition offers zonal redundancy, advanced transformations, pipeline lineage, monitoring dashboards, and CMEK encryption for production environments.
Can Data Fusion process streaming pipelines?
Yes, Data Fusion supports both batch and streaming pipelines. Streaming sources like Pub/Sub, Kafka, or Cloud Storage events can be processed and written in real-time to BigQuery, Bigtable, or other destinations. Windowing and aggregations are also possible.
How is Data Fusion priced?
Data Fusion charges based on instance runtime (per hour) and pipeline executions. Basic Edition costs less than Enterprise Edition. Additionally, costs accrue for compute resources used by pipeline jobs (Dataproc clusters). Inactive instances can be stopped to save costs.
Is Data Fusion GDPR-compliant?
Yes, Data Fusion is available in EU regions and meets all GDPR requirements. Enterprise Edition offers CMEK (Customer-Managed Encryption Keys) for additional control over encryption. VPC peering enables secure connections to on-premises systems without public internet.
Does Data Fusion support data lineage?
Yes, Enterprise Edition offers automatic data lineage at the field level. You can visually trace how data flows through transformations, which fields originate from which sources, and where they end up. This is essential for impact analysis and compliance.
