Dataflow is Google’s serverless service for stream and batch data processing. One codebase for both models, automatic scaling without cluster management.
What is Dataflow?
Dataflow runs Apache Beam pipelines on Google’s infrastructure. You write pipelines in Java, Python, or Go; Dataflow handles provisioning, scaling, and monitoring. The same code runs for streaming (real-time) and batch (historical data).
Unlike Hadoop/Spark clusters, Dataflow is serverless: no cluster configuration, automatic scaling, pay-per-use.
Streaming vs. Batch
Streaming Pipeline (real-time):
Pub/Sub → Dataflow → BigQuery
(continuous)
Batch Pipeline (periodic):
Cloud Storage → Dataflow → BigQuery
(daily, hourly)Apache Beam enables the same code for both modes. Switch from batch to streaming by changing the pipeline configuration.
Core Features
- Unified Model: One SDK for streaming and batch
- Autoscaling: Automatic adding/removing of workers
- Exactly-Once: Guaranteed processing without duplicates
- Windowing: Time-based aggregations (tumbling, sliding, session)
- Watermarks: Handling of late-arriving events
Typical Use Cases
Real-time Analytics
Aggregate events from Pub/Sub in real-time and write to BigQuery. Dashboards show metrics with seconds latency.
ETL Pipelines
Read data from Cloud Storage, transform, clean, and load into BigQuery. Scheduled jobs for daily/hourly processing.
Pub/Sub to BigQuery
Streaming ingestion of events to BigQuery. Dataflow Templates enable deployment without code.
Event-Driven ML
Real-time feature computation for ML models. Dataflow calculates features, Vertex AI makes predictions.
Dataflow vs. Dataproc vs. Data Fusion
| Criterion | Dataflow | Dataproc | Data Fusion |
|---|---|---|---|
| Model | Serverless | Managed Cluster | Visual ETL |
| SDK | Apache Beam | Spark, Hadoop | GUI, CDAP |
| Scaling | Automatic | Manual/Autoscaling | Automatic |
| Use Case | New pipelines | Existing Spark jobs | No-code ETL |
| Pricing | Pay-per-use | Pay-per-cluster | Pay-per-hour |
Benefits
- Serverless: No cluster management
- Unified: Streaming and batch with one SDK
- Portable: Apache Beam runs on other runners too
- Integrated: Native GCP connectors for all services
Integration with innFactory
As a Google Cloud Partner, innFactory supports you with Dataflow: pipeline development, migration from Spark to Beam, performance optimization, and cost analysis.
Typical Use Cases
Technical Specifications
Frequently Asked Questions
What is Dataflow?
Dataflow is a fully managed data processing service that supports both streaming and batch with the same code. It's based on Apache Beam and automatically scales based on workload.
What's the difference between Dataflow and Data Fusion?
Dataflow is for developers writing Apache Beam pipelines in Java, Python, or Go. Data Fusion is for business analysts with a visual drag-and-drop interface without code. For complex, code-based pipelines, use Dataflow.
When should I use Dataflow instead of Dataproc?
Dataflow is serverless with automatic scaling, ideal for pipelines without cluster management. Dataproc is managed Hadoop/Spark for existing Spark jobs or when you need cluster control. For new projects, we often recommend Dataflow.
How does autoscaling work in Dataflow?
Dataflow automatically scales based on backlog (unprocessed messages). For streaming pipelines, workers are added when the backlog grows. You can configure min/max workers and the scaling algorithm.
How much does Dataflow cost?
Dataflow charges for vCPU time ($0.056/vCPU-hour), RAM ($0.003557/GB-hour), and persistent disk. Streaming is more expensive than batch. Flex Resource Scheduling offers up to 40% discount for batch jobs with flexible timing.
