Dataplex organizes distributed data into logical data lakes without data movement. Centralized governance, automatic metadata, and data quality monitoring.
What is Dataplex?
Dataplex is a data fabric that organizes data across Cloud Storage and BigQuery into logical structures. Instead of copying data into a central data lake, Dataplex creates virtual views over distributed data sources.
The service provides automatic metadata discovery, data quality checks, and centralized governance policies.
Concepts
Lake
Logical container for related data. Typically per business unit or project.
Zone
Grouping by processing stage:
- Raw Zone: Raw data without transformation
- Curated Zone: Cleansed, structured data
Asset
The actual data: Cloud Storage buckets or BigQuery datasets. Assets are assigned to zones.
Lake: Customer Analytics
├── Zone: Raw
│ ├── Asset: gs://raw-events (Cloud Storage)
│ └── Asset: gs://raw-transactions
└── Zone: Curated
├── Asset: bq://project.curated.events (BigQuery)
└── Asset: bq://project.curated.customersCore Features
- Virtual organization: Data stays where it is
- Auto discovery: Schema and statistics captured automatically
- Data quality: Define rules and check automatically
- Central policies: IAM policies at lake level
- Data Catalog integration: All metadata searchable
Typical Use Cases
Data Lake Management
Organize hundreds of storage buckets and BigQuery datasets into logical lakes. Teams find data without knowing where it physically resides.
Data Quality Monitoring
Define quality rules (no null values in key fields, valid date formats) and check automatically on schedule. Alerts on violations.
Cross-Team Governance
Centralized policies for data access across multiple teams. Data owners define who can access which zones.
Automatic Documentation
Dataplex automatically captures schemas, statistics, and samples. Teams understand data without manual documentation.
Dataplex vs. Data Catalog
| Feature | Dataplex | Data Catalog |
|---|---|---|
| Metadata search | Yes (via Data Catalog) | Yes |
| Data organization | Lakes, Zones, Assets | No |
| Data quality | Yes | No |
| Policies | Lake-level | Tag-based |
| Discovery | Automatic | Automatic |
Benefits
- No data copying: Virtual organization
- Automatic: Discovery and profiling without manual effort
- Unified: Single view of Cloud Storage and BigQuery
- Governance: Centralized policies and data quality
Integration with innFactory
As a Google Cloud Partner, innFactory supports you with Dataplex: data lake design, data quality strategies, and governance framework implementation.
Typical Use Cases
Technical Specifications
Frequently Asked Questions
What is Dataplex?
Dataplex is a data fabric service that organizes distributed data into logical lakes without moving it. It provides centralized governance, automatic metadata discovery, and data quality checks across Cloud Storage and BigQuery.
What's the difference between Dataplex and Data Catalog?
Data Catalog is for metadata search and tagging. Dataplex goes further by organizing data into lakes/zones, offering data quality checks, and enabling central policies. Dataplex uses Data Catalog for the metadata layer.
What are Lakes, Zones, and Assets in Dataplex?
A Lake is a logical container for related data (e.g., per business unit). Zones group assets by processing stage (Raw, Curated). Assets are the actual data in Cloud Storage buckets or BigQuery datasets.
How does data quality work in Dataplex?
Dataplex Data Quality defines rules (null values, formats, ranges, uniqueness) and checks them automatically on schedule. Results appear in Data Catalog and can trigger alerts. Auto Data Quality suggests rules based on profiling.
How much does Dataplex cost?
Dataplex charges based on Compute Units (CU) for discovery, quality scans, and processing. Discovery is relatively inexpensive; quality scans on large datasets can be more expensive. The first 30 days per lake are free.
