Azure Open Datasets on Microsoft Azure
Azure Open Datasets is a collection of curated public datasets specifically optimized for machine learning and data analytics in Azure. The datasets cover areas such as weather, census data, holidays, public safety, and more.
Unlike raw public data sources, Azure Open Datasets are cleaned, normalized, and stored in Azure-optimized formats (Parquet). The datasets are directly usable in Azure Machine Learning, Databricks, Synapse Analytics, and other Azure services without separate download or transformation steps.
Use of the datasets themselves is free. Costs only arise for Azure services like compute or storage used for processing.
Typical Use Cases
ML model training: Using weather data, demographics, or public transportation data to enrich own ML models for better predictions.
Data science prototyping: Quick start into data science projects with immediately available, clean datasets without lengthy data acquisition.
Feature engineering: Enrichment of own business data with external factors like weather, holidays, or demographic information.
Education and research: Use of real, large datasets for academic projects, courses, and research work.
Frequently Asked Questions about Azure Open Datasets
Which datasets are available?
Azure Open Datasets includes over 50 datasets, including NOAA Weather Data, US Census, Public Holidays, NYC Taxi Trips, COVID-19 Data Lake, Genomics Data, and many more. The complete list is available in the documentation.
How is this different from public data sources?
Azure Open Datasets are cleaned, normalized, and stored in cloud-optimized formats. They are directly accessible via Azure SDKs and services without downloads or separate ETL processes. Additionally, many datasets are automatically updated.
Can I contribute my own datasets?
Currently, own datasets cannot be added to Azure Open Datasets. For own public datasets, Azure Storage with Public Access or Azure Data Share should be used.
In which formats are the data available?
Most datasets are stored in Parquet format, which offers optimal performance in Azure. Some are also available as CSV. Data can be accessed via Azure Storage Blob APIs, Python SDK, or directly from Azure ML/Databricks.
Are there usage restrictions?
The datasets are freely available for research, development, and commercial use. Specific licenses vary by dataset. Rate limits or quotas don’t exist, but Azure service limits (e.g., Storage Requests) apply.
How often are datasets updated?
Update frequency varies: Weather data is updated daily, census data upon new releases, other datasets depending on availability of source data. Documentation provides information about update frequencies.
Can I use the datasets outside of Azure?
Yes, the datasets are accessible via public Azure Storage URLs and can also be downloaded and used outside Azure. However, use within Azure offers performance advantages through data locality.
Alternatives
alternatives:
- provider: “aws” product: “open-data”
- provider: “gcp” product: “public-datasets”
Integration with innFactory
As a Microsoft Solutions Partner, innFactory supports you in data science and ML projects with Azure Open Datasets. We help with data integration, feature engineering, and building ML pipelines.
Contact us for a non-binding consultation on Azure Open Datasets and data science on Azure.
