What Legacy Data Has to Do with Spotify's Year in Review

Spotify summarizes the last 10 years for over 248 million users in personalized playlists.
For the first time in 2016, Spotify created a personal review for users. The challenges in 2019 were significantly greater, as the music streams to be analyzed had quintupled compared to 2018 alone. Nevertheless, Spotify managed at the end of 2019 to analyze the entire listening statistics of the last 10 years for over 248 million users. From this, a personal year-end review with personalized playlists was created for each user.
Technically, Spotify has been using Google Cloud since 2016 and is considered one of the largest consumers of virtual computing resources there. For the first three years alone after moving to GCP, Spotify had budgeted over $450 million for cloud costs to handle the seemingly endless amount of data. Together with Google’s engineering teams, this repeatedly resulted in the largest Dataflow analysis job ever executed in Google Cloud. With these enormous amounts of data, such calculations are of course also a financial burden, so such a process must be very well planned and cannot simply be restarted. Google charges in GCP either by hours or data volumes depending on the tariff. Both were gigantic in this case.
The mentioned GCP service “Dataflow” is based on the open-source project Apache Beam. Apache Beam is a framework or model for defining and creating data processing pipelines. Even before moving to Google Cloud, Spotify began developing a Scala framework called “Scio” that abstracts the Apache Beam API in Scala. Apache Beam and thus also Scio provide all the necessary tools to process gigantic ETL transactions in the form of batch or streaming jobs. The jobs programmed in “Scio” can then be executed serverlessly directly in Google Dataflow.
Scala, Scio and Dataflow as a “Serverless Wonder Weapon” for Processing Legacy Data.
Since the majority of our customer projects and our products also run in Google Cloud, Scio is an ideal tool for us to bring legacy data from old systems into the cloud. The jobs can run once as a batch or continuously stream new data into the new system afterward if the old system is operated in parallel during a transition period. Since Google manages the infrastructure and everything runs serverlessly, this is a very cost-efficient way for us to connect legacy systems. At the same time, we also use Scio jobs for BigData analyses that are written to a BigQuery Data Warehouse, for example. The advantages for us at innFactory are obvious:
- Seamless scaling across all sizes
- Low development effort
- Scala as our preferred backend language
- Very stable jobs that can also be tested on local systems.
Scio 101 – Data Transformation from and to Postgres
The following example is intended to provide a first insight into how legacy data from an old system, or from its database, can be transferred to a new database in the cloud. The complete example can also be found in the innFactory GitHub Account.
The example shows how user data from Postgres Database 1 can be transformed and written to a new Postgres Database 2.
At the beginning of our job, the credentials from the command line arguments / Dataflow parameters for the databases should be extracted. We need these arguments later for the Scio context to read and write data via JDBC.
Scio offers us the possibility to stream all data consistently from the old database and map the rows to our case classes. We can then further process the case classes in the next step and write them as side output to the new database. Since in Scio, similar to Spark, we define a “DAG” (directed acyclic graph) using “higher ordered functions”, Apache Beam can optimize the execution of the entire flow before the run. In addition, Dataflow ensures that neither the source nor the target is overloaded with too much data. Our example thus counts the processed persons at the end and outputs them to the console.
The following command can be used to start the example:
Summary
Since Apache Beam takes care of the threads and the distribution of our job, we can focus on the essential ETL functionalities. A quick transformation of data at any time becomes very easy as a result. Thanks to the serverless Google Dataflow service, we also don’t have to worry about scaling multiple servers and don’t pay any fees when no job is running. Dataflow scales down to 0.



