Scio for Big Data Processing
Scio is “a Scala API for Apache Beam and Google Cloud Dataflow inspired by Apache Spark and Scalding”.
I recently had to opportunity to try out the platform while consulting for Middesk. Their goal is to write data pipelines that ingest business entity formation records and store them in a universal format for all 50 states. The trick is that some states have very large datasets.
Reading from large datasets like that and processing them in parallel is where Scio excels. It provides a convenient Scala API for the underlying Apache Beam library, which can be used to execute Big Data jobs across several platforms, including Google Dataflow, Spark, and Hadoop. It provides simple connectors to Google BigQuery, BigTable, ElasticSearch, and others that greatly simplifies the input and output steps.
Compared to my past experience working directly on Hadoop, writing Scio jobs for Dataflow is much simpler. Scala is naturally well positioned to describe the steps of a job because it is very easy to describe job steps using Scala’s first-class functions.
Dataflow is well-integrated into Google’s GCP cloud computing platform. While GCP is smaller than Amazon in terms of capabilities, the functionality they have in place is in some ways more polished. Tackling 80% of use cases makes it easier to make those work together smoothly.
One downside to Dataflow is the difficulty of adjusting the infrastructure of the individual nodes assigned to a job. For instance, while working on a DataDog integration to measure how many lines failed to parse correctly, we were not able to install Google’s own log parsing library. That meant we were forced to send all logs across the wire, instead of filtering out the ones we knew would be useless.
Overall, I’m a fan of Scio and Dataflow. These sorts of libraries go a long way towards paving over some of the complexities of big data processing compared to MapReduce jobs written directly on Hadoop.