Airflow for Scio
Use a SubDag with a Sequential Executor
It took exploring a bunch of dead-ends to crack it, but I got a working JAR solution that runs the same as local execution.
I’m using the sbt-pack plugin to generate a Makefile and a /lib directory. Then currently I’m manually zipping those and uploading to GCS. In Airflow, a GCSOperator downloads the zip. Then a BashOperator unzips, installs, and executes the JAR. The zip file is 180-200MB.
One of the tricky parts is handling fungibility of workers/environments between tasks. At first, the GCSoperator would download the file, but the bash operator couldn’t find it, as these tasks were often assigned to separate workers. The current working solution is to add the GCSDownloadoperator and the Bashoperator to a SubDag that uses sequential_executor. This ensures that both tasks execute on the same worker.
Another solution I explored was creating a custom operator to combine the GCSDownload and the Bash Operator into a single operator, which would naturally occur in the same step. Progress on that was a bit slow due to my mediocre Python skills and lack of knowledge about Airflow. But if you’re strong in those areas that might be a better solution.
The only drawback I see here is the creation of a subdag per pipeline, which shows up on Airflow’s DAG dashboard.
I’d recommend automating the steps needed to zip the sbt-pack output. This could be done as an SBT task, which would be a lightweight way to do it. But a CI/CD deploy process sounds more robust IMO.