Containers¶
Co-execution¶
Galaxy job inputs and outputs are very flexible and staging up job inputs, configs, and scripts, and staging down results doesn’t map cleanly to cloud APIs and cannot be fully reasoned about until job runtime. For this reason, code that needs to know how stage Galaxy jobs up and down needs to run in the cloud when disk isn’t shared and Galaxy cannot do this directly. Galaxy jobs however are typically executed in Biocontainers that are minimal containers just for the tool being executed and not appropriate for executing Galaxy code.
For this reason, the Pulsar runners that schedule containers will run a container beside (or before and after) that is responsible for staging the job up and down, communicating with Galaxy, etc..
Perhaps the most typical potential scenario is using the Kubernetes Job API along with a message queue for communication with Galaxy and a Biocontainer. A diagram for this deployment would look something like:
The modern Galaxy landscape is much more container driven, but the setup can be simplified to use Galaxy dependency resolution from within the “pulsar” container. This allows the tool and the staging code to live side-by-side and results in requesting only one container for the execution from the target container. The default Pulsar staging container has a conda environment configured out of the box and has some initial tooling to be connected to a CVM-FS available conda directory.
This one-container approach (staging+conda) is available with or without MQ and on either Kubernetes or against a GA4GH TES server. The TES version of this with RabbitMQ to mitigate communication looks like:
Notice when executing jobs on Kubernetes, the containers of the pod run concurrrently. The Pulsar container will compute a command-line and write it out, the tool container will wait for it on boot and execute it when available, while the Pulsar container waits for a return code from the tool container to proceed to staging out the job. In the GA4GH TES case, 3 containers are used instead of 2, but they run sequentially one at a time.
Typically, a MQ is needed to communicate between Pulsar and Galaxy even though the status of the job could potentially be inferred from the container scheduling environment. This is because Pulsar needs to transfer information about job state, etc. after the job is complete.
More experimentally this shouldn’t be needed if extended metadata is being collected because then the whole job state that needs to be ingested by Galaxy should be populated as part of the job. In this case it may be possible to get away without a MQ.
Deployment Scenarios¶
Kubernetes¶
GA4GH TES¶
AWS Batch¶
Work in progress.