Galaxy Configuration

Examples

The most complete and updated documentation for configuring Galaxy job destinations is Galaxy’s job_conf.xml.sample_advanced file (check it out on GitHub). These examples just provide a different Pulsar-centric perspective on some of the documentation in that file.

Simple Windows Pulsar Web Server

The following Galaxy job_conf.xml assumes you have deployed a simple Pulsar web server to the Windows host windowshost.examle.com on the default port (8913) with a private_token (defined in app.yml) of 123456789changeme. Most Galaxy jobs will just route use Galaxy’s local job runner but msconvert and proteinpilot will be sent to the Pulsar server on windowshost.examle.com. Sophisticated tool dependency resolution is not available for Windows-based Pulsar servers so ensure the underlying application are on the Pulsar’s path.

<?xml version="1.0"?>
<job_conf>
    <plugins>
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
        <plugin id="pulsar" type="runner" load="galaxy.jobs.runners.pulsar:PulsarLegacyJobRunner"/>
    </plugins>
    <handlers>
        <handler id="main"/>
    </handlers>
    <destinations default="local">
        <destination id="local" runner="local"/>
        <destination id="win_pulsar" runner="pulsar">
            <param id="url">https://windowshost.examle.com:8913/</param>
            <param id="private_token">123456789changeme</param>
        </destination>
    </destinations>
    <tools>
        <tool id="msconvert" destination="win_pulsar" />
        <tool id="proteinpilot" destination="win_pulsar" />
	</tools>
</job_conf>

Targeting a Linux Cluster (Pulsar Web Server)

The following Galaxy job_conf.xml assumes you have a very typical Galaxy setup - there is a local, smaller cluster that mounts all of Galaxy’s data (so no need for the Pulsar) and a bigger shared resource that cannot mount Galaxy’s files requiring the use of the Pulsar. This variant routes some larger assembly jobs to the remote cluster - namely the trinity and abyss tools. Be sure the underlying applications required by the trinity and abyss tools are on the Pulsar path or set tool_dependency_dir in app.yml and setup Galaxy env.sh-style packages definitions for these applications.

<?xml version="1.0"?>
<job_conf>
    <plugins>
        <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner"/>
        <plugin id="pulsar" type="runner" load="galaxy.jobs.runners.pulsar:PulsarRESTJobRunner"/>
    </plugins>
    <handlers>
        <handler id="main"/>
    </handlers>
    <destinations default="local_cluster">
        <destination id="local_cluster" runner="drmaa">
            <param id="native_specification">-P littlenodes -R y -pe threads 4</param>
        </destination>
        <destination id="remote_cluster" runner="pulsar">
            <param id="url">http://remotelogin:8913/</param>
            <param id="submit_native_specification">-P bignodes -R y -pe threads 16</param>
            <!-- Look for trinity package at remote location - define tool_dependency_dir
            in the Pulsar app.yml file.
            -->
            <param id="dependency_resolution">remote</param>
        </destination>
    </destinations>
    <tools>
        <tool id="trinity" destination="remote_cluster" />
        <tool id="abyss" destination="remote_cluster" />
	</tools>
</job_conf>

For this configuration, on the Pulsar side be sure to also set a DRMAA_LIBRARY_PATH in local_env.sh, install the Python drmaa module, and configure a DRMAA job manager for Pulsar in job_managers.ini as follows:

[manager:_default_]
type=queued_drmaa

Targeting a Linux Cluster (Pulsar over Message Queue)

For Pulsar instances sitting behind a firewall, a web server may be impossible. If the same Pulsar configuration discussed above is additionally configured with a message_queue_url of amqp://rabbituser:rabb8pa8sw0d@mqserver:5672// in app.yml, the following Galaxy configuration will cause this message queue to be used for communication. This is also likely better for large file transfers since typically your production Galaxy server will be sitting behind a high-performance proxy while Pulsar will not.

<?xml version="1.0"?>
<job_conf>
    <plugins>
        <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner"/>
        <plugin id="pulsar" type="runner" load="galaxy.jobs.runners.pulsar:PulsarMQJobRunner">
            <!-- Must tell Pulsar where to send files. -->
            <param id="galaxy_url">https://galaxyserver</param>
            <!-- Message Queue Connection (should match message_queue_url in Pulsar's app.yml)
            -->
            <param id="url">amqp://rabbituser:rabb8pa8sw0d@mqserver:5672//</param>
        </plugin>
    </plugins>
    <handlers>
        <handler id="main"/>
    </handlers>
    <destinations default="drmaa">
        <destination id="local_cluster" runner="drmaa">
            <param id="native_specification">-P littlenodes -R y -pe threads 4</param>
        </destination>
        <destination id="remote_cluster" runner="pulsar">
            <!-- Tell Galaxy where files are being stored on remote system, so
                 the web server can simply ask for this information.
            -->
            <param id="jobs_directory">/path/to/remote/pulsar/files/staging/</param>
            <!-- Remaining parameters same as previous example -->
            <param id="submit_native_specification">-P bignodes -R y -pe threads 16</param>
        </destination>
    </destinations>
    <tools>
        <tool id="trinity" destination="remote_cluster" />
        <tool id="abyss" destination="remote_cluster" />
	</tools>
</job_conf>

For those interested in this deployment option and new to Message Queues, there is more documentation in Message Queues with Galaxy and Pulsar.

Additionally, Pulsar now ships with an RSync and SCP transfer action rather than making use of the HTTP transport method.

<?xml version="1.0"?>
<job_conf>
    <plugins>
        <plugin id="pulsar_mq" type="runner" load="galaxy.jobs.runners.pulsar:PulsarMQJobRunner">
            <!-- Must tell Pulsar where to send files. -->
            <param id="galaxy_url">https://galaxyserver</param>
            <!-- Message Queue Connection (should match message_queue_url in
                 Pulsar's app.yml). pyamqp may be necessary over amqp if SSL is used
            -->
            <param id="url">pyamqp://rabbituser:rabb8pa8sw0d@mqserver:5671//?ssl=1</param>
        </plugin>
    </plugins>
    <handlers>
        <handler id="main"/>
    </handlers>
    <destinations default="pulsar_mq">
        <destination id="remote_cluster" runner="pulsar_mq">
            <!-- This string is replaced by Pulsar, removing the requirement
                 of coordinating Pulsar installation directory between cluster
                 admin and galaxy admin
            -->
            <param id="jobs_directory">__PULSAR_JOBS_DIRECTORY__</param>
            <!-- Provide connection information, should look like:

                    paths:
                        - path: /home/vagrant/  # Home directory for galaxy user
                          action: remote_rsync_transfer # _rsync_ and _scp_ are available
                          ssh_user: vagrant
                          ssh_host: galaxy-vm.host.edu
                          ssh_port: 22

            -->
             <param id="file_action_config">file_actions.yaml</param>
             <!-- Provide an SSH key for access to the local $GALAXY_ROOT,
            should be accessible with the username/hostname provided in
            file_actions.yaml
             -->
             <param id="ssh_key">-----BEGIN RSA PRIVATE KEY-----
            .............
            </param>
            <!-- Allow the remote end to know who is running the job, may need
                 to append @domain.edu after it. Only used if the
                 "DRMAA (via external users) manager" is used
             -->
            <param id="submit_user">$__user_name__</param>
        </destination>
    </destinations>
    <tools>
        <tool id="trinity" destination="remote_cluster" />
        <tool id="abyss" destination="remote_cluster" />
	</tools>
</job_conf>

Targeting Apache Mesos (Prototype)

See commit message for initial work on this and this post on galaxy-dev.

Forcing Pulsar to Generate Galaxy Metadata

Typically Galaxy will process Pulsar’s outputs and generate metadata on the Galaxy server. One can force this to happen with Pulsar. (TODO: document how here).

Etc…

There are many more options for configuring what paths get staged/unstaged, how Galaxy metadata is generated, running jobs as the real user, defining multiple job managers on the Pulsar side, etc…. If you ever have any questions please don’t hesitate to ask John Chilton (jmchilton@gmail.com).

Data Staging

Most of the parameters settable in Galaxy’s job configuration file job_conf.xml are straight forward - but specifying how Galaxy and the Pulsar stage various files may benefit from more explanation.

default_file_action defined in Galaxy’s job_conf.xml describes how inputs, outputs, indexed reference data, etc… are staged. The default transfer has Galaxy initiate HTTP transfers. This makes little sense in the context of message queues so this should be set to remote_transfer, which causes Pulsar to initiate the file transfers. Additional options are available including none, copy, and remote_copy.

In addition to this default - paths may be overridden based on various patterns to allow optimization of file transfers in production infrastructures where various systems mount different file stores and file stores with different paths on different systems.

To do this, the defined Pulsar destination in Galaxy’s job_conf.xml may specify a parameter named file_action_config. This needs to be a config file path (if relative, relative to Galaxy’s root) like config/pulsar_actions.yaml (can be YAML or JSON - but older Galaxy’s only supported JSON). The following captures available options:

paths: 
  # Use transfer (or remote_transfer) if only Galaxy mounts a directory.
  - path: /galaxy/files/store/1
    action: transfer

  # Use copy (or remote_copy) if remote Pulsar server also mounts the directory
  # but the actual compute servers do not.
  - path: /galaxy/files/store/2
    action: copy

  # If Galaxy, the Pulsar, and the compute nodes all mount the same directory
  # staging can be disabled altogether for given paths.
  - path: /galaxy/files/store/3
    action: none

  # Following block demonstrates specifying paths by globs as well as rewriting
  # unstructured data in .loc files.
  - path: /mnt/indices/**/bwa/**/*.fa
    match_type: glob
    path_types: unstructured  # Set to *any* to apply to defaults & unstructured paths.
    action: transfer
    depth: 1  # Stage whole directory with job and just file.

  # Following block demonstrates rewriting paths without staging. Useful for
  # instance if Galaxy's data indices are mounted on both servers but with
  # different paths.
  - path: /galaxy/data
    path_types: unstructured
    action: rewrite
    source_directory: /galaxy/data
    destination_directory: /work/galaxy/data

  # The following demonstrates use of the Rsync transport layer
  - path: /galaxy/files/
    action: remote_rsync_transfer
    # Additionally the action remote_scp_transfer is available which behaves in
    # an identical manner
    ssh_user: galaxy
    ssh_host: f.q.d.n
    ssh_port: 22