Configuration

A configuration file encapsulates an analysis pipeline.

BioLockJ takes a single configuration file as a runtime parameter.

biolockj config.properties

Every line in a BioLockJ configuration file is one of:

BioModule (line starts with #BioModule)
comment (all other lines that start with #, has no effect)
property (name=value)

BioModule execution order#

To include a BioModule in your pipeline, add a #BioModule line to the top your configuration file, as shown in the examples found in templates. Each line has the #BioModule keyword followed by the path for that module. For example:

#BioModule biolockj.module.seq.PearMergeReads
#BioModule biolockj.module.classifier.wgs.Kraken2Classifier
#BioModule biolockj.module.report.r.R_PlotMds

This line is given at the top of the user guide page for each module.

BioModules will be executed in the order they are listed in here.

A typical pipeline contians one classifier module. Any number of sequence pre-processing modules may come before the classifier module. Any number of report modules may come after the classifier module. In addition to the BioModules specified in the configuration file, BioLockJ may add implicit modules that the are required by specified modules. See Example Pipeline.

A module can be given an alias by using the AS keyword in its execution line:

#BioModule biolockj.module.seq.PearMergeReads AS Pear

This is is generally used for modules that are used more than once in the same pipeline. Given this alias, the folder for this module will be called 01_Pear instead of 01_PearMergeReads, and any general properties directed to this module would use the prefix Pear instead of PearMergedReads. An alias must start with a capital letter, and cannot duplicate a name/alias of any other module in the same pipeline.

Properties#

Properties are defined as name-value pairs. List-values are comma separated. Leading and trailing whitespace is removed so "propName=x,y" is equivalent to "propName = x, y".

See the list of available properties.

Variables#

Bash variables can be referenced in the config. They must be "fully dressed": ${VAR}

There are two variables that BioLockJ requires:
BLJ is the file path to the BioLockJ directory and
BLJ_PROJ is the directory where pipelines created by BioLockJ are stored and run.
After installation these are defined in the shell profile. These can referenced in the config file.

The ~ ("tilde") is replaced with ${HOME} if (and only if) the ~ is the first character.

Variables can also be defined in the config file and referenced in the same way:

DIR=/path/to/big/data/dir                    
sra.destinationDir=${DIR}/seqs  
sra.sraAccList=${DIR}/SraAccList.txt      
input.dirPaths=${DIR}/seqs

Variables that are defined in the config file, can be referenced within the config file, however these variables are not added to the module script environment.

If you are referencing environment variables and running in docker, you will need to use the -e parameter to biolockj to pass the variables into the docker environment (even if the variable is defined in the config file). For example:

biolockj --docker -e SHEP=$SHEP,DIR=/path/to/big/data/dir config.properties

Most environement variables will NOT be part of the module script environment. However, any environment variable that is referenced in the configuration file is considered necissary for the pipeline, and it is passed into the main program environment, docker containers, module runtime enviroments.

Environment variables are not the best way to get information to a script because they can be difficult to trace / troubleshoot. However if your script or tool requires a particular environment variable, you can define it in your local environment, and reference it in the config file using an arbitrary property name, for example:

my.variable=${QIIME_CONFIG_FP}

This has essentially the same effect as using the -e QIIME_CONFIG_FP=$QIIME_CONFIG_FP argument in the biolockj command. If this variable is required this is one way to communicate that the value of QIIME_CONFIG_FP may change from one system to the next, but the existence of QIIME_CONFIG_FP is essential for the pipeline to run.

Relative file paths#

File paths can be given using relative paths. The path should start with ./.
The location . is interpreted as being the directory where the primary configuration file is.

Example file structure:

 /users/joe/analysis01/
                       config.properties 
                       metadata.txt
                       /sra/
                            SraAccList.txt

Properties in config.properties can use relative paths:

metadata.filePath=./metadata.txt                    
sra.sraAccList=./sra/SraAccList.txt

Note: ../ is also supported but it does not stack ( ../../../data/ is not supported).

With this design, the "analysis01" folder could be shared or moved and the configuration file would not need to be updated to reflect the new location of the project files it references.

Special properties#

Some properties invoke special handling.

pipeline.defaultProps#

pipeline.defaultProps is a handled before any other property. It is used to link another properties file. The properties from that file are added to the MASTER set. The pipeline.defaultProps property itself is not included in the MASTER properties set.

Module-specific forms#

Many pipeline properties (usually those used by pipeline utilities) can be directed to a specific module. For example, script.numThreads is a general property that specifies that number of threads alloted to each script launched by any module; and PearMergeReads.numThreads overrides that property ONLY for the PearMergeReads module.

exe.* properties#

exe. properties are used to specify the path to common executables. exe. properties are special in that they have the automatic default of returning the the property name minus the exe.-prefex as their value. Modules are sometimes written to use a common tool, such as Rscript or bowtie. These modules will write scripts with the assumption that this command is on the $PATH when the script is executed UNLESS exe.Rscript is given specifying a path to use. The exe. properties are often specified in a defaultProps file for a given environment rather than in individual project properties files.

Most often, docker containers are used because of the executables baked into them, and any exe. configurations are only applicable when not runnig in docker. In a pipeline running in docker, all references to an exe. property will return the default value (by removing the exe. prefix), regardless of how the exe. property is configured. In the rare case where you do need to give the path to executable within a container, you can specify this by using the prefix dockerExe. in place of exe.. In the even rarer case where you want to use an executable from your local machine, while running a pipeline in docker, you can specify this by using the prefix hostExe. in place of exe..

Chaining configuration files#

Although all properties can be configured in one file, we recommend chaining default files through the pipeline.defaultProps option. This can often improve the portability, maintainability, and readability of the project-specific configuration files.

Standard Properties#

BioLockJ will always apply the standard.properties file packaged with BioLockJ under resources/config/default/; you do not need to specify this file in your pipeline.defaultProps chain.
IFF running a pipeline in docker, then BioLockJ will apply the docker.properties file packaged with BioLockJ under resources/config/default/.

User-specified Defaults#

We recommend creating an environment.properties file to assign envionment-specific defaults.

Set cluster & script properties
Set paths to key executables through exe properties
Override standard.properties as needed.
This information is the same for many (or all) projects run in this environment, and entering the info anew for each project is tedious, time-consuming and error-prone.

If using a shared system, consider using a user.properties file.

Set user-specific properties such as download.dir and mail.to.
For shared projects, use a path that will be updated per-user, such as ~/biolock_user.properties

Other logical intermediates my also present themselves. For example, some group of projects may need to override several of the defaults set in environmment.properties, but others still use the those defaults. Projects in this set can use pipeline.defaultProps=group2.properties and the group2.properties files may include pipeline.defaultProps=environment.properties

Project Properties#

Create a new configuration file for each pipeline to assign project-specific properties:

Set the BioModule execution order
Set pipeline.defaultProps = environment.properties
You may use multiple default config files:
pipeline.defaultProps=environment.properties,groupSettings.properties
Override environment.properties and standard.properties as needed
Example project configuration files can be found in templates.

If the same property is given in multiple config files, the highest priority goes to the file used to launch the pipeline. Standard.properties always has the lowest priority.

A copy of each configuration file is stored in the pipeline root directory to serve as primary project documentation.