Validation#

Summary#

Description: Validation checks whether the output files of a pipeline match the expectation.

Property Description
validation.compareOn list
Which columns in the expectation file should be used for the comparison. Options: name, size, md5. Default: use all columns in the expectation file.
default: null
validation.disableValidation boolean
Turn off validation. No validation file output is produced. Options: Y/N. default: N
default: null
validation.expectationFile file path
file path that gives the expected values for file metrics (probably generated by a previous run of the same pipeline)
default: null
validation.reportOn list
Which attributes of the file should be included in the validation report file. Options: name, size, md5
default: null
validation.sizeWithinPercent numeric
What percentage difference is permitted between an output file and its expectation. Options: any positive number
default: null
validation.stopPipeline boolean
If enabled, the validation utlility will stop the pipeline if any module fails validation. Options: Y/N
default: N

The validation utility creates a table for the output of each module where it reports the file name, size and md5. These tables are saved in the validation folder; the validation folder generated by a pipeline can be used as the expectations when re-running the same pipeline.

If there are no expectations, these values are reported in the validation folder.
If there are expectations, these values are reported and compared against the expected values; the result of the comparison is reported as either PASS or FAIL for each file.

If validation.stopPipeline=Y, the validation utility will halt the pipeline if any outputs FAIL to meet expectations, otherwise the result is reported and the pipeline moves forward.

Soft Validation#

Many components of a pipeline have the potential for tiny variation: maybe a date is stored in the output, or a reported confidence level is based on a random sampling. With these tiny variations, the file is practically the same, but it will FAIL md5 validation. The file might also be a few bytes bigger or smaller, so it will also FAIL size validation. "Soft validation" is the practice of allowing some wiggle room. If the config file gives validation.sizeWithinPercent=1, then an output file will PASS size validation as long as it is within 1.0% of the expected file size. By default, this value is 0, and a file must be exactly the expected size to pass size validation.

Expectations#

Give the file path to the expectation file using validation.expectationFile=/path/to/saved/validation.

This path can either point to a tab-delimited table giving the expectations for a single module, or it can point to a folder, in which case BioLockJ assumes that a file within this folder has a name that matches the module being validated. When validating an entire pipeline, the expectation file for all modules can be passed with a single file path. The validation folder created by a pipeline is designed to be used as this input.

The expectation file format is:

The expectation file is a tab-delimited table.
The first row is column names.
The first column (labeled "name") gives the file names.
Optional column "size" gives the file size in bytes.
Optional column "md5" gives the md5 string.
Optional column "MATCHED_EXPECTION" is always ignored.
The file should not have any other columns.

Use cases#

The expectation is usually based on a previous run of the same pipeline.
Maybe some software has been updated and the results are not expected to change, but you have to re-do your analysis with the latest version to appease reviewers.
Maybe you added a filtering step.
Maybe you just want to change module 5, and you expect 1-4 to produce the same outputs they did last time.
Maybe this analysis has been published and the the original researcher made their pipeline available to you; you want to re-run it and know if the output you generated by running the pipeline is the same as what they had.

The expectation can be set by hand. This is recommended for validation using name only, or soft validation using size only. This is a way to prevent a pipeline from continuing after it is effectively doomed.

For example:
Maybe module 5 is a resource-intensive classifier, and modules 1-4 are processing and filtering steps ending with the SeqFileValidator. If modules 1-4 filter out too much, you might not want to move forward with module 5 until you've made adjustments to the earlier modules.
You could create an expectation file for module 4, that just lists the names of the files and their pre-filtering file size (in bytes), and set validation.sizeWithinPercent=80 and SeqFileValidator.stopPipeline=Y. With this, the pipeline will stop if any of those files are not in the module 4 output or if any of them have been reduced by more than 80%.
The output file names are predictable if you've ever seen output from that module before.

Other notes#

gzip is a common utility, frequently used with sequence data. It can incorporate metadata into the zipped file, a minor variation which can cause md5 checks to fail. To avoid these misleading failures, the validation utility will take the md5 of the decompressed form of the file for any file that ends in ".gz". Thus, the md5 reported for a fastq file is the same regardless of whether it has been gzipped.