Table Of Conents

YMP - a Flexible Omics Pipeline

Welcome to the YMP documentation!

YMP is a tool that makes it easy to process large amounts of NGS read data. It comes “batteries included” with everything needed to preprocess your reads (QC, trimming, contaminant removal), assemble metagenomes, annotate assemblies, or assemble and quantify RNA-Seq transcripts, offering a choice of tools for each of those procecssing stages. When your needs exceed what the stock YMP processing stages provide, you can easily add your own, using YMP to drive novel tools, tools specific to your area of research, or tools you wrote yourself.

Features:

batteries included

YMP comes with a large number of Stages implementing common read processing steps. These stages cover the most common topics, including quality control, filtering and sorting of reads, assembly of metagenomes and transcripts, read mapping, community profiling, visualisation and pathway analysis.

For a complete list, check the documentation or the source.

get started quickly

Simply point YMP at a folder containing read files, at a mapping file, a list of URLs or even an SRA RunTable and YMP will configure itself. Use tab expansion to complete your desired series of stages to be applied to your data. YMP will then proceed to do your bidding, downloading raw read files and reference databases as needed, installing requisite software environments and scheduling the execution of tools either locally or on your cluster.

explore alternative workflows

Not sure which assembler works best for your data, or what the effect of more stringent quality trimming would be? YMP is made for this! By keeping the output of each stage in a folder named to match the stack of applied stages, YMP can manage many variant workflows in parallel, while minimizing the amount of duplicate computation and storage.

go beyond the beaten path

Built on top of Bioconda and Snakemake, YMP is easily extended with your own Snakefiles, allowing you to integrate any type of processing you desire into YMP, including your own, custom made tools. Within the YMP framework, you can also make use of the extensions to the Snakemake language provided by YMP (default values, inheritance, recursive wildcard expansion, etc.), making writing rules less error prone and repetative.

Background

Bioinformatical data processing workflows can easily get very complex, even convoluted. On the way from the raw read data to publishable results, a sizeable collection of tools needs to be applied, intermediate outputs verified, reference databases selected, and summary data produced. A host of data files must be managed, processed individually or aggregated by host or spatial transect along the way. And, of course, to arrive at a workflow that is just right for a particular study, many alternative workflow variants need to be evaluated. Which tools perform best? Which parameters are right? Does re-ordering steps make a difference? Should the data be assembled individually, grouped, or should a grand co-assembly be computed? Which reference database is most appropriate?

Answering these questions is a time consuming process, justifying the plethora of published ready made pipelines each providing a polished workflow for a typical study type or use case. The price for the convenience of such a polished pipeline is the lack of flexibility - they are not meant to be adapted or extended to match the needs of a particular study. Workflow management systems on the other hand offer great flexibility by focussing on the orchestration of user defined workflows, but typicially require significant initial effort as they come without predefined workflows.

YMP strives to walk the middle ground between these. It brings everything needed to classic metagenome and RNA-Seq workflows, yet built on the workflow management system Snakemake, it can be easily expanded by simply adding Snakemake rules files. Designed around the needs of processing primarily multi-omic NGS read data, it brings a framework for handling read file meta data, provisioning reference databases, and organizing rules into semantic stages.

Installing and Updating YMP

Working with the Github Development Version

Installing from GitHub

  1. Clone the repository:

    git clone  --recurse-submodules https://github.com/epruesse/ymp.git
    

    Or, if your have github ssh keys set up:

    git clone --recurse-submodules git@github.com:epruesse/ymp.git
    
  2. Create and activate conda environment:

    conda env create -n ymp --file environment.yaml
    source activate ymp
    
  3. Install YMP into conda environment:

    pip install -e .
    
  4. Verify that YMP works:

    source activate ymp
    ymp --help
    

Updating Development Version

Usually, all you need to do is a pull:

git pull
git submodule update --recursive --remote

If environments where updated, you may want to regenerate the local installations and clean out environments no longer used to save disk space:

source activate ymp
ymp env update
ymp env clean
# alternatively, you can just delete existing envs and let YMP
# reinstall as needed:
# rm -rf ~/.ymp/conda*
conda clean -a

If you see errors before jobs are executed, the core requirements may have changed. To update the YMP conda environment, enter the folder where you installed YMP and run the following:

source activate ymp
conda env update --file environment.yaml

If something changed in setup.py, a re-install may be necessary:

source activate ymp
pip install -U -e .

Configuration

YMP reads its configuration from a YAML formatted file ymp.yml. To run YMP, you need to first tell it which datasets you want to process and where it can find them.

Getting Started

A simple configuration looks like this:

projects:
  myproject:
    data: mapping.csv

This tells YMP to look for a file mapping.csv located in the same folder as your ymp.yml listing the datasets for the project myproject. By default, YMP will use the left most unique column as names for your datasets and try to guess which columns point to your input data.

The matching mapping.csv might look like this:

sample,fq1,fq2
foot,sample1_1.fq.gz,sample1_2.fq.gz
hand,sample2_1.fq,gz,sample2_2.fq.gz

So we have two samples, foot and hand, and the read files for those in the same directory as the configuration file. Using relative or absolute paths you can point to any place in your filesystem. You can also use SRA references like SRR123456 or URLs pointing to remote files.

The mapping file itself may be in comma separated or tab separated format or may be an Excel file. For Excel files, you may specify the sheet to be used separated from the file name by a % sign. For example:

project:
  myproject:
    data: myproject.xlsx%sheet3

The matching Excel file could then have a sheet3 with this content:

sample

fq1

fq2

srr

foot

/data/foot1.fq.gz

/data/foot2.fq.gz

hand

SRR123456

head

http://datahost/head1.fq.gz

http://datahost/head2.fq.gz

SRR234234

For foot, the two gzipped FastQ files are used. The data for hand is retrieved from SRA and the data for head downloaded from datahost. The SRR number for head is ignored as the URL pair is found first.

Referencing Read Files

YMP will search your map file data for references to the read data files. It understands three types of references to your reads:

Local FastQ files: data/some_1.fq.gz, data/some_2.fq.gz

The file names should end in .fastq or .fq, optionally followed by .gz if your data is compressed. You need to provide forward and reverse reads in separate columns; the left most column is assumed to refer to the forward reads.

If the filename is relative (does not start with a /), it is assumed to be relative to the location of ymp.yml.

Remote FastQ files: http://myhost/some_1.fq.gz, http://myhost/some_2.fq.gz

If the filename starts with http:// or https://, YMP will download the files automatically.

Forward and reverse reads need to be either both local or both remote.

SRA Run IDs: SRR123456

Instead of giving names for FastQ files, you may provide SRA Run accessions, e.g. SRR123456 (or ERRnnn or DRRnnn for runs originally submitted to EMBL or DDBJ, respectively). YMP will use fastq-dump to download and extract the SRA files.

Which type to use is determined for each row in your map file data individually. From left to right, the first recognized data source is used in the order they are listed above.

Configuration processing an SRA RunTable:

projects:
  smith17:
    data:
      - SraRunTable.txt
    id_col: Sample_Name_s

Project Configuration

Each project must have a data key defining which mapping file(s) to load. This may be a simple string referring to the file (URLs are OK as well) or a more complex configuration.

Specifying Columns

By default, YMP will choose the columns to use as data set name and to locate the read data automatically. You can override this behavior by specifying the columns explicitly:

  1. Data set names: id_col: Sample

    The left most unique column may not always be the most informative to use as names for the datasets. In the above example, we specify the column to use explicitly with the line id_col: Sample_Name_s as the columns in SRA run tables are sorted alpha-numerically and the left most unique one may well contain random numeric data.

    Default: left most unique column

  2. Data set read columns: reads_cols: [fq1, fq2]

    If your map files contain multiple references to source files, e.g. local and remote, and the order of preference used by YMP does not meet your needs you can restrict the search for suitable data references to a set of columns using the key read_cols.

    Default: all columns

Example
projects:
  smith17:
    data:
      - SraRunTable.txt
    id_col: Sample_Name_s
    read_cols: Run_s

Multiple Mapping Files per Project

To combine data sets from multiple mapping files, simply list the files under the data key:

projects:
  myproject:
    data:
      - sequencing_run_1.txt
      - sequencing_run_2.txt

The files should at least share one column containing unique values to use as names for the datasets.

If you need to merge meta-data spread over multiple files, you can use the join key:

project:
  myproject:
    data:
      - join:
          - SraRunTable.txt
          - metadata.xlsx%reference_project
      - metadata.xlsx%our_samples

This will merge rows from SraRunTable.txt with rows in the reference_project sheet in metadata.xls if all columns of the same name contain the same data (natural join) and add samples from the our_samples sheet to the bottom of the list.

Complete Example

projects:
  myproject:
    data:
      - join:
          - SraRunTable.txt
          - metadata.xlsx%reference_project
      - metadata.xlsx%our_samples
      - mapping.csv
    id_col: Sample
    read_cols:
      - fq1
      - fq2
      - Run_s

Command Line

ymp

Welcome to YMP!

Please find the full manual at https://ymp.readthedocs.io

ymp [OPTIONS] COMMAND [ARGS]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

--version

Show the version and exit.

--install-completion

Install command completion for the current shell. Make sure to have psutil installed.

--profile <profile>

Profile execution time using Yappi

env

Manipulate conda software environments

These commands allow accessing the conda software environments managed by YMP. Use e.g.

>>> $(ymp env activate multiqc)

to enter the software environment for multiqc.

ymp env [OPTIONS] COMMAND [ARGS]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

activate

source activate environment

Usage: $(ymp activate env [ENVNAME])

ymp env activate [OPTIONS] ENVNAME

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

Arguments

ENVNAME

Required argument

clean

Remove unused conda environments

ymp env clean [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-a, --all

Delete all environments

Arguments

ENVNAMES

Optional argument(s)

export

Export conda environments

Resolved package specifications for the selected conda environments can be exported either in YAML format suitable for use with conda env create -f FILE or in TXT format containing a list of URLs suitable for use with conda create --file FILE. Please note that the TXT format is platform specific.

If other formats are desired, use ymp env list to view the environments’ installation path (“prefix” in conda lingo) and export the specification with the conda command line utlity directly.

Note:
Environments must be installed before they can be exported. This is due
to limitations of the conda utilities. Use the “–create” flag to
automatically install missing environments.
ymp env export [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-d, --dest <FILE>

Destination file or directory. If a directory, file names will be derived from environment names and selected export format. Default: print to standard output.

-f, --overwrite

Overwrite existing files

-c, --create-missing

Create environments not yet installed

-s, --skip-missing

Skip environments not yet installed

-t, --filetype <filetype>

Select export format. Default: yml unless FILE ends in ‘.txt’

Options

yml | txt

Arguments

ENVNAMES

Optional argument(s)

install

Install conda software environments

ymp env install [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-p, --conda-prefix <conda_prefix>

Override location for conda environments

-e, --conda-env-spec <conda_env_spec>

Override conda env specs settings

-n, --dry-run

Only show what would be done

-r, --reinstall

Delete existing environment and reinstall

--no-spec

Don’t use conda env spec even if present

--no-archive

Delete existing archives before install

--fresh

Create fresh install. Implies reinstall, no-spec and no-archve

Arguments

ENVNAMES

Optional argument(s)

list

List conda environments

ymp env list [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

--static, --no-static

List environments statically defined via env.yml files

--dynamic, --no-dynamic

List environments defined inline from rule files

-a, --all

List all environments, including outdated ones.

-s, --sort <sort_col>

Sort by column

Options

name | hash | path | installed

-r, --reverse

Reverse sort order

Arguments

ENVNAMES

Optional argument(s)

prepare

Create envs needed to build target

ymp env prepare [OPTIONS] TARGET_FILES

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-n, --dryrun

Only show what would be done

-p, --printshellcmds

Print shell commands to be executed on shell

-k, --keepgoing

Don’t stop after failed job

--lock, --no-lock

Use/don’t use locking to prevent clobbering of files by parallel instances of YMP running

--rerun-incomplete, --ri

Re-run jobs left incomplete in last run

-F, --forceall

Force rebuilding of all stages leading to target

-f, --force

Force rebuilding of target

--notemp

Do not remove temporary files

-t, --touch

Only touch files, faking update

--shadow-prefix <shadow_prefix>

Directory to place data for shadowed rules

-r, --reason

Print reason for executing rule

-N, --nohup

Don’t die once the terminal goes away.

Arguments

TARGET_FILES

Optional argument(s)

remove

Remove conda environments

ymp env remove [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

Arguments

ENVNAMES

Optional argument(s)

run

Execute COMMAND with activated environment ENV

Usage: ymp env run <ENV> [–] <COMMAND…>

(Use the “–” if your command line contains option type parameters

beginning with - or –)

ymp env run [OPTIONS] ENVNAME [COMMAND]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

Arguments

ENVNAME

Required argument

COMMAND

Optional argument(s)

update

Update conda environments

ymp env update [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

Arguments

ENVNAMES

Optional argument(s)

init

Initialize YMP workspace

ymp init [OPTIONS] COMMAND [ARGS]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

cluster

Set up cluster

ymp init cluster [OPTIONS]

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-y, --yes

Confirm every prompt

demo

Copies YMP tutorial data into the current working directory

ymp init demo [OPTIONS]

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

project
ymp init project [OPTIONS] [NAME]

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-y, --yes

Confirm every prompt

Arguments

NAME

Optional argument

make

Build target(s) locally

ymp make [OPTIONS] TARGET_FILES

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-n, --dryrun

Only show what would be done

-p, --printshellcmds

Print shell commands to be executed on shell

-k, --keepgoing

Don’t stop after failed job

--lock, --no-lock

Use/don’t use locking to prevent clobbering of files by parallel instances of YMP running

--rerun-incomplete, --ri

Re-run jobs left incomplete in last run

-F, --forceall

Force rebuilding of all stages leading to target

-f, --force

Force rebuilding of target

--notemp

Do not remove temporary files

-t, --touch

Only touch files, faking update

--shadow-prefix <shadow_prefix>

Directory to place data for shadowed rules

-r, --reason

Print reason for executing rule

-N, --nohup

Don’t die once the terminal goes away.

-j, --cores <CORES>

The number of parallel threads used for scheduling jobs

--dag

Print the Snakemake execution DAG and exit

--rulegraph

Print the Snakemake rule graph and exit

--debug-dag

Show candidates and selections made while the rule execution graph is being built

--debug

Set the Snakemake debug flag

Arguments

TARGET_FILES

Optional argument(s)

show

Show configuration properties

ymp show [OPTIONS] PROPERTY

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-h, --help
-s, --source

Show source

Arguments

PROPERTY

Optional argument

stage

Manipulate YMP stages

ymp stage [OPTIONS] COMMAND [ARGS]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

list

List available stages

ymp stage list [OPTIONS] STAGE

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-l, --long

Show full stage descriptions

-s, --short

Show only stage names

-c, --code

Show definition file name and line number

-t, --types

Show input/output types

Arguments

STAGE

Optional argument(s)

submit

Build target(s) on cluster

The parameters for cluster execution are drawn from layered profiles. YMP includes base profiles for the “torque” and “slurm” cluster engines.

ymp submit [OPTIONS] TARGET_FILES

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-n, --dryrun

Only show what would be done

-p, --printshellcmds

Print shell commands to be executed on shell

-k, --keepgoing

Don’t stop after failed job

--lock, --no-lock

Use/don’t use locking to prevent clobbering of files by parallel instances of YMP running

--rerun-incomplete, --ri

Re-run jobs left incomplete in last run

-F, --forceall

Force rebuilding of all stages leading to target

-f, --force

Force rebuilding of target

--notemp

Do not remove temporary files

-t, --touch

Only touch files, faking update

--shadow-prefix <shadow_prefix>

Directory to place data for shadowed rules

-r, --reason

Print reason for executing rule

-N, --nohup

Don’t die once the terminal goes away.

-P, --profile <NAME>

Select cluster config profile to use. Overrides cluster.profile setting from config.

-c, --snake-config <FILE>

Provide snakemake cluster config file

-d, --drmaa

Use DRMAA to submit jobs to cluster. Note: Make sure you have a working DRMAA library. Set DRMAA_LIBRAY_PATH if necessary.

-s, --sync

Use synchronous cluster submission, keeping the submit command running until the job has completed. Adds qsub_sync_arg to cluster command

-i, --immediate

Use immediate submission, submitting all jobs to the cluster at once.

--command <CMD>

Use CMD to submit job script to the cluster

--wrapper <CMD>

Use CMD as script submitted to the cluster. See Snakemake documentation for more information.

--max-jobs-per-second <N>

Limit the number of jobs submitted per second

-l, --latency-wait <T>

Time in seconds to wait after job completed until files are expected to have appeared in local file system view. On NFS, this time is governed by the acdirmax mount option, which defaults to 60 seconds.

-J, --cluster-cores <N>

Limit the maximum number of cores used by jobs submitted at a time

-j, --cores <N>

Number of local threads to use

--args <ARGS>

Additional arguments passed to cluster submission command. Note: Make sure the first character of the argument is not ‘-‘, prefix with ‘ ‘ as necessary.

--scriptname <NAME>

Set the name template used for submitted jobs

Arguments

TARGET_FILES

Optional argument(s)

Stages

Listing of stages implemented in YMP

stage Import[source]

Imports raw read files into YMP.

>>> ymp make toy
>>> ymp make mpic
rule export_qiime_map_file[source]
stage annotate_blast[source]

Annotate sequences with BLAST

Searches a reference database for hits with blastn. Use E flag to specify exponent to required E-value. Use N or Mega to specify default. Use Best to add -subject_besthit flag.

This stage produces blast7.gz files as output.

>>> ymp make toy.ref_genome.index_blast.annotate_blast
rule blast_db_size[source]

Determines size of BLAST database (for splitting)

rule blast_db_size_SPLIT[source]

Variant of blast_db_size for multi-file blast indices

rule blast_db_size_V4[source]

Variant of blast_db_size for V4 blast indices

rule blastn_join_result[source]

Merges BLAST results

rule blastn_query[source]

Runs BLAST

rule blastn_query_SPLIT[source]

Variant of blastn_query for multi-file blast indices

rule blastn_query_V4[source]

Variant of blastn_query for V4 blast indices

rule blastn_split_query_fasta[source]

Split FASTA query file into chunks for individual BLAST runs

rule blastn_split_query_fasta_hack[source]

Workaround for a problem with snakemake checkpoints and run: statements

stage annotate_diamond[source]

FIXME

rule diamond_blastx_fasta[source]
rule diamond_view[source]

Convert Diamond binary output (daa) to BLAST6 format

stage annotate_prodigal[source]

Call genes using prodigal

>>> ymp make toy.ref_genome.annotate_prodigal
rule prodigal[source]

Predict genes using prodigal

stage annotate_tblastn[source]

Runs tblastn

rule blast7_to_gtf[source]

Convert from Blast Format 7 to GFF/GTF format

rule tblastn_query[source]

Runs a TBLASTN search against an assembly.

stage assemble_megahit[source]

Assemble metagenome using MegaHit.

>>> ymp make toy.assemble_megahit.map_bbmap
>>> ymp make toy.group_ALL.assemble_megahit.map_bbmap
>>> ymp make toy.group_Subject.assemble_megahit.map_bbmap
rule megahit[source]

Runs MegaHit.

stage assemble_spades[source]

Assemble reads using spades

>>> ymp make toy.assemble_spades
>>> ymp make toy.group_ALL.assemble_spades
>>> ymp make toy.group_Subject.assemble_spades
>>> ymp make toy.assemble_spades
>>> ymp make toy.assemble_spadesMeta
>>> ymp make toy.assemble_spadesSc
>>> ymp make toy.assemble_spadesRna
>>> ymp make toy.assemble_spadesIsolate
>>> ymp make toy.assemble_spadesNC
>>> ymp make toy.assemble_spadesMetaNC
rule spades[source]

Runs Spades. Supports reads.by_COLUMN.sp/complete as target for by group co-assembly.

rule spades_input_yaml[source]

Prepares a dataset config for spades. Spades commandline is limited to at most 9 pairs of fq files, so to allow arbitrary numbers we need to use the dataset config option.

Preparing in a separate rule so that the main spades rule can use the shell: rule and not run:, which would preclude it from using conda environments.

stage assemble_trinity[source]
rule trinity[source]
rule trinity_stats[source]
stage assemble_unicycler[source]

Assemble reads using unicycler

>>> ymp make toy.assemble_unicycler
rule unicycler[source]

Runs unicycler

stage basecov_bedtools[source]

Creates BLAST index running makeblastdb on input fasta.gz files.

>>> ymp make toy.ref_genome.index_blast
rule bedtools_genomecov[source]
stage bin_metabat2[source]

Bin metagenome assembly into MAGs

>>> ymp make mock.assemble_megahit.map_bbmap.sort_bam.bin_metabat2
>>> ymp make mock.group_ALL.assemble_megahit.map_bbmap.sort_bam.group_ALL.bin_metabat2
rule metabat2_bin[source]

Bin metagenome with MetaBat2

rule metabat2_depth[source]

Generates a depth file from BAM

stage check[source]

Verify file availability

This stage provides rules for checking the file availability at a given point in the stage stack.

Mainly useful for testing and debugging.

rule check_fasta[source]

Verify availability of FastA type reference

rule check_fastp[source]

Verify availability of FastP type reference

stage cluster_cdhit[source]

Clusters protein sequences using CD-HIT

>>> ymp make toy.ref_query.cluster_cdhit
rule cdhit_clstr_to_csv[source]
rule cdhit_faa_single[source]

Clustering predicted genes using cdhit

rule cdhit_prepare_input[source]

Prepares input data for CD-HIT

  • rewrites ‘*’ to ‘X’ as stop-codon not understood by CD-HIT

  • prefixes lost ID to Fasta ID

stage correct_bbmap[source]

Correct read errors by overlapping inside tails

Applies BBMap's “bbmerge.sh ecco” mode. This will overlap the inside of read pairs and choose the base with the higher quality where the alignment contains mismatches and increase the quality score as indicated by the double observation where the alignment contains matches.

>>> ymp make toy.correct_bbmap
>>> ymp make mpic.correct_bbmap
rule bbmap_error_correction[source]

Error correction with BBMerge overlapping

rule bbmap_error_correction_all[source]
rule bbmap_error_correction_se[source]

Error correction with BBMerge overlapping

stage count_diamond[source]
rule diamond_count[source]
stage count_stringtie[source]
rule stringtie[source]
rule stringtie_abundance[source]
rule stringtie_all[source]
rule stringtie_all_target[source]
rule stringtie_gather_ballgown[source]
rule stringtie_merge[source]
stage coverage_samtools[source]

Computes coverage from a sorted bam file using samtools coverage

rule samtools_coverage[source]
stage dedup_bbmap[source]

Remove duplicate reads

Applies BBMap's “dedupe.sh”

>>> ymp make toy.dedup_bbmap
>>> ymp make mpic.dedup_bbmap
rule bbmap_dedupe[source]

Deduplicate reads using BBMap’s dedupe.sh

rule bbmap_dedupe_all[source]
rule bbmap_dedupe_se[source]

Deduplicate reads using BBMap's dedupe.sh

stage dust_bbmap[source]

Perform entropy filtering on reads using BBMap's bbduk.sh

The parameter Enn gives the entropy cutoff. Higher values filter more sequences.

>>> ymp make toy.dust_bbmap
>>> ymp make toy.dust_bbmapE60
rule bbmap_dust[source]
stage extract_reads[source]

Extract reads from BAM file using samtools fastq.

Parameters fn, Fn and Gn are passed through to samtools view. Reads are output only if all bits in f are set, none of the bits in F are set, and any of the bits in G is unset.

1: paired 2: proper pair (both aligned in right orientation) 4: unmapped 8: other read unmapped

Some options include:

  • f2: correctly mapped (only proper pairs)

  • F12: both ends mapped (but potentially “improper”)

  • G12: either end mapped

  • F2: not correctly mapped (not proper pair, could also be unmapped)

  • f12: not mapped (neither read mapped)

rule samtools_fastq[source]
stage extract_seqs[source]

Extract sequences from .fasta.gz file using samtools faidx

Currently requires a .blast7 file as input.

Use parameter Nomatch to instead keep unmatched sequences.

rule samtools_faidx[source]
rule samtools_select_blast[source]
stage filter_bmtagger[source]

Filter(-out) contaminant reads using BMTagger

>>> ymp make toy.ref_phiX.index_bmtagger.remove_bmtagger
>>> ymp make toy.ref_phiX.index_bmtagger.remove_bmtagger.assemble_megahit
>>> ymp make toy.ref_phiX.index_bmtagger.filter_bmtagger
>>> ymp make mpic.ref_phiX.index_bmtagger.remove_bmtagger
rule bmtagger_filter[source]

Filter reads using reference

rule bmtagger_filter_all[source]
rule bmtagger_filter_out[source]

Filter-out reads using reference

rule bmtagger_filter_revread[source]

Filter reads using reference

rule bmtagger_find[source]

Match paired end reads against reference

rule bmtagger_find_se[source]

Match single end reads against reference

rule bmtagger_remove_all[source]
stage format_bbmap[source]

Process sequences with BBMap's format.sh

Parameter Ln filters sequences at a minimum length.

>>> ymp make toy.assemble_spades.format_bbmapL200
rule bbmap_reformat[source]
stage humann2[source]

Compute functional profiles using HUMAnN2

rule humann2[source]

Runs HUMAnN2 with separately processed Metaphlan2 output.

Note

HUMAnN2 has no special support for paired end reads. As per manual, we just feed it the concatenated forward and reverse reads.

rule humann2_all[source]
rule humann2_join_tables[source]

Joins HUMAnN2 per sample output tables

rule humann2_renorm_table[source]

Renormalizes humann2 output tables

stage index_bbmap[source]

Creates BBMap index

>>> ymp make toy.ref_genome.index_bbmap
rule bbmap_makedb[source]

Precomputes BBMap index

stage index_blast[source]
rule blast_makedb[source]

Build Blast index

stage index_bmtagger[source]
rule bmtagger_bitmask[source]
rule bmtagger_index[source]
stage index_bowtie2[source]
>>> ymp make toy.ref_genome.index_bowtie2
rule bowtie2_index[source]
stage index_diamond[source]
rule diamond_makedb[source]

Build Diamond index file

stage map_bbmap[source]

Map reads using BBMap

>>> ymp make toy.assemble_megahit.map_bbmap
>>> ymp make toy.ref_genome.map_bbmap
>>> ymp make mpic.ref_ssu.map_bbmap
rule bbmap_map[source]

Map read from each (co-)assembly read file to the assembly

rule bbmap_map_SE[source]

Map read from each (co-)assembly read file to the assembly

stage map_bowtie2[source]

Map reads using Bowtie2

>>> ymp make toy.ref_genome.index_bowtie2.map_bowtie2
>>> ymp make toy.ref_genome.index_bowtie2.map_bowtie2VF
>>> ymp make toy.ref_genome.index_bowtie2.map_bowtie2F
>>> ymp make toy.ref_genome.index_bowtie2.map_bowtie2S
>>> ymp make toy.ref_genome.index_bowtie2.map_bowtie2VS
>>> ymp make toy.ref_genome.index_bowtie2.map_bowtie2X800
>>> ymp make toy.ref_genome.index_bowtie2.map_bowtie2I5
>>> ymp make toy.ref_genome.index_bowtie2.map_bowtie2L
>>> ymp make toy.assemble_megahit.index_bowtie2.map_bowtie2
>>> ymp make toy.group_Subject.assemble_megahit.index_bowtie2.map_bowtie2
>>> ymp make mpic.ref_ssu.index_bowtie2.map_bowtie2
rule bowtie2_map[source]
rule bowtie2_map_SE[source]
stage map_diamond[source]
rule diamond_blastx_fastq[source]
rule diamond_blastx_fastq2[source]
rule diamond_view_2[source]

Convert Diamond binary output (daa) to BLAST6 format

stage map_hisat2[source]

Map reads using Hisat2

rule hisat2_map[source]

For hisat we always assume a pre-build index as providing SNPs and haplotypes etc is beyond this pipelines scope.

stage map_star[source]

Map RNA-Seq reads with STAR

rule star_map[source]
stage markdup_sambamba[source]
rule sambamba_markdup[source]
stage metaphlan2[source]

Assess metagenome community composition using Metaphlan 2

rule metaphlan2[source]

Computes community profile from mapped reads and Metaphlan’s custom reference database.

rule metaphlan2_map[source]

Align reads to Metaphlan’s custom reference database.

rule metaphlan2_merge[source]

Merges Metaphlan community profiles.

stage polish_pilon[source]

Polish genomes with Pilon

Requires fasta.gz and sorted.bam files as input.

rule pilon_polish[source]
stage primermatch_bbmap[source]

Filters reads by matching reference primer using BBMap's “bbduk.sh”.

>>> ymp make mpic.ref_primers.primermatch_bbmap
rule bbduk_primer[source]

Splits reads based on primer matching into “primermatch” and “primerfail”.

rule bbduk_primer_all[source]
rule bbduk_primer_se[source]

Splits reads based on primer matching into “primermatch” and “primerfail”.

stage profile_centrifuge[source]

Classify reads using centrifuge

rule centrifuge[source]
stage qc_fastqc[source]

Quality screen reads using FastQC

>>> ymp make toy.qc_fastqc
rule qc_fastqc[source]

Run FastQC on read files

stage qc_multiqc[source]

Aggregate QC reports using MultiQC

rule multiqc_fastqc[source]

Assemble report on all FQ files in a directory

stage qc_quast[source]

Estimate assemly quality using Quast

rule metaquast_all_at_once[source]

Run quast on all assemblies in the previous stage at once.

rule metaquast_by_sample[source]

Run quast on each assembly

rule metaquast_multiq_summary[source]

Aggregate Quast per assembly reports

stage quant_rsem[source]

Quantify transcripts using RSEM

rule rsem_all[source]
rule rsem_all_for_target[source]
rule rsem_quant[source]
stage references[source]

This is a “virtual” stage. It does not process read data, but comprises rules used for reference provisioning.

rule human_db_download[source]

Download HUMAnN2 reference databases

rule prepare_reference[source]

Provisions files in <reference_dir>/<reference_name>

  • Creates symlinks to downloaded references

  • Compresses references provided uncompressed upstream

  • Connects files requested by stages with downloaded files and unpacked archives

rule unpack_archive[source]

Template rule for unpacking references provisioned upstream as archive.

rule unpack_ref_GRCh38_eaa4c10f

Unpacks ref_GRCh38 archive:

URL: ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38_snp_tran.tar.gz

Files:

  • ALL.1.ht2

  • ALL.2.ht2

  • ALL.3.ht2

  • ALL.4.ht2

  • ALL.5.ht2

  • ALL.6.ht2

  • ALL.7.ht2

  • ALL.8.ht2

rule unpack_ref_centrifuge_0d910a96

Unpacks ref_centrifuge archive:

URL: ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p+h+v.tar.gz

Files:

  • p+h+v.1.cf

  • p+h+v.2.cf

  • p+h+v.3.cf

rule unpack_ref_centrifuge_1ee7c028

Unpacks ref_centrifuge archive:

URL: ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/nt.tar.gz

Files:

  • nt.1.cf

  • nt.2.cf

  • nt.3.cf

rule unpack_ref_centrifuge_43ba6165

Unpacks ref_centrifuge archive:

URL: ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed.tar.gz

Files:

  • p_compressed.1.cf

  • p_compressed.2.cf

  • p_compressed.3.cf

rule unpack_ref_centrifuge_a9964521

Unpacks ref_centrifuge archive:

URL: ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz

Files:

  • p_compressed+h+v.1.cf

  • p_compressed+h+v.2.cf

  • p_compressed+h+v.3.cf

rule unpack_ref_greengenes_305aa905

Unpacks ref_greengenes archive:

URL: ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz

Files:

  • rep_set/99_otus.fasta

  • rep_set/97_otus.fasta

  • rep_set/94_otus.fasta

rule unpack_ref_metaphlan2_a6545140

Unpacks ref_metaphlan2 archive:

URL: https://depot.galaxyproject.org/software/metaphlan2/metaphlan2_2.6.0_src_all.tar.gz

Files:

  • db_v20/mpa_v20_m200.1.bt2

  • db_v20/mpa_v20_m200.2.bt2

  • db_v20/mpa_v20_m200.3.bt2

  • db_v20/mpa_v20_m200.4.bt2

  • db_v20/mpa_v20_m200.rev.1.bt2

  • db_v20/mpa_v20_m200.rev.2.bt2

  • db_v20/mpa_v20_m200.pkl

rule unpack_ref_mothur_SEED_39c9f686

Unpacks ref_mothur_SEED archive:

URL: https://www.mothur.org/w/images/a/a4/Silva.seed_v128.tgz

Files:

  • silva.seed_v128.tax

  • silva.seed_v128.align

stage remove_bbmap[source]

Filter reads by reference

This stage aligns the reads with a given reference using BBMap in fast mode. Matching reads are collected in the stage filter_bbmap and remaining reads are collectec in the stage remove_bbmap.

>>> ymp make toy.ref_phiX.index_bbmap.remove_bbmap
>>> ymp make toy.ref_phiX.index_bbmap.filter_bbmap
>>> ymp make mpic.ref_phiX.index_bbmap.remove_bbmap
rule bbmap_split[source]
rule bbmap_split_all[source]
rule bbmap_split_all_remove[source]
rule bbmap_split_se[source]
stage sort_bam[source]
rule sambamba_sort[source]
stage split_library[source]

Demultiplexes amplicon sequencing files

This rule is treated specially. If a configured project specifies a barcode_col, reads from the file (or files) are used in combination with

rule fastq_multix[source]
rule split_library_compress_sample[source]
stage trim_bbmap[source]

Trim adapters and low quality bases from reads

Applies BBMap's “bbduk.sh”.

Parameters:

A: append to enable adapter trimming Q20: append to select phred score cutoff (default 20) L20: append to select minimum read length (default 20)

>>> ymp make toy.trim_bbmap
>>> ymp make toy.trim_bbmapA
>>> ymp make toy.trim_bbmapAQ10L10
>>> ymp make mpic.trim_bbmap
rule bbmap_trim[source]

Trimming and Adapter Removal using BBTools BBDuk

rule bbmap_trim_all[source]
rule bbmap_trim_se[source]

Trimming and Adapter Removal using BBTools BBDuk

stage trim_sickle[source]

Perform read trimming using Sickle

>>> ymp make toy.trim_sickle
>>> ymp make toy.trim_sickleQ10L10
>>> ymp make mpic.trim_sickleL20
rule sicke_all[source]
rule sickle[source]
rule sickle_se[source]
stage trim_trimmomatic[source]

Adapter trim reads using trimmomatic

>>> ymp make toy.trim_trimmomaticT32
>>> ymp make mpic.trim_trimmomatic
rule trimmomatic_adapter[source]

Trimming with Trimmomatic

rule trimmomatic_adapter_all[source]
rule trimmomatic_adapter_se[source]

Trimming with Trimmomatic

rule download_file_ftp[source]

Downloads remote file using wget

rule download_file_http[source]

Downloads remote file using internal downloader

rule mkdir[source]

Auto-create directories listed in ymp config.

Use these as input: >>> input: tmpdir = ancient(ymp.get_config().dir.tmp) Or as param: >>> param: tmpdir = “/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/doc/tmp”

rule prefetch[source]

Downloads SRA files into NCBI SRA folder (ncbi/public/sra).

rule fastq_dump[source]

Extracts FQ from SRA files

rule cdhit_fna_single[source]

Clustering predicted genes (nuc) using cdhit-est

rule normalize_16S[source]

Normalize 16S by copy number using picrust, must be run with closed reference OTU table

rule predict_metagenome[source]

Predict metagenome using picrust

rule categorize_by_function[source]

Categorize PICRUSt KOs into pathways

rule rsem_index[source]

Build Genome Index for RSEM

rule star_index[source]

Build Genome Index for Star

API

ymp package

ymp.get_config()[source]

Access the current YMP configuration object.

This object might change once during normal execution: it is deleted before passing control to Snakemake. During unit test execution the object is deleted between all tests.

Return type

ConfigMgr

ymp.print_rule = 0

Set to 1 to show the YMP expansion process as it is applied to the next Snakemake rule definition.

>>> ymp.print_rule = 1
>>> rule broken:
>>>   ...
>>> ymp make broken -vvv
ymp.snakemake_versions = ['6.0.5', '6.1.0', '6.1.1', '6.2.1']

List of versions this version of YMP has been verified to work with

Subpackages

ymp.cli package
ymp.cli.install_completion(ctx, attr, value)[source]

Installs click_completion tab expansion into users shell

ymp.cli.install_profiler(ctx, attr, value)[source]
Submodules
ymp.cli.env module
ymp.cli.env.get_env(envname)[source]

Get single environment matching glob pattern

Parameters

envname – environment glob pattern

ymp.cli.env.get_envs(patterns=None)[source]

Get environments matching glob pattern

Parameters

envnames – list of strings to match

ymp.cli.init module

Implements subcommands for ymp init

ymp.cli.init.have_command(cmd)[source]
ymp.cli.make module

Implements subcommands for ymp make and ymp submit

class ymp.cli.make.TargetParam[source]

Bases: click.types.ParamType

Handles tab expansion for build targets

classmethod complete(ctx, incomplete)[source]

Try to complete incomplete command

This is executed on tab or tab-tab from the shell

Parameters
  • ctx – click context object

  • incomplete – last word in command line up until cursor

Returns

list of words incomplete can be completed to

ymp.cli.make.debug(msg, *args, **kwargs)[source]
ymp.cli.make.snake_params(func)[source]

Default parameters for subcommands launching Snakemake

ymp.cli.make.start_snakemake(kwargs)[source]

Execute Snakemake with given parameters and targets

Fixes paths of kwargs[‘targets’] to be relative to YMP root.

ymp.cli.shared_options module
class ymp.cli.shared_options.Group(name=None, commands=None, **attrs)[source]

Bases: click.core.Group

command(*args, **kwargs)[source]

A shortcut decorator for declaring and attaching a command to the group. This takes the same arguments as command() but immediately registers the created command with this instance by calling into add_command().

class ymp.cli.shared_options.Log[source]

Bases: object

Set up Logging

classmethod logfile_option(ctx, param, val)[source]
mod_level(n)[source]
classmethod quiet_option(ctx, param, val)[source]
static set_logfile(filename)[source]
classmethod verbose_option(ctx, param, val)[source]
class ymp.cli.shared_options.LogFormatter[source]

Bases: coloredlogs.ColoredFormatter

Initialize a ColoredFormatter object.

Parameters
  • fmt – A log format string (defaults to DEFAULT_LOG_FORMAT).

  • datefmt – A date/time format string (defaults to None, but see the documentation of BasicFormatter.formatTime()).

  • style – One of the characters %, { or $ (defaults to DEFAULT_FORMAT_STYLE)

  • level_styles – A dictionary with custom level styles (defaults to DEFAULT_LEVEL_STYLES).

  • field_styles – A dictionary with custom field styles (defaults to DEFAULT_FIELD_STYLES).

Raises

Refer to check_style().

This initializer uses colorize_format() to inject ANSI escape sequences in the log format string before it is passed to the initializer of the base class.

format(record)[source]

Apply level-specific styling to log records.

Parameters

record – A LogRecord object.

Returns

The result of logging.Formatter.format().

This method injects ANSI escape sequences that are specific to the level of each log record (because such logic cannot be expressed in the syntax of a log format string). It works by making a copy of the log record, changing the msg field inside the copy and passing the copy into the format() method of the base class.

snakemake_level_styles = {'crirical': {'color': 'red'}, 'debug': {'color': 'blue'}, 'error': {'color': 'red'}, 'info': {'color': 'green'}, 'warning': {'color': 'yellow'}}
class ymp.cli.shared_options.TqdmHandler(stream=None)[source]

Bases: logging.StreamHandler

Tqdm aware logging StreamHandler

Passes all log writes through tqdm to allow progress bars and log messages to coexist without clobbering terminal

Initialize the handler.

If stream is not specified, sys.stderr is used.

emit(record)[source]

Emit a record.

If a formatter is specified, it is used to format the record. The record is then written to the stream with a trailing newline. If exception information is present, it is formatted using traceback.print_exception and appended to the stream. If the stream has an ‘encoding’ attribute, it is used to determine how to do the output to the stream.

ymp.cli.shared_options.command(*args, **kwargs)[source]
ymp.cli.shared_options.enable_debug(_ctx, param, val)[source]
ymp.cli.shared_options.group(*args, **kwargs)[source]
ymp.cli.shared_options.log_options(f)[source]
ymp.cli.shared_options.nohup(ctx, param, val)[source]

Make YMP continue after the shell dies.

  • redirects stdout and stderr into pipes and sub process that won’t die if it can’t write to either anymore

  • closes stdin

ymp.cli.show module

Implements subcommands for ymp show

class ymp.cli.show.ConfigPropertyParam[source]

Bases: click.types.ParamType

Handles tab expansion for ymp show arguments

complete(_ctx, incomplete)[source]

Try to complete incomplete command

This is executed on tab or tab-tab from the shell

Parameters
  • ctx – click context object

  • incomplete – last word in command line up until cursor

Returns

list of words incomplete can be completed to

convert(value, param, ctx)[source]

Convert value of param given context

Parameters
  • value – string passed on command line

  • param – click parameter object

  • ctx – click context object

property properties

Find properties offered by ConfigMgr

ymp.cli.show.show_help(ctx, _param=None, value=True)[source]

Display click command help

ymp.cli.stage module
ymp.cli.stage.wrap(header, data)[source]
ymp.stage package

YMP processes data in stages, each of which is contained in its own directory.

with Stage("trim_bbmap") as S:
  S.doc("Trim reads with BBMap")
  rule bbmap_trim:
    output: "{:this:}/{sample}{:pairnames:}.fq.gz"
    input:  "{:prev:}/{sample}{:pairnames:}.fq.gz"
    ...
Submodules
ymp.stage.base module

Base classes for all Stage types

class ymp.stage.base.Activateable(*args, **kwargs)[source]

Bases: object

Mixin for Stages that can be filled with rules from Snakefiles.

add_rule(rule, workflow)[source]
Return type

None

check_active_stage(name)[source]
Return type

None

static get_active()[source]
Return type

BaseStage

register_inout(name, target, item)[source]

Determine stage input/output file type from prev/this filename

Detects patterns like “PREFIX{: NAME :}/INFIX{TARGET}.EXT”. Also checks if there is an active stage.

Parameters
  • name (str) – The NAME

  • target (Set) – Set to which to add the type

  • item (str) – The filename

Return type

None

Returns

Normalized output pattern

rules: List[snakemake.rules.Rule]

Rules in this stage

static set_active(stage)[source]
Return type

None

class ymp.stage.base.BaseStage(name)[source]

Bases: object

Base class for stage types

altname

Alternative name

can_provide(inputs)[source]

Determines which of inputs this stage can provide.

Returns a dictionary with the keys a subset of inputs and the values identifying redirections. An empty string indicates that no redirection is to take place. Otherwise, the string is the suffix to be appended to the prior StageStack.

Return type

Dict[str, str]

doc(doc)[source]

Add documentation to Stage

Parameters

doc (str) – Docstring passed to Sphinx

Return type

None

docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

get_all_targets(stack, output_types=None)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

Return type

List[str]

get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack (StageStack) – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, match_groups=None, match_value=None)[source]

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters
  • groups (List[str]) – Set of columns the values of which should form IDs

  • match_value (Optional[str]) – Limit output to rows with this value

  • match_groups (Optional[List[str]]) – … in these groups

Return type

List[str]

get_inputs()[source]

Returns the set of inputs required by this stage

This function must return a copy, to ensure internal data is not modified.

Return type

Set[str]

get_outputs(path)[source]

Returns a dictionary of outputs

Return type

Dict[str, str]

get_path(stack)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

Return type

str

has_checkpoint()[source]
Return type

bool

match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

Return type

bool

modify_next_group(_stack)[source]
name

The name of the stage is a string uniquely identifying it among all stages.

property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

class ymp.stage.base.ConfigStage(name, cfg)[source]

Bases: ymp.stage.base.BaseStage

Base for stages created via configuration

These Stages derive from the yml.yml and not from a rules file.

cfg

The configuration object defining this Stage.

property defined_in

List of files defining this stage

Used to invalidate caches.

docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

filename

Semi-colon separated list of file names defining this Stage.

lineno

Line number within the first file at which this Stage is defined.

ymp.stage.expander module
class ymp.stage.expander.StageExpander[source]

Bases: ymp.snakemake.ColonExpander

  • Registers rules with stages when they are created

class Formatter(expander)[source]

Bases: ymp.snakemake.FormatExpander.Formatter, ymp.string.PartialFormatter

get_value(key, args, kwargs)[source]
get_value_(key, args, kwargs)[source]
regroup = re.compile('(?<!{){\\s*([^{}\\s]+)\\s*}(?!})')
expand_ruleinfo(rule, item, expand_args, rec)[source]
expand_str(rule, item, expand_args, rec, cb)[source]
expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

ymp.stage.groupby module

Implements forward grouping

Grouping allows processing multiple input datasets at once, such as in a co-assembly. It is initiated by adding the virtual stage “group_<COL>” directly before the stage that should be grouping its output. “<COL>” may be a project data column, in which case all data for which column COL shares a value will be combined, or “ALL”, which combines all samples. The output filename prefix will be either the column value or “ALL”.

>>> ymp make mock.group_sample.assemble_megahit
>>> ymp make mock.group_ALL.assemble_megahit

Subsequent stages will use the most finegrained grouping required by their input data.

# FIXME: How to avoid re-specifying groupby?

class ymp.stage.groupby.GroupBy(name)[source]

Bases: ymp.stage.base.BaseStage

Virtual stage for grouping

PREFIX = 'group_'
docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack (StageStack) – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

Return type

bool

modify_next_group(stack)[source]
Return type

List[str]

ymp.stage.params module
class ymp.stage.params.Param(stage, key, name, value=None, default=None)[source]

Bases: abc.ABC

Stage Parameter (base class)

property constraint
format(groupdict)[source]
classmethod make(stage, typ, key, name, value, default)[source]
Return type

Param

parse(wildcards, nodefault=False)[source]
pattern(show_constraint=True)[source]

String to add to filenames passed to Snakemake

I.e. a pattern of the form {wildcard,constraint}

regex: str = NotImplemented
type_name: str = NotImplemented

Name of type, must be overwritten by children

types: Dict[str, Type[ymp.stage.params.Param]] = {'choice': <class 'ymp.stage.params.ParamChoice'>, 'flag': <class 'ymp.stage.params.ParamFlag'>, 'int': <class 'ymp.stage.params.ParamInt'>, 'ref': <class 'ymp.stage.params.ParamRef'>}

Type/Class mapping for param types

property wildcard
class ymp.stage.params.ParamChoice(*args, **kwargs)[source]

Bases: ymp.stage.params.Param

Stage Choice Parameter

type_name: str = 'choice'

Name of type, must be overwritten by children

class ymp.stage.params.ParamFlag(*args, **kwargs)[source]

Bases: ymp.stage.params.Param

Stage Flag Parameter

format(groupdict)[source]
parse(wildcards)[source]

Returns function that will extract parameter value from wildcards

type_name: str = 'flag'

Name of type, must be overwritten by children

class ymp.stage.params.ParamInt(*args, **kwargs)[source]

Bases: ymp.stage.params.Param

Stage Int Parameter

type_name: str = 'int'

Name of type, must be overwritten by children

class ymp.stage.params.ParamRef(stage, key, name, value=None, default=None)[source]

Bases: ymp.stage.params.Param

Reference Choice Parameter

property regex
type_name: str = 'ref'

Name of type, must be overwritten by children

class ymp.stage.params.Parametrizable(*args, **kwargs)[source]

Bases: ymp.stage.base.BaseStage

add_param(key, typ, name, value=None, default=None)[source]

Add parameter to stage

Example

>>> with Stage("test") as S
>>>   S.add_param("N", "int", "nval", default=50)
>>>   rule:
>>>      shell: "echo {param.nval}"

This would add a stage “test”, optionally callable as “testN123”, printing “50” or in the case of “testN123” printing “123”.

Parameters
  • char – The character to use in the Stage name

  • typ – The type of the parameter (int, flag)

  • param – Name of parameter in params

  • value – value {param.xyz} should be set to if param given

  • default – default value for {{param.xyz}} if no param given

Return type

bool

docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

format(groupdict)[source]
match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

Return type

bool

property params
parse(name)[source]
Return type

Dict[str, str]

property regex
ymp.stage.pipeline module

Pipelines Module

Contains classes for pre-configured pipelines comprising multiple stages.

class ymp.stage.pipeline.Pipeline(name, cfg)[source]

Bases: ymp.stage.params.Parametrizable, ymp.stage.base.ConfigStage

A virtual stage aggregating a sequence of stages, i.e. a pipeline or sub-workflow.

Pipelines are configured via ymp.yml.

Example

pipelines:
my_pipeline:

hide: false params:

System Message: ERROR/3 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src/ymp/stage/pipeline.py:docstring of ymp.stage.pipeline.Pipeline, line 12)

Unexpected indentation.

length:

key: L type: int default: 20

System Message: WARNING/2 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src/ymp/stage/pipeline.py:docstring of ymp.stage.pipeline.Pipeline, line 16)

Block quote ends without a blank line; unexpected unindent.

stages:
  • stage_1{length}:

    hide: true

  • stage_2

  • stage_3

can_provide(inputs)[source]

Determines which of inputs this stage can provide.

The result dictionary values will point to the “real” output.

Return type

Dict[str, str]

docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

get_all_targets(stack)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack (StageStack) – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, mygroups=None, target=None)[source]

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters
  • groups – Set of columns the values of which should form IDs

  • match_value – Limit output to rows with this value

  • match_groups – … in these groups

get_path(stack, typ=None)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

hide_outputs

If true, outputs of stages are hidden by default

property outputs

The outputs of a pipeline are the sum of the outputs of each component stage. Outputs of stages further down the pipeline override those generated earlier.

Return type

Dict[str, str]

property params
pipeline

Path fragment describing this pipeline

stages

Dictionary of stages with configuration options for each

ymp.stage.project module

This module defines “Project”, a Stage type defined by a project matrix file giving units and meta data for input files.

class ymp.stage.project.PandasTableBuilder[source]

Bases: object

Builds the data table describing each sample in a project

This class implements loading and combining tabular data files as specified by the YAML configuration.

Format:
  • string items are files

  • lists of files are concatenated top to bottom

  • dicts must have one “command” value:

    • ‘join’ contains a two-item list the two items are joined ‘naturally’ on shared headers

    • ‘table’ contains a list of one-item dicts dicts have form key:value[,value...] a in-place table is created from the keys list-of-dict is necessary as dicts are unordered

    • ‘paste’ contains a list of tables pasted left to right tables pasted must be of equal length or length 1

  • if a value is a valid path relative to the csv/tsv/xls file’s location, it is expanded to a path relative to CWD

Example

- top.csv
- join:
  - excel.xslx%left.csv
  - right.tsv
- table:
  - sample: s1,s2,s3
  - fq1: s1.1.fq, s2.1.fq, s3.1.fq
  - fq2: s1.2.fq, s2.2.fq, s3.2.fq
load_data(cfg, key)[source]
class ymp.stage.project.Project(name, cfg)[source]

Bases: ymp.stage.base.ConfigStage

Contains configuration for a source dataset to be processed

KEY_BCCOL = 'barcode_col'
KEY_DATA = 'data'
KEY_IDCOL = 'id_col'
KEY_READCOLS = 'read_cols'
RE_FILE = re.compile('^(?!http://).*(?:fq|fastq)(?:|\\.gz)$')
RE_REMOTE = re.compile('^(?:https?|ftp|sftp)://(?:.*)')
RE_SRR = re.compile('^[SED]RR[0-9]+$')
choose_fq_columns()[source]

Configures the columns referencing the fastq sources

choose_id_column()[source]

Configures column to use as index on runs

If explicitly configured via KEY_IDCOL, verifies that the column exists and that it is unique. Otherwise chooses the leftmost unique column in the data.

property data

Pandas dataframe of runs

Lazy loading property, first call may take a while.

do_get_ids(_stack, groups, match_groups=None, match_values=None)[source]
docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

encode_barcode_path(barcode_file, run, pair)[source]
property fq_names

Names of all FastQ files

property fwd_fq_names

Names of forward FastQ files (se and pe)

property fwd_pe_fq_names

Names of forward FastQ files part of pair

get_all_targets(stack, output_types=None)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

Return type

List[str]

get_fq_names(only_fwd=False, only_rev=False, only_pe=False, only_se=False)[source]

Get pipeline names of fq files

get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack (StageStack) – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, match_groups=None, match_values=None)[source]

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters
  • groups – Set of columns the values of which should form IDs

  • match_value – Limit output to rows with this value

  • match_groups – … in these groups

property idcol
iter_samples(variables=None)[source]
minimize_variables(groups)[source]

Removes redundant groupings

property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

property pe_fq_names

Names of paired end FastQ files

property project_name
raw_reads_source_path(args, kwargs)[source]
property rev_pe_fq_names

Names of reverse FastQ files part of pair

property runs

Pandas dataframe index of runs

Lazy loading property, first call may take a while.

property se_fq_names

Names of single end FastQ files

property source_cfg
source_path(target, pair, nosplit=False)[source]

Get path for FQ file for run and pair

unsplit_path(barcode_id, pairname)[source]
property variables
class ymp.stage.project.SQLiteProjectData(cfg, key, name='data')[source]

Bases: object

columns()[source]
property db_url
dump()[source]
duplicate_rows(column)[source]
fetch(cols, idcols=None, values=None)[source]
Return type

List[List[str]]

groupby_dedup(cols)[source]
identifying_columns()[source]
property nrows
query(*args)[source]
rows(col)[source]
string_columns()[source]
ymp.stage.reference module
class ymp.stage.reference.Archive(name, dirname, tar, url, strip, files)[source]

Bases: object

dirname = None
files = None
get_files()[source]
hash = None
make_unpack_rule(baserule)[source]
name = None
strip_components = None
tar = None
class ymp.stage.reference.Reference(name, cfg)[source]

Bases: ymp.stage.base.Activateable, ymp.stage.base.ConfigStage

Represents (remote) reference file/database configuration

add_resource(rsc)[source]
files: Dict[str, str]

Files provided by the reference. Keys are the file names within ymp (“target.extension”), symlinked into dir.ref/ref_name/ and values are the path to the reference file from workspace root.

get_all_targets(stack)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

Return type

List[str]

get_file(filename)[source]
get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack (StageStack) – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, match_groups=None, match_value=None)[source]

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters
  • groups (List[str]) – Set of columns the values of which should form IDs

  • match_value (Optional[str]) – Limit output to rows with this value

  • match_groups (Optional[List[str]]) – … in these groups

Return type

List[str]

get_path(_stack)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

make_unpack_rules(baserule)[source]
property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

prev(args=None, kwargs=None)[source]
rules: List[snakemake.rules.Rule]

Rules in this stage

this(args=None, kwargs=None)[source]
ymp.stage.stack module

Implements the StageStack

class ymp.stage.stack.StageStack(path)[source]

Bases: object

The “head” of a processing chain - a stack of stages

all_targets()[source]
complete(incomplete)[source]
debug = False

Set to true to enable additional Stack debug logging

property defined_in
get_ids(select_cols, where_cols=None, where_vals=None)[source]
group: List[str]

Grouping in effect for this StageStack. And empty list groups into one pseudo target, ‘ALL’.

classmethod instance(path)[source]

Cached access to StageStack

Parameters
  • path – Stage path

  • stage – Stage object at head of stack

name

Name of stack, aka is its full path

property path

On disk location of files provided by this stack

prev(_args=None, kwargs=None)[source]

Directory of previous stage

Return type

StageStack

prev_stage

Stage below top stage or None if first in stack

prevs

Mapping of each input type required by the stage of this stack to the prefix stack providing it.

project

Project on which stack operates This is needed for grouping variables currently.

resolve_prevs()[source]
show_info()[source]
stage

Top Stage

stage_name

Top Stage Name

stage_names

Names of stages on stack

stages

Stages on stack

target(args, kwargs)[source]

Determines the IDs for a given input data type and output ID (replaces “{:target:}”).

property targets

Determines the IDs to be built by this Stage Stack (replaces “{:targets:}”).

used_stacks = {}
ymp.stage.stack.find_stage(name)[source]
ymp.stage.stack.norm_wildcards(pattern)[source]
ymp.stage.stage module

Implements the “Stage”

At it’s most basic, a “Stage” is a set of Snakemake rules that share an output folder.

class ymp.stage.stage.Stage(name, altname=None, env=None, doc=None)[source]

Bases: ymp.snakemake.WorkflowObject, ymp.stage.params.Parametrizable, ymp.stage.base.Activateable, ymp.stage.base.BaseStage

Creates a new stage

While entered using with, several stage specific variables are expanded within rules:

  • {:this:} – The current stage directory

  • {:that:} – The alternate output stage directory

  • {:prev:} – The previous stage’s directory

Parameters
  • name (str) – Name of this stage

  • altname (Optional[str]) – Alternate name of this stage (used for stages with multiple output variants, e.g. filter_x and remove_x.

  • doc (Optional[str]) – See doc()

  • env (Optional[str]) – See env()

altname: str

Alternative stage name (deprecated)

bin(_args=None, kwargs=None)[source]

Dynamic ID for splitting stages

checkpoints: Dict[str, Set[str]]

Checkpoints in this stage

env(name)[source]

Add package specifications to Stage environment

Note

This sets the environment for all rules within the stage, which leads to errors with Snakemake rule types not supporting conda environments

Parameters

name (str) – Environment name or filename

>>> Env("blast", packages="blast =2.7*")
>>> with Stage("test") as S:
>>>    S.env("blast")
>>>    rule testing:
>>>       ...
>>> with Stage("test", env="blast") as S:
>>>    rule testing:
>>>       ...
>>> with Stage("test") as S:
>>>    rule testing:
>>>       conda: "blast"
>>>       ...
Return type

None

get_all_targets(stack)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

get_checkpoint_ids(stack, mygroup, target)[source]
get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, mygroups=None, target=None)[source]

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters
  • groups – Set of columns the values of which should form IDs

  • match_value – Limit output to rows with this value

  • match_groups – … in these groups

get_inputs()[source]

Returns the set of inputs required by this stage

This function must return a copy, to ensure internal data is not modified.

has_checkpoint()[source]
Return type

bool

match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Set[str]

prev(_args, kwargs)[source]

Gathers {:prev:} calls from rules

Here, input requirements for each stage are collected.

Return type

None

require(**kwargs)[source]

Override inferred stage inputs

In theory, this should not be needed. But it’s simpler for now.

requires

Contains override stage inputs

satisfy_inputs(other_stage, inputs)[source]
Return type

Dict[str, str]

that(_args=None, kwargs=None)[source]

Alternate directory of current stage

Used for splitting stages

this(args=None, kwargs=None)[source]

Replaces {:this:} in rules

Also gathers output capabilities of each stage.

wc2path(wc)[source]

Submodules

ymp.blast module

Parsers for blast output formats 6 (CSV) and 7 (CSV with comments between queries).

class ymp.blast.BlastBase[source]

Bases: object

Base class for BLAST readers and writers

FIELD_MAP = {'% identity': 'pident', 'alignment length': 'length', 'bit score': 'bitscore', 'evalue': 'evalue', 'gap opens': 'gapopen', 'mismatches': 'mismatch', 'q. end': 'qend', 'q. start': 'qstart', 'query acc.': 'qacc', 'query frame': 'qframe', 'query length': 'qlen', 's. end': 'send', 's. start': 'sstart', 'sbjct frame': 'sframe', 'score': 'score', 'subject acc.': 'sacc', 'subject strand': 'sstrand', 'subject tax ids': 'staxids', 'subject title': 'stitle'}

Map between field short and long names

FIELD_REV_MAP = {'bitscore': 'bit score', 'evalue': 'evalue', 'gapopen': 'gap opens', 'length': 'alignment length', 'mismatch': 'mismatches', 'pident': '% identity', 'qacc': 'query acc.', 'qend': 'q. end', 'qframe': 'query frame', 'qlen': 'query length', 'qstart': 'q. start', 'sacc': 'subject acc.', 'score': 'score', 'send': 's. end', 'sframe': 'sbjct frame', 'sstart': 's. start', 'sstrand': 'subject strand', 'staxids': 'subject tax ids', 'stitle': 'subject title'}

Reversed map from short to long name

FIELD_TYPE = {'bitscore': <class 'float'>, 'evalue': <class 'float'>, 'gapopen': <class 'int'>, 'length': <class 'int'>, 'mismatch': <class 'int'>, 'pident': <class 'float'>, 'qend': <class 'int'>, 'qframe': <class 'int'>, 'qlen': <class 'int'>, 'qstart': <class 'int'>, 'score': <class 'float'>, 'send': <class 'int'>, 'sframe': <class 'int'>, 'sstart': <class 'int'>, 'staxids': <function BlastBase.tupleofint>, 'stitle': <class 'str'>}

Map defining types of fields

tupleofint()[source]
class ymp.blast.BlastParser[source]

Bases: ymp.blast.BlastBase

Base class for BLAST readers

get_fields()[source]
class ymp.blast.BlastWriter[source]

Bases: ymp.blast.BlastBase

Base class for BLAST writers

write_hit(hit)[source]
class ymp.blast.Fmt6Parser(fileobj)[source]

Bases: ymp.blast.BlastParser

Parser for BLAST format 6 (CSV)

Hit

alias of ymp.blast.BlastHit

field_types = [None, None, <class 'float'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'float'>, <class 'float'>]
fields = ['qseqid', 'sseqid', 'pident', 'length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore']

Default field types

get_fields()[source]
class ymp.blast.Fmt7Parser(fileobj)[source]

Bases: ymp.blast.BlastParser

Parses BLAST results in format ‘7’ (CSV with comments)

PAT_DATABASE = '# Database: '
PAT_FIELDS = '# Fields: '
PAT_HITSFOUND = ' hits found'
PAT_QUERY = '# Query: '
get_fields()[source]

Returns list of available field names

Format 7 specifies which columns it contains in comment lines, allowing this parser to be agnostic of the selection of columns made when running BLAST.

Return type

List[str]

Returns

List of field names (e.g. ['sacc', 'qacc', 'evalue'])

isfirsthit()[source]

Returns True if the current hit is the first hit for the current query

Return type

bool

class ymp.blast.Fmt7Writer(fileobj)[source]

Bases: ymp.blast.BlastWriter

write_header()[source]

Writes BLAST7 format header

write_hit(hit)[source]
write_hitset()[source]
ymp.blast.reader(fileobj, t=7)[source]

Creates a reader for files in BLAST format

>>> with open(blast_file) as infile:
>>>    reader = blast.reader(infile)
>>>    for hit in reader:
>>>       print(hit)
Parameters
  • fileobj – iterable yielding lines in blast format

  • t (int) – number of blast format type

Return type

BlastParser

ymp.blast.writer(fileobj, t=7)[source]

Creates a writer for files in BLAST format

>>> with open(blast_file) as outfile:
>>>    writer = blast.writer(outfile)
>>>    for hit in hits:
>>>       writer.write_hit(hit)
Return type

BlastWriter

ymp.blast2gff module

ymp.cluster module

Module handling talking to cluster management systems

>>> python -m ymp.cluster slurm status <jobid>
class ymp.cluster.ClusterMS[source]

Bases: object

class ymp.cluster.Lsf[source]

Bases: ymp.cluster.ClusterMS

Talking to LSF

states = {'DONE': 'success', 'EXIT': 'failed', 'PEND': 'running', 'POST_DONE': 'success', 'POST_ERR': 'failed', 'PSUSP': 'running', 'RUN': 'running', 'SSUSP': 'running', 'UNKWN': 'running', 'USUSP': 'running', 'WAIT': 'running'}
static status(jobid)[source]
static submit(args)[source]
class ymp.cluster.Slurm[source]

Bases: ymp.cluster.ClusterMS

Talking to Slurm

states = {'BOOT_FAIL': 'failed', 'CANCELLED': 'failed', 'COMPLETED': 'success', 'COMPLETING': 'running', 'CONFIGURING': 'running', 'DEADLINE': 'failed', 'FAILED': 'failed', 'NODE_FAIL': 'failed', 'PENDING': 'running', 'PREEMPTED': 'failed', 'RESIZING': 'running', 'REVOKED': 'running', 'RUNNING': 'running', 'SPECIAL_EXIT': 'running', 'SUSPENDED': 'running', 'TIMEOUT': 'failed'}
static status(jobid)[source]

Print status of job @param jobid to stdout (as needed by snakemake)

Anectotal benchmarking shows 200ms per invocation, half used by Python startup and half by calling sacct. Using scontrol show job instead of sacct -pbs is faster by 80ms, but finished jobs are purged after unknown time window.

ymp.cluster.error(*args, **kwargs)[source]

ymp.common module

Collection of shared utility classes and methods

class ymp.common.AttrDict[source]

Bases: dict

AttrDict adds accessing stored keys as attributes to dict

class ymp.common.Cache(root)[source]

Bases: object

close()[source]
commit()[source]
get_cache(name, clean=False, *args, **kwargs)[source]
load(cache, key)[source]
load_all(cache)[source]
store(cache, key, obj)[source]
class ymp.common.CacheDict(cache, name, *args, loadfunc=None, itemloadfunc=None, itemdata=None, **kwargs)[source]

Bases: ymp.common.AttrDict

get(key, default=None)[source]

Return the value for key if key is in the dictionary, else default.

items()a set-like object providing a view on D’s items[source]
keys()a set-like object providing a view on D’s keys[source]
values()an object providing a view on D’s values[source]
class ymp.common.MkdirDict[source]

Bases: ymp.common.AttrDict

Creates directories as they are requested

class ymp.common.NoCache(root)[source]

Bases: object

close()[source]
commit()[source]
get_cache(name, clean=False, *args, **kwargs)[source]
load(_cache, _key)[source]
load_all(_cache)[source]
store(cache, key, obj)[source]
ymp.common.ensure_list(obj)[source]

Wrap obj in a list as needed

ymp.common.flatten(item)[source]

Flatten lists without turning strings into letters

ymp.common.format_number(num, unit='')[source]
Return type

int

ymp.common.format_time(seconds, unit=None)[source]

Prints time in SLURM format

Return type

str

ymp.common.is_container(obj)[source]

Check if object is container, considering strings not containers

ymp.common.parse_number(s='')[source]

Basic 1k 1m 1g 1t parsing.

  • assumes base 2

  • returns “byte” value

  • accepts “1kib”, “1kb” or “1k”

ymp.common.parse_time(timestr)[source]

Parses time in “SLURM” format

<minutes> <minutes>:<seconds> <hours>:<minutes>:<seconds> <days>-<hours> <days>-<hours>:<minutes> <days>-<hours>:<minutes>:<seconds>

Return type

int

ymp.config module

class ymp.config.ConfigExpander(config_mgr)[source]

Bases: ymp.snakemake.ColonExpander

class Formatter(expander)[source]

Bases: ymp.snakemake.FormatExpander.Formatter, ymp.string.PartialFormatter

get_value(field_name, args, kwargs)[source]
expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

class ymp.config.ConfigMgr(root, conffiles)[source]

Bases: object

Manages workflow configuration

This is a singleton object of which only one instance should be around at a given time. It is available in the rules files as icfg and via ymp.get_config() elsewhere.

ConfigMgr loads and maintains the workflow configuration as given in the ymp.yml files located in the workflow root directory, the user config folder (~/.ymp) and the installation etc folder.

CONF_DEFAULT_FNAME = '/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src/ymp/etc/defaults.yml'
CONF_FNAME = 'ymp.yml'
CONF_USER_FNAME = '/home/docs/.ymp/ymp.yml'
KEY_LIMITS = 'resource_limits'
KEY_PIPELINES = 'pipelines'
KEY_PROJECTS = 'projects'
KEY_REFERENCES = 'references'
RULE_MAIN_FNAME = '/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src/ymp/rules/Snakefile'
property absdir

Dictionary of absolute paths of named YMP directories

classmethod activate()[source]
property cluster

The YMP cluster configuration.

property conda
property dir

Dictionary of relative paths of named YMP directories

The directory paths are relative to the YMP root workdir.

property ensuredir

Dictionary of absolute paths of named YMP directories

Directories will be created on the fly as they are requested.

expand(item, **kwargs)[source]
classmethod find_config()[source]

Locates ymp config files and ymp root

The root ymp work dir is determined as the first (parent) directory containing a file named ConfigMgr.CONF_FNAME (default ymp.yml).

The stack of config files comprises 1. the default config ConfigMgr.CONF_DEFAULT_FNAME (etc/defaults.yml in the ymp package directory), 2. the user config ConfigMgr.CONF_USER_FNAME (~/.ymp/ymp.yml) and 3. the yml.yml in the ymp root.

Returns

Root working directory conffiles: list of active configuration files

Return type

root

classmethod instance()[source]

Returns the active Ymp ConfigMgr instance

property pairnames
property pipeline

Configure pipelines

property platform

Name of current platform (macos or linux)

property ref

Configure references

property rules
property shell

The shell used by YMP

Change by adding e.g. shell: /path/to/shell to ymp.yml.

property snakefiles

Snakefiles used under this config in parsing order

classmethod unload()[source]
property workflow
class ymp.config.OverrideExpander(cfgmgr)[source]

Bases: ymp.snakemake.BaseExpander

Override rule parameters, resources and threads using config values

Example

Set the wordsize parameter in the bmtagger_bitmask rule to 12:

ymp.yml
overrides:
  rules:
    bmtagger_bitmask:
      params:
        wordsize: 12
      resources:
        memory: 15G
      threads: 12
expand(rule, ruleinfo, **kwargs)[source]

Expands RuleInfo object and children recursively.

Will call :meth:format (via :meth:format_annotated) on str items encountered in the tree and wrap encountered functions to be called once the wildcards object is available.

Set ymp.print_rule = 1 before a rule: statement in snakefiles to enable debug logging of recursion.

Parameters
  • rule – The :class:snakemake.rules.Rule object to be populated with the data from the RuleInfo object passed from item

  • item – The item to be expanded. Initially a :class:snakemake.workflow.RuleInfo object into which is recursively decendet. May ultimately be None, str, function, int, float, dict, list or tuple.

  • expand_args – Parameters passed on late expansion (when the dag tries to instantiate the rule into a job.

  • rec – Recursion level

types = {'params': typing.Mapping, 'resources': typing.Mapping, 'threads': <class 'int'>}
class ymp.config.ResourceLimitsExpander(cfg)[source]

Bases: ymp.snakemake.BaseExpander

Allows adjusting resources to local compute environment

Each config item defines processing for an item in resources: or the special resource``threads``. Each item may have a default value filled in for rules not defining the resource, min and max defining the lower and uppeer bounds, and a scale value applied to the default to adjust resources up or down globally. Values in time or “human readable” format mabe parsed specially by passing the format values time or number, respectively. These values will also be reformatted, with the optional paramter unit defining the output format (k/g/m/t for numbers and minutes/seconds for time). Additional resource values may be generated from configured onces using the from keyword (e.g. to provide both mem_mb and mem_gb from a generic mem value.

static adjust_value(value, default, scale, minimum, maximum)[source]

Applies default, scale, minimum and maximum to a numeric value)

Return type

Optional[int]

expand(rule, ruleinfo, **kwargs)[source]

Expands RuleInfo object and children recursively.

Will call :meth:format (via :meth:format_annotated) on str items encountered in the tree and wrap encountered functions to be called once the wildcards object is available.

Set ymp.print_rule = 1 before a rule: statement in snakefiles to enable debug logging of recursion.

Parameters
  • rule – The :class:snakemake.rules.Rule object to be populated with the data from the RuleInfo object passed from item

  • item – The item to be expanded. Initially a :class:snakemake.workflow.RuleInfo object into which is recursively decendet. May ultimately be None, str, function, int, float, dict, list or tuple.

  • expand_args – Parameters passed on late expansion (when the dag tries to instantiate the rule into a job.

  • rec – Recursion level

Return type

None

expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field (str) – the field to check

Return type

bool

Returns

True if field should be expanded.

formatters = {'number': <function format_number>, 'time': <function format_time>}
parse_config(cfg)[source]

Parses limits config

parsers = {'number': <function parse_number>, 'time': <function parse_time>}

ymp.dna module

ymp.dna.nuc2aa(seq)
ymp.dna.nuc2num(seq)

ymp.download module

class ymp.download.DownloadThread[source]

Bases: object

get(url, dest, md5)[source]
main()[source]
terminate()[source]
class ymp.download.FileDownloader(block_size=4096, timeout=300, parallel=4, loglevel=30, alturls=None, retry=3)[source]

Bases: object

Manages download of a set of URLs

Downloads happen concurrently using asyncronous network IO.

Parameters
  • block_size (int) – Byte size of chunks to download

  • timeout (int) – Aiohttp cumulative timeout

  • parallel (int) – Number of files to download in parallel

  • loglevel (int) – Log level for messages send to logging (Errors are send with loglevel+10)

  • alturls – List of regexps modifying URLs

  • retry (int) – Number of times to retry download

error(msg, *args, **kwargs)[source]

Send error to logger

Message is sent with a log level 10 higher than the default for this object.

Return type

None

get(urls, dest, md5s=None)[source]

Download a list of URLs

Parameters
Return type

None

log(msg, *args, modlvl=0, **kwargs)[source]

Send message to logger

Honors loglevel set for the FileDownloader object.

Parameters
  • msg (str) – The log message

  • modlvl (int) – Added to default logging level for object

Return type

None

static make_bar_format(desc_width=20, count_width=0, rate=False, eta=False, have_total=True)[source]

Construct bar_format for tqdm

Parameters
  • desc_width (int) – minimum space allocated for description

  • count_width (int) – min space for counts

  • rate (bool) – show rate to right of progress bar

  • eta (bool) – show eta to right of progress bar

  • have_total (bool) – whether a total exists (required to add percentage)

Return type

str

ymp.env module

This module manages the conda environments.

class ymp.env.CondaPathExpander(config, *args, **kwargs)[source]

Bases: ymp.snakemake.BaseExpander

Applies search path for conda environment specifications

File names supplied via rule: conda: "some.yml" are replaced with absolute paths if they are found in any searched directory. Each search_paths entry is appended to the directory containing the top level Snakefile and the directory checked for the filename. Thereafter, the stack of including Snakefiles is traversed backwards. If no file is found, the original name is returned.

expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

format(conda_env, *args, **kwargs)[source]

Format item using *args and **kwargs

class ymp.env.Env(env_file=None, workflow=None, env_dir=None, container_img=None, cleanup=None, name=None, packages=None, base='none', channels=None)[source]

Bases: ymp.snakemake.WorkflowObject, snakemake.deployment.conda.Env

Represents YMP conda environment

Snakemake expects the conda environments in a per-workflow directory configured by conda_prefix. YMP sets this value by default to ~/.ymp/conda, which has a greater chance of being on the same file system as the conda cache, allowing for hard linking of environment files.

Within the folder conda_prefix, each environment is created in a folder named by the hash of the environment definition file’s contents and the conda_prefix path. This class inherits from snakemake.deployment.conda.Env to ensure that the hash we use is identical to the one Snakemake will use during workflow execution.

The class provides additional features for updating environments, creating environments dynamically and executing commands within those environments.

Note

This is not called from within the execution. Snakemake instanciates its own Env object purely based on the filename.

Creates an inline defined conda environment

Parameters
  • name (Optional[str]) – Name of conda environment (and basename of file)

  • packages (Union[list, str, None]) – package(s) to be installed into environment. Version constraints can be specified in each package string separated from the package name by whitespace. E.g. "blast =2.6*"

  • channels (Union[list, str, None]) – channel(s) to be selected for the environment

  • base (str) – Select a set of default channels and packages to be added to the newly created environment. Sets are defined in conda.defaults in yml.yml

create(dryrun=False, reinstall=False, nospec=False, noarchive=False)[source]

Ensure the conda environment has been created

Inherits from snakemake.deployment.conda.Env.create

Behavior of super class
  • Resolve remote file

  • If containerized, check environment path exists and return if true

  • Check for interrupted env create, delete if so

  • Return if environment exists

  • Install from archive if env_archive exists

  • Install using self.frontent if not_careful

Handling pre-computed environment specs

In addition to freezing environments by maintaining a copy of the package binaries, we allow maintaining a copy of the package binary URLs, from which the archive folder is populated on demand. We just download those to self.archive and pass on.

export(stream, typ='yml')[source]

Freeze environment

static get_installed_env_hashes()[source]
property installed
run(command)[source]

Execute command in environment

Returns exit code of command run.

set_prefix(prefix)[source]
update()[source]

Update conda environment

ymp.exceptions module

Exceptions raised by YMP

exception ymp.exceptions.YmpConfigError(obj, msg, key=None)[source]

Bases: ymp.exceptions.YmpLocateableError

Indicates an error in the ymp.yml config files

Parameters
  • obj (object) – Subtree of config causing error

  • msg (str) – The message to display

  • key (Optional[object]) – Key indicating part of obj causing error

  • exc – Upstream exception causing error

get_fileline()[source]

Retrieve filename and linenumber from object associated with exception

Returns

Tuple of filename and linenumber

exception ymp.exceptions.YmpException[source]

Bases: Exception

Base class of all YMP Exceptions

exception ymp.exceptions.YmpLocateableError(obj, msg, show_includes=True)[source]

Bases: ymp.exceptions.YmpPrettyException

Errors that have a file location to be shown

Parameters
  • obj (object) – The object causing the exception. Must have lineno and filename as these will be shown as part of the error message on the command line.

  • msg (str) – The message to display

  • show_includes (bool) – Whether or not the “stack” of includes should be printed.

get_fileline()[source]

Retrieve filename and linenumber from object associated with exception

Return type

Tuple[str, int]

Returns

Tuple of filename and linenumber

show(file=None)[source]
Return type

None

exception ymp.exceptions.YmpPrettyException(message)[source]

Bases: ymp.exceptions.YmpException, click.exceptions.ClickException, snakemake.exceptions.WorkflowError

Exception that does not lead to stack trace on CLI

Inheriting from ClickException makes click print only the self.msg value of the exception, rather than allowing Python to print a full stack trace.

This is useful for exceptions indicating usage or configuration errors. We use this, instead of click.UsageError and friends so that the exceptions can be caught and handled explicitly where needed.

Note that click will call the show method on this object to print the exception. The default implementation from click will just prefix the msg with Error:.

FIXME: This does not work if the exception is raised from within

the snakemake workflow as snakemake.snakemake catches and reformats exceptions.

rule = None
snakefile = None
exception ymp.exceptions.YmpRuleError(obj, msg, show_includes=True)[source]

Bases: ymp.exceptions.YmpLocateableError

Indicates an error in the rules files

This could e.g. be a Stage or Environment defined twice.

exception ymp.exceptions.YmpStageError(msg)[source]

Bases: ymp.exceptions.YmpPrettyException

Indicates an error in the requested stage stack

show(file=None)[source]
Return type

None

exception ymp.exceptions.YmpSystemError(message)[source]

Bases: ymp.exceptions.YmpPrettyException

Indicates problem running YMP with available system software

exception ymp.exceptions.YmpUsageError(message)[source]

Bases: ymp.exceptions.YmpPrettyException

General usage error

exception ymp.exceptions.YmpWorkflowError(message)[source]

Bases: ymp.exceptions.YmpPrettyException

Indicates an error during workflow execution

E.g. failures to expand dynamic variables

ymp.gff module

Implements simple reader and writer for GFF (general feature format) files.

Unfinished

  • only supports one version, GFF 3.2.3.

  • no escaping

class ymp.gff.Attributes(ID, Name, Alias, Parent, Target, Gap, Derives_From, Note, Dbxref, Ontology_term, Is_circular)

Bases: tuple

Create new instance of Attributes(ID, Name, Alias, Parent, Target, Gap, Derives_From, Note, Dbxref, Ontology_term, Is_circular)

Alias

Alias for field number 2

Dbxref

Alias for field number 8

Derives_From

Alias for field number 6

Gap

Alias for field number 5

ID

Alias for field number 0

Is_circular

Alias for field number 10

Name

Alias for field number 1

Note

Alias for field number 7

Ontology_term

Alias for field number 9

Parent

Alias for field number 3

Target

Alias for field number 4

class ymp.gff.Feature(seqid, source, type, start, end, score, strand, phase, attributes)

Bases: tuple

Create new instance of Feature(seqid, source, type, start, end, score, strand, phase, attributes)

attributes

Alias for field number 8

end

Alias for field number 4

phase

Alias for field number 7

score

Alias for field number 5

seqid

Alias for field number 0

source

Alias for field number 1

start

Alias for field number 3

strand

Alias for field number 6

type

Alias for field number 2

class ymp.gff.reader(fileobj)[source]

Bases: object

class ymp.gff.writer(fileobj)[source]

Bases: object

write(feature)[source]

ymp.helpers module

This module contains helper functions.

Not all of these are currently in use

class ymp.helpers.OrderedDictMaker[source]

Bases: object

odict creates OrderedDict objects in a dict-literal like syntax

>>>  my_ordered_dict = odict[
>>>    'key': 'value'
>>>  ]

Implementation: odict uses the python slice syntax which is similar to dict literals. The [] operator is implemented by overriding __getitem__. Slices passed to the operator as object[start1:stop1:step1, start2:...], are passed to the implementation as a list of objects with start, stop and step members. odict simply creates an OrderedDictionary by iterating over that list.

ymp.helpers.update_dict(dst, src)[source]

Recursively update dictionary dst with src

  • Treats a list as atomic, replacing it with new list.

  • Dictionaries are overwritten by item

  • None is replaced by empty dict if necessary

ymp.map2otu module

class ymp.map2otu.MapfileParser(minid=0)[source]

Bases: object

read(mapfiles)[source]
write(outfile)[source]
class ymp.map2otu.emirge_info(line)[source]

Bases: object

ymp.map2otu.main()[source]

ymp.nuc2aa module

ymp.nuc2aa.fasta_dna2aa(inf, outf)[source]
ymp.nuc2aa.nuc2aa(seq)[source]
ymp.nuc2aa.nuc2num(seq)[source]

ymp.snakemake module

Extends Snakemake Features

class ymp.snakemake.BaseExpander[source]

Bases: object

Base class for Snakemake expansion modules.

Subclasses should override the :meth:expand method if they need to work on the entire RuleInfo object or the :meth:format and :meth:expands_field methods if they intend to modify specific fields.

expand(rule, item, expand_args=None, rec=- 1, cb=False)[source]

Expands RuleInfo object and children recursively.

Will call :meth:format (via :meth:format_annotated) on str items encountered in the tree and wrap encountered functions to be called once the wildcards object is available.

Set ymp.print_rule = 1 before a rule: statement in snakefiles to enable debug logging of recursion.

Parameters
  • rule – The :class:snakemake.rules.Rule object to be populated with the data from the RuleInfo object passed from item

  • item – The item to be expanded. Initially a :class:snakemake.workflow.RuleInfo object into which is recursively decendet. May ultimately be None, str, function, int, float, dict, list or tuple.

  • expand_args – Parameters passed on late expansion (when the dag tries to instantiate the rule into a job.

  • rec – Recursion level

expand_dict(rule, item, expand_args, rec)[source]
expand_func(rule, item, expand_args, rec, debug)[source]
expand_list(rule, item, expand_args, rec, cb)[source]
expand_ruleinfo(rule, item, expand_args, rec)[source]
expand_str(rule, item, expand_args, rec, cb)[source]
expand_tuple(rule, item, expand_args, rec, cb)[source]
expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

format(item, *args, **kwargs)[source]

Format item using *args and **kwargs

format_annotated(item, expand_args)[source]

Wrapper for :meth:format preserving AnnotatedString flags

Calls :meth:format to format item into a new string and copies flags from original item.

This is used by :meth:expand

exception ymp.snakemake.CircularReferenceException(deps, rule)[source]

Bases: ymp.exceptions.YmpRuleError

Exception raised if parameters in rule contain a circular reference

class ymp.snakemake.ColonExpander[source]

Bases: ymp.snakemake.FormatExpander

Expander using {:xyz:} formatted variables.

regex = re.compile('\n        \\{:\n            (?=(\n                \\s*\n                 (?P<name>(?:.(?!\\s*\\:\\}))*.)\n                \\s*\n            ))\\1\n        :\\}\n        ', re.VERBOSE)
spec = '{{:{}:}}'
class ymp.snakemake.DefaultExpander(**kwargs)[source]

Bases: ymp.snakemake.InheritanceExpander

Adds default values to rules

The implementation simply makes all rules inherit from a defaults rule.

Creates DefaultExpander

Each parameter passed is considered a RuleInfo default value. Where applicable, Snakemake’s argtuples ([],{}) must be passed.

get_super(rule, ruleinfo)[source]

Find rule parent

Parameters
  • rule (Rule) – Rule object being built

  • ruleinfo (RuleInfo) – RuleInfo object describing rule being built

Returns

name of parent rule and RuleInfo describing parent rule or (None, None).

Return type

2-Tuple

exception ymp.snakemake.ExpandLateException[source]

Bases: Exception

class ymp.snakemake.ExpandableWorkflow(*args, **kwargs)[source]

Bases: snakemake.workflow.Workflow

Adds hook for additional rule expansion methods to Snakemake

Constructor for ExpandableWorkflow overlay attributes

This may be called on an already initialized Workflow object.

classmethod activate()[source]

Installs the ExpandableWorkflow

Replaces the Workflow object in the snakemake.workflow module with an instance of this class and initializes default expanders (the snakemake syntax).

add_rule(name=None, lineno=None, snakefile=None, checkpoint=False, allow_overwrite=False)[source]

Add a rule.

Parameters
  • name – name of the rule

  • lineno – line number within the snakefile where the rule was defined

  • snakefile – name of file in which rule was defined

classmethod clear()[source]
classmethod ensure_global_workflow()[source]
get_rule(name=None)[source]

Get rule by name. If name is none, the last created rule is returned.

Parameters

name – the name of the rule

global_workflow = <ymp.snakemake.ExpandableWorkflow object>
classmethod load_workflow(snakefile='/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src/ymp/rules/Snakefile')[source]
classmethod register_expanders(*expanders)[source]

Register an object the expand() function of which will be called on each RuleInfo object before it is passed on to snakemake.

rule(name=None, lineno=None, snakefile=None, checkpoint=None)[source]

Intercepts “rule:” Here we have the entire ruleinfo object

class ymp.snakemake.FormatExpander[source]

Bases: ymp.snakemake.BaseExpander

Expander using a custom formatter object.

class Formatter(expander)[source]

Bases: ymp.string.ProductFormatter

parse(format_string)[source]
format(*args, **kwargs)[source]

Format item using *args and **kwargs

get_names(pattern)[source]
regex = re.compile('\n        \\{\n            (?=(\n                (?P<name>[^{}]+)\n            ))\\1\n        \\}\n        ', re.VERBOSE)
spec = '{{{}}}'
exception ymp.snakemake.InheritanceException(msg, rule, parent, include=None, lineno=None, snakefile=None)[source]

Bases: snakemake.exceptions.RuleException

Exception raised for errors during rule inheritance

Creates a new instance of RuleException.

Arguments message – the exception message include – iterable of other exceptions to be included lineno – the line the exception originates snakefile – the file the exception originates

class ymp.snakemake.InheritanceExpander[source]

Bases: ymp.snakemake.BaseExpander

Adds class-like inheritance to Snakemake rules

To avoid redundancy between closely related rules, e.g. rules for single ended and paired end data, YMP allows Snakemake rules to inherit from another rule.

Example

Derived rules are always created with an implicit ruleorder statement, making Snakemake prefer the parent rule if either parent or child rule could be used to generate the requested output file(s).

Derived rules initially contain the same attributes as the parent rule. Each attribute assigned to the child rule overrides the matching attribute in the parent. Where attributes may contain named and unnamed values, specifying a named value overrides only the value of that name while specifying an unnamed value overrides all unnamed values in the parent attribute.

KEYWORD = 'ymp: extends'

Comment keyword enabling inheritance

expand(rule, ruleinfo)[source]

Expands RuleInfo object and children recursively.

Will call :meth:format (via :meth:format_annotated) on str items encountered in the tree and wrap encountered functions to be called once the wildcards object is available.

Set ymp.print_rule = 1 before a rule: statement in snakefiles to enable debug logging of recursion.

Parameters
  • rule – The :class:snakemake.rules.Rule object to be populated with the data from the RuleInfo object passed from item

  • item – The item to be expanded. Initially a :class:snakemake.workflow.RuleInfo object into which is recursively decendet. May ultimately be None, str, function, int, float, dict, list or tuple.

  • expand_args – Parameters passed on late expansion (when the dag tries to instantiate the rule into a job.

  • rec – Recursion level

get_code_line(rule)[source]

Returns the source line defining rule

Return type

str

get_super(rule, ruleinfo)[source]

Find rule parent

Parameters
  • rule (Rule) – Rule object being built

  • ruleinfo (RuleInfo) – RuleInfo object describing rule being built

Returns

name of parent rule and RuleInfo describing parent rule or (None, None).

Return type

2-Tuple

class ymp.snakemake.NamedList(fromtuple=None, **kwargs)[source]

Bases: snakemake.io.Namedlist

Extended version of Snakemake’s Namedlist

  • Fixes array assignment operator: Writing a field via [] operator updates the value accessed via . operator.

  • Adds fromtuple to constructor: Builds from Snakemake’s typial (args, kwargs) tuples as present in ruleinfo structures.

  • Adds update_tuple method: Updates values in (args,kwargs) tuples as present in ruleinfo structures.

get_names(*args, **kwargs)[source]

Export get_names as public func

update_tuple(totuple)[source]

Update values in (args, kwargs) tuple.

The tuple must be the same as used in the constructor and must not have been modified.

class ymp.snakemake.RecursiveExpander[source]

Bases: ymp.snakemake.BaseExpander

Recursively expands {xyz} wildcards in Snakemake rules.

expand(rule, ruleinfo)[source]

Recursively expand wildcards within RuleInfo object

expands_field(field)[source]

Returns true for all fields but shell:, message: and wildcard_constraints.

We don’t want to mess with the regular expressions in the fields in wildcard_constraints:, and there is little use in expanding message: or shell: as these already have all wildcards applied just before job execution (by format_wildcards()).

exception ymp.snakemake.RemoveValue[source]

Bases: Exception

Return to remove a value from the list

class ymp.snakemake.SnakemakeExpander[source]

Bases: ymp.snakemake.BaseExpander

Expand wildcards in strings returned from functions.

Snakemake does not do this by default, leaving wildcard expansion to the functions provided themselves. Since we never want {input} to be in a string returned as a file, we expand those always.

expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

format(item, *args, **kwargs)[source]

Format item using *args and **kwargs

class ymp.snakemake.WorkflowObject(*args, **kwargs)[source]

Bases: object

Base for extension classes defined from snakefiles

This currently encompasses ymp.env.Env and ymp.stage.stage.Stage.

This mixin sets the properties filename and lineno according to the definition source in the rules file. It also maintains a registry within the Snakemake workflow object and provides an accessor method to this registry.

property defined_in
filename

Name of file in which object was defined

Type

str

classmethod get_registry(clean=False)[source]

Return all objects of this class registered with current workflow

lineno

Line number of object definition

Type

int

classmethod new_registry()[source]
register()[source]

Add self to registry

ymp.snakemake.check_snakemake()[source]
Return type

bool

ymp.snakemake.get_workflow()[source]

Get active workflow, loading one if necessary

ymp.snakemake.load_workflow(snakefile)[source]

Load new workflow

ymp.snakemake.make_rule(name=None, lineno=None, snakefile=None, **kwargs)[source]
ymp.snakemake.networkx()[source]
ymp.snakemake.print_ruleinfo(rule, ruleinfo, func=<bound method Logger.debug of <Logger ymp.snakemake (WARNING)>>)[source]

Logs contents of Rule and RuleInfo objects.

Parameters
  • rule (Rule) – Rule object to be printed

  • ruleinfo (RuleInfo) – Matching RuleInfo object to be printed

  • func – Function used for printing (default is log.error)

ymp.snakemake.ruleinfo_fields = {'benchmark': {'apply_wildcards': True, 'format': 'string'}, 'conda_env': {'apply_wildcards': True, 'format': 'string'}, 'container_img': {'format': 'string'}, 'docstring': {'format': 'string'}, 'func': {'format': 'callable'}, 'input': {'apply_wildcards': True, 'format': 'argstuple', 'funcparams': ('wildcards',)}, 'log': {'apply_wildcards': True, 'format': 'argstuple'}, 'message': {'format': 'string', 'format_wildcards': True}, 'norun': {'format': 'bool'}, 'output': {'apply_wildcards': True, 'format': 'argstuple'}, 'params': {'apply_wildcards': True, 'format': 'argstuple', 'funcparams': ('wildcards', 'input', 'resources', 'output', 'threads')}, 'priority': {'format': 'numeric'}, 'resources': {'format': 'argstuple', 'funcparams': ('wildcards', 'input', 'attempt', 'threads')}, 'script': {'format': 'string'}, 'shadow_depth': {'format': 'string_or_true'}, 'shellcmd': {'format': 'string', 'format_wildcards': True}, 'threads': {'format': 'int', 'funcparams': ('wildcards', 'input', 'attempt', 'threads')}, 'version': {'format': 'object'}, 'wildcard_constraints': {'format': 'argstuple'}, 'wrapper': {'format': 'string'}}

describes attributes of snakemake.workflow.RuleInfo

ymp.snakemakelexer module

ymp.snakemakelexer
class ymp.snakemakelexer.SnakemakeLexer(*args, **kwds)[source]

Bases: pygments.lexers.python.PythonLexer

name = 'Snakemake'

Name of the lexer

tokens = {'globalkeyword': [(<pygments.lexer.words object>, Token.Keyword)], 'root': [('(rule|checkpoint)((?:\\s|\\\\\\s)+)', <function bygroups.<locals>.callback>, 'rulename'), 'rulekeyword', 'globalkeyword', inherit], 'rulekeyword': [(<pygments.lexer.words object>, Token.Keyword)], 'rulename': [('[a-zA-Z_]\\w*', Token.Name.Class, '#pop')]}

Dict of {'state': [(regex, tokentype, new_state), ...], ...}

The initial state is ‘root’. new_state can be omitted to signify no state transition. If it is a string, the state is pushed on the stack and changed. If it is a tuple of strings, all states are pushed on the stack and the current state will be the topmost. It can also be combined('state1', 'state2', ...) to signify a new, anonymous state combined from the rules of two or more existing ones. Furthermore, it can be ‘#pop’ to signify going back one step in the state stack, or ‘#push’ to push the current state on the stack again.

The tuple can also be replaced with include('state'), in which case the rules from the state named by the string are included in the current one.

ymp.sphinxext module

This module contains a Sphinx extension for documenting YMP stages and Snakemake rules.

The SnakemakeDomain (name sm) provides the following directives:

.. sm:rule:: name

Describes a Snakemake rule

.. sm:stage:: name

Describes a YMP Stage

Both directives accept an optional source parameter. If given, a link to the source code of the stage or rule definition will be added. The format of the string passed is filename:line. Referenced Snakefiles will be highlighted with pygments and added to the documentation when building HTML.

The extension also provides an autodoc-like directive:

.. autosnake:: filename

Generates documentation from Snakefile filename.

class ymp.sphinxext.AutoSnakefileDirective(name, arguments, options, content, lineno, content_offset, block_text, state, state_machine)[source]

Bases: docutils.parsers.rst.Directive

Implements RSt directive .. autosnake:: filename

The directive extracts docstrings from rules in snakefile and auto-generates documentation.

has_content = False

This rule does not accept content

Type

bool

load_workflow(file_path)[source]

Load the Snakefile

Return type

ExpandableWorkflow

parse_doc(doc, source, idt=0)[source]

Convert doc string to StringList

Parameters
  • doc (str) – Documentation text

  • source (str) – Source filename

  • idt (int) – Result indentation in characters (default 0)

Return type

StringList

Returns

StringList of re-indented documentation wrapped in newlines

parse_rule(rule_name, idt=0)[source]

Convert Rule to StringList

Parameters
  • rule – Rule object

  • idt (int) – Result indentation in characters (default 0)

Retuns:

StringList containing formatted Rule documentation

Return type

StringList

parse_stage(stage, idt=0)[source]
Return type

StringList

required_arguments = 1

This rule needs one argument (the filename)

Type

int

run()[source]

Entry point

tpl_rule = '.. sm:rule:: {name}'

Template for generated Rule RSt

Type

str

tpl_source = '   :source: {filename}:{lineno}'

Template option source

Type

str

tpl_stage = '.. sm:stage:: {name}'

Template for generated Stage RSt

Type

str

ymp.sphinxext.BASEPATH = '/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src'

Path in which YMP package is located

Type

str

class ymp.sphinxext.CondaDomain(env)[source]

Bases: sphinx.domains.Domain

name = 'conda'

should be short, but unique

Type

domain name

object_types: Dict[str, ObjType] = {'package': <sphinx.domains.ObjType object>}

type (usually directive) name -> ObjType instance

roles: Dict[str, Union[RoleFunction, XRefRole]] = {'package': <sphinx.roles.XRefRole object>}

role name -> role callable

class ymp.sphinxext.DomainTocTreeCollector[source]

Bases: sphinx.environment.collectors.EnvironmentCollector

Add Sphinx Domain entries to the TOC

clear_doc(app, env, docname)[source]

Clear data from environment

If we have cached data in environment for document docname, we should clear it here.

Return type

None

get_ref(node)[source]
Return type

Optional[Node]

locate_in_toc(app, node)[source]
Return type

Optional[Node]

make_heading(node)[source]
Return type

List[Node]

merge_other(app, env, docnames, other)[source]

Merge with results from parallel processes

Called if Sphinx is processing documents in parallel. We should merge this from other into env for all docnames.

Return type

None

process_doc(app, doctree)[source]

Process doctree

This is called by read-doctree, so after the doctree has been loaded. The signal is processed in registered first order, so we are called after built-in extensions, such as the sphinx.environment.collectors.toctree extension building the TOC.

Return type

None

select_doc_nodes(doctree)[source]

Select the nodes for which entries in the TOC are desired

This is a separate method so that it might be overriden by subclasses wanting to add other types of nodes to the TOC.

Return type

List[Node]

select_toc_location(app, node)[source]

Select location in TOC where node should be referenced

Return type

Node

toc_insert(docname, tocnode, node, heading)[source]
Return type

None

class ymp.sphinxext.SnakemakeDomain(env)[source]

Bases: sphinx.domains.Domain

Snakemake language domain

clear_doc(docname)[source]

Delete objects derived from file docname

data_version = 0

data version, bump this when the format of self.data changes

directives: Dict[str, Any] = {'rule': <class 'ymp.sphinxext.SnakemakeRule'>, 'stage': <class 'ymp.sphinxext.YmpStage'>}

directive name -> directive class

get_objects()[source]

Return an iterable of “object descriptions”.

Object descriptions are tuples with six items:

name

Fully qualified name.

dispname

Name to display when searching/linking.

type

Object type, a key in self.object_types.

docname

The document where it is to be found.

anchor

The anchor name for the object.

priority

How “important” the object is (determines placement in search results). One of:

1

Default priority (placed before full-text matches).

0

Object is important (placed before default-priority objects).

2

Object is unimportant (placed after full-text matches).

-1

Object should not show up in search at all.

initial_data: Dict = {'objects': {}}

data value for a fresh environment

label = 'Snakemake'

longer, more descriptive (used in messages)

Type

domain label

name = 'sm'

should be short, but unique

Type

domain name

object_types: Dict[str, ObjType] = {'rule': <sphinx.domains.ObjType object>, 'stage': <sphinx.domains.ObjType object>}

type (usually directive) name -> ObjType instance

resolve_xref(env, fromdocname, builder, typ, target, node, contnode)[source]

Resolve the pending_xref node with the given typ and target.

This method should return a new node, to replace the xref node, containing the contnode which is the markup content of the cross-reference.

If no resolution can be found, None can be returned; the xref node will then given to the :event:`missing-reference` event, and if that yields no resolution, replaced by contnode.

System Message: ERROR/3 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src/ymp/sphinxext.py:docstring of ymp.sphinxext.SnakemakeDomain.resolve_xref, line 7); backlink

Unknown interpreted text role “event”.

The method can also raise sphinx.environment.NoUri to suppress the :event:`missing-reference` event being emitted.

System Message: ERROR/3 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src/ymp/sphinxext.py:docstring of ymp.sphinxext.SnakemakeDomain.resolve_xref, line 11); backlink

Unknown interpreted text role “event”.

roles: Dict[str, Union[RoleFunction, XRefRole]] = {'rule': <sphinx.roles.XRefRole object>, 'stage': <sphinx.roles.XRefRole object>}

role name -> role callable

class ymp.sphinxext.SnakemakeRule(name, arguments, options, content, lineno, content_offset, block_text, state, state_machine)[source]

Bases: sphinx.util.docutils.SphinxDirective, Generic[sphinx.directives.T]

Directive sm:rule:: describing a Snakemake rule

typename = 'rule'
class ymp.sphinxext.YmpObjectDescription(name, arguments, options, content, lineno, content_offset, block_text, state, state_machine)[source]

Bases: sphinx.util.docutils.SphinxDirective, Generic[sphinx.directives.T]

Base class for RSt directives in SnakemakeDomain

Since this inherhits from Sphinx’ ObjectDescription, content generated by the directive will always be inside an addnodes.desc.

Parameters

source – Specify source position as file:line to create link

add_target_and_index(name, sig, signode)[source]

Add cross-reference IDs and entries to self.indexnode

Return type

None

get_index_text(typename, name)[source]

Formats object for entry into index

Return type

str

handle_signature(sig, signode)[source]

Parse rule signature sig into RST nodes and append them to signode.

The retun value identifies the object and is passed to add_target_and_index() unchanged

Parameters
  • sig (str) – Signature string (i.e. string passed after directive)

  • signode (desc) – Node created for object signature

Return type

str

Returns

Normalized signature (white space removed)

option_spec: Dict[str, DirectiveOption] = {'source': <function unchanged>}

Mapping of option names to validator functions.

typename = '[object name]'
class ymp.sphinxext.YmpStage(name, arguments, options, content, lineno, content_offset, block_text, state, state_machine)[source]

Bases: sphinx.util.docutils.SphinxDirective, Generic[sphinx.directives.T]

Directive sm:stage:: describing an YMP stage

typename = 'stage'
ymp.sphinxext.collect_pages(app)[source]

Add Snakefiles to documentation (in HTML mode)

ymp.sphinxext.relpath(path)[source]

Make absolute path relative to BASEPATH

Parameters

path (str) – absolute path

Return type

str

Returns

path relative to BASEPATH

ymp.sphinxext.setup(app)[source]

Register the extension with Sphinx

ymp.string module

exception ymp.string.FormattingError(message, fieldname)[source]

Bases: AttributeError

class ymp.string.GetNameFormatter[source]

Bases: string.Formatter

get_names(pattern)[source]
class ymp.string.OverrideJoinFormatter[source]

Bases: string.Formatter

Formatter with overridable join method

The default formatter joins all arguments with "".join(args). This class overrides _vformat() with identical code, changing only that line to one that can be overridden by a derived class.

join(args)[source]

Joins the expanded pieces of the template string to form the output.

This function is equivalent to ''.join(args). By overriding it, alternative methods can be implemented, e.g. to create a list of strings, each corresponding to a the cross product of the expanded variables.

Return type

Union[List[str], str]

class ymp.string.PartialFormatter[source]

Bases: string.Formatter

Formats what it can and leaves the remainder untouched

get_field(field_name, args, kwargs)[source]
class ymp.string.ProductFormatter[source]

Bases: ymp.string.OverrideJoinFormatter

String Formatter that creates a list of strings each expanded using one point in the cartesian product of all replacement values.

If none of the arguments evaluate to lists, the result is a string, otherwise it is a list.

>>> ProductFormatter().format("{A} and {B}", A=[1,2], B=[3,4])
"1 and 3"
"1 and 4"
"2 and 3"
"2 and 4"
format_field(value, format_spec)[source]
join(args)[source]

Joins the expanded pieces of the template string to form the output.

This function is equivalent to ''.join(args). By overriding it, alternative methods can be implemented, e.g. to create a list of strings, each corresponding to a the cross product of the expanded variables.

Return type

Union[List[str], str]

class ymp.string.QuotedElementFormatter(*args, **kwargs)[source]

Bases: snakemake.utils.SequenceFormatter

class ymp.string.RegexFormatter(regex)[source]

Bases: string.Formatter

String Formatter accepting a regular expression defining the format of the expanded tags.

get_names(format_string)[source]

Get set of field names in format_string)

Return type

Set[str]

parse(format_string)[source]

Parse format_string into tuples. Tuples contain literal_text: text to copy field_name: follwed by field name format_spec: conversion:

ymp.string.make_formatter(product=None, regex=None, partial=None, quoted=None)[source]
Return type

Formatter

ymp.util module

ymp.util.R(code='', **kwargs)[source]

Execute R code

This function executes the R code given as a string. Additional arguments are injected into the R environment. The value of the last R statement is returned.

The function requires rpy2 to be installed.

Parameters
  • code (str) – R code to be executed

  • **kwargs (dict) – variables to inject into R globalenv

Yields

value of last R statement

>>>  R("1*1", input=input)
ymp.util.Rmd(rmd, out, **kwargs)[source]
ymp.util.activate_R()[source]
ymp.util.check_input(names, minlines=0, minbytes=0)[source]
Return type

Callable

ymp.util.ensure_list(arg)[source]
ymp.util.fasta_names(fasta_file)[source]
ymp.util.file_not_empty(fn, minsize=1)[source]

Checks is a file is not empty, accounting for gz mininum size 20

ymp.util.filter_input(name, also=None, join=None, minsize=None)[source]
Return type

Callable

ymp.util.filter_out_empty(*args)[source]

Removes empty sets of files from input file lists.

Takes a variable number of file lists of equal length and removes indices where any of the files is empty. Strings are converted to lists of length 1.

Returns a generator tuple.

Example: r1, r2 = filter_out_empty(input.r1, input.r2)

ymp.util.glob_wildcards(pattern, files=None)[source]

Glob the values of the wildcards by matching the given pattern to the filesystem. Returns a named tuple with a list of values for each wildcard.

ymp.util.is_fq(path)[source]
ymp.util.make_local_path(icfg, url)[source]
ymp.util.read_propfiles(files)[source]

ymp.yaml module

class ymp.yaml.AttrItemAccessMixin[source]

Bases: object

Mixin class mapping dot to bracket access

Added to classes implementing __getitem__, __setitem__ and __delitem__, this mixin will allow acessing items using dot notation. I.e. “object.xyz” is translated to “object[xyz]”.

class ymp.yaml.Entry(filename, yaml, index)[source]

Bases: object

exception ymp.yaml.LayeredConfAccessError(obj, msg, key=None, stack=None)[source]

Bases: ymp.yaml.LayeredConfError, KeyError, IndexError

Can’t access

exception ymp.yaml.LayeredConfError(obj, msg, key=None, stack=None)[source]

Bases: ymp.exceptions.YmpConfigError

Error in LayeredConf

get_fileline()[source]

Retrieve filename and linenumber from object associated with exception

Returns

Tuple of filename and linenumber

class ymp.yaml.LayeredConfProxy(maps, root=None, parent=None, key=None)[source]

Bases: ymp.yaml.MultiMapProxy

Layered configuration

save(outstream=None, layer=0)[source]
exception ymp.yaml.LayeredConfWriteError(obj, msg, key=None, stack=None)[source]

Bases: ymp.yaml.LayeredConfError

Can’t write

exception ymp.yaml.MixedTypeError(obj, msg, key=None, stack=None)[source]

Bases: ymp.yaml.LayeredConfError

Mixed types in proxy collection

class ymp.yaml.MultiMapProxy(maps, root=None, parent=None, key=None)[source]

Bases: ymp.yaml.MultiProxy, ymp.yaml.AttrItemAccessMixin, collections.abc.Mapping

Mapping Proxy for layered containers

get(k[, d])D[k] if k in D, else d.  d defaults to None.[source]
get_paths(absolute=False)[source]
items()a set-like object providing a view on D’s items[source]
keys()a set-like object providing a view on D’s keys[source]
values()an object providing a view on D’s values[source]
class ymp.yaml.MultiMapProxyItemsView(mapping)[source]

Bases: ymp.yaml.MultiMapProxyMappingView, collections.abc.ItemsView

ItemsView for MultiMapProxy

class ymp.yaml.MultiMapProxyKeysView(mapping)[source]

Bases: ymp.yaml.MultiMapProxyMappingView, collections.abc.KeysView

KeysView for MultiMapProxy

class ymp.yaml.MultiMapProxyMappingView(mapping)[source]

Bases: collections.abc.MappingView

MappingView for MultiMapProxy

class ymp.yaml.MultiMapProxyValuesView(mapping)[source]

Bases: ymp.yaml.MultiMapProxyMappingView, collections.abc.ValuesView

ValuesView for MultiMapProxy

class ymp.yaml.MultiProxy(maps, root=None, parent=None, key=None)[source]

Bases: object

Base class for layered container structure

add_layer(name, container)[source]
get_fileline(key=None)[source]
get_files()[source]
get_linenos()[source]
get_path(key=None, absolute=False)[source]
remove_layer(name)[source]
to_yaml(show_source=False)[source]
class ymp.yaml.MultiSeqProxy(maps, root=None, parent=None, key=None)[source]

Bases: ymp.yaml.MultiProxy, ymp.yaml.AttrItemAccessMixin, collections.abc.Sequence

Sequence Proxy for layered containers

extend(item)[source]
get_paths(absolute=False)[source]
class ymp.yaml.WorkdirTag(path)[source]

Bases: object

classmethod from_yaml(_constructor, node)[source]
classmethod to_yaml(representer, instance)[source]
yaml_tag = '!workdir'
ymp.yaml.load(files, root=None)[source]

Load configuration files

Creates a LayeredConfProxy configuration object from a set of YAML files.

Files listed later will override parts of earlier included files

ymp.yaml.resolve_installed_package(fname, stack)[source]

Indices and tables