Table Of Conents

YMP - a Flexible Omics Pipeline

Welcome to the YMP documentation!

YMP is a tool that makes it easy to process large amounts of NGS read data. It comes “batteries included” with everything needed to preprocess your reads (QC, trimming, contaminant removal), assemble metagenomes, annotate assemblies, or assemble and quantify RNA-Seq transcripts, offering a choice of tools for each of those procecssing stages. When your needs exceed what the stock YMP processing stages provide, you can easily add your own, using YMP to drive novel tools, tools specific to your area of research, or tools you wrote yourself.

Features:

batteries included

YMP comes with a large number of Stages implementing common read processing steps. These stages cover the most common topics, including quality control, filtering and sorting of reads, assembly of metagenomes and transcripts, read mapping, community profiling, visualisation and pathway analysis.

For a complete list, check the documentation or the source.

get started quickly

Simply point YMP at a folder containing read files, at a mapping file, a list of URLs or even an SRA RunTable and YMP will configure itself. Use tab expansion to complete your desired series of stages to be applied to your data. YMP will then proceed to do your bidding, downloading raw read files and reference databases as needed, installing requisite software environments and scheduling the execution of tools either locally or on your cluster.

explore alternative workflows

Not sure which assembler works best for your data, or what the effect of more stringent quality trimming would be? YMP is made for this! By keeping the output of each stage in a folder named to match the stack of applied stages, YMP can manage many variant workflows in parallel, while minimizing the amount of duplicate computation and storage.

go beyond the beaten path

Built on top of Bioconda and Snakemake, YMP is easily extended with your own Snakefiles, allowing you to integrate any type of processing you desire into YMP, including your own, custom made tools. Within the YMP framework, you can also make use of the extensions to the Snakemake language provided by YMP (default values, inheritance, recursive wildcard expansion, etc.), making writing rules less error prone and repetative.

Background

Bioinformatical data processing workflows can easily get very complex, even convoluted. On the way from the raw read data to publishable results, a sizeable collection of tools needs to be applied, intermediate outputs verified, reference databases selected, and summary data produced. A host of data files must be managed, processed individually or aggregated by host or spatial transect along the way. And, of course, to arrive at a workflow that is just right for a particular study, many alternative workflow variants need to be evaluated. Which tools perform best? Which parameters are right? Does re-ordering steps make a difference? Should the data be assembled individually, grouped, or should a grand co-assembly be computed? Which reference database is most appropriate?

Answering these questions is a time consuming process, justifying the plethora of published ready made pipelines each providing a polished workflow for a typical study type or use case. The price for the convenience of such a polished pipeline is the lack of flexibility - they are not meant to be adapted or extended to match the needs of a particular study. Workflow management systems on the other hand offer great flexibility by focussing on the orchestration of user defined workflows, but typicially require significant initial effort as they come without predefined workflows.

YMP strives to walk the middle ground between these. It brings everything needed to classic metagenome and RNA-Seq workflows, yet built on the workflow management system Snakemake, it can be easily expanded by simply adding Snakemake rules files. Designed around the needs of processing primarily multi-omic NGS read data, it brings a framework for handling read file meta data, provisioning reference databases, and organizing rules into semantic stages.

Installing and Updating YMP

Working with the Github Development Version

Installing from GitHub

  1. Clone the repository:

    git clone  --recurse-submodules https://github.com/epruesse/ymp.git
    

    Or, if your have github ssh keys set up:

    git clone --recurse-submodules git@github.com:epruesse/ymp.git
    
  2. Create and activate conda environment:

    conda env create -n ymp --file environment.yaml
    source activate ymp
    
  3. Install YMP into conda environment:

    pip install -e .
    
  4. Verify that YMP works:

    source activate ymp
    ymp --help
    

Updating Development Version

Usually, all you need to do is a pull:

git pull
git submodule update --recursive --remote

If environments where updated, you may want to regenerate the local installations and clean out environments no longer used to save disk space:

source activate ymp
ymp env update
ymp env clean
# alternatively, you can just delete existing envs and let YMP
# reinstall as needed:
# rm -rf ~/.ymp/conda*
conda clean -a

If you see errors before jobs are executed, the core requirements may have changed. To update the YMP conda environment, enter the folder where you installed YMP and run the following:

source activate ymp
conda env update --file environment.yaml

If something changed in setup.py, a re-install may be necessary:

source activate ymp
pip install -U -e .

Configuration

YMP reads its configuration from a YAML formatted file ymp.yml. To run YMP, you need to first tell it which datasets you want to process and where it can find them.

Getting Started

A simple configuration looks like this:

projects:
  myproject:
    data: mapping.csv

This tells YMP to look for a file mapping.csv located in the same folder as your ymp.yml listing the datasets for the project myproject. By default, YMP will use the left most unique column as names for your datasets and try to guess which columns point to your input data.

The matching mapping.csv might look like this:

sample,fq1,fq2
foot,sample1_1.fq.gz,sample1_2.fq.gz
hand,sample2_1.fq,gz,sample2_2.fq.gz

So we have two samples, foot and hand, and the read files for those in the same directory as the configuration file. Using relative or absolute paths you can point to any place in your filesystem. You can also use SRA references like SRR123456 or URLs pointing to remote files.

The mapping file itself may be in comma separated or tab separated format or may be an Excel file. For Excel files, you may specify the sheet to be used separated from the file name by a % sign. For example:

project:
  myproject:
    data: myproject.xlsx%sheet3

The matching Excel file could then have a sheet3 with this content:

sample

fq1

fq2

srr

foot

/data/foot1.fq.gz

/data/foot2.fq.gz

hand

SRR123456

head

http://datahost/head1.fq.gz

http://datahost/head2.fq.gz

SRR234234

For foot, the two gzipped FastQ files are used. The data for hand is retrieved from SRA and the data for head downloaded from datahost. The SRR number for head is ignored as the URL pair is found first.

Referencing Read Files

YMP will search your map file data for references to the read data files. It understands three types of references to your reads:

Local FastQ files: data/some_1.fq.gz, data/some_2.fq.gz

The file names should end in .fastq or .fq, optionally followed by .gz if your data is compressed. You need to provide forward and reverse reads in separate columns; the left most column is assumed to refer to the forward reads.

If the filename is relative (does not start with a /), it is assumed to be relative to the location of ymp.yml.

Remote FastQ files: http://myhost/some_1.fq.gz, http://myhost/some_2.fq.gz

If the filename starts with http:// or https://, YMP will download the files automatically.

Forward and reverse reads need to be either both local or both remote.

SRA Run IDs: SRR123456

Instead of giving names for FastQ files, you may provide SRA Run accessions, e.g. SRR123456 (or ERRnnn or DRRnnn for runs originally submitted to EMBL or DDBJ, respectively). YMP will use fastq-dump to download and extract the SRA files.

Which type to use is determined for each row in your map file data individually. From left to right, the first recognized data source is used in the order they are listed above.

Configuration processing an SRA RunTable:

projects:
  smith17:
    data:
      - SraRunTable.txt
    id_col: Sample_Name_s

Project Configuration

Each project must have a data key defining which mapping file(s) to load. This may be a simple string referring to the file (URLs are OK as well) or a more complex configuration.

Specifying Columns

By default, YMP will choose the columns to use as data set name and to locate the read data automatically. You can override this behavior by specifying the columns explicitly:

  1. Data set names: id_col: Sample

    The left most unique column may not always be the most informative to use as names for the datasets. In the above example, we specify the column to use explicitly with the line id_col: Sample_Name_s as the columns in SRA run tables are sorted alpha-numerically and the left most unique one may well contain random numeric data.

    Default: left most unique column

  2. Data set read columns: reads_cols: [fq1, fq2]

    If your map files contain multiple references to source files, e.g. local and remote, and the order of preference used by YMP does not meet your needs you can restrict the search for suitable data references to a set of columns using the key read_cols.

    Default: all columns

Example
projects:
  smith17:
    data:
      - SraRunTable.txt
    id_col: Sample_Name_s
    read_cols: Run_s

Multiple Mapping Files per Project

To combine data sets from multiple mapping files, simply list the files under the data key:

projects:
  myproject:
    data:
      - sequencing_run_1.txt
      - sequencing_run_2.txt

The files should at least share one column containing unique values to use as names for the datasets.

If you need to merge meta-data spread over multiple files, you can use the join key:

project:
  myproject:
    data:
      - join:
          - SraRunTable.txt
          - metadata.xlsx%reference_project
      - metadata.xlsx%our_samples

This will merge rows from SraRunTable.txt with rows in the reference_project sheet in metadata.xls if all columns of the same name contain the same data (natural join) and add samples from the our_samples sheet to the bottom of the list.

Complete Example

projects:
  myproject:
    data:
      - join:
          - SraRunTable.txt
          - metadata.xlsx%reference_project
      - metadata.xlsx%our_samples
      - mapping.csv
    id_col: Sample
    read_cols:
      - fq1
      - fq2
      - Run_s

Command Line

ymp

Welcome to YMP!

Please find the full manual at https://ymp.readthedocs.io

ymp [OPTIONS] COMMAND [ARGS]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

--version

Show the version and exit.

--install-completion

Install command completion for the current shell. Make sure to have psutil installed.

--profile <profile>

Profile execution time using Yappi

env

Manipulate conda software environments

These commands allow accessing the conda software environments managed by YMP. Use e.g.

>>> $(ymp env activate multiqc)

to enter the software environment for multiqc.

ymp env [OPTIONS] COMMAND [ARGS]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

activate

source activate environment

Usage: $(ymp activate env [ENVNAME])

ymp env activate [OPTIONS] ENVNAME

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

Arguments

ENVNAME

Required argument

clean

Remove unused conda environments

ymp env clean [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-a, --all

Delete all environments

Arguments

ENVNAMES

Optional argument(s)

export

Export conda environments

Resolved package specifications for the selected conda environments can be exported either in YAML format suitable for use with conda env create -f FILE or in TXT format containing a list of URLs suitable for use with conda create --file FILE. Please note that the TXT format is platform specific.

If other formats are desired, use ymp env list to view the environments’ installation path (“prefix” in conda lingo) and export the specification with the conda command line utlity directly.

Note:
Environments must be installed before they can be exported. This is due
to limitations of the conda utilities. Use the “–create” flag to
automatically install missing environments.
ymp env export [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-d, --dest <FILE>

Destination file or directory. If a directory, file names will be derived from environment names and selected export format. Default: print to standard output.

-f, --overwrite

Overwrite existing files

-c, --create-missing

Create environments not yet installed

-s, --skip-missing

Skip environments not yet installed

-t, --filetype <filetype>

Select export format. Default: yml unless FILE ends in ‘.txt’

Options

yml|txt

Arguments

ENVNAMES

Optional argument(s)

install

Install conda software environments

ymp env install [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-p, --conda-prefix <conda_prefix>

Override location for conda environments

-e, --conda-env-spec <conda_env_spec>

Override conda env specs settings

-n, --dry-run

Only show what would be done

-f, --force

Install environment even if it already exists

Arguments

ENVNAMES

Optional argument(s)

list

List conda environments

ymp env list [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

--static, --no-static

List environments statically defined via env.yml files

--dynamic, --no-dynamic

List environments defined inline from rule files

-a, --all

List all environments, including outdated ones.

-s, --sort <sort_col>

Sort by column

Options

name|hash|path|installed

-r, --reverse

Reverse sort order

Arguments

ENVNAMES

Optional argument(s)

prepare

Create envs needed to build target

ymp env prepare [OPTIONS] TARGET_FILES

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-n, --dryrun

Only show what would be done

-p, --printshellcmds

Print shell commands to be executed on shell

-k, --keepgoing

Don’t stop after failed job

--lock, --no-lock

Use/don’t use locking to prevent clobbering of files by parallel instances of YMP running

--rerun-incomplete, --ri

Re-run jobs left incomplete in last run

-F, --forceall

Force rebuilding of all stages leading to target

-f, --force

Force rebuilding of target

--notemp

Do not remove temporary files

-t, --touch

Only touch files, faking update

--shadow-prefix <shadow_prefix>

Directory to place data for shadowed rules

-r, --reason

Print reason for executing rule

-N, --nohup

Don’t die once the terminal goes away.

Arguments

TARGET_FILES

Optional argument(s)

remove

Remove conda environments

ymp env remove [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

Arguments

ENVNAMES

Optional argument(s)

run

Execute COMMAND with activated environment ENV

Usage: ymp env run <ENV> [–] <COMMAND…>

(Use the “–” if your command line contains option type parameters

beginning with - or –)

ymp env run [OPTIONS] ENVNAME [COMMAND]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

Arguments

ENVNAME

Required argument

COMMAND

Optional argument(s)

update

Update conda environments

ymp env update [OPTIONS] [ENVNAMES]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

--reinstall <reinstall>

Remove and reinstall environments rather than trying to update

Arguments

ENVNAMES

Optional argument(s)

init

Initialize YMP workspace

ymp init [OPTIONS] COMMAND [ARGS]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

cluster

Set up cluster

ymp init cluster [OPTIONS]

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-y, --yes

Confirm every prompt

demo

Copies YMP tutorial data into the current working directory

ymp init demo [OPTIONS]

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

project
ymp init project [OPTIONS] [NAME]

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-y, --yes

Confirm every prompt

Arguments

NAME

Optional argument

make

Build target(s) locally

ymp make [OPTIONS] TARGET_FILES

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-n, --dryrun

Only show what would be done

-p, --printshellcmds

Print shell commands to be executed on shell

-k, --keepgoing

Don’t stop after failed job

--lock, --no-lock

Use/don’t use locking to prevent clobbering of files by parallel instances of YMP running

--rerun-incomplete, --ri

Re-run jobs left incomplete in last run

-F, --forceall

Force rebuilding of all stages leading to target

-f, --force

Force rebuilding of target

--notemp

Do not remove temporary files

-t, --touch

Only touch files, faking update

--shadow-prefix <shadow_prefix>

Directory to place data for shadowed rules

-r, --reason

Print reason for executing rule

-N, --nohup

Don’t die once the terminal goes away.

-j, --cores <CORES>

The number of parallel threads used for scheduling jobs

--dag

Print the Snakemake execution DAG and exit

--rulegraph

Print the Snakemake rule graph and exit

--debug-dag

Show candidates and selections made while the rule execution graph is being built

--debug

Set the Snakemake debug flag

Arguments

TARGET_FILES

Optional argument(s)

show

Show configuration properties

ymp show [OPTIONS] PROPERTY

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-h, --help
-s, --source

Show source

Arguments

PROPERTY

Optional argument

stage

Manipulate YMP stages

ymp stage [OPTIONS] COMMAND [ARGS]...

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

list

List available stages

ymp stage list [OPTIONS] STAGE

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-l, --long

Show full stage descriptions

-s, --short

Show only stage names

-c, --code

Show definition file name and line number

-t, --types

Show input/output types

Arguments

STAGE

Optional argument(s)

submit

Build target(s) on cluster

The parameters for cluster execution are drawn from layered profiles. YMP includes base profiles for the “torque” and “slurm” cluster engines.

ymp submit [OPTIONS] TARGET_FILES

Options

-P, --pdb

Drop into debugger on uncaught exception

-q, --quiet

Decrease log verbosity

-v, --verbose

Increase log verbosity

--log-file <log_file>

Specify a log file

-n, --dryrun

Only show what would be done

-p, --printshellcmds

Print shell commands to be executed on shell

-k, --keepgoing

Don’t stop after failed job

--lock, --no-lock

Use/don’t use locking to prevent clobbering of files by parallel instances of YMP running

--rerun-incomplete, --ri

Re-run jobs left incomplete in last run

-F, --forceall

Force rebuilding of all stages leading to target

-f, --force

Force rebuilding of target

--notemp

Do not remove temporary files

-t, --touch

Only touch files, faking update

--shadow-prefix <shadow_prefix>

Directory to place data for shadowed rules

-r, --reason

Print reason for executing rule

-N, --nohup

Don’t die once the terminal goes away.

-P, --profile <NAME>

Select cluster config profile to use. Overrides cluster.profile setting from config.

-c, --snake-config <FILE>

Provide snakemake cluster config file

-d, --drmaa

Use DRMAA to submit jobs to cluster. Note: Make sure you have a working DRMAA library. Set DRMAA_LIBRAY_PATH if necessary.

-s, --sync

Use synchronous cluster submission, keeping the submit command running until the job has completed. Adds qsub_sync_arg to cluster command

-i, --immediate

Use immediate submission, submitting all jobs to the cluster at once.

--command <CMD>

Use CMD to submit job script to the cluster

--wrapper <CMD>

Use CMD as script submitted to the cluster. See Snakemake documentation for more information.

--max-jobs-per-second <N>

Limit the number of jobs submitted per second

-l, --latency-wait <T>

Time in seconds to wait after job completed until files are expected to have appeared in local file system view. On NFS, this time is governed by the acdirmax mount option, which defaults to 60 seconds.

-J, --cluster-cores <N>

Limit the maximum number of cores used by jobs submitted at a time

-j, --cores <N>

Number of local threads to use

--args <ARGS>

Additional arguments passed to cluster submission command. Note: Make sure the first character of the argument is not ‘-‘, prefix with ‘ ‘ as necessary.

--scriptname <NAME>

Set the name template used for submitted jobs

Arguments

TARGET_FILES

Optional argument(s)

Stages

Listing of stages implemented in YMP

System Message: ERROR/3 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/stable/doc/stages.rst, line 1)

Error in “sm:stage” directive: 1 argument(s) required, 0 supplied.

.. sm:stage:: 
   :source: ymp/rules/00_import.rules:64

   Imports raw read files into YMP.

   >>> ymp make toy
   >>> ymp make mpic
stage annotate_blast[source]

Annotate sequences with BLAST

Searches a reference database for hits with blastn. Use E flag to specify exponent to required E-value. Use N or Mega to specify default. Use Best to add -subject_besthit flag.

stage annotate_diamond[source]

FIXME

stage annotate_prodigal[source]

Call genes using prodigal

>>> ymp make toy.ref_genome.annotate_prodigal
stage annotate_tblastn[source]

Runs tblastn

stage assemble_megahit[source]

Assemble metagenome using MegaHit.

>>> ymp make toy.assemble_megahit.map_bbmap
>>> ymp make toy.group_ALL.assemble_megahit.map_bbmap
>>> ymp make toy.group_Subject.assemble_megahit.map_bbmap
stage assemble_metaspades[source]

Assemble reads using metaspades

>>> ymp make toy.assemble_metaspades
>>> ymp make toy.group_ALL.assemble_metaspades
>>> ymp make toy.group_Subject.assemble_metaspades
stage assemble_trinity[source]
stage bin_metabat2[source]

Bin metagenome assembly into MAGs

stage check[source]

Verify file availability

This stage provides rules for checking the file availability at a given point in the stage stack.

Mainly useful for testing and debugging.

stage cluster_cdhit[source]

Clusters protein sequences using CD-HIT

>>> ymp make toy.ref_query.cluster_cdhit
stage correct_bbmap[source]

Correct read errors by overlapping inside tails

Applies BBMap's “bbmerge.sh ecco” mode. This will overlap the inside of read pairs and choose the base with the higher quality where the alignment contains mismatches and increase the quality score as indicated by the double observation where the alignment contains matches.

>>> ymp make toy.correct_bbmap
>>> ymp make mpic.correct_bbmap
stage count_diamond[source]
stage count_stringtie[source]
stage coverage_samtools[source]

Computes coverage from a sorted bam file using samtools coverage

stage dedup_bbmap[source]

Remove duplicate reads

Applies BBMap’s “dedupe.sh”

>>> ymp make toy.dedup_bbmap
>>> ymp make mpic.dedup_bbmap
stage dust_bbmap[source]

Perform entropy filtering on reads using BBMap’s bbduk.sh

The parameter Enn gives the entropy cutoff. Higher values filter more sequences.

>>> ymp make toy.dust_bbmap
>>> ymp make toy.dust_bbmapE60
stage extract_reads[source]

Extract reads from BAM file using samtools fastq.

Parameters fn, Fn and Gn are passed through. Some options include:

  • f2: fully mapped (only proper pairs)

  • F2: not fully mapped (unmapped at least one read)

  • f12: not mapped (neither read mapped)

stage extract_seqs[source]

Extract sequences from .fasta.gz file using samtools faidx

Currently requires a .blast7 file as input.

Use parameter Nomatch to instead keep unmatched sequences.

stage filter_bmtagger[source]

Filter(-out) contaminant reads using BMTagger

>>> ymp make toy.ref_phiX.index_bmtagger.remove_bmtagger
>>> ymp make toy.ref_phiX.index_bmtagger.remove_bmtagger.assemble_megahit
>>> ymp make toy.ref_phiX.index_bmtagger.filter_bmtagger
>>> ymp make mpic.ref_phiX.index_bmtagger.remove_bmtagger
stage format_bbmap[source]

Process sequences with BBMap’s format.sh

Parameter Ln filters sequences at a minimum length.

>>> ymp make toy.assemble_metaspades.format_bbmapL200
stage humann2[source]

Compute functional profiles using HUMAnN2

stage index_bbmap[source]
>>> ymp make toy.ref_genome.index_bbmap
stage index_blast[source]
stage index_bmtagger[source]
stage index_bowtie2[source]
>>> ymp make toy.ref_genome.index_bowtie2
stage index_diamond[source]
stage map_bbmap[source]

Map reads using BBMap

>>> ymp make toy.assemble_megahit.map_bbmap
>>> ymp make toy.ref_genome.map_bbmap
>>> ymp make mpic.ref_ssu.map_bbmap
stage map_bowtie2[source]

Map reads using Bowtie2

>>> ymp make toy.ref_genome.index_bowtie2.map_bowtie2
>>> ymp make toy.assemble_megahit.index_bowtie2.map_bowtie2
>>> ymp make toy.group_Subject.assemble_megahit.index_bowtie2.map_bowtie2
>>> ymp make mpic.ref_ssu.index_bowtie2.map_bowtie2
stage map_diamond[source]
stage map_hisat2[source]

Map reads using Hisat2

stage map_star[source]

Map RNA-Seq reads with STAR

stage metaphlan2[source]

Assess metagenome community composition using Metaphlan 2

stage primermatch_bbmap[source]

Filters reads by matching reference primer

>>> ymp make mpic.ref_primers.primermatch_bbmap
stage profile_centrifuge[source]

Classify reads using centrifuge

stage qc_fastqc[source]

Quality screen reads using FastQC

>>> ymp make toy.qc_fastqc
stage qc_multiqc[source]

Aggregate QC reports using MultiQC

stage qc_quast[source]

Estimate assemly quality using Quast

stage quant_rsem[source]

Quantify transcripts using RSEM

stage references[source]

This is a “virtual” stage. It does not process read data, but comprises rules used for reference provisioning.

stage remove_bbmap[source]

Filter reads by reference

This stage aligns the reads with a given reference using BBMap in fast mode. Matching reads are collected in the stage filter_bbmap and remaining reads are collectec in the stage remove_bbmap.

>>> ymp make toy.ref_phiX.index_bbmap.remove_bbmap
>>> ymp make toy.ref_phiX.index_bbmap.filter_bbmap
>>> ymp make mpic.ref_phiX.index_bbmap.remove_bbmap
stage sort_bam[source]
stage split_library[source]

Demultiplexes amplicon sequencing files

This rule is treated specially. If a configured project specifies a barcode_col, reads from the file (or files) are used in combination with

stage trim_bbmap[source]

Trim adapters and low quality bases from reads

Applies BBMap’s “bbduk.sh”.

Parameters:

A: append to enable adapter trimming Q20: append to select phred score cutoff (default 20) L20: append to select minimum read length (default 20)

>>> ymp make toy.trim_bbmap
>>> ymp make toy.trim_bbmapA
>>> ymp make toy.trim_bbmapAQ10L10
>>> ymp make mpic.trim_bbmap
stage trim_sickle[source]

Perform read trimming using Sickle

>>> ymp make toy.trim_sickle
>>> ymp make toy.trim_sickleQ10L10
>>> ymp make mpic.trim_sickleL20
stage trim_trimmomatic[source]

Adapter trim reads using trimmomatic

>>> ymp make toy.trim_trimmomaticT32
>>> ymp make mpic.trim_trimmomatic
rule download_file_ftp[source]

Downloads remote file using wget

rule download_file_http[source]

Downloads remote file using internal downloader

rule prefetch[source]

Downloads SRA files into NCBI SRA folder (ncbi/public/sra).

rule fastq_dump[source]

Extracts FQ from SRA files

rule combine_with_ref[source]
rule align_mafft[source]
rule blast7_merge[source]

Merges blast results from all samples into single file

rule blast7_extract[source]

Generates meta-data csv and sequence fasta pair from blast7 file for one gene.

rule blast7_extract_merge[source]

Merges extracted csv/fasta pairs over all samples.

rule blast7_all[source]
rule blast7_reports[source]
rule blast7_eval_hist[source]
rule blast7_eval_plot[source]
rule cdhit_fna_single[source]

Clustering predicted genes (nuc) using cdhit-est

rule 87[source]
rule 88[source]
rule 89[source]
rule 90[source]
rule 91[source]
rule 92[source]
rule faa_fastp[source]
rule fasta_to_fastp_gz[source]
rule gunzip[source]

Generic temporary gunzip

Use ruleorder: gunzip > myrule to prefer gunzipping over re-running a rule. E.g.

>>> ruleorder: gunzip > myrule
>>> rule myrule:
>>>   output: temp("some.txt"), "some.txt.gzip"
rule mkdir[source]

Auto-create directories listed in ymp config.

Use these as input: >>> input: tmpdir = ancient(icfg.dir.tmp)

rule fq2fa[source]

Unzip and convert fastq to fasta

rule make_otu_table[source]
rule otu_to_qiime_txt[source]
rule otu_to_biom[source]
rule blast7_coverage_per_otu[source]
rule pick_open_otus[source]

Pick open reference OTUs

rule pick_closed_otus[source]

Pick closed reference OTUs

rule rarefy_table[source]
rule convert_to_closed_ref[source]

Convert open reference otu table to closed reference

rule env_wait[source]
rule ticktock[source]
rule noop[source]
rule normalize_16S[source]

Normalize 16S by copy number using picrust, must be run with closed reference OTU table

rule predict_metagenome[source]

Predict metagenome using picrust

rule categorize_by_function[source]

Categorize PICRUSt KOs into pathways

rule raxml_tree[source]
rule rsem_index[source]

Build Genome Index for RSEM

rule scnic_within_minsamp[source]
rule scnic_within_sparcc_filter[source]
rule star_index[source]

Build Genome Index for Star

API

ymp package

ymp.get_config()[source]

Access the current YMP configuration object.

This object might change once during normal execution: it is deleted before passing control to Snakemake. During unit test execution the object is deleted between all tests.

Return type

ConfigMgr

ymp.print_rule = 0

Set to 1 to show the YMP expansion process as it is applied to the next Snakemake rule definition.

>>> ymp.print_rule = 1
>>> rule broken:
>>>   ...
>>> ymp make broken -vvv
ymp.snakemake_versions = ['5.20.1']

List of versions this version of YMP has been verified to work with

Subpackages

ymp.cli package
ymp.cli.install_completion(ctx, attr, value)[source]

Installs click_completion tab expansion into users shell

ymp.cli.install_profiler(ctx, attr, value)[source]
Submodules
ymp.cli.env module
ymp.cli.env.get_env(envname)[source]

Get single environment matching glob pattern

Parameters

envname – environment glob pattern

ymp.cli.env.get_envs(patterns=None)[source]

Get environments matching glob pattern

Parameters

envnames – list of strings to match

ymp.cli.init module

Implements subcommands for ymp init

ymp.cli.init.have_command(cmd)[source]
ymp.cli.make module

Implements subcommands for ymp make and ymp submit

class ymp.cli.make.TargetParam[source]

Bases: click.types.ParamType

Handles tab expansion for build targets

classmethod complete(ctx, incomplete)[source]

Try to complete incomplete command

This is executed on tab or tab-tab from the shell

Parameters
  • ctx – click context object

  • incomplete – last word in command line up until cursor

Returns

list of words incomplete can be completed to

exception ymp.cli.make.YmpConfigNotFound[source]

Bases: ymp.exceptions.YmpException

Exception raised by YMP if no config was found in current path

ymp.cli.make.debug(msg, *args, **kwargs)[source]
ymp.cli.make.snake_params(func)[source]

Default parameters for subcommands launching Snakemake

ymp.cli.make.start_snakemake(kwargs)[source]

Execute Snakemake with given parameters and targets

Fixes paths of kwargs[‘targets’] to be relative to YMP root.

ymp.cli.shared_options module
class ymp.cli.shared_options.Group(name=None, commands=None, **attrs)[source]

Bases: click.core.Group

command(*args, **kwargs)[source]

A shortcut decorator for declaring and attaching a command to the group. This takes the same arguments as command() but immediately registers the created command with this instance by calling into add_command().

class ymp.cli.shared_options.Log[source]

Bases: object

Set up Logging

classmethod logfile_option(ctx, param, val)[source]
mod_level(n)[source]
classmethod quiet_option(ctx, param, val)[source]
static set_logfile(filename)[source]
classmethod verbose_option(ctx, param, val)[source]
class ymp.cli.shared_options.LogFormatter[source]

Bases: coloredlogs.ColoredFormatter

Initialize a ColoredFormatter object.

Parameters
  • fmt – A log format string (defaults to DEFAULT_LOG_FORMAT).

  • datefmt – A date/time format string (defaults to None, but see the documentation of BasicFormatter.formatTime()).

  • style – One of the characters %, { or $ (defaults to DEFAULT_FORMAT_STYLE)

  • level_styles – A dictionary with custom level styles (defaults to DEFAULT_LEVEL_STYLES).

  • field_styles – A dictionary with custom field styles (defaults to DEFAULT_FIELD_STYLES).

Raises

Refer to check_style().

This initializer uses colorize_format() to inject ANSI escape sequences in the log format string before it is passed to the initializer of the base class.

format(record)[source]

Apply level-specific styling to log records.

Parameters

record – A LogRecord object.

Returns

The result of logging.Formatter.format().

This method injects ANSI escape sequences that are specific to the level of each log record (because such logic cannot be expressed in the syntax of a log format string). It works by making a copy of the log record, changing the msg field inside the copy and passing the copy into the format() method of the base class.

snakemake_level_styles = {'crirical': {'color': 'red'}, 'debug': {'color': 'blue'}, 'error': {'color': 'red'}, 'info': {'color': 'green'}, 'warning': {'color': 'yellow'}}
class ymp.cli.shared_options.TqdmHandler(stream=None)[source]

Bases: logging.StreamHandler

Tqdm aware logging StreamHandler

Passes all log writes through tqdm to allow progress bars and log messages to coexist without clobbering terminal

Initialize the handler.

If stream is not specified, sys.stderr is used.

emit(record)[source]

Emit a record.

If a formatter is specified, it is used to format the record. The record is then written to the stream with a trailing newline. If exception information is present, it is formatted using traceback.print_exception and appended to the stream. If the stream has an ‘encoding’ attribute, it is used to determine how to do the output to the stream.

ymp.cli.shared_options.command(*args, **kwargs)[source]
ymp.cli.shared_options.enable_debug(ctx, param, val)[source]
ymp.cli.shared_options.group(*args, **kwargs)[source]
ymp.cli.shared_options.log_options(f)[source]
ymp.cli.shared_options.nohup(ctx, param, val)[source]

Make YMP continue after the shell dies.

  • redirects stdout and stderr into pipes and sub process that won’t die if it can’t write to either anymore

  • closes stdin

ymp.cli.show module

Implements subcommands for ymp show

class ymp.cli.show.ConfigPropertyParam[source]

Bases: click.types.ParamType

Handles tab expansion for ymp show arguments

complete(_ctx, incomplete)[source]

Try to complete incomplete command

This is executed on tab or tab-tab from the shell

Parameters
  • ctx – click context object

  • incomplete – last word in command line up until cursor

Returns

list of words incomplete can be completed to

convert(value, param, ctx)[source]

Convert value of param given context

Parameters
  • value – string passed on command line

  • param – click parameter object

  • ctx – click context object

property properties

Find properties offered by ConfigMgr

ymp.cli.show.show_help(ctx, _param=None, value=True)[source]

Display click command help

ymp.cli.stage module
ymp.cli.stage.wrap(header, data)[source]
ymp.stage package

YMP processes data in stages, each of which is contained in its own directory.

with Stage("trim_bbmap") as S:
  S.doc("Trim reads with BBMap")
  rule bbmap_trim:
    output: "{:this:}/{sample}{:pairnames:}.fq.gz"
    input:  "{:prev:}/{sample}{:pairnames:}.fq.gz"
    ...
Submodules
ymp.stage.base module
class ymp.stage.base.BaseStage(name)[source]

Bases: object

Base class for stage types

STAMP_FILENAME = 'all_targets.stamp'

The name of the stamp file that is touched to indicate completion of the stage.

can_provide(inputs)[source]

Determines which of inputs this stage can provide.

Returns a dictionary with the keys a subset of inputs and the values identifying redirections. An empty string indicates that no redirection is to take place. Otherwise, the string is the suffix to be appended to the prior StageStack.

Return type

Dict[str, str]

doc(doc)[source]

Add documentation to Stage

Parameters

doc (str) – Docstring passed to Sphinx

Return type

None

docstring: str

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

get_all_targets(stack)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

Return type

List[str]

get_inputs()[source]

Returns the set of inputs required by this stage

This function must return a copy, to ensure internal data is not modified.

Return type

Set[str]

get_path(stack)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

Return type

str

match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

Return type

bool

name

The name of the stage is a string uniquely identifying it among all stages.

property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

class ymp.stage.base.ConfigStage(name, cfg)[source]

Bases: ymp.stage.base.BaseStage

Base for stages created via configuration

These Stages derive from the yml.yml and not from a rules file.

cfg

The configuration object defining this Stage.

property defined_in

List of files defining this stage

Used to invalidate caches.

filename

Semi-colon separated list of file names defining this Stage.

lineno

Line number within the first file at which this Stage is defined.

ymp.stage.expander module
class ymp.stage.expander.StageExpander[source]

Bases: ymp.snakemake.ColonExpander

  • Registers rules with stages when they are created

class Formatter(expander)[source]

Bases: ymp.snakemake.FormatExpander.Formatter, ymp.string.PartialFormatter

get_value(key, args, kwargs)[source]
get_value_(key, args, kwargs)[source]
expand_ruleinfo(rule, item, expand_args, rec)[source]
expand_str(rule, item, expand_args, rec, cb)[source]
expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

ymp.stage.groupby module
class ymp.stage.groupby.GroupBy(name)[source]

Bases: ymp.stage.base.BaseStage

Dummy stage for grouping

ymp.stage.pipeline module

Pipelines Module

Contains classes for pre-configured pipelines comprising multiple stages.

class ymp.stage.pipeline.Pipeline(name, cfg)[source]

Bases: ymp.stage.base.ConfigStage

A virtual stage aggregating a sequence of stages, i.e. a pipeline or sub-workflow.

Pipelines are configured via ymp.yml.

Example

pipelines:
my_pipeline:
  • stage_1

  • stage_2

  • stage_3

can_provide(inputs)[source]

Determines which of inputs this stage can provide.

The result dictionary values will point to the “real” output.

Return type

Dict[str, str]

get_all_targets(stack)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

get_path(stack)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

property outputs

The outputs of a pipeline are the sum of the outputs of each component stage. Outputs of stages further down the pipeline override those generated earlier.

TODO: Allow hiding the output of intermediary stages.

Return type

Dict[str, str]

property pipeline
ymp.stage.project module
class ymp.stage.project.PandasProjectData(cfg)[source]

Bases: object

column(col)[source]
columns()[source]
dump()[source]
duplicate_rows(column)[source]
get(idcol, row, col)[source]
groupby_dedup(cols)[source]

Return non-redundant identifying subset of cols

identifying_columns()[source]
rows(cols)[source]
string_columns()[source]
class ymp.stage.project.PandasTableBuilder[source]

Bases: object

Builds the data table describing each sample in a project

This class implements loading and combining tabular data files as specified by the YAML configuration.

Format:
  • string items are files

  • lists of files are concatenated top to bottom

  • dicts must have one “command” value:

    • ‘join’ contains a two-item list the two items are joined ‘naturally’ on shared headers

    • ‘table’ contains a list of one-item dicts dicts have form key:value[,value...] a in-place table is created from the keys list-of-dict is necessary as dicts are unordered

    • ‘paste’ contains a list of tables pasted left to right tables pasted must be of equal length or length 1

  • if a value is a valid path relative to the csv/tsv/xls file’s location, it is expanded to a path relative to CWD

Example

- top.csv
- join:
  - excel.xslx%left.csv
  - right.tsv
- table:
  - sample: s1,s2,s3
  - fq1: s1.1.fq, s2.1.fq, s3.1.fq
  - fq2: s1.2.fq, s2.2.fq, s3.2.fq
load_data(cfg)[source]
class ymp.stage.project.Project(name, cfg)[source]

Bases: ymp.stage.base.ConfigStage

Contains configuration for a source dataset to be processed

KEY_BCCOL = 'barcode_col'
KEY_DATA = 'data'
KEY_IDCOL = 'id_col'
KEY_READCOLS = 'read_cols'
RE_FILE = re.compile('^(?!http://).*(?:fq|fastq)(?:|\\.gz)$')
RE_REMOTE = re.compile('^(?:https?|ftp|sftp)://(?:.*)')
RE_SRR = re.compile('^[SED]RR[0-9]+$')
choose_fq_columns()[source]

Configures the columns referencing the fastq sources

choose_id_column()[source]

Configures column to use as index on runs

If explicitly configured via KEY_IDCOL, verifies that the column exists and that it is unique. Otherwise chooses the leftmost unique column in the data.

property data

Pandas dataframe of runs

Lazy loading property, first call may take a while.

encode_barcode_path(barcode_file, run, pair)[source]
property fq_names

Names of all FastQ files

property fwd_fq_names

Names of forward FastQ files (se and pe)

property fwd_pe_fq_names

Names of forward FastQ files part of pair

get_fq_names(only_fwd=False, only_rev=False, only_pe=False, only_se=False)[source]

Get pipeline names of fq files

get_ids(groups, match_groups=None, match_value=None)[source]
property idcol
iter_samples(variables=None)[source]
minimize_variables(groups)[source]
property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

property pe_fq_names

Names of paired end FastQ files

property project_name
raw_reads_source_path(args, kwargs)[source]
property rev_pe_fq_names

Names of reverse FastQ files part of pair

property runs

Pandas dataframe index of runs

Lazy loading property, first call may take a while.

property se_fq_names

Names of single end FastQ files

property source_cfg
source_path(target, pair, nosplit=False)[source]

Get path for FQ file for run and pair

unsplit_path(barcode_id, pairname)[source]
property variables
class ymp.stage.project.SQLiteProjectData(cfg, name='data')[source]

Bases: object

column(col)[source]
columns()[source]
property db_url
dump()[source]
duplicate_rows(column)[source]
get(idcol, row, col)[source]
groupby_dedup(cols)[source]
identifying_columns()[source]
property nrows
query(*args)[source]
rows(col)[source]
string_columns()[source]
ymp.stage.reference module
class ymp.stage.reference.Archive(name, dirname, tar, url, strip, files)[source]

Bases: object

dirname = None
files = None
get_files()[source]
hash = None
make_unpack_rule(baserule)[source]
name = None
strip_components = None
tar = None
class ymp.stage.reference.Reference(name, cfg)[source]

Bases: ymp.stage.base.ConfigStage

Represents (remote) reference file/database configuration

add_files(rsc, local_path)[source]
get_file(filename)[source]
get_path(_stack)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

make_unpack_rules(baserule)[source]
property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

ymp.stage.stack module
class ymp.stage.stack.StageStack(path, stage=None)[source]

Bases: object

The “head” of a processing chain - a stack of stages

all_targets()[source]
complete(incomplete)[source]
property defined_in
classmethod get(path, stage=None)[source]

Cached access to StageStack

Parameters
  • path – Stage path

  • stage – Stage object at head of stack

property path

On disk location of files provided by this stack

prev(args=None, kwargs=None)[source]

Directory of previous stage

resolve_prevs()[source]
target(args, kwargs)[source]

Finds the target in the prev stage matching current target

property targets

Returns the current targets

used_stacks = {}
ymp.stage.stack.find_stage(name)[source]
ymp.stage.stack.norm_wildcards(pattern)[source]
ymp.stage.stage module
class ymp.stage.stage.Param(stage, key, name, value=None, default=None)[source]

Bases: object

Stage Parameter (base class)

property constraint
pattern(show_constraint=True)[source]

String to add to filenames passed to Snakemake

I.e. a pattern of the form {wildcard,constraint}

class ymp.stage.stage.ParamChoice(*args, **kwargs)[source]

Bases: ymp.stage.stage.Param

Stage Choice Parameter

param_func()[source]

Returns function that will extract parameter value from wildcards

class ymp.stage.stage.ParamFlag(*args, **kwargs)[source]

Bases: ymp.stage.stage.Param

Stage Flag Parameter

param_func()[source]

Returns function that will extract parameter value from wildcards

class ymp.stage.stage.ParamInt(*args, **kwargs)[source]

Bases: ymp.stage.stage.Param

Stage Int Parameter

param_func()[source]

Returns function that will extract parameter value from wildcards

class ymp.stage.stage.Stage(name, altname=None, env=None, doc=None)[source]

Bases: ymp.snakemake.WorkflowObject, ymp.stage.base.BaseStage

Creates a new stage

While entered using with, several stage specific variables are expanded within rules:

  • {:this:} – The current stage directory

  • {:that:} – The alternate output stage directory

  • {:prev:} – The previous stage’s directory

Parameters
  • name (str) – Name of this stage

  • altname (Optional[str]) – Alternate name of this stage (used for stages with multiple output variants, e.g. filter_x and remove_x.

  • doc (Optional[str]) – See Stage.doc

  • env (Optional[str]) – See Stage.env

active = None

Currently active stage (“entered”)

add_param(key, typ, name, value=None, default=None)[source]

Add parameter to stage

Example

>>> with Stage("test") as S
>>>   S.add_param("N", "int", "nval", default=50)
>>>   rule:
>>>      shell: "echo {param.nval}"

This would add a stage “test”, optionally callable as “testN123”, printing “50” or in the case of “testN123” printing “123”.

Parameters
  • char – The character to use in the Stage name

  • typ – The type of the parameter (int, flag)

  • param – Name of parameter in params

  • value – value {param.xyz} should be set to if param given

  • default – default value for {{param.xyz}} if no param given

env(name)[source]

Add package specifications to Stage environment

Note

This sets the environment for all rules within the stage, which leads to errors with Snakemake rule types not supporting conda environments

Parameters

name (str) – Environment name or filename

>>> Env("blast", packages="blast =2.7*")
>>> with Stage("test") as S:
>>>    S.env("blast")
>>>    rule testing:
>>>       ...
>>> with Stage("test", env="blast") as S:
>>>    rule testing:
>>>       ...
>>> with Stage("test") as S:
>>>    rule testing:
>>>       conda: "blast"
>>>       ...
Return type

None

get_inputs()[source]

Returns the set of inputs required by this stage

This function must return a copy, to ensure internal data is not modified.

match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

prev(args, kwargs)[source]

Gathers {:prev:} calls from rules

require(**kwargs)[source]

Override inferred stage inputs

In theory, this should not be needed. But it’s simpler for now.

satisfy_inputs(other_stage, inputs)[source]
Return type

Dict[str, str]

that(args=None, kwargs=None)[source]

Alternate directory of current stage

Used for splitting stages

this(args=None, kwargs=None)[source]

Directory of current stage

wc2path(wc)[source]
wildcards(args=None, kwargs=None)[source]
ymp.stage.stage.norm_wildcards(pattern)[source]

Submodules

ymp.blast module

Parsers for blast output formats 6 (CSV) and 7 (CSV with comments between queries).

class ymp.blast.BlastParser[source]

Bases: object

Base class for BLAST parsers

FIELD_MAP = {'% identity': 'pident', 'alignment length': 'length', 'bit score': 'bitscore', 'evalue': 'evalue', 'gap opens': 'gapopen', 'mismatches': 'mismatch', 'q. end': 'qend', 'q. start': 'qstart', 'query acc.': 'qacc', 'query frame': 'qframe', 'query length': 'qlen', 's. end': 'send', 's. start': 'sstart', 'sbjct frame': 'sframe', 'score': 'score', 'subject acc.': 'sacc', 'subject strand': 'sstrand', 'subject tax ids': 'staxids', 'subject title': 'stitle'}
FIELD_TYPE = {'bitscore': <class 'float'>, 'evalue': <class 'float'>, 'gapopen': <class 'int'>, 'length': <class 'int'>, 'mismatch': <class 'int'>, 'pident': <class 'float'>, 'qend': <class 'int'>, 'qframe': <class 'int'>, 'qlen': <class 'int'>, 'qstart': <class 'int'>, 'score': <class 'float'>, 'send': <class 'int'>, 'sframe': <class 'int'>, 'sstart': <class 'int'>, 'staxids': <function BlastParser.tupleofint>, 'stitle': <class 'str'>}
tupleofint()[source]
class ymp.blast.Fmt6Parser(fileobj)[source]

Bases: ymp.blast.BlastParser

Parser for BLAST format 6 (CSV)

Hit

alias of BlastHit

field_types = [None, None, <class 'float'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'int'>, <class 'float'>, <class 'float'>]
fields = ['qseqid', 'sseqid', 'pident', 'length', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore']

Default field types

get_fields()[source]
class ymp.blast.Fmt7Parser(fileobj)[source]

Bases: ymp.blast.BlastParser

Parses BLAST results in format ‘7’ (CSV with comments)

DATABASE = '# Database: '
FIELDS = '# Fields: '
HITSFOUND = ' hits found'
QUERY = '# Query: '
get_fields()[source]

Returns list of available field names

Format 7 specifies which columns it contains in comment lines, allowing this parser to be agnostic of the selection of columns made when running BLAST.

Return type

List[str]

Returns

List of field names (e.g. ['sacc', 'qacc', 'evalue'])

isfirsthit()[source]

Returns True if the current hit is the first hit for the current query

Return type

bool

ymp.blast.reader(fileobj, t=7)[source]

Creates a reader for files in BLAST format

>>> with open(blast_file) as infile:
>>>    reader = blast.reader(infile)
>>>    for hit in reader:
>>>       print(hit)
Parameters
  • fileobj – iterable yielding lines in blast format

  • t (int) – number of blast format type

Return type

BlastParser

ymp.blast2gff module

ymp.cluster module

Module handling talking to cluster management systems

>>> python -m ymp.cluster slurm status <jobid>
class ymp.cluster.ClusterMS[source]

Bases: object

class ymp.cluster.Lsf[source]

Bases: ymp.cluster.ClusterMS

Talking to LSF

states = {'DONE': 'success', 'EXIT': 'failed', 'PEND': 'running', 'POST_DONE': 'success', 'POST_ERR': 'failed', 'PSUSP': 'running', 'RUN': 'running', 'SSUSP': 'running', 'UNKWN': 'running', 'USUSP': 'running', 'WAIT': 'running'}
static status(jobid)[source]
static submit(args)[source]
class ymp.cluster.Slurm[source]

Bases: ymp.cluster.ClusterMS

Talking to Slurm

states = {'BOOT_FAIL': 'failed', 'CANCELLED': 'failed', 'COMPLETED': 'success', 'COMPLETING': 'running', 'CONFIGURING': 'running', 'DEADLINE': 'failed', 'FAILED': 'failed', 'NODE_FAIL': 'failed', 'PENDING': 'running', 'PREEMPTED': 'failed', 'RESIZING': 'running', 'REVOKED': 'running', 'RUNNING': 'running', 'SPECIAL_EXIT': 'running', 'SUSPENDED': 'running', 'TIMEOUT': 'failed'}
static status(jobid)[source]

Print status of job @param jobid to stdout (as needed by snakemake)

Anectotal benchmarking shows 200ms per invocation, half used by Python startup and half by calling sacct. Using scontrol show job instead of sacct -pbs is faster by 80ms, but finished jobs are purged after unknown time window.

ymp.cluster.error(*args, **kwargs)[source]

ymp.common module

Collection of shared utility classes and methods

class ymp.common.AttrDict[source]

Bases: dict

AttrDict adds accessing stored keys as attributes to dict

class ymp.common.Cache(root)[source]

Bases: object

close()[source]
commit()[source]
get_cache(name, clean=False, *args, **kwargs)[source]
load(cache, key)[source]
load_all(cache)[source]
store(cache, key, obj)[source]
class ymp.common.CacheDict(cache, name, *args, loadfunc=None, itemloadfunc=None, itemdata=None, **kwargs)[source]

Bases: ymp.common.AttrDict

get(k[, d]) → D[k] if k in D, else d. d defaults to None.[source]
items() → a set-like object providing a view on D’s items[source]
keys() → a set-like object providing a view on D’s keys[source]
values() → an object providing a view on D’s values[source]
class ymp.common.MkdirDict[source]

Bases: ymp.common.AttrDict

Creates directories as they are requested

ymp.common.ensure_list(obj)[source]

Wrap obj in a list as needed

ymp.common.flatten(item)[source]

Flatten lists without turning strings into letters

ymp.common.is_container(obj)[source]

Check if object is container, considering strings not containers

ymp.common.parse_number(s='')[source]

Basic 1k 1m 1g 1t parsing.

  • assumes base 2

  • returns “byte” value

  • accepts “1kib”, “1kb” or “1k”

ymp.config module

class ymp.config.ConfigExpander(config_mgr)[source]

Bases: ymp.snakemake.ColonExpander

class Formatter(expander)[source]

Bases: ymp.snakemake.FormatExpander.Formatter, ymp.string.PartialFormatter

get_value(field_name, args, kwargs)[source]
expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

class ymp.config.ConfigMgr(root, conffiles)[source]

Bases: object

Manages workflow configuration

This is a singleton object of which only one instance should be around at a given time. It is available in the rules files as icfg and via ymp.get_config() elsewhere.

ConfigMgr loads and maintains the workflow configuration as given in the ymp.yml files located in the workflow root directory, the user config folder (~/.ymp) and the installation etc folder.

CONF_DEFAULT_FNAME = '/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/stable/src/ymp/etc/defaults.yml'
CONF_FNAME = 'ymp.yml'
CONF_USER_FNAME = '/home/docs/.ymp/ymp.yml'
KEY_PIPELINES = 'pipelines'
KEY_PROJECTS = 'projects'
KEY_REFERENCES = 'references'
RULE_MAIN_FNAME = '/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/stable/src/ymp/rules/Snakefile'
property absdir

Dictionary of absolute paths of named YMP directories

classmethod activate()[source]
property cluster

The YMP cluster configuration.

property conda
property dir

Dictionary of relative paths of named YMP directories

The directory paths are relative to the YMP root workdir.

property ensuredir

Dictionary of absolute paths of named YMP directories

Directories will be created on the fly as they are requested.

expand(item, **kwargs)[source]
classmethod find_config()[source]

Locates ymp config files and ymp root

The root ymp work dir is determined as the first (parent) directory containing a file named ConfigMgr.CONF_FNAME (default ymp.yml).

The stack of config files comprises 1. the default config ConfigMgr.CONF_DEFAULT_FNAME (etc/defaults.yml in the ymp package directory), 2. the user config ConfigMgr.CONF_USER_FNAME (~/.ymp/ymp.yml) and 3. the yml.yml in the ymp root.

Returns

Root working directory conffiles: list of active configuration files

Return type

root

classmethod instance()[source]

Returns the active Ymp ConfigMgr instance

property limits

The YMP limits configuration.

mem(base='0', per_thread=None, unit='m')[source]

Clamp memory to configuration limits

Params:

base: base memory requested per_thread: additional mem required per allocated thread unit: output unit (b, k, m, g, t)

property pairnames
property pipeline

Configure pipelines

property platform

Name of current platform (macos or linux)

property ref

Configure references

property shell

The shell used by YMP

Change by adding e.g. shell: /path/to/shell to ymp.yml.

property snakefiles

Snakefiles used under this config in parsing order

classmethod unload()[source]
class ymp.config.OverrideExpander(cfgmgr)[source]

Bases: ymp.snakemake.BaseExpander

Apply rule attribute overrides from ymp.yml config

Example

Set the wordsize parameter in the bmtagger_bitmask rule to 12:

ymp.yml
overrides:
  rules:
    bmtagger_bitmask:
      params:
        wordsize: 12
expand(rule, ruleinfo, **kwargs)[source]

Expands RuleInfo object and children recursively.

Will call :meth:format (via :meth:format_annotated) on str items encountered in the tree and wrap encountered functions to be called once the wildcards object is available.

Set ymp.print_rule = 1 before a rule: statement in snakefiles to enable debug logging of recursion.

Parameters
  • rule – The :class:snakemake.rules.Rule object to be populated with the data from the RuleInfo object passed from item

  • item – The item to be expanded. Initially a :class:snakemake.workflow.RuleInfo object into which is recursively decendet. May ultimately be None, str, function, int, float, dict, list or tuple.

  • expand_args – Parameters passed on late expansion (when the dag tries to instantiate the rule into a job.

  • rec – Recursion level

ymp.dna module

ymp.dna.nuc2aa(seq)
ymp.dna.nuc2num(seq)

ymp.download module

class ymp.download.DownloadThread[source]

Bases: object

get(url, dest, md5)[source]
main()[source]
terminate()[source]
class ymp.download.FileDownloader(block_size=4096, timeout=300, parallel=4, loglevel=30, alturls=None, retry=3)[source]

Bases: object

Manages download of a set of URLs

Downloads happen concurrently using asyncronous network IO.

Parameters
  • block_size (int) – Byte size of chunks to download

  • timeout (int) – Aiohttp cumulative timeout

  • parallel (int) – Number of files to download in parallel

  • loglevel (int) – Log level for messages send to logging (Errors are send with loglevel+10)

  • alturls – List of regexps modifying URLs

  • retry (int) – Number of times to retry download

error(msg, *args, **kwargs)[source]

Send error to logger

Message is sent with a log level 10 higher than the default for this object.

Return type

None

get(urls, dest, md5s=None)[source]

Download a list of URLs

Parameters
Return type

None

log(msg, *args, modlvl=0, **kwargs)[source]

Send message to logger

Honors loglevel set for the FileDownloader object.

Parameters
  • msg (str) – The log message

  • modlvl (int) – Added to default logging level for object

Return type

None

static make_bar_format(desc_width=20, count_width=0, rate=False, eta=False, have_total=True)[source]

Construct bar_format for tqdm

Parameters
  • desc_width (int) – minimum space allocated for description

  • count_width (int) – min space for counts

  • rate (bool) – show rate to right of progress bar

  • eta (bool) – show eta to right of progress bar

  • have_total (bool) – whether a total exists (required to add percentage)

Return type

str

ymp.env module

This module manages the conda environments.

class ymp.env.CondaPathExpander(config, *args, **kwargs)[source]

Bases: ymp.snakemake.BaseExpander

Applies search path for conda environment specifications

File names supplied via rule: conda: "some.yml" are replaced with absolute paths if they are found in any searched directory. Each search_paths entry is appended to the directory containing the top level Snakefile and the directory checked for the filename. Thereafter, the stack of including Snakefiles is traversed backwards. If no file is found, the original name is returned.

expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

format(conda_env, *args, **kwargs)[source]

Format item using *args and **kwargs

class ymp.env.Env(env_file=None, dag=None, singularity_img=None, container_img=None, cleanup=None, name=None, packages=None, base='none', channels=None, rule=None)[source]

Bases: ymp.snakemake.WorkflowObject, snakemake.deployment.conda.Env

Represents YMP conda environment

Snakemake expects the conda environments in a per-workflow directory configured by conda_prefix. YMP sets this value by default to ~/.ymp/conda, which has a greater chance of being on the same file system as the conda cache, allowing for hard linking of environment files.

Within the folder conda_prefix, each environment is created in a folder named by the hash of the environment definition file’s contents and the conda_prefix path. This class inherits from snakemake.deployment.conda.Env to ensure that the hash we use is identical to the one Snakemake will use during workflow execution.

The class provides additional features for updating environments, creating environments dynamically and executing commands within those environments.

Note

This is not called from within the execution. Snakemake instanciates its own Env object purely based on the filename.

Creates an inline defined conda environment

Parameters
  • name (Optional[str]) – Name of conda environment (and basename of file)

  • packages (Union[list, str, None]) – package(s) to be installed into environment. Version constraints can be specified in each package string separated from the package name by whitespace. E.g. "blast =2.6*"

  • channels (Union[list, str, None]) – channel(s) to be selected for the environment

  • base (str) – Select a set of default channels and packages to be added to the newly created environment. Sets are defined in conda.defaults in yml.yml

create(dryrun=False, force=False)[source]

Ensure the conda environment has been created

Inherits from snakemake.conda.Env.create

Behavior of super class

The environment is installed in a folder in conda_prefix named according to a hash of the environment.yaml defining the environment and the value of conda-prefix (Env.hash). The latter is included as installed environments cannot be moved.

  • If this folder (Env.path) exists, nothing is done.

  • If a folder named according to the hash of just the contents of environment.yaml exists, the environment is created by unpacking the tar balls in that folder.

Handling pre-computed environment specs

In addition to freezing environments by maintaining a copy of the package binaries, we allow maintaining a copy of the package binary URLs, from which the archive folder is populated on demand.

If a file {Env.name}.txt exists in conda.spec FIXME

export(stream, typ='yml')[source]

Freeze environment

static get_installed_env_hashes()[source]
property installed
run(command)[source]

Execute command in environment

Returns exit code of command run.

set_prefix(prefix)[source]
update()[source]

Update conda environment

ymp.exceptions module

Exceptions raised by YMP

exception ymp.exceptions.YmpConfigError(obj, msg, key=None, exc=None)[source]

Bases: ymp.exceptions.YmpNoStackException

Indicates an error in the ymp.yml config files

Parameters
  • obj (object) – Subtree of config causing error

  • msg (str) – The message to display

  • key (object) – Key indicating part of obj causing error

  • exc (Optional[Exception]) – Upstream exception causing error

exception ymp.exceptions.YmpException[source]

Bases: Exception

Base class of all YMP Exceptions

exception ymp.exceptions.YmpNoStackException(message)[source]

Bases: ymp.exceptions.YmpException, click.exceptions.ClickException

Exception that does not lead to stack trace on CLI

Inheriting from ClickException makes click print only the self.msg value of the exception, rather than allowing Python to print a full stack trace.

This is useful for exceptions indicating usage or configuration errors. We use this, instead of click.UsageError and friends so that the exceptions can be caught and handled explicitly where needed.

Note that click will call the show method on this object to print the exception. The default implementation from click will just prefix the msg with Error:.

FIXME: This does not work if the exception is raised from within

the snakemake workflow as snakemake.snakemake catches and reformats exceptions.

exception ymp.exceptions.YmpRuleError(obj, msg)[source]

Bases: ymp.exceptions.YmpNoStackException

Indicates an error in the rules files

This could e.g. be a Stage or Environment defined twice.

Parameters
  • obj (object) – The object causing the exception. Must have lineno and filename as these will be shown as part of the error message on the command line.

  • msg (str) – The message to display

show()[source]
Return type

None

exception ymp.exceptions.YmpStageError(msg)[source]

Bases: ymp.exceptions.YmpNoStackException

Indicates an error in the requested stage stack

show()[source]
Return type

None

exception ymp.exceptions.YmpSystemError(message)[source]

Bases: ymp.exceptions.YmpNoStackException

Indicates problem running YMP with available system software

exception ymp.exceptions.YmpUsageError(message)[source]

Bases: ymp.exceptions.YmpNoStackException

exception ymp.exceptions.YmpWorkflowError(message)[source]

Bases: ymp.exceptions.YmpNoStackException

Indicates an error during workflow execution

E.g. failures to expand dynamic variables

ymp.gff module

Implements simple reader and writer for GFF (general feature format) files.

Unfinished

  • only supports one version, GFF 3.2.3.

  • no escaping

class ymp.gff.Attributes(ID, Name, Alias, Parent, Target, Gap, Derives_From, Note, Dbxref, Ontology_term, Is_circular)

Bases: tuple

Create new instance of Attributes(ID, Name, Alias, Parent, Target, Gap, Derives_From, Note, Dbxref, Ontology_term, Is_circular)

property Alias

Alias for field number 2

property Dbxref

Alias for field number 8

property Derives_From

Alias for field number 6

property Gap

Alias for field number 5

property ID

Alias for field number 0

property Is_circular

Alias for field number 10

property Name

Alias for field number 1

property Note

Alias for field number 7

property Ontology_term

Alias for field number 9

property Parent

Alias for field number 3

property Target

Alias for field number 4

class ymp.gff.Feature(seqid, source, type, start, end, score, strand, phase, attributes)

Bases: tuple

Create new instance of Feature(seqid, source, type, start, end, score, strand, phase, attributes)

property attributes

Alias for field number 8

property end

Alias for field number 4

property phase

Alias for field number 7

property score

Alias for field number 5

property seqid

Alias for field number 0

property source

Alias for field number 1

property start

Alias for field number 3

property strand

Alias for field number 6

property type

Alias for field number 2

class ymp.gff.reader(fileobj)[source]

Bases: object

class ymp.gff.writer(fileobj)[source]

Bases: object

write(feature)[source]

ymp.helpers module

This module contains helper functions.

Not all of these are currently in use

class ymp.helpers.OrderedDictMaker[source]

Bases: object

odict creates OrderedDict objects in a dict-literal like syntax

>>>  my_ordered_dict = odict[
>>>    'key': 'value'
>>>  ]

Implementation: odict uses the python slice syntax which is similar to dict literals. The [] operator is implemented by overriding __getitem__. Slices passed to the operator as object[start1:stop1:step1, start2:...], are passed to the implementation as a list of objects with start, stop and step members. odict simply creates an OrderedDictionary by iterating over that list.

ymp.helpers.update_dict(dst, src)[source]

Recursively update dictionary dst with src

  • Treats a list as atomic, replacing it with new list.

  • Dictionaries are overwritten by item

  • None is replaced by empty dict if necessary

ymp.map2otu module

class ymp.map2otu.MapfileParser(minid=0)[source]

Bases: object

read(mapfiles)[source]
write(outfile)[source]
class ymp.map2otu.emirge_info(line)[source]

Bases: object

ymp.map2otu.main()[source]

ymp.nuc2aa module

ymp.nuc2aa.fasta_dna2aa(inf, outf)[source]
ymp.nuc2aa.nuc2aa(seq)[source]
ymp.nuc2aa.nuc2num(seq)[source]

ymp.snakemake module

Extends Snakemake Features

class ymp.snakemake.BaseExpander[source]

Bases: object

Base class for Snakemake expansion modules.

Subclasses should override the :meth:expand method if they need to work on the entire RuleInfo object or the :meth:format and :meth:expands_field methods if they intend to modify specific fields.

expand(rule, item, expand_args=None, rec=- 1, cb=False)[source]

Expands RuleInfo object and children recursively.

Will call :meth:format (via :meth:format_annotated) on str items encountered in the tree and wrap encountered functions to be called once the wildcards object is available.

Set ymp.print_rule = 1 before a rule: statement in snakefiles to enable debug logging of recursion.

Parameters
  • rule – The :class:snakemake.rules.Rule object to be populated with the data from the RuleInfo object passed from item

  • item – The item to be expanded. Initially a :class:snakemake.workflow.RuleInfo object into which is recursively decendet. May ultimately be None, str, function, int, float, dict, list or tuple.

  • expand_args – Parameters passed on late expansion (when the dag tries to instantiate the rule into a job.

  • rec – Recursion level

expand_dict(rule, item, expand_args, rec)[source]
expand_func(rule, item, expand_args, rec, debug)[source]
expand_list(rule, item, expand_args, rec, cb)[source]
expand_ruleinfo(rule, item, expand_args, rec)[source]
expand_str(rule, item, expand_args, rec, cb)[source]
expand_tuple(rule, item, expand_args, rec, cb)[source]
expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

format(item, *args, **kwargs)[source]

Format item using *args and **kwargs

format_annotated(item, expand_args)[source]

Wrapper for :meth:format preserving AnnotatedString flags

Calls :meth:format to format item into a new string and copies flags from original item.

This is used by :meth:expand

exception ymp.snakemake.CircularReferenceException(deps, rule)[source]

Bases: ymp.exceptions.YmpRuleError

Exception raised if parameters in rule contain a circular reference

class ymp.snakemake.ColonExpander[source]

Bases: ymp.snakemake.FormatExpander

Expander using {:xyz:} formatted variables.

regex = re.compile('\n \\{:\n (?=(\n \\s*\n (?P<name>(?:.(?!\\s*\\:\\}))*.)\n \\s*\n ))\\1\n :\\}\n ', re.VERBOSE)
spec = '{{:{}:}}'
class ymp.snakemake.DefaultExpander(**kwargs)[source]

Bases: ymp.snakemake.InheritanceExpander

Adds default values to rules

The implementation simply makes all rules inherit from a defaults rule.

Creates DefaultExpander

Each parameter passed is considered a RuleInfo default value. Where applicable, Snakemake’s argtuples ([],{}) must be passed.

get_super(rule, ruleinfo)[source]

Find rule parent

Parameters
  • rule (Rule) – Rule object being built

  • ruleinfo (RuleInfo) – RuleInfo object describing rule being built

Returns

name of parent rule and RuleInfo describing parent rule or (None, None).

Return type

2-Tuple

exception ymp.snakemake.ExpandLateException[source]

Bases: Exception

class ymp.snakemake.ExpandableWorkflow(*args, **kwargs)[source]

Bases: snakemake.workflow.Workflow

Adds hook for additional rule expansion methods to Snakemake

Constructor for ExpandableWorkflow overlay attributes

This may be called on an already initialized Workflow object.

classmethod activate()[source]

Installs the ExpandableWorkflow

Replaces the Workflow object in the snakemake.workflow module with an instance of this class and initializes default expanders (the snakemake syntax).

add_rule(name=None, lineno=None, snakefile=None, checkpoint=False)[source]

Add a rule.

Parameters
  • name – name of the rule

  • lineno – line number within the snakefile where the rule was defined

  • snakefile – name of file in which rule was defined

classmethod clear()[source]
classmethod ensure_global_workflow()[source]
get_rule(name=None)[source]

Get rule by name. If name is none, the last created rule is returned.

Parameters

name – the name of the rule

global_workflow = <ymp.snakemake.ExpandableWorkflow object>
classmethod load_workflow(snakefile='/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/stable/src/ymp/rules/Snakefile')[source]
classmethod register_expanders(*expanders)[source]

Register an object the expand() function of which will be called on each RuleInfo object before it is passed on to snakemake.

rule(name=None, lineno=None, snakefile=None, checkpoint=None)[source]

Intercepts “rule:” Here we have the entire ruleinfo object

class ymp.snakemake.FormatExpander[source]

Bases: ymp.snakemake.BaseExpander

Expander using a custom formatter object.

class Formatter(expander)[source]

Bases: ymp.string.ProductFormatter

parse(format_string)[source]
format(*args, **kwargs)[source]

Format item using *args and **kwargs

get_names(pattern)[source]
regex = re.compile('\n \\{\n (?=(\n (?P<name>[^{}]+)\n ))\\1\n \\}\n ', re.VERBOSE)
spec = '{{{}}}'
exception ymp.snakemake.InheritanceException(msg, rule, parent, include=None, lineno=None, snakefile=None)[source]

Bases: snakemake.exceptions.RuleException

Exception raised for errors during rule inheritance

Creates a new instance of RuleException.

Arguments message – the exception message include – iterable of other exceptions to be included lineno – the line the exception originates snakefile – the file the exception originates

class ymp.snakemake.InheritanceExpander[source]

Bases: ymp.snakemake.BaseExpander

Adds class-like inheritance to Snakemake rules

To avoid redundancy between closely related rules, e.g. rules for single ended and paired end data, YMP allows Snakemake rules to inherit from another rule.

Example

Derived rules are always created with an implicit ruleorder statement, making Snakemake prefer the parent rule if either parent or child rule could be used to generate the requested output file(s).

Derived rules initially contain the same attributes as the parent rule. Each attribute assigned to the child rule overrides the matching attribute in the parent. Where attributes may contain named and unnamed values, specifying a named value overrides only the value of that name while specifying an unnamed value overrides all unnamed values in the parent attribute.

KEYWORD = 'ymp: extends'

Comment keyword enabling inheritance

expand(rule, ruleinfo)[source]

Expands RuleInfo object and children recursively.

Will call :meth:format (via :meth:format_annotated) on str items encountered in the tree and wrap encountered functions to be called once the wildcards object is available.

Set ymp.print_rule = 1 before a rule: statement in snakefiles to enable debug logging of recursion.

Parameters
  • rule – The :class:snakemake.rules.Rule object to be populated with the data from the RuleInfo object passed from item

  • item – The item to be expanded. Initially a :class:snakemake.workflow.RuleInfo object into which is recursively decendet. May ultimately be None, str, function, int, float, dict, list or tuple.

  • expand_args – Parameters passed on late expansion (when the dag tries to instantiate the rule into a job.

  • rec – Recursion level

get_code_line(rule)[source]

Returns the source line defining rule

Return type

str

get_super(rule, ruleinfo)[source]

Find rule parent

Parameters
  • rule (Rule) – Rule object being built

  • ruleinfo (RuleInfo) – RuleInfo object describing rule being built

Returns

name of parent rule and RuleInfo describing parent rule or (None, None).

Return type

2-Tuple

class ymp.snakemake.NamedList(fromtuple=None, **kwargs)[source]

Bases: snakemake.io.Namedlist

Extended version of Snakemake’s Namedlist

  • Fixes array assignment operator: Writing a field via [] operator updates the value accessed via . operator.

  • Adds fromtuple to constructor: Builds from Snakemake’s typial (args, kwargs) tuples as present in ruleinfo structures.

  • Adds update_tuple method: Updates values in (args,kwargs) tuples as present in ruleinfo structures.

Create the object.

Arguments toclone – another Namedlist that shall be cloned fromdict – a dict that shall be converted to a

System Message: ERROR/3 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/stable/src/ymp/snakemake.py:docstring of ymp.snakemake.NamedList, line 18)

Unexpected indentation.

Namedlist (keys become names)

get_names(*args, **kwargs)[source]

Export get_names as public func

update_tuple(totuple)[source]

Update values in (args, kwargs) tuple. The tuple must be the same as used in the constructor and must not have been modified.

class ymp.snakemake.RecursiveExpander[source]

Bases: ymp.snakemake.BaseExpander

Recursively expands {xyz} wildcards in Snakemake rules.

expand(rule, ruleinfo)[source]

Recursively expand wildcards within RuleInfo object

expands_field(field)[source]

Returns true for all fields but shell:, message: and wildcard_constraints.

We don’t want to mess with the regular expressions in the fields in wildcard_constraints:, and there is little use in expanding message: or shell: as these already have all wildcards applied just before job execution (by format_wildcards()).

class ymp.snakemake.SnakemakeExpander[source]

Bases: ymp.snakemake.BaseExpander

Expand wildcards in strings returned from functions.

Snakemake does not do this by default, leaving wildcard expansion to the functions provided themselves. Since we never want {input} to be in a string returned as a file, we expand those always.

expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

format(item, *args, **kwargs)[source]

Format item using *args and **kwargs

class ymp.snakemake.WorkflowObject(*args, **kwargs)[source]

Bases: object

Base for extension classes defined from snakefiles

This currently encompasses ymp.env.Env and ymp.stage.Stage.

This mixin sets the properties filename and lineno according to the definition source in the rules file. It also maintains a registry within the Snakemake workflow object and provides an accessor method to this registry.

property defined_in
filename

Name of file in which object was defined

Type

str

classmethod get_registry(clean=False)[source]

Return all objects of this class registered with current workflow

lineno

Line number of object definition

Type

int

classmethod new_registry()[source]
register()[source]

Add self to registry

ymp.snakemake.check_snakemake()[source]
Return type

bool

ymp.snakemake.get_workflow()[source]

Get active workflow, loading one if necessary

ymp.snakemake.load_workflow(snakefile)[source]

Load new workflow

ymp.snakemake.make_rule(name=None, lineno=None, snakefile=None, **kwargs)[source]
ymp.snakemake.networkx()[source]
ymp.snakemake.print_ruleinfo(rule, ruleinfo, func=<bound method Logger.debug of <Logger ymp.snakemake (WARNING)>>)[source]

Logs contents of Rule and RuleInfo objects.

Parameters
  • rule (Rule) – Rule object to be printed

  • ruleinfo (RuleInfo) – Matching RuleInfo object to be printed

  • func – Function used for printing (default is log.error)

ymp.snakemake.ruleinfo_fields = {'benchmark': {'apply_wildcards': True, 'format': 'string'}, 'conda_env': {'apply_wildcards': True, 'format': 'string'}, 'container_img': {'format': 'string'}, 'docstring': {'format': 'string'}, 'func': {'format': 'callable'}, 'input': {'apply_wildcards': True, 'format': 'argstuple', 'funcparams': ('wildcards',)}, 'log': {'apply_wildcards': True, 'format': 'argstuple'}, 'message': {'format': 'string', 'format_wildcards': True}, 'norun': {'format': 'bool'}, 'output': {'apply_wildcards': True, 'format': 'argstuple'}, 'params': {'apply_wildcards': True, 'format': 'argstuple', 'funcparams': ('wildcards', 'input', 'resources', 'output', 'threads')}, 'priority': {'format': 'numeric'}, 'resources': {'format': 'argstuple', 'funcparams': ('wildcards', 'input', 'attempt', 'threads')}, 'script': {'format': 'string'}, 'shadow_depth': {'format': 'string_or_true'}, 'shellcmd': {'format': 'string', 'format_wildcards': True}, 'threads': {'format': 'int', 'funcparams': ('wildcards', 'input', 'attempt', 'threads')}, 'version': {'format': 'object'}, 'wildcard_constraints': {'format': 'argstuple'}, 'wrapper': {'format': 'string'}}

describes attributes of snakemake.workflow.RuleInfo

ymp.snakemakelexer module

ymp.snakemakelexer
class ymp.snakemakelexer.SnakemakeLexer(*args, **kwds)[source]

Bases: pygments.lexers.python.PythonLexer

name = 'Snakemake'
tokens = {'globalkeyword': [(<pygments.lexer.words object>, Token.Keyword)], 'root': [('(rule|checkpoint)((?:\\s|\\\\\\s)+)', <function bygroups.<locals>.callback>, 'rulename'), 'rulekeyword', 'globalkeyword', ('\\n', Token.Text), ('^(\\s*)([rRuUbB]{,2})("""(?:.|\\n)*?""")', <function bygroups.<locals>.callback>), ("^(\\s*)([rRuUbB]{,2})('''(?:.|\\n)*?''')", <function bygroups.<locals>.callback>), ('\\A#!.+$', Token.Comment.Hashbang), ('#.*$', Token.Comment.Single), ('\\\\\\n', Token.Text), ('\\\\', Token.Text), 'keywords', ('(def)((?:\\s|\\\\\\s)+)', <function bygroups.<locals>.callback>, 'funcname'), ('(class)((?:\\s|\\\\\\s)+)', <function bygroups.<locals>.callback>, 'classname'), ('(from)((?:\\s|\\\\\\s)+)', <function bygroups.<locals>.callback>, 'fromimport'), ('(import)((?:\\s|\\\\\\s)+)', <function bygroups.<locals>.callback>, 'import'), 'expr'], 'rulekeyword': [(<pygments.lexer.words object>, Token.Keyword)], 'rulename': [('[a-zA-Z_]\\w*', Token.Name.Class, '#pop')]}

ymp.sphinxext module

This module contains a Sphinx extension for documenting YMP stages and Snakemake rules.

The SnakemakeDomain (name sm) provides the following directives:

.. sm:rule:: name

Describes a Snakemake rule

.. sm:stage:: name

Describes a YMP Stage

Both directives accept an optional source parameter. If given, a link to the source code of the stage or rule definition will be added. The format of the string passed is filename:line. Referenced Snakefiles will be highlighted with pygments and added to the documentation when building HTML.

The extension also provides an autodoc-like directive:

.. autosnake:: filename

Generates documentation from Snakefile filename.

class ymp.sphinxext.AutoSnakefileDirective(name, arguments, options, content, lineno, content_offset, block_text, state, state_machine)[source]

Bases: docutils.parsers.rst.Directive

Implements RSt directive .. autosnake:: filename

The directive extracts docstrings from rules in snakefile and auto-generates documentation.

has_content = False

This rule does not accept content

Type

bool

load_workflow(file_path)[source]

Load the Snakefile

Return type

ExpandableWorkflow

parse_doc(doc, source, idt=0)[source]

Convert doc string to StringList

Parameters
  • doc (str) – Documentation text

  • source (str) – Source filename

  • idt (int) – Result indentation in characters (default 0)

Return type

StringList

Returns

StringList of re-indented documentation wrapped in newlines

parse_rule(rule, idt=0)[source]

Convert Rule to StringList

Parameters
  • rule (Rule) – Rule object

  • idt (int) – Result indentation in characters (default 0)

Retuns:

StringList containing formatted Rule documentation

Return type

StringList

parse_stage(stage, idt=0)[source]
Return type

StringList

required_arguments = 1

This rule needs one argument (the filename)

Type

int

run()[source]

Entry point

tpl_rule = '.. sm:rule:: {name}'

Template for generated Rule RSt

Type

str

tpl_source = ' :source: {filename}:{lineno}'

Template option source

Type

str

tpl_stage = '.. sm:stage:: {name}'

Template for generated Stage RSt

Type

str

ymp.sphinxext.BASEPATH = '/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/stable/src'

Path in which YMP package is located

Type

str

class ymp.sphinxext.DomainTocTreeCollector[source]

Bases: sphinx.environment.collectors.EnvironmentCollector

Add Sphinx Domain entries to the TOC

clear_doc(app, env, docname)[source]

Clear data from environment

If we have cached data in environment for document docname, we should clear it here.

Return type

None

get_ref(node)[source]
Return type

Optional[Node]

locate_in_toc(app, node)[source]
Return type

Optional[Node]

make_heading(node)[source]
Return type

List[Node]

merge_other(app, env, docnames, other)[source]

Merge with results from parallel processes

Called if Sphinx is processing documents in parallel. We should merge this from other into env for all docnames.

Return type

None

process_doc(app, doctree)[source]

Process doctree

This is called by read-doctree, so after the doctree has been loaded. The signal is processed in registered first order, so we are called after built-in extensions, such as the sphinx.environment.collectors.toctree extension building the TOC.

Return type

None

select_doc_nodes(doctree)[source]

Select the nodes for which entries in the TOC are desired

This is a separate method so that it might be overriden by subclasses wanting to add other types of nodes to the TOC.

Return type

List[Node]

select_toc_location(app, node)[source]

Select location in TOC where node should be referenced

Return type

Node

toc_insert(docname, tocnode, node, heading)[source]
Return type

None

class ymp.sphinxext.SnakemakeDomain(env)[source]

Bases: sphinx.domains.Domain

Snakemake language domain

clear_doc(docname)[source]

Delete objects derived from file docname

data_version = 0
directives = {'rule': <class 'ymp.sphinxext.SnakemakeRule'>, 'stage': <class 'ymp.sphinxext.YmpStage'>}
get_objects()[source]

Return an iterable of “object descriptions”.

Object descriptions are tuples with six items:

name

Fully qualified name.

dispname

Name to display when searching/linking.

type

Object type, a key in self.object_types.

docname

The document where it is to be found.

anchor

The anchor name for the object.

priority

How “important” the object is (determines placement in search results). One of:

1

Default priority (placed before full-text matches).

0

Object is important (placed before default-priority objects).

2

Object is unimportant (placed after full-text matches).

-1

Object should not show up in search at all.

initial_data = {'objects': {}}
label = 'Snakemake'
name = 'sm'
object_types = {'rule': <sphinx.domains.ObjType object>, 'stage': <sphinx.domains.ObjType object>}
resolve_xref(env, fromdocname, builder, typ, target, node, contnode)[source]

Resolve the pending_xref node with the given typ and target.

This method should return a new node, to replace the xref node, containing the contnode which is the markup content of the cross-reference.

If no resolution can be found, None can be returned; the xref node will then given to the :event:`missing-reference` event, and if that yields no resolution, replaced by contnode.

System Message: ERROR/3 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/stable/src/ymp/sphinxext.py:docstring of ymp.sphinxext.SnakemakeDomain.resolve_xref, line 7); backlink

Unknown interpreted text role “event”.

The method can also raise sphinx.environment.NoUri to suppress the :event:`missing-reference` event being emitted.

System Message: ERROR/3 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/stable/src/ymp/sphinxext.py:docstring of ymp.sphinxext.SnakemakeDomain.resolve_xref, line 11); backlink

Unknown interpreted text role “event”.

roles = {'rule': <sphinx.roles.XRefRole object>, 'stage': <sphinx.roles.XRefRole object>}
class ymp.sphinxext.SnakemakeRule(name, arguments, options, content, lineno, content_offset, block_text, state, state_machine)[source]

Bases: ymp.sphinxext.YmpObjectDescription

Directive sm:rule:: describing a Snakemake rule

typename = 'rule'
class ymp.sphinxext.YmpObjectDescription(name, arguments, options, content, lineno, content_offset, block_text, state, state_machine)[source]

Bases: sphinx.directives.ObjectDescription

Base class for RSt directives in SnakemakeDomain

Since this inherhits from Sphinx’ ObjectDescription, content generated by the directive will always be inside an addnodes.desc.

Parameters

source – Specify source position as file:line to create link

add_target_and_index(name, sig, signode)[source]

Add cross-reference IDs and entries to self.indexnode

Return type

None

get_index_text(typename, name)[source]

Formats object for entry into index

Return type

str

handle_signature(sig, signode)[source]

Parse rule signature sig into RST nodes and append them to signode.

The retun value identifies the object and is passed to add_target_and_index() unchanged

Parameters
  • sig (str) – Signature string (i.e. string passed after directive)

  • signode (desc) – Node created for object signature

Return type

str

Returns

Normalized signature (white space removed)

option_spec = {'source': <function unchanged>}
typename = '[object name]'
class ymp.sphinxext.YmpStage(name, arguments, options, content, lineno, content_offset, block_text, state, state_machine)[source]

Bases: ymp.sphinxext.YmpObjectDescription

Directive sm:stage:: describing an YMP stage

typename = 'stage'
ymp.sphinxext.collect_pages(app)[source]

Add Snakefiles to documentation (in HTML mode)

ymp.sphinxext.relpath(path)[source]

Make absolute path relative to BASEPATH

Parameters

path (str) – absolute path

Return type

str

Returns

path relative to BASEPATH

ymp.sphinxext.setup(app)[source]

Register the extension with Sphinx

ymp.string module

exception ymp.string.FormattingError(message, fieldname)[source]

Bases: AttributeError

class ymp.string.GetNameFormatter[source]

Bases: string.Formatter

get_names(pattern)[source]
class ymp.string.OverrideJoinFormatter[source]

Bases: string.Formatter

Formatter with overridable join method

The default formatter joins all arguments with "".join(args). This class overrides _vformat() with identical code, changing only that line to one that can be overridden by a derived class.

join(args)[source]

Joins the expanded pieces of the template string to form the output.

This function is equivalent to ''.join(args). By overriding it, alternative methods can be implemented, e.g. to create a list of strings, each corresponding to a the cross product of the expanded variables.

Return type

Union[List[str], str]

class ymp.string.PartialFormatter[source]

Bases: string.Formatter

Formats what it can and leaves the remainder untouched

get_field(field_name, args, kwargs)[source]
class ymp.string.ProductFormatter[source]

Bases: ymp.string.OverrideJoinFormatter

String Formatter that creates a list of strings each expanded using one point in the cartesian product of all replacement values.

If none of the arguments evaluate to lists, the result is a string, otherwise it is a list.

>>> ProductFormatter().format("{A} and {B}", A=[1,2], B=[3,4])
"1 and 3"
"1 and 4"
"2 and 3"
"2 and 4"
format_field(value, format_spec)[source]
join(args)[source]

Joins the expanded pieces of the template string to form the output.

This function is equivalent to ''.join(args). By overriding it, alternative methods can be implemented, e.g. to create a list of strings, each corresponding to a the cross product of the expanded variables.

Return type

Union[List[str], str]

class ymp.string.QuotedElementFormatter(*args, **kwargs)[source]

Bases: snakemake.utils.SequenceFormatter

class ymp.string.RegexFormatter(regex)[source]

Bases: string.Formatter

String Formatter accepting a regular expression defining the format of the expanded tags.

get_names(format_string)[source]

Get set of field names in format_string)

Return type

Set[str]

parse(format_string)[source]

Parse format_string into tuples. Tuples contain literal_text: text to copy field_name: follwed by field name format_spec: conversion:

ymp.string.make_formatter(product=None, regex=None, partial=None, quoted=None)[source]

ymp.util module

ymp.util.R(code='', **kwargs)[source]

Execute R code

This function executes the R code given as a string. Additional arguments are injected into the R environment. The value of the last R statement is returned.

The function requires rpy2 to be installed.

Parameters
  • code (str) – R code to be executed

  • **kwargs (dict) – variables to inject into R globalenv

Yields

value of last R statement

>>>  R("1*1", input=input)
ymp.util.Rmd(rmd, out, **kwargs)[source]
ymp.util.activate_R()[source]
ymp.util.fasta_names(fasta_file)[source]
ymp.util.file_not_empty(fn)[source]

Checks is a file is not empty, accounting for gz mininum size 20

ymp.util.filter_out_empty(*args)[source]

Removes empty sets of files from input file lists.

Takes a variable number of file lists of equal length and removes indices where any of the files is empty. Strings are converted to lists of length 1.

Returns a generator tuple.

Example: r1, r2 = filter_out_empty(input.r1, input.r2)

ymp.util.glob_wildcards(pattern, files=None)[source]

Glob the values of the wildcards by matching the given pattern to the filesystem. Returns a named tuple with a list of values for each wildcard.

ymp.util.is_fq(path)[source]
ymp.util.make_local_path(icfg, url)[source]
ymp.util.read_propfiles(files)[source]

ymp.yaml module

class ymp.yaml.AttrItemAccessMixin[source]

Bases: object

Mixin class mapping dot to bracket access

Added to classes implementing __getitem__, __setitem__ and __delitem__, this mixin will allow acessing items using dot notation. I.e. “object.xyz” is translated to “object[xyz]”.

exception ymp.yaml.LayeredConfAccessError[source]

Bases: ymp.yaml.LayeredConfError, KeyError, IndexError

Can’t access

exception ymp.yaml.LayeredConfError[source]

Bases: Exception

Error in LayeredConf

class ymp.yaml.LayeredConfProxy(maps, parent=None, key=None)[source]

Bases: ymp.yaml.MultiMapProxy

Layered configuration

save(outstream=None, layer=0)[source]
exception ymp.yaml.LayeredConfWriteError[source]

Bases: ymp.yaml.LayeredConfError

Can’t write

exception ymp.yaml.MixedTypeError[source]

Bases: Exception

Mixed types in proxy collection

class ymp.yaml.MultiMapProxy(maps, parent=None, key=None)[source]

Bases: collections.abc.Mapping, ymp.yaml.MultiProxy, ymp.yaml.AttrItemAccessMixin

Mapping Proxy for layered containers

get(k[, d]) → D[k] if k in D, else d. d defaults to None.[source]
items() → a set-like object providing a view on D’s items[source]
keys() → a set-like object providing a view on D’s keys[source]
values() → an object providing a view on D’s values[source]
class ymp.yaml.MultiMapProxyItemsView(mapping)[source]

Bases: ymp.yaml.MultiMapProxyMappingView, collections.abc.ItemsView

ItemsView for MultiMapProxy

class ymp.yaml.MultiMapProxyKeysView(mapping)[source]

Bases: ymp.yaml.MultiMapProxyMappingView, collections.abc.KeysView

KeysView for MultiMapProxy

class ymp.yaml.MultiMapProxyMappingView(mapping)[source]

Bases: collections.abc.MappingView

MappingView for MultiMapProxy

class ymp.yaml.MultiMapProxyValuesView(mapping)[source]

Bases: ymp.yaml.MultiMapProxyMappingView, collections.abc.ValuesView

ValuesView for MultiMapProxy

class ymp.yaml.MultiProxy(maps, parent=None, key=None)[source]

Bases: object

Base class for layered container structure

add_layer(name, container)[source]
get_files()[source]
get_linenos()[source]
make_map_proxy(key, items)[source]
make_seq_proxy(key, items)[source]
remove_layer(name)[source]
to_yaml(show_source=False)[source]
class ymp.yaml.MultiSeqProxy(maps, parent=None, key=None)[source]

Bases: collections.abc.Sequence, ymp.yaml.MultiProxy, ymp.yaml.AttrItemAccessMixin

Sequence Proxy for layered containers

extend(item)[source]
ymp.yaml.load(files)[source]

Load configuration files

Creates a LayeredConfProxy configuration object from a set of YAML files.

Indices and tables