ymp.stage package¶

YMP processes data in stages, each of which is contained in its own directory.

with Stage("trim_bbmap") as S:
  S.doc("Trim reads with BBMap")
  rule bbmap_trim:
    output: "{:this:}/{sample}{:pairnames:}.fq.gz"
    input:  "{:prev:}/{sample}{:pairnames:}.fq.gz"
    ...

Submodules¶

ymp.stage.base module¶

Base classes for all Stage types

class ymp.stage.base.Activateable(*args, **kwargs)[source]¶

Bases: object

Mixin for Stages that can be filled with rules from Snakefiles.

add_rule(rule, workflow)[source]¶

Return type: None

check_active_stage(name)[source]¶

Return type: None

static get_active()[source]¶

Return type: BaseStage

register_inout(name, target, item)[source]¶

Determine stage input/output file type from prev/this filename

Detects patterns like “PREFIX{: NAME :}/INFIX{TARGET}.EXT”. Also checks if there is an active stage.

Parameters

name (str) – The NAME
target (Set) – Set to which to add the type
item (str) – The filename

Return type

None

Returns

Normalized output pattern

rules: List[snakemake.rules.Rule]¶: Rules in this stage

static set_active(stage)[source]¶

Return type: None

class ymp.stage.base.BaseStage(name)[source]¶

Bases: object

Base class for stage types

altname¶: Alternative name

can_provide(inputs)[source]¶

Determines which of inputs this stage can provide.

Returns a dictionary with the keys a subset of inputs and the values identifying redirections. An empty string indicates that no redirection is to take place. Otherwise, the string is the suffix to be appended to the prior StageStack.

Return type: Dict[str, str]

doc(doc)[source]¶

Add documentation to Stage

Parameters: doc (str) – Docstring passed to Sphinx
Return type: None

docstring: Optional[str]¶: The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

get_all_targets(stack, output_types=None)[source]¶

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

Return type: List[str]

get_group(stack, default_groups)[source]¶

Determine output grouping for stage

Parameters

stack (StageStack) – The stack for which output grouping is requested.
default_groups (List[str]) – Grouping determined from stage inputs
override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, match_groups=None, match_value=None)[source]¶

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters

groups (List[str]) – Set of columns the values of which should form IDs
match_value (Optional[str]) – Limit output to rows with this value
match_groups (Optional[List[str]]) – … in these groups

Return type

List[str]

get_inputs()[source]¶

Returns the set of inputs required by this stage

This function must return a copy, to ensure internal data is not modified.

Return type: Set[str]

get_outputs(path)[source]¶

Returns a dictionary of outputs

Return type: Dict[str, str]

get_path(stack)[source]¶

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

Return type: str

has_checkpoint()[source]¶

Return type: bool

match(name)[source]¶

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

Return type: bool

modify_next_group(_stack)[source]¶

name¶: The name of the stage is a string uniquely identifying it among all stages.

property outputs¶

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type: Union[Set[str], Dict[str, str]]

class ymp.stage.base.ConfigStage(name, cfg)[source]¶

Bases: ymp.stage.base.BaseStage

Base for stages created via configuration

These Stages derive from the yml.yml and not from a rules file.

cfg¶: The configuration object defining this Stage.

property defined_in¶

List of files defining this stage

Used to invalidate caches.

docstring: Optional[str]¶: The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

filename¶: Semi-colon separated list of file names defining this Stage.

lineno¶: Line number within the first file at which this Stage is defined.

ymp.stage.expander module¶

class ymp.stage.expander.StageExpander[source]¶

Bases: ymp.snakemake.ColonExpander

Registers rules with stages when they are created

class Formatter(expander)[source]¶

Bases: ymp.snakemake.FormatExpander.Formatter, ymp.string.PartialFormatter

get_value(key, args, kwargs)[source]¶

get_value_(key, args, kwargs)[source]¶

regroup = re.compile('(?<!{){\\s*([^{}\\s]+)\\s*}(?!})')¶

expand_ruleinfo(rule, item, expand_args, rec)[source]¶

expand_str(rule, item, expand_args, rec, cb)[source]¶

expands_field(field)[source]¶

Checks if this expander should expand a Rule field type

Parameters: field – the field to check
Returns: True if field should be expanded.

ymp.stage.groupby module¶

Implements forward grouping

Grouping allows processing multiple input datasets at once, such as in a co-assembly. It is initiated by adding the virtual stage “group_<COL>” directly before the stage that should be grouping its output. “<COL>” may be a project data column, in which case all data for which column COL shares a value will be combined, or “ALL”, which combines all samples. The output filename prefix will be either the column value or “ALL”.

>>> ymp make mock.group_sample.assemble_megahit
>>> ymp make mock.group_ALL.assemble_megahit

Subsequent stages will use the most finegrained grouping required by their input data.

# FIXME: How to avoid re-specifying groupby?

class ymp.stage.groupby.GroupBy(name)[source]¶

Bases: ymp.stage.base.BaseStage

Virtual stage for grouping

PREFIX = 'group_'¶

docstring: Optional[str]¶: The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

get_group(stack, default_groups)[source]¶

Determine output grouping for stage

Parameters

stack (StageStack) – The stack for which output grouping is requested.
default_groups (List[str]) – Grouping determined from stage inputs
override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

match(name)[source]¶

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

Return type: bool

modify_next_group(stack)[source]¶

Return type: List[str]

ymp.stage.params module¶

class ymp.stage.params.Param(stage, key, name, value=None, default=None)[source]¶

Bases: abc.ABC

Stage Parameter (base class)

property constraint¶

format(groupdict)[source]¶

classmethod make(stage, typ, key, name, value, default)[source]¶

Return type: Param

parse(wildcards, nodefault=False)[source]¶

pattern(show_constraint=True)[source]¶

String to add to filenames passed to Snakemake

I.e. a pattern of the form {wildcard,constraint}

regex: str = NotImplemented¶

type_name: str = NotImplemented¶: Name of type, must be overwritten by children

types: Dict[str, Type[ymp.stage.params.Param]] = {'choice': <class 'ymp.stage.params.ParamChoice'>, 'flag': <class 'ymp.stage.params.ParamFlag'>, 'int': <class 'ymp.stage.params.ParamInt'>, 'ref': <class 'ymp.stage.params.ParamRef'>}¶: Type/Class mapping for param types

property wildcard¶

class ymp.stage.params.ParamChoice(*args, **kwargs)[source]¶

Bases: ymp.stage.params.Param

Stage Choice Parameter

type_name: str = 'choice'¶: Name of type, must be overwritten by children

class ymp.stage.params.ParamFlag(*args, **kwargs)[source]¶

Bases: ymp.stage.params.Param

Stage Flag Parameter

format(groupdict)[source]¶

parse(wildcards)[source]¶: Returns function that will extract parameter value from wildcards

type_name: str = 'flag'¶: Name of type, must be overwritten by children

class ymp.stage.params.ParamInt(*args, **kwargs)[source]¶

Bases: ymp.stage.params.Param

Stage Int Parameter

type_name: str = 'int'¶: Name of type, must be overwritten by children

class ymp.stage.params.ParamRef(stage, key, name, value=None, default=None)[source]¶

Bases: ymp.stage.params.Param

Reference Choice Parameter

property regex¶

type_name: str = 'ref'¶: Name of type, must be overwritten by children

class ymp.stage.params.Parametrizable(*args, **kwargs)[source]¶

Bases: ymp.stage.base.BaseStage

add_param(key, typ, name, value=None, default=None)[source]¶

Add parameter to stage

Example

>>> with Stage("test") as S
>>>   S.add_param("N", "int", "nval", default=50)
>>>   rule:
>>>      shell: "echo {param.nval}"

This would add a stage “test”, optionally callable as “testN123”, printing “50” or in the case of “testN123” printing “123”.

Parameters

char – The character to use in the Stage name
typ – The type of the parameter (int, flag)
param – Name of parameter in params
value – value {param.xyz} should be set to if param given
default – default value for {{param.xyz}} if no param given

Return type

bool

docstring: Optional[str]¶: The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

format(groupdict)[source]¶

match(name)[source]¶

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

Return type: bool

property params¶

parse(name)[source]¶

Return type: Dict[str, str]

property regex¶

ymp.stage.pipeline module¶

Pipelines Module

Contains classes for pre-configured pipelines comprising multiple stages.

class ymp.stage.pipeline.Pipeline(name, cfg)[source]¶

Bases: ymp.stage.params.Parametrizable, ymp.stage.base.ConfigStage

A virtual stage aggregating a sequence of stages, i.e. a pipeline or sub-workflow.

Pipelines are configured via ymp.yml.

Example

pipelines:

my_pipeline:

hide: false params:

System Message: ERROR/3 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src/ymp/stage/pipeline.py:docstring of ymp.stage.pipeline.Pipeline, line 12)

Unexpected indentation.

length:
key: L type: int default: 20

System Message: WARNING/2 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src/ymp/stage/pipeline.py:docstring of ymp.stage.pipeline.Pipeline, line 16)

Block quote ends without a blank line; unexpected unindent.

stages:

stage_1{length}:
hide: true
stage_2
stage_3

can_provide(inputs)[source]¶

Determines which of inputs this stage can provide.

The result dictionary values will point to the “real” output.

Return type: Dict[str, str]

docstring: Optional[str]¶: The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

get_all_targets(stack)[source]¶

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

get_group(stack, default_groups)[source]¶

Determine output grouping for stage

Parameters

stack (StageStack) – The stack for which output grouping is requested.
default_groups (List[str]) – Grouping determined from stage inputs
override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, mygroups=None, target=None)[source]¶

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters

groups – Set of columns the values of which should form IDs
match_value – Limit output to rows with this value
match_groups – … in these groups

get_path(stack, typ=None)[source]¶

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

hide_outputs¶: If true, outputs of stages are hidden by default

property outputs¶

The outputs of a pipeline are the sum of the outputs of each component stage. Outputs of stages further down the pipeline override those generated earlier.

Return type: Dict[str, str]

property params¶

pipeline¶: Path fragment describing this pipeline

stages¶: Dictionary of stages with configuration options for each

ymp.stage.project module¶

This module defines “Project”, a Stage type defined by a project matrix file giving units and meta data for input files.

class ymp.stage.project.PandasTableBuilder[source]¶

Bases: object

Builds the data table describing each sample in a project

This class implements loading and combining tabular data files as specified by the YAML configuration.

Format:

string items are files
lists of files are concatenated top to bottom
dicts must have one “command” value:
- ‘join’ contains a two-item list the two items are joined ‘naturally’ on shared headers
- ‘table’ contains a list of one-item dicts dicts have form key:value[,value...] a in-place table is created from the keys list-of-dict is necessary as dicts are unordered
- ‘paste’ contains a list of tables pasted left to right tables pasted must be of equal length or length 1
if a value is a valid path relative to the csv/tsv/xls file’s location, it is expanded to a path relative to CWD

Example

- top.csv
- join:
  - excel.xslx%left.csv
  - right.tsv
- table:
  - sample: s1,s2,s3
  - fq1: s1.1.fq, s2.1.fq, s3.1.fq
  - fq2: s1.2.fq, s2.2.fq, s3.2.fq

load_data(cfg, key)[source]¶

class ymp.stage.project.Project(name, cfg)[source]¶

Bases: ymp.stage.base.ConfigStage

Contains configuration for a source dataset to be processed

KEY_BCCOL = 'barcode_col'¶

KEY_DATA = 'data'¶

KEY_IDCOL = 'id_col'¶

KEY_READCOLS = 'read_cols'¶

RE_FILE = re.compile('^(?!http://).*(?:fq|fastq)(?:|\\.gz)$')¶

RE_REMOTE = re.compile('^(?:https?|ftp|sftp)://(?:.*)')¶

RE_SRR = re.compile('^[SED]RR[0-9]+$')¶

choose_fq_columns()[source]¶: Configures the columns referencing the fastq sources

choose_id_column()[source]¶

Configures column to use as index on runs

If explicitly configured via KEY_IDCOL, verifies that the column exists and that it is unique. Otherwise chooses the leftmost unique column in the data.

property data¶

Pandas dataframe of runs

Lazy loading property, first call may take a while.

do_get_ids(_stack, groups, match_groups=None, match_values=None)[source]¶

docstring: Optional[str]¶: The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

encode_barcode_path(barcode_file, run, pair)[source]¶

property fq_names¶: Names of all FastQ files

property fwd_fq_names¶: Names of forward FastQ files (se and pe)

property fwd_pe_fq_names¶: Names of forward FastQ files part of pair

get_all_targets(stack, output_types=None)[source]¶

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

Return type: List[str]

get_fq_names(only_fwd=False, only_rev=False, only_pe=False, only_se=False)[source]¶: Get pipeline names of fq files

get_group(stack, default_groups)[source]¶

Determine output grouping for stage

Parameters

stack (StageStack) – The stack for which output grouping is requested.
default_groups (List[str]) – Grouping determined from stage inputs
override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, match_groups=None, match_values=None)[source]¶

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters

groups – Set of columns the values of which should form IDs
match_value – Limit output to rows with this value
match_groups – … in these groups

property idcol¶

iter_samples(variables=None)[source]¶

minimize_variables(groups)[source]¶: Removes redundant groupings

property outputs¶

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type: Union[Set[str], Dict[str, str]]

property pe_fq_names¶: Names of paired end FastQ files

property project_name¶

raw_reads_source_path(args, kwargs)[source]¶

property rev_pe_fq_names¶: Names of reverse FastQ files part of pair

property runs¶

Pandas dataframe index of runs

Lazy loading property, first call may take a while.

property se_fq_names¶: Names of single end FastQ files

property source_cfg¶

source_path(target, pair, nosplit=False)[source]¶: Get path for FQ file for run and pair

unsplit_path(barcode_id, pairname)[source]¶

property variables¶

class ymp.stage.project.SQLiteProjectData(cfg, key, name='data')[source]¶

Bases: object

columns()[source]¶

property db_url¶

dump()[source]¶

duplicate_rows(column)[source]¶

fetch(cols, idcols=None, values=None)[source]¶

Return type: List[List[str]]

groupby_dedup(cols)[source]¶

identifying_columns()[source]¶

property nrows¶

query(*args)[source]¶

rows(col)[source]¶

string_columns()[source]¶

ymp.stage.reference module¶

class ymp.stage.reference.Archive(name, dirname, tar, url, strip, files)[source]¶

Bases: object

dirname = None¶

files = None¶

get_files()[source]¶

hash = None¶

make_unpack_rule(baserule)[source]¶

name = None¶

strip_components = None¶

tar = None¶

class ymp.stage.reference.Reference(name, cfg)[source]¶

Bases: ymp.stage.base.Activateable, ymp.stage.base.ConfigStage

Represents (remote) reference file/database configuration

add_resource(rsc)[source]¶

files: Dict[str, str]¶: Files provided by the reference. Keys are the file names within ymp (“target.extension”), symlinked into dir.ref/ref_name/ and values are the path to the reference file from workspace root.

get_all_targets(stack)[source]¶

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

Return type: List[str]

get_file(filename)[source]¶

get_group(stack, default_groups)[source]¶

Determine output grouping for stage

Parameters

stack (StageStack) – The stack for which output grouping is requested.
default_groups (List[str]) – Grouping determined from stage inputs
override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, match_groups=None, match_value=None)[source]¶

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters

groups (List[str]) – Set of columns the values of which should form IDs
match_value (Optional[str]) – Limit output to rows with this value
match_groups (Optional[List[str]]) – … in these groups

Return type

List[str]

get_path(_stack)[source]¶

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

make_unpack_rules(baserule)[source]¶

property outputs¶

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type: Union[Set[str], Dict[str, str]]

prev(args=None, kwargs=None)[source]¶

rules: List[snakemake.rules.Rule]¶: Rules in this stage

this(args=None, kwargs=None)[source]¶

ymp.stage.stack module¶

Implements the StageStack

class ymp.stage.stack.StageStack(path)[source]¶

Bases: object

The “head” of a processing chain - a stack of stages

all_targets()[source]¶

complete(incomplete)[source]¶

debug = False¶: Set to true to enable additional Stack debug logging

property defined_in¶

get_ids(select_cols, where_cols=None, where_vals=None)[source]¶

group: List[str]¶: Grouping in effect for this StageStack. And empty list groups into one pseudo target, ‘ALL’.

classmethod instance(path)[source]¶

Cached access to StageStack

Parameters

path – Stage path
stage – Stage object at head of stack

name¶: Name of stack, aka is its full path

property path¶: On disk location of files provided by this stack

prev(_args=None, kwargs=None)[source]¶

Directory of previous stage

Return type: StageStack

prev_stage¶: Stage below top stage or None if first in stack

prevs¶: Mapping of each input type required by the stage of this stack to the prefix stack providing it.

project¶: Project on which stack operates This is needed for grouping variables currently.

resolve_prevs()[source]¶

show_info()[source]¶

stage¶: Top Stage

stage_name¶: Top Stage Name

stage_names¶: Names of stages on stack

stages¶: Stages on stack

target(args, kwargs)[source]¶: Determines the IDs for a given input data type and output ID (replaces “{:target:}”).

property targets¶: Determines the IDs to be built by this Stage Stack (replaces “{:targets:}”).

used_stacks = {}¶

ymp.stage.stack.find_stage(name)[source]¶

ymp.stage.stack.norm_wildcards(pattern)[source]¶

ymp.stage.stage module¶

Implements the “Stage”

At it’s most basic, a “Stage” is a set of Snakemake rules that share an output folder.

class ymp.stage.stage.Stage(name, altname=None, env=None, doc=None)[source]¶

Bases: ymp.snakemake.WorkflowObject, ymp.stage.params.Parametrizable, ymp.stage.base.Activateable, ymp.stage.base.BaseStage

Creates a new stage

While entered using with, several stage specific variables are expanded within rules:

{:this:} – The current stage directory
{:that:} – The alternate output stage directory
{:prev:} – The previous stage’s directory

Parameters

name (str) – Name of this stage
altname (Optional[str]) – Alternate name of this stage (used for stages with multiple output variants, e.g. filter_x and remove_x.
doc (Optional[str]) – See doc()
env (Optional[str]) – See env()

altname: str¶: Alternative stage name (deprecated)

bin(_args=None, kwargs=None)[source]¶: Dynamic ID for splitting stages

checkpoints: Dict[str, Set[str]]¶: Checkpoints in this stage

env(name)[source]¶

Add package specifications to Stage environment

Note

This sets the environment for all rules within the stage, which leads to errors with Snakemake rule types not supporting conda environments

Parameters: name (str) – Environment name or filename

>>> Env("blast", packages="blast =2.7*")
>>> with Stage("test") as S:
>>>    S.env("blast")
>>>    rule testing:
>>>       ...

>>> with Stage("test", env="blast") as S:
>>>    rule testing:
>>>       ...

>>> with Stage("test") as S:
>>>    rule testing:
>>>       conda: "blast"
>>>       ...

Return type: None

get_all_targets(stack)[source]¶

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

get_checkpoint_ids(stack, mygroup, target)[source]¶

get_group(stack, default_groups)[source]¶

Determine output grouping for stage

Parameters

stack – The stack for which output grouping is requested.
default_groups (List[str]) – Grouping determined from stage inputs
override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, mygroups=None, target=None)[source]¶

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters

groups – Set of columns the values of which should form IDs
match_value – Limit output to rows with this value
match_groups – … in these groups

get_inputs()[source]¶

Returns the set of inputs required by this stage

This function must return a copy, to ensure internal data is not modified.

has_checkpoint()[source]¶

Return type: bool

match(name)[source]¶

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

property outputs¶

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type: Set[str]

prev(_args, kwargs)[source]¶

Gathers {:prev:} calls from rules

Here, input requirements for each stage are collected.

Return type: None

require(**kwargs)[source]¶

Override inferred stage inputs

In theory, this should not be needed. But it’s simpler for now.

requires¶: Contains override stage inputs

satisfy_inputs(other_stage, inputs)[source]¶

Return type: Dict[str, str]

that(_args=None, kwargs=None)[source]¶

Alternate directory of current stage

Used for splitting stages

this(args=None, kwargs=None)[source]¶

Replaces {:this:} in rules

Also gathers output capabilities of each stage.

wc2path(wc)[source]¶