ymp.stage package

YMP processes data in stages, each of which is contained in its own directory.

with Stage("trim_bbmap") as S:
  S.doc("Trim reads with BBMap")
  rule bbmap_trim:
    output: "{:this:}/{sample}{:pairnames:}.fq.gz"
    input:  "{:prev:}/{sample}{:pairnames:}.fq.gz"
    ...

Submodules

ymp.stage.base module

Base classes for all Stage types

class ymp.stage.base.Activateable(*args, **kwargs)[source]

Bases: object

Mixin for Stages that can be filled with rules from Snakefiles.

add_rule(rule, workflow)[source]
Return type

None

check_active_stage(name)[source]
Return type

None

static get_active()[source]
Return type

BaseStage

register_inout(name, target, item)[source]

Determine stage input/output file type from prev/this filename

Detects patterns like “PREFIX{: NAME :}/INFIX{TARGET}.EXT”. Also checks if there is an active stage.

Parameters
  • name (str) – The NAME

  • target (Set) – Set to which to add the type

  • item (str) – The filename

Return type

None

Returns

Normalized output pattern

rules: List[snakemake.rules.Rule]

Rules in this stage

static set_active(stage)[source]
Return type

None

class ymp.stage.base.BaseStage(name)[source]

Bases: object

Base class for stage types

altname

Alternative name

can_provide(inputs)[source]

Determines which of inputs this stage can provide.

Returns a dictionary with the keys a subset of inputs and the values identifying redirections. An empty string indicates that no redirection is to take place. Otherwise, the string is the suffix to be appended to the prior StageStack.

Return type

Dict[str, str]

doc(doc)[source]

Add documentation to Stage

Parameters

doc (str) – Docstring passed to Sphinx

Return type

None

docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

get_all_targets(stack, output_types=None)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

Return type

List[str]

get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack (StageStack) – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, match_groups=None, match_value=None)[source]

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters
  • groups (List[str]) – Set of columns the values of which should form IDs

  • match_value (Optional[str]) – Limit output to rows with this value

  • match_groups (Optional[List[str]]) – … in these groups

Return type

List[str]

get_inputs()[source]

Returns the set of inputs required by this stage

This function must return a copy, to ensure internal data is not modified.

Return type

Set[str]

get_outputs(path)[source]

Returns a dictionary of outputs

Return type

Dict[str, str]

get_path(stack)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

Return type

str

has_checkpoint()[source]
Return type

bool

match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

Return type

bool

modify_next_group(_stack)[source]
name

The name of the stage is a string uniquely identifying it among all stages.

property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

class ymp.stage.base.ConfigStage(name, cfg)[source]

Bases: ymp.stage.base.BaseStage

Base for stages created via configuration

These Stages derive from the yml.yml and not from a rules file.

cfg

The configuration object defining this Stage.

property defined_in

List of files defining this stage

Used to invalidate caches.

docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

filename

Semi-colon separated list of file names defining this Stage.

lineno

Line number within the first file at which this Stage is defined.

ymp.stage.expander module

class ymp.stage.expander.StageExpander[source]

Bases: ymp.snakemake.ColonExpander

  • Registers rules with stages when they are created

class Formatter(expander)[source]

Bases: ymp.snakemake.FormatExpander.Formatter, ymp.string.PartialFormatter

get_value(key, args, kwargs)[source]
get_value_(key, args, kwargs)[source]
regroup = re.compile('(?<!{){\\s*([^{}\\s]+)\\s*}(?!})')
expand_ruleinfo(rule, item, expand_args, rec)[source]
expand_str(rule, item, expand_args, rec, cb)[source]
expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

ymp.stage.groupby module

Implements forward grouping

Grouping allows processing multiple input datasets at once, such as in a co-assembly. It is initiated by adding the virtual stage “group_<COL>” directly before the stage that should be grouping its output. “<COL>” may be a project data column, in which case all data for which column COL shares a value will be combined, or “ALL”, which combines all samples. The output filename prefix will be either the column value or “ALL”.

>>> ymp make mock.group_sample.assemble_megahit
>>> ymp make mock.group_ALL.assemble_megahit

Subsequent stages will use the most finegrained grouping required by their input data.

# FIXME: How to avoid re-specifying groupby?

class ymp.stage.groupby.GroupBy(name)[source]

Bases: ymp.stage.base.BaseStage

Virtual stage for grouping

PREFIX = 'group_'
docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack (StageStack) – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

Return type

bool

modify_next_group(stack)[source]
Return type

List[str]

ymp.stage.params module

class ymp.stage.params.Param(stage, key, name, value=None, default=None)[source]

Bases: abc.ABC

Stage Parameter (base class)

property constraint
format(groupdict)[source]
classmethod make(stage, typ, key, name, value, default)[source]
Return type

Param

parse(wildcards, nodefault=False)[source]
pattern(show_constraint=True)[source]

String to add to filenames passed to Snakemake

I.e. a pattern of the form {wildcard,constraint}

regex: str = NotImplemented
type_name: str = NotImplemented

Name of type, must be overwritten by children

types: Dict[str, Type[ymp.stage.params.Param]] = {'choice': <class 'ymp.stage.params.ParamChoice'>, 'flag': <class 'ymp.stage.params.ParamFlag'>, 'int': <class 'ymp.stage.params.ParamInt'>, 'ref': <class 'ymp.stage.params.ParamRef'>}

Type/Class mapping for param types

property wildcard
class ymp.stage.params.ParamChoice(*args, **kwargs)[source]

Bases: ymp.stage.params.Param

Stage Choice Parameter

type_name: str = 'choice'

Name of type, must be overwritten by children

class ymp.stage.params.ParamFlag(*args, **kwargs)[source]

Bases: ymp.stage.params.Param

Stage Flag Parameter

format(groupdict)[source]
parse(wildcards)[source]

Returns function that will extract parameter value from wildcards

type_name: str = 'flag'

Name of type, must be overwritten by children

class ymp.stage.params.ParamInt(*args, **kwargs)[source]

Bases: ymp.stage.params.Param

Stage Int Parameter

type_name: str = 'int'

Name of type, must be overwritten by children

class ymp.stage.params.ParamRef(stage, key, name, value=None, default=None)[source]

Bases: ymp.stage.params.Param

Reference Choice Parameter

property regex
type_name: str = 'ref'

Name of type, must be overwritten by children

class ymp.stage.params.Parametrizable(*args, **kwargs)[source]

Bases: ymp.stage.base.BaseStage

add_param(key, typ, name, value=None, default=None)[source]

Add parameter to stage

Example

>>> with Stage("test") as S
>>>   S.add_param("N", "int", "nval", default=50)
>>>   rule:
>>>      shell: "echo {param.nval}"

This would add a stage “test”, optionally callable as “testN123”, printing “50” or in the case of “testN123” printing “123”.

Parameters
  • char – The character to use in the Stage name

  • typ – The type of the parameter (int, flag)

  • param – Name of parameter in params

  • value – value {param.xyz} should be set to if param given

  • default – default value for {{param.xyz}} if no param given

Return type

bool

docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

format(groupdict)[source]
match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

Return type

bool

property params
parse(name)[source]
Return type

Dict[str, str]

property regex

ymp.stage.pipeline module

Pipelines Module

Contains classes for pre-configured pipelines comprising multiple stages.

class ymp.stage.pipeline.Pipeline(name, cfg)[source]

Bases: ymp.stage.params.Parametrizable, ymp.stage.base.ConfigStage

A virtual stage aggregating a sequence of stages, i.e. a pipeline or sub-workflow.

Pipelines are configured via ymp.yml.

Example

pipelines:
my_pipeline:

hide: false params:

System Message: ERROR/3 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src/ymp/stage/pipeline.py:docstring of ymp.stage.pipeline.Pipeline, line 12)

Unexpected indentation.

length:

key: L type: int default: 20

System Message: WARNING/2 (/home/docs/checkouts/readthedocs.org/user_builds/ymp/checkouts/latest/src/ymp/stage/pipeline.py:docstring of ymp.stage.pipeline.Pipeline, line 16)

Block quote ends without a blank line; unexpected unindent.

stages:
  • stage_1{length}:

    hide: true

  • stage_2

  • stage_3

can_provide(inputs)[source]

Determines which of inputs this stage can provide.

The result dictionary values will point to the “real” output.

Return type

Dict[str, str]

docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

get_all_targets(stack)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack (StageStack) – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, mygroups=None, target=None)[source]

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters
  • groups – Set of columns the values of which should form IDs

  • match_value – Limit output to rows with this value

  • match_groups – … in these groups

get_path(stack, typ=None)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

hide_outputs

If true, outputs of stages are hidden by default

property outputs

The outputs of a pipeline are the sum of the outputs of each component stage. Outputs of stages further down the pipeline override those generated earlier.

Return type

Dict[str, str]

property params
pipeline

Path fragment describing this pipeline

stages

Dictionary of stages with configuration options for each

ymp.stage.project module

This module defines “Project”, a Stage type defined by a project matrix file giving units and meta data for input files.

class ymp.stage.project.PandasTableBuilder[source]

Bases: object

Builds the data table describing each sample in a project

This class implements loading and combining tabular data files as specified by the YAML configuration.

Format:
  • string items are files

  • lists of files are concatenated top to bottom

  • dicts must have one “command” value:

    • ‘join’ contains a two-item list the two items are joined ‘naturally’ on shared headers

    • ‘table’ contains a list of one-item dicts dicts have form key:value[,value...] a in-place table is created from the keys list-of-dict is necessary as dicts are unordered

    • ‘paste’ contains a list of tables pasted left to right tables pasted must be of equal length or length 1

  • if a value is a valid path relative to the csv/tsv/xls file’s location, it is expanded to a path relative to CWD

Example

- top.csv
- join:
  - excel.xslx%left.csv
  - right.tsv
- table:
  - sample: s1,s2,s3
  - fq1: s1.1.fq, s2.1.fq, s3.1.fq
  - fq2: s1.2.fq, s2.2.fq, s3.2.fq
load_data(cfg, key)[source]
class ymp.stage.project.Project(name, cfg)[source]

Bases: ymp.stage.base.ConfigStage

Contains configuration for a source dataset to be processed

KEY_BCCOL = 'barcode_col'
KEY_DATA = 'data'
KEY_IDCOL = 'id_col'
KEY_READCOLS = 'read_cols'
RE_FILE = re.compile('^(?!http://).*(?:fq|fastq)(?:|\\.gz)$')
RE_REMOTE = re.compile('^(?:https?|ftp|sftp)://(?:.*)')
RE_SRR = re.compile('^[SED]RR[0-9]+$')
choose_fq_columns()[source]

Configures the columns referencing the fastq sources

choose_id_column()[source]

Configures column to use as index on runs

If explicitly configured via KEY_IDCOL, verifies that the column exists and that it is unique. Otherwise chooses the leftmost unique column in the data.

property data

Pandas dataframe of runs

Lazy loading property, first call may take a while.

do_get_ids(_stack, groups, match_groups=None, match_values=None)[source]
docstring: Optional[str]

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

encode_barcode_path(barcode_file, run, pair)[source]
property fq_names

Names of all FastQ files

property fwd_fq_names

Names of forward FastQ files (se and pe)

property fwd_pe_fq_names

Names of forward FastQ files part of pair

get_all_targets(stack, output_types=None)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

Return type

List[str]

get_fq_names(only_fwd=False, only_rev=False, only_pe=False, only_se=False)[source]

Get pipeline names of fq files

get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack (StageStack) – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, match_groups=None, match_values=None)[source]

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters
  • groups – Set of columns the values of which should form IDs

  • match_value – Limit output to rows with this value

  • match_groups – … in these groups

property idcol
iter_samples(variables=None)[source]
minimize_variables(groups)[source]

Removes redundant groupings

property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

property pe_fq_names

Names of paired end FastQ files

property project_name
raw_reads_source_path(args, kwargs)[source]
property rev_pe_fq_names

Names of reverse FastQ files part of pair

property runs

Pandas dataframe index of runs

Lazy loading property, first call may take a while.

property se_fq_names

Names of single end FastQ files

property source_cfg
source_path(target, pair, nosplit=False)[source]

Get path for FQ file for run and pair

unsplit_path(barcode_id, pairname)[source]
property variables
class ymp.stage.project.SQLiteProjectData(cfg, key, name='data')[source]

Bases: object

columns()[source]
property db_url
dump()[source]
duplicate_rows(column)[source]
fetch(cols, idcols=None, values=None)[source]
Return type

List[List[str]]

groupby_dedup(cols)[source]
identifying_columns()[source]
property nrows
query(*args)[source]
rows(col)[source]
string_columns()[source]

ymp.stage.reference module

class ymp.stage.reference.Archive(name, dirname, tar, url, strip, files)[source]

Bases: object

dirname = None
files = None
get_files()[source]
hash = None
make_unpack_rule(baserule)[source]
name = None
strip_components = None
tar = None
class ymp.stage.reference.Reference(name, cfg)[source]

Bases: ymp.stage.base.Activateable, ymp.stage.base.ConfigStage

Represents (remote) reference file/database configuration

add_resource(rsc)[source]
files: Dict[str, str]

Files provided by the reference. Keys are the file names within ymp (“target.extension”), symlinked into dir.ref/ref_name/ and values are the path to the reference file from workspace root.

get_all_targets(stack)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

Return type

List[str]

get_file(filename)[source]
get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack (StageStack) – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, match_groups=None, match_value=None)[source]

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters
  • groups (List[str]) – Set of columns the values of which should form IDs

  • match_value (Optional[str]) – Limit output to rows with this value

  • match_groups (Optional[List[str]]) – … in these groups

Return type

List[str]

get_path(_stack)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

make_unpack_rules(baserule)[source]
property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

prev(args=None, kwargs=None)[source]
rules: List[snakemake.rules.Rule]

Rules in this stage

this(args=None, kwargs=None)[source]

ymp.stage.stack module

Implements the StageStack

class ymp.stage.stack.StageStack(path)[source]

Bases: object

The “head” of a processing chain - a stack of stages

all_targets()[source]
complete(incomplete)[source]
debug = False

Set to true to enable additional Stack debug logging

property defined_in
get_ids(select_cols, where_cols=None, where_vals=None)[source]
group: List[str]

Grouping in effect for this StageStack. And empty list groups into one pseudo target, ‘ALL’.

classmethod instance(path)[source]

Cached access to StageStack

Parameters
  • path – Stage path

  • stage – Stage object at head of stack

name

Name of stack, aka is its full path

property path

On disk location of files provided by this stack

prev(_args=None, kwargs=None)[source]

Directory of previous stage

Return type

StageStack

prev_stage

Stage below top stage or None if first in stack

prevs

Mapping of each input type required by the stage of this stack to the prefix stack providing it.

project

Project on which stack operates This is needed for grouping variables currently.

resolve_prevs()[source]
show_info()[source]
stage

Top Stage

stage_name

Top Stage Name

stage_names

Names of stages on stack

stages

Stages on stack

target(args, kwargs)[source]

Determines the IDs for a given input data type and output ID (replaces “{:target:}”).

property targets

Determines the IDs to be built by this Stage Stack (replaces “{:targets:}”).

used_stacks = {}
ymp.stage.stack.find_stage(name)[source]
ymp.stage.stack.norm_wildcards(pattern)[source]

ymp.stage.stage module

Implements the “Stage”

At it’s most basic, a “Stage” is a set of Snakemake rules that share an output folder.

class ymp.stage.stage.Stage(name, altname=None, env=None, doc=None)[source]

Bases: ymp.snakemake.WorkflowObject, ymp.stage.params.Parametrizable, ymp.stage.base.Activateable, ymp.stage.base.BaseStage

Creates a new stage

While entered using with, several stage specific variables are expanded within rules:

  • {:this:} – The current stage directory

  • {:that:} – The alternate output stage directory

  • {:prev:} – The previous stage’s directory

Parameters
  • name (str) – Name of this stage

  • altname (Optional[str]) – Alternate name of this stage (used for stages with multiple output variants, e.g. filter_x and remove_x.

  • doc (Optional[str]) – See doc()

  • env (Optional[str]) – See env()

altname: str

Alternative stage name (deprecated)

bin(_args=None, kwargs=None)[source]

Dynamic ID for splitting stages

checkpoints: Dict[str, Set[str]]

Checkpoints in this stage

env(name)[source]

Add package specifications to Stage environment

Note

This sets the environment for all rules within the stage, which leads to errors with Snakemake rule types not supporting conda environments

Parameters

name (str) – Environment name or filename

>>> Env("blast", packages="blast =2.7*")
>>> with Stage("test") as S:
>>>    S.env("blast")
>>>    rule testing:
>>>       ...
>>> with Stage("test", env="blast") as S:
>>>    rule testing:
>>>       ...
>>> with Stage("test") as S:
>>>    rule testing:
>>>       conda: "blast"
>>>       ...
Return type

None

get_all_targets(stack)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

get_checkpoint_ids(stack, mygroup, target)[source]
get_group(stack, default_groups)[source]

Determine output grouping for stage

Parameters
  • stack – The stack for which output grouping is requested.

  • default_groups (List[str]) – Grouping determined from stage inputs

  • override_groups – Override grouping from GroupBy stage or None.

Return type

List[str]

get_ids(stack, groups, mygroups=None, target=None)[source]

Determine the target ID names for a set of active groupings

Called from {:target:} and {:targets:}. For {:targets:}, groups is the set of active groupings for the stage stack. For {:target:}, it’s the same set for the source of the file type, the current grouping and the current target.

Parameters
  • groups – Set of columns the values of which should form IDs

  • match_value – Limit output to rows with this value

  • match_groups – … in these groups

get_inputs()[source]

Returns the set of inputs required by this stage

This function must return a copy, to ensure internal data is not modified.

has_checkpoint()[source]
Return type

bool

match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Set[str]

prev(_args, kwargs)[source]

Gathers {:prev:} calls from rules

Here, input requirements for each stage are collected.

Return type

None

require(**kwargs)[source]

Override inferred stage inputs

In theory, this should not be needed. But it’s simpler for now.

requires

Contains override stage inputs

satisfy_inputs(other_stage, inputs)[source]
Return type

Dict[str, str]

that(_args=None, kwargs=None)[source]

Alternate directory of current stage

Used for splitting stages

this(args=None, kwargs=None)[source]

Replaces {:this:} in rules

Also gathers output capabilities of each stage.

wc2path(wc)[source]