ymp.stage package

YMP processes data in stages, each of which is contained in its own directory.

with Stage("trim_bbmap") as S:
  S.doc("Trim reads with BBMap")
  rule bbmap_trim:
    output: "{:this:}/{sample}{:pairnames:}.fq.gz"
    input:  "{:prev:}/{sample}{:pairnames:}.fq.gz"
    ...

Submodules

ymp.stage.base module

class ymp.stage.base.BaseStage(name)[source]

Bases: object

Base class for stage types

STAMP_FILENAME = 'all_targets.stamp'

The name of the stamp file that is touched to indicate completion of the stage.

can_provide(inputs)[source]

Determines which of inputs this stage can provide.

Returns a dictionary with the keys a subset of inputs and the values identifying redirections. An empty string indicates that no redirection is to take place. Otherwise, the string is the suffix to be appended to the prior StageStack.

Return type

Dict[str, str]

doc(doc)[source]

Add documentation to Stage

Parameters

doc (str) – Docstring passed to Sphinx

Return type

None

docstring: str

The docstring describing this stage. Visible via ymp stage list and in the generated sphinx documentation.

get_all_targets(stack)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

Return type

List[str]

get_inputs()[source]

Returns the set of inputs required by this stage

This function must return a copy, to ensure internal data is not modified.

Return type

Set[str]

get_path(stack)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

Return type

str

match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

Return type

bool

name

The name of the stage is a string uniquely identifying it among all stages.

property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

class ymp.stage.base.ConfigStage(name, cfg)[source]

Bases: ymp.stage.base.BaseStage

Base for stages created via configuration

These Stages derive from the yml.yml and not from a rules file.

cfg

The configuration object defining this Stage.

property defined_in

List of files defining this stage

Used to invalidate caches.

filename

Semi-colon separated list of file names defining this Stage.

lineno

Line number within the first file at which this Stage is defined.

ymp.stage.expander module

class ymp.stage.expander.StageExpander[source]

Bases: ymp.snakemake.ColonExpander

  • Registers rules with stages when they are created

class Formatter(expander)[source]

Bases: ymp.snakemake.FormatExpander.Formatter, ymp.string.PartialFormatter

get_value(key, args, kwargs)[source]
get_value_(key, args, kwargs)[source]
expand_ruleinfo(rule, item, expand_args, rec)[source]
expand_str(rule, item, expand_args, rec, cb)[source]
expands_field(field)[source]

Checks if this expander should expand a Rule field type

Parameters

field – the field to check

Returns

True if field should be expanded.

ymp.stage.groupby module

class ymp.stage.groupby.GroupBy(name)[source]

Bases: ymp.stage.base.BaseStage

Dummy stage for grouping

ymp.stage.pipeline module

Pipelines Module

Contains classes for pre-configured pipelines comprising multiple stages.

class ymp.stage.pipeline.Pipeline(name, cfg)[source]

Bases: ymp.stage.base.ConfigStage

A virtual stage aggregating a sequence of stages, i.e. a pipeline or sub-workflow.

Pipelines are configured via ymp.yml.

Example

pipelines:
my_pipeline:
  • stage_1

  • stage_2

  • stage_3

can_provide(inputs)[source]

Determines which of inputs this stage can provide.

The result dictionary values will point to the “real” output.

Return type

Dict[str, str]

get_all_targets(stack)[source]

Targets to build to complete this stage given stack.

Typically, this is the StageStack’s path appended with the stamp name.

get_path(stack)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

property outputs

The outputs of a pipeline are the sum of the outputs of each component stage. Outputs of stages further down the pipeline override those generated earlier.

TODO: Allow hiding the output of intermediary stages.

Return type

Dict[str, str]

property pipeline

ymp.stage.project module

class ymp.stage.project.PandasProjectData(cfg)[source]

Bases: object

column(col)[source]
columns()[source]
dump()[source]
duplicate_rows(column)[source]
get(idcol, row, col)[source]
groupby_dedup(cols)[source]

Return non-redundant identifying subset of cols

identifying_columns()[source]
rows(cols)[source]
string_columns()[source]
class ymp.stage.project.PandasTableBuilder[source]

Bases: object

Builds the data table describing each sample in a project

This class implements loading and combining tabular data files as specified by the YAML configuration.

Format:
  • string items are files

  • lists of files are concatenated top to bottom

  • dicts must have one “command” value:

    • ‘join’ contains a two-item list the two items are joined ‘naturally’ on shared headers

    • ‘table’ contains a list of one-item dicts dicts have form key:value[,value...] a in-place table is created from the keys list-of-dict is necessary as dicts are unordered

    • ‘paste’ contains a list of tables pasted left to right tables pasted must be of equal length or length 1

  • if a value is a valid path relative to the csv/tsv/xls file’s location, it is expanded to a path relative to CWD

Example

- top.csv
- join:
  - excel.xslx%left.csv
  - right.tsv
- table:
  - sample: s1,s2,s3
  - fq1: s1.1.fq, s2.1.fq, s3.1.fq
  - fq2: s1.2.fq, s2.2.fq, s3.2.fq
load_data(cfg)[source]
class ymp.stage.project.Project(name, cfg)[source]

Bases: ymp.stage.base.ConfigStage

Contains configuration for a source dataset to be processed

KEY_BCCOL = 'barcode_col'
KEY_DATA = 'data'
KEY_IDCOL = 'id_col'
KEY_READCOLS = 'read_cols'
RE_FILE = re.compile('^(?!http://).*(?:fq|fastq)(?:|\\.gz)$')
RE_REMOTE = re.compile('^(?:https?|ftp|sftp)://(?:.*)')
RE_SRR = re.compile('^[SED]RR[0-9]+$')
choose_fq_columns()[source]

Configures the columns referencing the fastq sources

choose_id_column()[source]

Configures column to use as index on runs

If explicitly configured via KEY_IDCOL, verifies that the column exists and that it is unique. Otherwise chooses the leftmost unique column in the data.

property data

Pandas dataframe of runs

Lazy loading property, first call may take a while.

encode_barcode_path(barcode_file, run, pair)[source]
property fq_names

Names of all FastQ files

property fwd_fq_names

Names of forward FastQ files (se and pe)

property fwd_pe_fq_names

Names of forward FastQ files part of pair

get_fq_names(only_fwd=False, only_rev=False, only_pe=False, only_se=False)[source]

Get pipeline names of fq files

get_ids(groups, match_groups=None, match_value=None)[source]
property idcol
iter_samples(variables=None)[source]
minimize_variables(groups)[source]
property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

property pe_fq_names

Names of paired end FastQ files

property project_name
raw_reads_source_path(args, kwargs)[source]
property rev_pe_fq_names

Names of reverse FastQ files part of pair

property runs

Pandas dataframe index of runs

Lazy loading property, first call may take a while.

property se_fq_names

Names of single end FastQ files

property source_cfg
source_path(target, pair, nosplit=False)[source]

Get path for FQ file for run and pair

unsplit_path(barcode_id, pairname)[source]
property variables
class ymp.stage.project.SQLiteProjectData(cfg, name='data')[source]

Bases: object

column(col)[source]
columns()[source]
property db_url
dump()[source]
duplicate_rows(column)[source]
get(idcol, row, col)[source]
groupby_dedup(cols)[source]
identifying_columns()[source]
property nrows
query(*args)[source]
rows(col)[source]
string_columns()[source]

ymp.stage.reference module

class ymp.stage.reference.Archive(name, dirname, tar, url, strip, files)[source]

Bases: object

dirname = None
files = None
get_files()[source]
hash = None
make_unpack_rule(baserule)[source]
name = None
strip_components = None
tar = None
class ymp.stage.reference.Reference(name, cfg)[source]

Bases: ymp.stage.base.ConfigStage

Represents (remote) reference file/database configuration

add_files(rsc, local_path)[source]
get_file(filename)[source]
get_path(_stack)[source]

On disk location for this stage given stack.

Called by StageStack to determine the real path for virtual stages (which must override this function).

make_unpack_rules(baserule)[source]
property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

Return type

Union[Set[str], Dict[str, str]]

ymp.stage.stack module

class ymp.stage.stack.StageStack(path, stage=None)[source]

Bases: object

The “head” of a processing chain - a stack of stages

all_targets()[source]
complete(incomplete)[source]
property defined_in
classmethod get(path, stage=None)[source]

Cached access to StageStack

Parameters
  • path – Stage path

  • stage – Stage object at head of stack

property path

On disk location of files provided by this stack

prev(args=None, kwargs=None)[source]

Directory of previous stage

resolve_prevs()[source]
target(args, kwargs)[source]

Finds the target in the prev stage matching current target

property targets

Returns the current targets

used_stacks = {}
ymp.stage.stack.find_stage(name)[source]
ymp.stage.stack.norm_wildcards(pattern)[source]

ymp.stage.stage module

class ymp.stage.stage.Param(stage, key, name, value=None, default=None)[source]

Bases: object

Stage Parameter (base class)

property constraint
pattern(show_constraint=True)[source]

String to add to filenames passed to Snakemake

I.e. a pattern of the form {wildcard,constraint}

class ymp.stage.stage.ParamChoice(*args, **kwargs)[source]

Bases: ymp.stage.stage.Param

Stage Choice Parameter

param_func()[source]

Returns function that will extract parameter value from wildcards

class ymp.stage.stage.ParamFlag(*args, **kwargs)[source]

Bases: ymp.stage.stage.Param

Stage Flag Parameter

param_func()[source]

Returns function that will extract parameter value from wildcards

class ymp.stage.stage.ParamInt(*args, **kwargs)[source]

Bases: ymp.stage.stage.Param

Stage Int Parameter

param_func()[source]

Returns function that will extract parameter value from wildcards

class ymp.stage.stage.Stage(name, altname=None, env=None, doc=None)[source]

Bases: ymp.snakemake.WorkflowObject, ymp.stage.base.BaseStage

Creates a new stage

While entered using with, several stage specific variables are expanded within rules:

  • {:this:} – The current stage directory

  • {:that:} – The alternate output stage directory

  • {:prev:} – The previous stage’s directory

Parameters
  • name (str) – Name of this stage

  • altname (Optional[str]) – Alternate name of this stage (used for stages with multiple output variants, e.g. filter_x and remove_x.

  • doc (Optional[str]) – See Stage.doc

  • env (Optional[str]) – See Stage.env

active = None

Currently active stage (“entered”)

add_param(key, typ, name, value=None, default=None)[source]

Add parameter to stage

Example

>>> with Stage("test") as S
>>>   S.add_param("N", "int", "nval", default=50)
>>>   rule:
>>>      shell: "echo {param.nval}"

This would add a stage “test”, optionally callable as “testN123”, printing “50” or in the case of “testN123” printing “123”.

Parameters
  • char – The character to use in the Stage name

  • typ – The type of the parameter (int, flag)

  • param – Name of parameter in params

  • value – value {param.xyz} should be set to if param given

  • default – default value for {{param.xyz}} if no param given

env(name)[source]

Add package specifications to Stage environment

Note

This sets the environment for all rules within the stage, which leads to errors with Snakemake rule types not supporting conda environments

Parameters

name (str) – Environment name or filename

>>> Env("blast", packages="blast =2.7*")
>>> with Stage("test") as S:
>>>    S.env("blast")
>>>    rule testing:
>>>       ...
>>> with Stage("test", env="blast") as S:
>>>    rule testing:
>>>       ...
>>> with Stage("test") as S:
>>>    rule testing:
>>>       conda: "blast"
>>>       ...
Return type

None

get_inputs()[source]

Returns the set of inputs required by this stage

This function must return a copy, to ensure internal data is not modified.

match(name)[source]

Check if the name can refer to this stage

As component of a StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.

property outputs

Returns the set of outputs this stage is able to generate.

May return either a set or a dict with the dictionary values representing redirections in the case of virtual stages such as Pipeline or Reference.

prev(args, kwargs)[source]

Gathers {:prev:} calls from rules

require(**kwargs)[source]

Override inferred stage inputs

In theory, this should not be needed. But it’s simpler for now.

satisfy_inputs(other_stage, inputs)[source]
Return type

Dict[str, str]

that(args=None, kwargs=None)[source]

Alternate directory of current stage

Used for splitting stages

this(args=None, kwargs=None)[source]

Directory of current stage

wc2path(wc)[source]
wildcards(args=None, kwargs=None)[source]
ymp.stage.stage.norm_wildcards(pattern)[source]