ymp.stage package¶
YMP processes data in stages, each of which is contained in its own directory.
with Stage("trim_bbmap") as S:
S.doc("Trim reads with BBMap")
rule bbmap_trim:
output: "{:this:}/{sample}{:pairnames:}.fq.gz"
input: "{:prev:}/{sample}{:pairnames:}.fq.gz"
...
Submodules¶
ymp.stage.base module¶
Base classes for all Stage types
-
class
ymp.stage.base.
Activateable
(*args, **kwargs)[source]¶ Bases:
object
Mixin for Stages that can be filled with rules from Snakefiles.
-
register_inout
(name, target, item)[source]¶ Determine stage input/output file type from prev/this filename
Detects patterns like “PREFIX{: NAME :}/INFIX{TARGET}.EXT”. Also checks if there is an active stage.
-
rules
: List[snakemake.rules.Rule]¶ Rules in this stage
-
-
class
ymp.stage.base.
BaseStage
(name)[source]¶ Bases:
object
Base class for stage types
-
altname
¶ Alternative name
-
can_provide
(inputs)[source]¶ Determines which of
inputs
this stage can provide.Returns a dictionary with the keys a subset of
inputs
and the values identifying redirections. An empty string indicates that no redirection is to take place. Otherwise, the string is the suffix to be appended to the priorStageStack
.
-
docstring
: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage list
and in the generated sphinx documentation.
-
get_all_targets
(stack, output_types=None)[source]¶ Targets to build to complete this stage given
stack
.Typically, this is the StageStack’s path appended with the stamp name.
-
get_group
(stack, default_groups)[source]¶ Determine output grouping for stage
- Parameters
stack (StageStack) – The stack for which output grouping is requested.
default_groups (
List
[str
]) – Grouping determined from stage inputsoverride_groups – Override grouping from GroupBy stage or None.
- Return type
-
get_ids
(stack, groups, match_groups=None, match_value=None)[source]¶ Determine the target ID names for a set of active groupings
Called from
{:target:}
and{:targets:}
. For{:targets:}
,groups
is the set of active groupings for the stage stack. For{:target:}
, it’s the same set for the source of the file type, the current grouping and the current target.
-
get_inputs
()[source]¶ Returns the set of inputs required by this stage
This function must return a copy, to ensure internal data is not modified.
-
get_path
(stack)[source]¶ On disk location for this stage given
stack
.Called by
StageStack
to determine the real path for virtual stages (which must override this function).- Return type
-
match
(name)[source]¶ Check if the
name
can refer to this stageAs component of a
StageStack
, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.- Return type
-
name
¶ The name of the stage is a string uniquely identifying it among all stages.
-
-
class
ymp.stage.base.
ConfigStage
(name, cfg)[source]¶ Bases:
ymp.stage.base.BaseStage
Base for stages created via configuration
These Stages derive from the
yml.yml
and not from a rules file.-
cfg
¶ The configuration object defining this Stage.
-
property
defined_in
¶ List of files defining this stage
Used to invalidate caches.
-
docstring
: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage list
and in the generated sphinx documentation.
-
filename
¶ Semi-colon separated list of file names defining this Stage.
-
lineno
¶ Line number within the first file at which this Stage is defined.
-
ymp.stage.expander module¶
-
class
ymp.stage.expander.
StageExpander
[source]¶ Bases:
ymp.snakemake.ColonExpander
Registers rules with stages when they are created
-
class
Formatter
(expander)[source]¶ Bases:
ymp.snakemake.FormatExpander.Formatter
,ymp.string.PartialFormatter
-
regroup
= re.compile('(?<!{){\\s*([^{}\\s]+)\\s*}(?!})')¶
-
ymp.stage.groupby module¶
Implements forward grouping
Grouping allows processing multiple input datasets at once, such as in a co-assembly. It is initiated by adding the virtual stage “group_<COL>” directly before the stage that should be grouping its output. “<COL>” may be a project data column, in which case all data for which column COL shares a value will be combined, or “ALL”, which combines all samples. The output filename prefix will be either the column value or “ALL”.
>>> ymp make mock.group_sample.assemble_megahit
>>> ymp make mock.group_ALL.assemble_megahit
Subsequent stages will use the most finegrained grouping required by their input data.
# FIXME: How to avoid re-specifying groupby?
-
class
ymp.stage.groupby.
GroupBy
(name)[source]¶ Bases:
ymp.stage.base.BaseStage
Virtual stage for grouping
-
PREFIX
= 'group_'¶
-
docstring
: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage list
and in the generated sphinx documentation.
-
get_group
(stack, default_groups)[source]¶ Determine output grouping for stage
- Parameters
stack (StageStack) – The stack for which output grouping is requested.
default_groups (
List
[str
]) – Grouping determined from stage inputsoverride_groups – Override grouping from GroupBy stage or None.
- Return type
-
match
(name)[source]¶ Check if the
name
can refer to this stageAs component of a
StageStack
, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.- Return type
-
ymp.stage.params module¶
-
class
ymp.stage.params.
Param
(stage, key, name, value=None, default=None)[source]¶ Bases:
abc.ABC
Stage Parameter (base class)
-
property
constraint
¶
-
pattern
(show_constraint=True)[source]¶ String to add to filenames passed to Snakemake
I.e. a pattern of the form
{wildcard,constraint}
-
types
: Dict[str, Type[ymp.stage.params.Param]] = {'choice': <class 'ymp.stage.params.ParamChoice'>, 'flag': <class 'ymp.stage.params.ParamFlag'>, 'int': <class 'ymp.stage.params.ParamInt'>, 'ref': <class 'ymp.stage.params.ParamRef'>}¶ Type/Class mapping for param types
-
property
wildcard
¶
-
property
-
class
ymp.stage.params.
ParamChoice
(*args, **kwargs)[source]¶ Bases:
ymp.stage.params.Param
Stage Choice Parameter
-
class
ymp.stage.params.
ParamFlag
(*args, **kwargs)[source]¶ Bases:
ymp.stage.params.Param
Stage Flag Parameter
-
class
ymp.stage.params.
ParamInt
(*args, **kwargs)[source]¶ Bases:
ymp.stage.params.Param
Stage Int Parameter
-
class
ymp.stage.params.
ParamRef
(stage, key, name, value=None, default=None)[source]¶ Bases:
ymp.stage.params.Param
Reference Choice Parameter
-
property
regex
¶
-
property
-
class
ymp.stage.params.
Parametrizable
(*args, **kwargs)[source]¶ Bases:
ymp.stage.base.BaseStage
-
add_param
(key, typ, name, value=None, default=None)[source]¶ Add parameter to stage
Example
>>> with Stage("test") as S >>> S.add_param("N", "int", "nval", default=50) >>> rule: >>> shell: "echo {param.nval}"
This would add a stage “test”, optionally callable as “testN123”, printing “50” or in the case of “testN123” printing “123”.
- Parameters
char – The character to use in the Stage name
typ – The type of the parameter (int, flag)
param – Name of parameter in params
value – value
{param.xyz}
should be set to if param givendefault – default value for
{{param.xyz}}
if no param given
- Return type
-
docstring
: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage list
and in the generated sphinx documentation.
-
match
(name)[source]¶ Check if the
name
can refer to this stageAs component of a
StageStack
, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.- Return type
-
property
params
¶
-
property
regex
¶
-
ymp.stage.pipeline module¶
Pipelines Module
Contains classes for pre-configured pipelines comprising multiple stages.
-
class
ymp.stage.pipeline.
Pipeline
(name, cfg)[source]¶ Bases:
ymp.stage.params.Parametrizable
,ymp.stage.base.ConfigStage
A virtual stage aggregating a sequence of stages, i.e. a pipeline or sub-workflow.
Pipelines are configured via
ymp.yml
.Example
- pipelines:
- my_pipeline:
hide: false params:
- length:
key: L type: int default: 20
- stages:
- stage_1{length}:
hide: true
stage_2
stage_3
-
can_provide
(inputs)[source]¶ Determines which of
inputs
this stage can provide.The result dictionary values will point to the “real” output.
-
docstring
: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage list
and in the generated sphinx documentation.
-
get_all_targets
(stack)[source]¶ Targets to build to complete this stage given
stack
.Typically, this is the StageStack’s path appended with the stamp name.
-
get_group
(stack, default_groups)[source]¶ Determine output grouping for stage
- Parameters
stack (
StageStack
) – The stack for which output grouping is requested.default_groups (
List
[str
]) – Grouping determined from stage inputsoverride_groups – Override grouping from GroupBy stage or None.
- Return type
-
get_ids
(stack, groups, mygroups=None, target=None)[source]¶ Determine the target ID names for a set of active groupings
Called from
{:target:}
and{:targets:}
. For{:targets:}
,groups
is the set of active groupings for the stage stack. For{:target:}
, it’s the same set for the source of the file type, the current grouping and the current target.- Parameters
groups – Set of columns the values of which should form IDs
match_value – Limit output to rows with this value
match_groups – … in these groups
-
get_path
(stack, typ=None)[source]¶ On disk location for this stage given
stack
.Called by
StageStack
to determine the real path for virtual stages (which must override this function).
-
hide_outputs
¶ If true, outputs of stages are hidden by default
-
property
outputs
¶ The outputs of a pipeline are the sum of the outputs of each component stage. Outputs of stages further down the pipeline override those generated earlier.
-
property
params
¶
-
pipeline
¶ Path fragment describing this pipeline
-
stages
¶ Dictionary of stages with configuration options for each
ymp.stage.project module¶
This module defines “Project”, a Stage type defined by a project matrix file giving units and meta data for input files.
-
class
ymp.stage.project.
PandasTableBuilder
[source]¶ Bases:
object
Builds the data table describing each sample in a project
This class implements loading and combining tabular data files as specified by the YAML configuration.
- Format:
string items are files
lists of files are concatenated top to bottom
dicts must have one “command” value:
‘join’ contains a two-item list the two items are joined ‘naturally’ on shared headers
‘table’ contains a list of one-item dicts dicts have form
key:value[,value...]
a in-place table is created from the keys list-of-dict is necessary as dicts are unordered‘paste’ contains a list of tables pasted left to right tables pasted must be of equal length or length 1
if a value is a valid path relative to the csv/tsv/xls file’s location, it is expanded to a path relative to CWD
Example
- top.csv - join: - excel.xslx%left.csv - right.tsv - table: - sample: s1,s2,s3 - fq1: s1.1.fq, s2.1.fq, s3.1.fq - fq2: s1.2.fq, s2.2.fq, s3.2.fq
-
class
ymp.stage.project.
Project
(name, cfg)[source]¶ Bases:
ymp.stage.base.ConfigStage
Contains configuration for a source dataset to be processed
-
KEY_BCCOL
= 'barcode_col'¶
-
KEY_DATA
= 'data'¶
-
KEY_IDCOL
= 'id_col'¶
-
KEY_READCOLS
= 'read_cols'¶
-
RE_FILE
= re.compile('^(?!http://).*(?:fq|fastq)(?:|\\.gz)$')¶
-
RE_REMOTE
= re.compile('^(?:https?|ftp|sftp)://(?:.*)')¶
-
RE_SRR
= re.compile('^[SED]RR[0-9]+$')¶
-
choose_id_column
()[source]¶ Configures column to use as index on runs
If explicitly configured via KEY_IDCOL, verifies that the column exists and that it is unique. Otherwise chooses the leftmost unique column in the data.
-
property
data
¶ Pandas dataframe of runs
Lazy loading property, first call may take a while.
-
docstring
: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage list
and in the generated sphinx documentation.
-
property
fq_names
¶ Names of all FastQ files
-
property
fwd_fq_names
¶ Names of forward FastQ files (se and pe)
-
property
fwd_pe_fq_names
¶ Names of forward FastQ files part of pair
-
get_all_targets
(stack, output_types=None)[source]¶ Targets to build to complete this stage given
stack
.Typically, this is the StageStack’s path appended with the stamp name.
-
get_fq_names
(only_fwd=False, only_rev=False, only_pe=False, only_se=False)[source]¶ Get pipeline names of fq files
-
get_group
(stack, default_groups)[source]¶ Determine output grouping for stage
- Parameters
stack (StageStack) – The stack for which output grouping is requested.
default_groups (
List
[str
]) – Grouping determined from stage inputsoverride_groups – Override grouping from GroupBy stage or None.
- Return type
-
get_ids
(stack, groups, match_groups=None, match_values=None)[source]¶ Determine the target ID names for a set of active groupings
Called from
{:target:}
and{:targets:}
. For{:targets:}
,groups
is the set of active groupings for the stage stack. For{:target:}
, it’s the same set for the source of the file type, the current grouping and the current target.- Parameters
groups – Set of columns the values of which should form IDs
match_value – Limit output to rows with this value
match_groups – … in these groups
-
property
idcol
¶
-
property
outputs
¶ Returns the set of outputs this stage is able to generate.
May return either a
set
or adict
with the dictionary values representing redirections in the case of virtual stages such asPipeline
orReference
.
-
property
pe_fq_names
¶ Names of paired end FastQ files
-
property
project_name
¶
-
property
rev_pe_fq_names
¶ Names of reverse FastQ files part of pair
-
property
runs
¶ Pandas dataframe index of runs
Lazy loading property, first call may take a while.
-
property
se_fq_names
¶ Names of single end FastQ files
-
property
source_cfg
¶
-
property
variables
¶
-
ymp.stage.reference module¶
-
class
ymp.stage.reference.
Archive
(name, dirname, tar, url, strip, files)[source]¶ Bases:
object
-
dirname
= None¶
-
files
= None¶
-
hash
= None¶
-
name
= None¶
-
strip_components
= None¶
-
tar
= None¶
-
-
class
ymp.stage.reference.
Reference
(name, cfg)[source]¶ Bases:
ymp.stage.base.Activateable
,ymp.stage.base.ConfigStage
Represents (remote) reference file/database configuration
-
files
: Dict[str, str]¶ Files provided by the reference. Keys are the file names within ymp (“target.extension”), symlinked into dir.ref/ref_name/ and values are the path to the reference file from workspace root.
-
get_all_targets
(stack)[source]¶ Targets to build to complete this stage given
stack
.Typically, this is the StageStack’s path appended with the stamp name.
-
get_group
(stack, default_groups)[source]¶ Determine output grouping for stage
- Parameters
stack (StageStack) – The stack for which output grouping is requested.
default_groups (
List
[str
]) – Grouping determined from stage inputsoverride_groups – Override grouping from GroupBy stage or None.
- Return type
-
get_ids
(stack, groups, match_groups=None, match_value=None)[source]¶ Determine the target ID names for a set of active groupings
Called from
{:target:}
and{:targets:}
. For{:targets:}
,groups
is the set of active groupings for the stage stack. For{:target:}
, it’s the same set for the source of the file type, the current grouping and the current target.
-
get_path
(_stack)[source]¶ On disk location for this stage given
stack
.Called by
StageStack
to determine the real path for virtual stages (which must override this function).
-
property
outputs
¶ Returns the set of outputs this stage is able to generate.
May return either a
set
or adict
with the dictionary values representing redirections in the case of virtual stages such asPipeline
orReference
.
-
rules
: List[snakemake.rules.Rule]¶ Rules in this stage
-
ymp.stage.stack module¶
Implements the StageStack
-
class
ymp.stage.stack.
StageStack
(path)[source]¶ Bases:
object
The “head” of a processing chain - a stack of stages
-
debug
= False¶ Set to true to enable additional Stack debug logging
-
property
defined_in
¶
-
group
: List[str]¶ Grouping in effect for this StageStack. And empty list groups into one pseudo target, ‘ALL’.
-
classmethod
instance
(path)[source]¶ Cached access to StageStack
- Parameters
path – Stage path
stage – Stage object at head of stack
-
name
¶ Name of stack, aka is its full path
-
property
path
¶ On disk location of files provided by this stack
-
prev_stage
¶ Stage below top stage or None if first in stack
-
prevs
¶ Mapping of each input type required by the stage of this stack to the prefix stack providing it.
-
project
¶ Project on which stack operates This is needed for grouping variables currently.
-
stage
¶ Top Stage
-
stage_name
¶ Top Stage Name
-
stage_names
¶ Names of stages on stack
-
stages
¶ Stages on stack
-
target
(args, kwargs)[source]¶ Determines the IDs for a given input data type and output ID (replaces “{:target:}”).
-
property
targets
¶ Determines the IDs to be built by this Stage Stack (replaces “{:targets:}”).
-
used_stacks
= {}¶
-
ymp.stage.stage module¶
Implements the “Stage”
At it’s most basic, a “Stage” is a set of Snakemake rules that share an output folder.
-
class
ymp.stage.stage.
Stage
(name, altname=None, env=None, doc=None)[source]¶ Bases:
ymp.snakemake.WorkflowObject
,ymp.stage.params.Parametrizable
,ymp.stage.base.Activateable
,ymp.stage.base.BaseStage
Creates a new stage
While entered using
with
, several stage specific variables are expanded within rules:{:this:}
– The current stage directory{:that:}
– The alternate output stage directory{:prev:}
– The previous stage’s directory
- Parameters
-
env
(name)[source]¶ Add package specifications to Stage environment
Note
This sets the environment for all rules within the stage, which leads to errors with Snakemake rule types not supporting conda environments
- Parameters
name (
str
) – Environment name or filename
>>> Env("blast", packages="blast =2.7*") >>> with Stage("test") as S: >>> S.env("blast") >>> rule testing: >>> ...
>>> with Stage("test", env="blast") as S: >>> rule testing: >>> ...
>>> with Stage("test") as S: >>> rule testing: >>> conda: "blast" >>> ...
- Return type
-
get_all_targets
(stack)[source]¶ Targets to build to complete this stage given
stack
.Typically, this is the StageStack’s path appended with the stamp name.
-
get_ids
(stack, groups, mygroups=None, target=None)[source]¶ Determine the target ID names for a set of active groupings
Called from
{:target:}
and{:targets:}
. For{:targets:}
,groups
is the set of active groupings for the stage stack. For{:target:}
, it’s the same set for the source of the file type, the current grouping and the current target.- Parameters
groups – Set of columns the values of which should form IDs
match_value – Limit output to rows with this value
match_groups – … in these groups
-
get_inputs
()[source]¶ Returns the set of inputs required by this stage
This function must return a copy, to ensure internal data is not modified.
-
match
(name)[source]¶ Check if the
name
can refer to this stageAs component of a
StageStack
, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.
-
property
outputs
¶ Returns the set of outputs this stage is able to generate.
May return either a
set
or adict
with the dictionary values representing redirections in the case of virtual stages such asPipeline
orReference
.
-
prev
(_args, kwargs)[source]¶ Gathers {:prev:} calls from rules
Here, input requirements for each stage are collected.
- Return type
-
require
(**kwargs)[source]¶ Override inferred stage inputs
In theory, this should not be needed. But it’s simpler for now.
-
requires
¶ Contains override stage inputs
-
that
(_args=None, kwargs=None)[source]¶ Alternate directory of current stage
Used for splitting stages