ymp.stage package¶
YMP processes data in stages, each of which is contained in its own directory.
with Stage("trim_bbmap") as S:
S.doc("Trim reads with BBMap")
rule bbmap_trim:
output: "{:this:}/{sample}{:pairnames:}.fq.gz"
input: "{:prev:}/{sample}{:pairnames:}.fq.gz"
...
Submodules¶
ymp.stage.base module¶
Base classes for all Stage types
-
class
ymp.stage.base.Activateable(*args, **kwargs)[source]¶ Bases:
objectMixin for Stages that can be filled with rules from Snakefiles.
-
register_inout(name, target, item)[source]¶ Determine stage input/output file type from prev/this filename
Detects patterns like “PREFIX{: NAME :}/INFIX{TARGET}.EXT”. Also checks if there is an active stage.
-
rules: List[snakemake.rules.Rule]¶ Rules in this stage
-
-
class
ymp.stage.base.BaseStage(name)[source]¶ Bases:
objectBase class for stage types
-
altname¶ Alternative name
-
can_provide(inputs)[source]¶ Determines which of
inputsthis stage can provide.Returns a dictionary with the keys a subset of
inputsand the values identifying redirections. An empty string indicates that no redirection is to take place. Otherwise, the string is the suffix to be appended to the priorStageStack.
-
docstring: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage listand in the generated sphinx documentation.
-
get_all_targets(stack, output_types=None)[source]¶ Targets to build to complete this stage given
stack.Typically, this is the StageStack’s path appended with the stamp name.
-
get_group(stack, default_groups)[source]¶ Determine output grouping for stage
- Parameters
stack (StageStack) – The stack for which output grouping is requested.
default_groups (
List[str]) – Grouping determined from stage inputsoverride_groups – Override grouping from GroupBy stage or None.
- Return type
-
get_ids(stack, groups, match_groups=None, match_value=None)[source]¶ Determine the target ID names for a set of active groupings
Called from
{:target:}and{:targets:}. For{:targets:},groupsis the set of active groupings for the stage stack. For{:target:}, it’s the same set for the source of the file type, the current grouping and the current target.
-
get_inputs()[source]¶ Returns the set of inputs required by this stage
This function must return a copy, to ensure internal data is not modified.
-
get_path(stack)[source]¶ On disk location for this stage given
stack.Called by
StageStackto determine the real path for virtual stages (which must override this function).- Return type
-
match(name)[source]¶ Check if the
namecan refer to this stageAs component of a
StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.- Return type
-
name¶ The name of the stage is a string uniquely identifying it among all stages.
-
-
class
ymp.stage.base.ConfigStage(name, cfg)[source]¶ Bases:
ymp.stage.base.BaseStageBase for stages created via configuration
These Stages derive from the
yml.ymland not from a rules file.-
cfg¶ The configuration object defining this Stage.
-
property
defined_in¶ List of files defining this stage
Used to invalidate caches.
-
docstring: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage listand in the generated sphinx documentation.
-
filename¶ Semi-colon separated list of file names defining this Stage.
-
lineno¶ Line number within the first file at which this Stage is defined.
-
ymp.stage.expander module¶
-
class
ymp.stage.expander.StageExpander[source]¶ Bases:
ymp.snakemake.ColonExpanderRegisters rules with stages when they are created
-
class
Formatter(expander)[source]¶ Bases:
ymp.snakemake.FormatExpander.Formatter,ymp.string.PartialFormatter-
regroup= re.compile('(?<!{){\\s*([^{}\\s]+)\\s*}(?!})')¶
-
ymp.stage.groupby module¶
Implements forward grouping
Grouping allows processing multiple input datasets at once, such as in a co-assembly. It is initiated by adding the virtual stage “group_<COL>” directly before the stage that should be grouping its output. “<COL>” may be a project data column, in which case all data for which column COL shares a value will be combined, or “ALL”, which combines all samples. The output filename prefix will be either the column value or “ALL”.
>>> ymp make mock.group_sample.assemble_megahit
>>> ymp make mock.group_ALL.assemble_megahit
Subsequent stages will use the most finegrained grouping required by their input data.
# FIXME: How to avoid re-specifying groupby?
-
class
ymp.stage.groupby.GroupBy(name)[source]¶ Bases:
ymp.stage.base.BaseStageVirtual stage for grouping
-
PREFIX= 'group_'¶
-
docstring: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage listand in the generated sphinx documentation.
-
get_group(stack, default_groups)[source]¶ Determine output grouping for stage
- Parameters
stack (StageStack) – The stack for which output grouping is requested.
default_groups (
List[str]) – Grouping determined from stage inputsoverride_groups – Override grouping from GroupBy stage or None.
- Return type
-
match(name)[source]¶ Check if the
namecan refer to this stageAs component of a
StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.- Return type
-
ymp.stage.params module¶
-
class
ymp.stage.params.Param(stage, key, name, value=None, default=None)[source]¶ Bases:
abc.ABCStage Parameter (base class)
-
property
constraint¶
-
pattern(show_constraint=True)[source]¶ String to add to filenames passed to Snakemake
I.e. a pattern of the form
{wildcard,constraint}
-
types: Dict[str, Type[ymp.stage.params.Param]] = {'choice': <class 'ymp.stage.params.ParamChoice'>, 'flag': <class 'ymp.stage.params.ParamFlag'>, 'int': <class 'ymp.stage.params.ParamInt'>, 'ref': <class 'ymp.stage.params.ParamRef'>}¶ Type/Class mapping for param types
-
property
wildcard¶
-
property
-
class
ymp.stage.params.ParamChoice(*args, **kwargs)[source]¶ Bases:
ymp.stage.params.ParamStage Choice Parameter
-
class
ymp.stage.params.ParamFlag(*args, **kwargs)[source]¶ Bases:
ymp.stage.params.ParamStage Flag Parameter
-
class
ymp.stage.params.ParamInt(*args, **kwargs)[source]¶ Bases:
ymp.stage.params.ParamStage Int Parameter
-
class
ymp.stage.params.ParamRef(stage, key, name, value=None, default=None)[source]¶ Bases:
ymp.stage.params.ParamReference Choice Parameter
-
property
regex¶
-
property
-
class
ymp.stage.params.Parametrizable(*args, **kwargs)[source]¶ Bases:
ymp.stage.base.BaseStage-
add_param(key, typ, name, value=None, default=None)[source]¶ Add parameter to stage
Example
>>> with Stage("test") as S >>> S.add_param("N", "int", "nval", default=50) >>> rule: >>> shell: "echo {param.nval}"
This would add a stage “test”, optionally callable as “testN123”, printing “50” or in the case of “testN123” printing “123”.
- Parameters
char – The character to use in the Stage name
typ – The type of the parameter (int, flag)
param – Name of parameter in params
value – value
{param.xyz}should be set to if param givendefault – default value for
{{param.xyz}}if no param given
- Return type
-
docstring: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage listand in the generated sphinx documentation.
-
match(name)[source]¶ Check if the
namecan refer to this stageAs component of a
StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.- Return type
-
property
params¶
-
property
regex¶
-
ymp.stage.pipeline module¶
Pipelines Module
Contains classes for pre-configured pipelines comprising multiple stages.
-
class
ymp.stage.pipeline.Pipeline(name, cfg)[source]¶ Bases:
ymp.stage.params.Parametrizable,ymp.stage.base.ConfigStageA virtual stage aggregating a sequence of stages, i.e. a pipeline or sub-workflow.
Pipelines are configured via
ymp.yml.Example
- pipelines:
- my_pipeline:
hide: false params:
- length:
key: L type: int default: 20
- stages:
- stage_1{length}:
hide: true
stage_2
stage_3
-
can_provide(inputs)[source]¶ Determines which of
inputsthis stage can provide.The result dictionary values will point to the “real” output.
-
docstring: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage listand in the generated sphinx documentation.
-
get_all_targets(stack)[source]¶ Targets to build to complete this stage given
stack.Typically, this is the StageStack’s path appended with the stamp name.
-
get_group(stack, default_groups)[source]¶ Determine output grouping for stage
- Parameters
stack (
StageStack) – The stack for which output grouping is requested.default_groups (
List[str]) – Grouping determined from stage inputsoverride_groups – Override grouping from GroupBy stage or None.
- Return type
-
get_ids(stack, groups, mygroups=None, target=None)[source]¶ Determine the target ID names for a set of active groupings
Called from
{:target:}and{:targets:}. For{:targets:},groupsis the set of active groupings for the stage stack. For{:target:}, it’s the same set for the source of the file type, the current grouping and the current target.- Parameters
groups – Set of columns the values of which should form IDs
match_value – Limit output to rows with this value
match_groups – … in these groups
-
get_path(stack, typ=None)[source]¶ On disk location for this stage given
stack.Called by
StageStackto determine the real path for virtual stages (which must override this function).
-
hide_outputs¶ If true, outputs of stages are hidden by default
-
property
outputs¶ The outputs of a pipeline are the sum of the outputs of each component stage. Outputs of stages further down the pipeline override those generated earlier.
-
property
params¶
-
pipeline¶ Path fragment describing this pipeline
-
stages¶ Dictionary of stages with configuration options for each
ymp.stage.project module¶
This module defines “Project”, a Stage type defined by a project matrix file giving units and meta data for input files.
-
class
ymp.stage.project.PandasTableBuilder[source]¶ Bases:
objectBuilds the data table describing each sample in a project
This class implements loading and combining tabular data files as specified by the YAML configuration.
- Format:
string items are files
lists of files are concatenated top to bottom
dicts must have one “command” value:
‘join’ contains a two-item list the two items are joined ‘naturally’ on shared headers
‘table’ contains a list of one-item dicts dicts have form
key:value[,value...]a in-place table is created from the keys list-of-dict is necessary as dicts are unordered‘paste’ contains a list of tables pasted left to right tables pasted must be of equal length or length 1
if a value is a valid path relative to the csv/tsv/xls file’s location, it is expanded to a path relative to CWD
Example
- top.csv - join: - excel.xslx%left.csv - right.tsv - table: - sample: s1,s2,s3 - fq1: s1.1.fq, s2.1.fq, s3.1.fq - fq2: s1.2.fq, s2.2.fq, s3.2.fq
-
class
ymp.stage.project.Project(name, cfg)[source]¶ Bases:
ymp.stage.base.ConfigStageContains configuration for a source dataset to be processed
-
KEY_BCCOL= 'barcode_col'¶
-
KEY_DATA= 'data'¶
-
KEY_IDCOL= 'id_col'¶
-
KEY_READCOLS= 'read_cols'¶
-
RE_FILE= re.compile('^(?!http://).*(?:fq|fastq)(?:|\\.gz)$')¶
-
RE_REMOTE= re.compile('^(?:https?|ftp|sftp)://(?:.*)')¶
-
RE_SRR= re.compile('^[SED]RR[0-9]+$')¶
-
choose_id_column()[source]¶ Configures column to use as index on runs
If explicitly configured via KEY_IDCOL, verifies that the column exists and that it is unique. Otherwise chooses the leftmost unique column in the data.
-
property
data¶ Pandas dataframe of runs
Lazy loading property, first call may take a while.
-
docstring: Optional[str]¶ The docstring describing this stage. Visible via
ymp stage listand in the generated sphinx documentation.
-
property
fq_names¶ Names of all FastQ files
-
property
fwd_fq_names¶ Names of forward FastQ files (se and pe)
-
property
fwd_pe_fq_names¶ Names of forward FastQ files part of pair
-
get_all_targets(stack, output_types=None)[source]¶ Targets to build to complete this stage given
stack.Typically, this is the StageStack’s path appended with the stamp name.
-
get_fq_names(only_fwd=False, only_rev=False, only_pe=False, only_se=False)[source]¶ Get pipeline names of fq files
-
get_group(stack, default_groups)[source]¶ Determine output grouping for stage
- Parameters
stack (StageStack) – The stack for which output grouping is requested.
default_groups (
List[str]) – Grouping determined from stage inputsoverride_groups – Override grouping from GroupBy stage or None.
- Return type
-
get_ids(stack, groups, match_groups=None, match_values=None)[source]¶ Determine the target ID names for a set of active groupings
Called from
{:target:}and{:targets:}. For{:targets:},groupsis the set of active groupings for the stage stack. For{:target:}, it’s the same set for the source of the file type, the current grouping and the current target.- Parameters
groups – Set of columns the values of which should form IDs
match_value – Limit output to rows with this value
match_groups – … in these groups
-
property
idcol¶
-
property
outputs¶ Returns the set of outputs this stage is able to generate.
May return either a
setor adictwith the dictionary values representing redirections in the case of virtual stages such asPipelineorReference.
-
property
pe_fq_names¶ Names of paired end FastQ files
-
property
project_name¶
-
property
rev_pe_fq_names¶ Names of reverse FastQ files part of pair
-
property
runs¶ Pandas dataframe index of runs
Lazy loading property, first call may take a while.
-
property
se_fq_names¶ Names of single end FastQ files
-
property
source_cfg¶
-
property
variables¶
-
ymp.stage.reference module¶
-
class
ymp.stage.reference.Archive(name, dirname, tar, url, strip, files)[source]¶ Bases:
object-
dirname= None¶
-
files= None¶
-
hash= None¶
-
name= None¶
-
strip_components= None¶
-
tar= None¶
-
-
class
ymp.stage.reference.Reference(name, cfg)[source]¶ Bases:
ymp.stage.base.Activateable,ymp.stage.base.ConfigStageRepresents (remote) reference file/database configuration
-
files: Dict[str, str]¶ Files provided by the reference. Keys are the file names within ymp (“target.extension”), symlinked into dir.ref/ref_name/ and values are the path to the reference file from workspace root.
-
get_all_targets(stack)[source]¶ Targets to build to complete this stage given
stack.Typically, this is the StageStack’s path appended with the stamp name.
-
get_group(stack, default_groups)[source]¶ Determine output grouping for stage
- Parameters
stack (StageStack) – The stack for which output grouping is requested.
default_groups (
List[str]) – Grouping determined from stage inputsoverride_groups – Override grouping from GroupBy stage or None.
- Return type
-
get_ids(stack, groups, match_groups=None, match_value=None)[source]¶ Determine the target ID names for a set of active groupings
Called from
{:target:}and{:targets:}. For{:targets:},groupsis the set of active groupings for the stage stack. For{:target:}, it’s the same set for the source of the file type, the current grouping and the current target.
-
get_path(_stack)[source]¶ On disk location for this stage given
stack.Called by
StageStackto determine the real path for virtual stages (which must override this function).
-
property
outputs¶ Returns the set of outputs this stage is able to generate.
May return either a
setor adictwith the dictionary values representing redirections in the case of virtual stages such asPipelineorReference.
-
rules: List[snakemake.rules.Rule]¶ Rules in this stage
-
ymp.stage.stack module¶
Implements the StageStack
-
class
ymp.stage.stack.StageStack(path)[source]¶ Bases:
objectThe “head” of a processing chain - a stack of stages
-
debug= False¶ Set to true to enable additional Stack debug logging
-
property
defined_in¶
-
group: List[str]¶ Grouping in effect for this StageStack. And empty list groups into one pseudo target, ‘ALL’.
-
classmethod
instance(path)[source]¶ Cached access to StageStack
- Parameters
path – Stage path
stage – Stage object at head of stack
-
name¶ Name of stack, aka is its full path
-
property
path¶ On disk location of files provided by this stack
-
prev_stage¶ Stage below top stage or None if first in stack
-
prevs¶ Mapping of each input type required by the stage of this stack to the prefix stack providing it.
-
project¶ Project on which stack operates This is needed for grouping variables currently.
-
stage¶ Top Stage
-
stage_name¶ Top Stage Name
-
stage_names¶ Names of stages on stack
-
stages¶ Stages on stack
-
target(args, kwargs)[source]¶ Determines the IDs for a given input data type and output ID (replaces “{:target:}”).
-
property
targets¶ Determines the IDs to be built by this Stage Stack (replaces “{:targets:}”).
-
used_stacks= {}¶
-
ymp.stage.stage module¶
Implements the “Stage”
At it’s most basic, a “Stage” is a set of Snakemake rules that share an output folder.
-
class
ymp.stage.stage.Stage(name, altname=None, env=None, doc=None)[source]¶ Bases:
ymp.snakemake.WorkflowObject,ymp.stage.params.Parametrizable,ymp.stage.base.Activateable,ymp.stage.base.BaseStageCreates a new stage
While entered using
with, several stage specific variables are expanded within rules:{:this:}– The current stage directory{:that:}– The alternate output stage directory{:prev:}– The previous stage’s directory
- Parameters
-
env(name)[source]¶ Add package specifications to Stage environment
Note
This sets the environment for all rules within the stage, which leads to errors with Snakemake rule types not supporting conda environments
- Parameters
name (
str) – Environment name or filename
>>> Env("blast", packages="blast =2.7*") >>> with Stage("test") as S: >>> S.env("blast") >>> rule testing: >>> ...
>>> with Stage("test", env="blast") as S: >>> rule testing: >>> ...
>>> with Stage("test") as S: >>> rule testing: >>> conda: "blast" >>> ...
- Return type
-
get_all_targets(stack)[source]¶ Targets to build to complete this stage given
stack.Typically, this is the StageStack’s path appended with the stamp name.
-
get_ids(stack, groups, mygroups=None, target=None)[source]¶ Determine the target ID names for a set of active groupings
Called from
{:target:}and{:targets:}. For{:targets:},groupsis the set of active groupings for the stage stack. For{:target:}, it’s the same set for the source of the file type, the current grouping and the current target.- Parameters
groups – Set of columns the values of which should form IDs
match_value – Limit output to rows with this value
match_groups – … in these groups
-
get_inputs()[source]¶ Returns the set of inputs required by this stage
This function must return a copy, to ensure internal data is not modified.
-
match(name)[source]¶ Check if the
namecan refer to this stageAs component of a
StageStack, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.
-
property
outputs¶ Returns the set of outputs this stage is able to generate.
May return either a
setor adictwith the dictionary values representing redirections in the case of virtual stages such asPipelineorReference.
-
prev(_args, kwargs)[source]¶ Gathers {:prev:} calls from rules
Here, input requirements for each stage are collected.
- Return type
-
require(**kwargs)[source]¶ Override inferred stage inputs
In theory, this should not be needed. But it’s simpler for now.
-
requires¶ Contains override stage inputs
-
that(_args=None, kwargs=None)[source]¶ Alternate directory of current stage
Used for splitting stages