ymp.stage package¶
YMP processes data in stages, each of which is contained in its own directory.
with Stage("trim_bbmap") as S:
S.doc("Trim reads with BBMap")
rule bbmap_trim:
output: "{:this:}/{sample}{:pairnames:}.fq.gz"
input: "{:prev:}/{sample}{:pairnames:}.fq.gz"
...
Submodules¶
ymp.stage.base module¶
-
class
ymp.stage.base.
BaseStage
(name)[source]¶ Bases:
object
Base class for stage types
-
STAMP_FILENAME
= 'all_targets.stamp'¶ The name of the stamp file that is touched to indicate completion of the stage.
-
can_provide
(inputs)[source]¶ Determines which of
inputs
this stage can provide.Returns a dictionary with the keys a subset of
inputs
and the values identifying redirections. An empty string indicates that no redirection is to take place. Otherwise, the string is the suffix to be appended to the priorStageStack
.
-
doc
(doc)[source]¶ Add documentation to Stage
- Parameters
doc (
str
) – Docstring passed to Sphinx- Return type
None
-
docstring
: str¶ The docstring describing this stage. Visible via
ymp stage list
and in the generated sphinx documentation.
-
get_all_targets
(stack)[source]¶ Targets to build to complete this stage given
stack
.Typically, this is the StageStack’s path appended with the stamp name.
-
get_inputs
()[source]¶ Returns the set of inputs required by this stage
This function must return a copy, to ensure internal data is not modified.
-
get_path
(stack)[source]¶ On disk location for this stage given
stack
.Called by
StageStack
to determine the real path for virtual stages (which must override this function).- Return type
-
match
(name)[source]¶ Check if the
name
can refer to this stageAs component of a
StageStack
, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.- Return type
-
name
¶ The name of the stage is a string uniquely identifying it among all stages.
-
-
class
ymp.stage.base.
ConfigStage
(name, cfg)[source]¶ Bases:
ymp.stage.base.BaseStage
Base for stages created via configuration
These Stages derive from the
yml.yml
and not from a rules file.-
cfg
¶ The configuration object defining this Stage.
-
property
defined_in
¶ List of files defining this stage
Used to invalidate caches.
-
filename
¶ Semi-colon separated list of file names defining this Stage.
-
lineno
¶ Line number within the first file at which this Stage is defined.
-
ymp.stage.expander module¶
-
class
ymp.stage.expander.
StageExpander
[source]¶ Bases:
ymp.snakemake.ColonExpander
Registers rules with stages when they are created
-
class
Formatter
(expander)[source]¶ Bases:
ymp.snakemake.FormatExpander.Formatter
,ymp.string.PartialFormatter
ymp.stage.groupby module¶
-
class
ymp.stage.groupby.
GroupBy
(name)[source]¶ Bases:
ymp.stage.base.BaseStage
Dummy stage for grouping
ymp.stage.pipeline module¶
Pipelines Module
Contains classes for pre-configured pipelines comprising multiple stages.
-
class
ymp.stage.pipeline.
Pipeline
(name, cfg)[source]¶ Bases:
ymp.stage.base.ConfigStage
A virtual stage aggregating a sequence of stages, i.e. a pipeline or sub-workflow.
Pipelines are configured via
ymp.yml
.Example
- pipelines:
- my_pipeline:
stage_1
stage_2
stage_3
-
can_provide
(inputs)[source]¶ Determines which of
inputs
this stage can provide.The result dictionary values will point to the “real” output.
-
get_all_targets
(stack)[source]¶ Targets to build to complete this stage given
stack
.Typically, this is the StageStack’s path appended with the stamp name.
-
get_path
(stack)[source]¶ On disk location for this stage given
stack
.Called by
StageStack
to determine the real path for virtual stages (which must override this function).
-
property
outputs
¶ The outputs of a pipeline are the sum of the outputs of each component stage. Outputs of stages further down the pipeline override those generated earlier.
TODO: Allow hiding the output of intermediary stages.
-
property
pipeline
¶
ymp.stage.project module¶
-
class
ymp.stage.project.
PandasTableBuilder
[source]¶ Bases:
object
Builds the data table describing each sample in a project
This class implements loading and combining tabular data files as specified by the YAML configuration.
- Format:
string items are files
lists of files are concatenated top to bottom
dicts must have one “command” value:
‘join’ contains a two-item list the two items are joined ‘naturally’ on shared headers
‘table’ contains a list of one-item dicts dicts have form
key:value[,value...]
a in-place table is created from the keys list-of-dict is necessary as dicts are unordered‘paste’ contains a list of tables pasted left to right tables pasted must be of equal length or length 1
if a value is a valid path relative to the csv/tsv/xls file’s location, it is expanded to a path relative to CWD
Example
- top.csv - join: - excel.xslx%left.csv - right.tsv - table: - sample: s1,s2,s3 - fq1: s1.1.fq, s2.1.fq, s3.1.fq - fq2: s1.2.fq, s2.2.fq, s3.2.fq
-
class
ymp.stage.project.
Project
(name, cfg)[source]¶ Bases:
ymp.stage.base.ConfigStage
Contains configuration for a source dataset to be processed
-
KEY_BCCOL
= 'barcode_col'¶
-
KEY_DATA
= 'data'¶
-
KEY_IDCOL
= 'id_col'¶
-
KEY_READCOLS
= 'read_cols'¶
-
RE_FILE
= re.compile('^(?!http://).*(?:fq|fastq)(?:|\\.gz)$')¶
-
RE_REMOTE
= re.compile('^(?:https?|ftp|sftp)://(?:.*)')¶
-
RE_SRR
= re.compile('^[SED]RR[0-9]+$')¶
-
choose_id_column
()[source]¶ Configures column to use as index on runs
If explicitly configured via KEY_IDCOL, verifies that the column exists and that it is unique. Otherwise chooses the leftmost unique column in the data.
-
property
data
¶ Pandas dataframe of runs
Lazy loading property, first call may take a while.
-
property
fq_names
¶ Names of all FastQ files
-
property
fwd_fq_names
¶ Names of forward FastQ files (se and pe)
-
property
fwd_pe_fq_names
¶ Names of forward FastQ files part of pair
-
get_fq_names
(only_fwd=False, only_rev=False, only_pe=False, only_se=False)[source]¶ Get pipeline names of fq files
-
property
idcol
¶
-
property
outputs
¶ Returns the set of outputs this stage is able to generate.
May return either a
set
or adict
with the dictionary values representing redirections in the case of virtual stages such asPipeline
orReference
.
-
property
pe_fq_names
¶ Names of paired end FastQ files
-
property
project_name
¶
-
property
rev_pe_fq_names
¶ Names of reverse FastQ files part of pair
-
property
runs
¶ Pandas dataframe index of runs
Lazy loading property, first call may take a while.
-
property
se_fq_names
¶ Names of single end FastQ files
-
property
source_cfg
¶
-
property
variables
¶
-
ymp.stage.reference module¶
-
class
ymp.stage.reference.
Archive
(name, dirname, tar, url, strip, files)[source]¶ Bases:
object
-
dirname
= None¶
-
files
= None¶
-
hash
= None¶
-
name
= None¶
-
strip_components
= None¶
-
tar
= None¶
-
-
class
ymp.stage.reference.
Reference
(name, cfg)[source]¶ Bases:
ymp.stage.base.ConfigStage
Represents (remote) reference file/database configuration
-
get_path
(_stack)[source]¶ On disk location for this stage given
stack
.Called by
StageStack
to determine the real path for virtual stages (which must override this function).
-
ymp.stage.stack module¶
-
class
ymp.stage.stack.
StageStack
(path, stage=None)[source]¶ Bases:
object
The “head” of a processing chain - a stack of stages
-
property
defined_in
¶
-
classmethod
get
(path, stage=None)[source]¶ Cached access to StageStack
- Parameters
path – Stage path
stage – Stage object at head of stack
-
property
path
¶ On disk location of files provided by this stack
-
property
targets
¶ Returns the current targets
-
used_stacks
= {}¶
-
property
ymp.stage.stage module¶
-
class
ymp.stage.stage.
Param
(stage, key, name, value=None, default=None)[source]¶ Bases:
object
Stage Parameter (base class)
-
property
constraint
¶
-
property
-
class
ymp.stage.stage.
ParamChoice
(*args, **kwargs)[source]¶ Bases:
ymp.stage.stage.Param
Stage Choice Parameter
-
class
ymp.stage.stage.
ParamFlag
(*args, **kwargs)[source]¶ Bases:
ymp.stage.stage.Param
Stage Flag Parameter
-
class
ymp.stage.stage.
ParamInt
(*args, **kwargs)[source]¶ Bases:
ymp.stage.stage.Param
Stage Int Parameter
-
class
ymp.stage.stage.
Stage
(name, altname=None, env=None, doc=None)[source]¶ Bases:
ymp.snakemake.WorkflowObject
,ymp.stage.base.BaseStage
Creates a new stage
While entered using
with
, several stage specific variables are expanded within rules:{:this:}
– The current stage directory{:that:}
– The alternate output stage directory{:prev:}
– The previous stage’s directory
- Parameters
-
active
= None¶ Currently active stage (“entered”)
-
add_param
(key, typ, name, value=None, default=None)[source]¶ Add parameter to stage
Example
>>> with Stage("test") as S >>> S.add_param("N", "int", "nval", default=50) >>> rule: >>> shell: "echo {param.nval}"
This would add a stage “test”, optionally callable as “testN123”, printing “50” or in the case of “testN123” printing “123”.
- Parameters
char – The character to use in the Stage name
typ – The type of the parameter (int, flag)
param – Name of parameter in params
value – value
{param.xyz}
should be set to if param givendefault – default value for
{{param.xyz}}
if no param given
-
env
(name)[source]¶ Add package specifications to Stage environment
Note
This sets the environment for all rules within the stage, which leads to errors with Snakemake rule types not supporting conda environments
- Parameters
name (
str
) – Environment name or filename
>>> Env("blast", packages="blast =2.7*") >>> with Stage("test") as S: >>> S.env("blast") >>> rule testing: >>> ...
>>> with Stage("test", env="blast") as S: >>> rule testing: >>> ...
>>> with Stage("test") as S: >>> rule testing: >>> conda: "blast" >>> ...
- Return type
None
-
get_inputs
()[source]¶ Returns the set of inputs required by this stage
This function must return a copy, to ensure internal data is not modified.
-
match
(name)[source]¶ Check if the
name
can refer to this stageAs component of a
StageStack
, a stage may be identified by alternative names and may also be parametrized by suffix modifiers. Stage types supporting this behavior must override this function.
-
property
outputs
¶ Returns the set of outputs this stage is able to generate.
May return either a
set
or adict
with the dictionary values representing redirections in the case of virtual stages such asPipeline
orReference
.
-
require
(**kwargs)[source]¶ Override inferred stage inputs
In theory, this should not be needed. But it’s simpler for now.