Workflow configuration reference

samples

path to the samples.tsv file that contains one row per sample. It can be parsed easily via pandas and opened by libreoffice and excel Typically:

samples: "config/samples.tsv"

sequences

List of studied RNA, with an identifier, and the corresponding fasta file The fasta file should contain a description line containing the origin of the sequence (sequence database) with its unique identifier all “rna_id” column value in samples.tsv must have a corresponding line in this section.

Example:

sequences:
  sequence_id_1: "resources/sequence1.fa"
  sequence_id_2: "resources/sequence2.fa"
  sequence_id_3: "resources/sequence3.fa"

conditions

List each condition here, each item in this line must correspond to a column of the samples.tsv file, and be also putted into format -> condition Example:

conditions:
  - probe
  - temperature
  - magnesium
  - ...

allow_auto_import

Allow snakemake to import automatically external files Not fully tested yet

format

Section to define filename and titles formatting

condition

Format of the conditions, will be use to identify sample and construct filename. All variable conditions should be referenced here Name of the conditions must correspond a column header in the samples.tsv file

Example:

format:
  ...
  condition: "{probe}_{temperature}_{magnesium}"
  ...

control_condition

Same as condition but for control data. {control} will be replaced by the rawdata->control variable content

Example:

format:
  ...
  control_condition: "{control}_of_{probe}_{temperature}C_{magnesium}"
  ...

message

Format use to create comprehensible Snakemake message when running workflow. conditions must be prefixed with “wildcards.”

Example:

format:
  ...
  message: "{wildcards.rna_id} with {wildcards.probe} with {wildcards.temperature}°C {wildcards.magnesium}"
  ...

rawdata

Information concerning raw datas

path_prefix

Path to the folder containing Raw experimental data, it will be prefixed to “control_file” and “probe_file” to auto import data from this folder. Absolute path, and relative path from project root are accepted.

*Example:

rawdata:
  ...
  path_prefix: "path/to/my/raw/data/folder
  ...

type

Type of the raw data. It will establish the starting point of the pipeline. Either from file directly from the sequencer or already converted to csv. Accepted values : “fluo-ceq8000” or “fluo-ce”

Example:

rawdata:
  ...
  type: "fluo-ceq8000"
  ...

control

Name of the control

Example:

rawdata:
  ...
  control: "DMSO"
  ...

qushape

QuShape project generator & extractor This section allow to configure the way snakemake will generate QuShape project. those informations must be identical to the configuration of your capillary sequencer

channel section corresponds to the definition of sequencer channels used for this project, here, channel start at index 0: channel 1 correspond to value 0 use_subsequence boolean tells if want the pipeline to manage RNA longer that what your sequencer can manage, in which case, you use multiple reverse transcription primer, in order to have several starting point for your sequencing. And you want the workflow to assemble those subsequence into a full RNA. run_qushape you can ask the workflow to launch qushape while running. you must also fill in run_qushape_conda_env with the conda env containing qushape check_integrity true by default, and highly recommanded to stay true. With true, the workflow will check if the sequence inside the QuShape File is the same as the sequence declared in sample.tsv and config.yaml. If you use QuShape file not generated by the workflow, you might have QuShape File with a another sequence (Longer for example). In which case the test will fail. Only if you know what your are doing you might want to disable this test. Warning: wrong sequences can lead to incoherent data, duplicated entry in output file and other inconsistancy. Example:

qushape:
 check_integrity: true
 use_subsequence: false
 run_qushape: false
 run_qushape_conda_env: qushape
 channels:
   RX: 0 # Channel 1
   RXS1: 2 # Channel 3
   BG: 0 # Channel 1
   BGS1: 2 # Channel 3

normalization

Reactivity Normalization configuration

items in this sections are equivalent of the flags available when running

python workflow/scripts/tools/normalize_reactivity.py --help

you can refer to this for more accurate information

normalization:
  # Which nucleotide are reactive to the shape probe
  reactive_nucleotides: ["A", "C", "G", "U"]
  # All value above this percentile are considered as outliers, preliminarly to all other treatments.
  stop_percentile: 90.
  # Value below this threshold are considered non-reactive (0.0)
  low_norm_reactivity_threshold: -0.3
  # Which normalizations methods will be used, several are accepted. Autorized values are : simple, interquartile
  norm_methods:
    - simple
    - interquartile
  # oulier threshold for simple normalization method
  simple_outlier_percentile: 98.
  # percentile of values averaged to create the normalization term
  simple_norm_term_avg_percentile: 90.

aggregate

Replicate Aggregation configuration

items in this sections are equivalent of the flags available when running

python workflow/scripts/tools/aggregate_reactivity.py --help

you can refer to this for more accurate information

aggregate:
  norm_method: simple # or interquartile

  # You can specify the name of the column used for normalization instead of the normalization method
  #norm_column: "simple_norm_reactivity" #
  min_ndata_perc: 0.5
  min_nsubdata_perc: 0.66
  max_mean_perc: 0.682
  min_dispersion: 0.05

ipanemap

IPANEMAP configuration

in order to calculate structures, you can define in which way you want to combine conditions as input for ipanemap. the pool section enable you to do so.

Example:

ipanemap:
  ...
  pools:
    - id: 1M7_noMg_37C
      rna_id: sequence1
      conditions:
        - temperature: 37
          magnesium: noMg
          probe: 1M7
    - id: allProbe_noMg_37C
      rna_id: sequence1
      conditions:
        - temperature: 37
          magnesium: noMg
          probe: 1M7
        - temperature: 37
          magnesium: noMg
          probe: NMIA
        - temperature: 37
          magnesium: noMg
          probe: BzCN
  ...

each run of IPANEMAP can be entirerly configure. to create un new run (or pool) add a dash in the pools sections each pool should have an unique id of your choice.

rna_id correspond to the sequence to input to IPANEMAP this id must correspond to and id of the sequences section of this configuration file.

conditions correspond of the sets of conditions you want to be added to IPANEMAP run. each set of condition start with a dash and must contains all conditions references in the conditions section of this configuration file.

In the Example:

first pool (1M7_noMg_37C) will be executed with “sequence1” and with one condition:
- all sample aggregated which have no magnesium, a temperature of 37C, and 1M7 probing
second pool (allProbe_noMg_37C) will be executed with “sequence1” and with 3 conditions :
- all sample aggregated which have no magnesium, a temperature of 37C, and 1M7 probing
- all sample aggregated which have no magnesium, a temperature of 37C, and NMIA probing
- all sample aggregated which have no magnesium, a temperature of 37C, and BzCN probing

Parameters in sampling, clustering, pareto, visualization are equivalent to the content of ipanemap config file. Please refer to IPANEMAP documentation

https://github.com/afafbioinfo/IPANEMAP

folders

Folder names

Intermediary folder naming is fully customable,

Default values:

folders:
  fluo-ceq8000: 1.1-fluo-ceq8000
  fluo-ce: 1.2-fluo-ce
  qushape: 2-qushape
  reactivity: 3.1-reactivity
  normreact: 3.2-normreact
  aggreact: 4.1-aggreact
  aggreact-ipanemap: 4.2-aggreact-ipanemap
  ipanemap-config: 5.1-ipanemap-config
  ipanemap-out: 5.2-ipanemap-out
  structure: 5.3-structure
  varna: 5.4-varna