Python CLI

A walk-through of Python CLI

CLI Reference

CLI Reference [OPTIONS] COMMAND [ARGS]...

Options

--debug, --no-debug

create

Creates a ribo file from a given reference, annotation and alignment file.

The resulting ribo file contains a single experiment whose name is provided in “–name”. If “–name” is not provided, the experiment name will be extracted from the ribo file path after stripping the file extension.

RiboPy works on transcriptomic coordinates. In other words, all coordinates are relative to transcripts where the first nucleotide of the transcript is always zero. Each entry in the alignment reference must come from a single transcript. Ribo-Seq data aligned against genomic coordinates is NOT usable with RiboPy.

A note on transcript <-> reference:
Sequencing reads are mapped against a reference to generate alignment files in bed or bam format. Each entry of this reference is coming from a transcript. Therefore, in this context, refrence and transcript correspond to the same entity. Thus, the names ‘reference’ and ‘transcript’ are used interchangebly. For example ‘reference names’ and ‘transcript names’ refer to the same list of names.
Reference Lengths File:

Reference lengths must be provided in the ‘lengths’ option. This must be a tab separated file where transcript names are in the first coulmn and the transcript lengths are in the seocond column.



Example:
TRANSCRIPT_1 1512
TRANSCRIPT_2 1387
Annotation Examples:

Below are some valid and invalid annotation file examples. ribopy will report an error and fail to generate a ribo file if an invalid annotation file is given.



VALID ANNOTATION:

(Assuming length of TRANSCRIPT_1 is 1252)

TRANSCRIPT_1 0 21 UTR5 0 +
TRANSCRIPT_1 21 1041 CDS 0 +
TRANSCRIPT_1 1041 1252 UTR3 0 +



INVALID ANNOTATION:

(Assuming length of TRANSCRIPT_2 is 1000. Nucleotide positions 32,33 and 34 are not annotated)

TRANSCRIPT_1 0 32 UTR5 0 +
TRANSCRIPT_1 35 920 CDS 0 +
TRANSCRIPT_1 920 1000 UTR3 0 +



INVALID ANNOTATION:

(Assuming length of TRANSCRIPT_3 is 1200) (There is no CDS region.)

TRANSCRIPT_3 0 50 UTR5 0 +
TRANSCRIPT_3 50 1200 UTR3 0 +



INVALID ANNOTATION:

(Assuming length of TRANSCRIPT_4 is 1000) (There is no UTR3 region.)

TRANSCRIPT_3 0 700 UTR5 0 +
TRANSCRIPT_3 100 1000 CDS 0 +
Region Counts & Left / Right Span:

For each transcript, the number of reads aligning to each region (UTR5, CDS, and UTR3) are stored. For this quantification, we exclude nucleotides in some proximity of start and stop sites.

This proximity is defined by taking ‘leftspan’ many nucleotides to the left of start / stop sites and ‘rightspan’ many nucleotides to the right of start / stop sites. The regions around start and stop sites are called UTR5_junction and UTR3_junction respectively.



UTR5_juction: Nucleotides ‘around’ start site.
This region is between UTR5 and CDS.



UTR3_juction: Nucleotides ‘around’ stop site.
This region is between CDS and UTR3.

Note that ‘around’ is precisely defined by ‘leftspan’ and ‘rightspan’ arguments.

Read Length Range:
Quantified Ribo-Seq data (metagene coverage, region counts, transcript ocoverage (if any)) is stored for each read length for a given range. This range is defined by –lengthmin and –lengthmax. Both values are inclusive. So, for example, for –lengthmin 19 –lengthmax 21, region counts are computed and stored for RNA fragments of length 19,20 and 21 separetely. For human Ribo-Seq data, a range from 15 to 35 can be sufficient for conventional experiments.
Metadata:

Metadata, either for the ribo file or for any experiment, is an optional argument. Metadata must be provided in yaml format.

 Users can provide metadata for

  1. ribo file
–ribometa
  1. experiment
–expmeta

Metadata is provided in pairs of the form ‘label: value’



Example Metadata:
cell-line: HEK
enzyme: RNASEI
Examples:

Below are some ribo file generation examples in different settings.

1) Some command line exaples to create ribo files. Create a file named WT.ribo in the current directory. Provide alignment data in a zipped bed file. Do NOT store coverage data.



ribopy create --name WT \
   --alignmentfile WT.bed.gz \
   --reference appris_human_v1 \
   --lengths appris_len.tsv \
   --annotation appris_regions.bed \
   --radius 50 \
   -l 35 -r 15 \
   --lengthmin 15 --lengthmax 35 \
   --nocoverage \
   WT.ribo

2) Create a file named Treatment_1.ribo in the current directory. Provide alignment data in a zipped bam file. Store coverage data.



ribopy create --name Treatment_1 \
    --alignmentfile Treatment_1.bam \
    --format bam \
    --reference appris_human_v1 \
    --lengths appris_len.tsv \
    --annotation appris_regions.bed \
    --radius 50 \
    -l 35 -r 15 \
    --lengthmin 15 --lengthmax 35 \
   Treatment_1.ribo

3) Create a file named WT_2.ribo in the current directory. Provide alignment data in a zipped bed file. Do NOT store coverage data. Read metadata of this experiment from WT_2_meta.yaml



ribopy create --name WT_2 \
    --alignmentfile WT_2.bed.gz \
    --reference appris_human_v1 \
    --lengths appris_len.tsv \
    --annotation appris_regions.bed \
    --radius 50 \
    -l 35 -r 15 \
    --lengthmin 15 --lengthmax 35 \
    --nocoverage \
    --expmeta WT_2_meta.yaml \
    WT_2.ribo
CLI Reference create [OPTIONS] RIBOFILE

Options

-a, --alignmentfile <alignmentfile>

Aligned Ribo-Seq Data File If no file is provided, it is read from standard input.Alignment file can be in bed or bam format.Bed files with ‘.gz’ or ‘.gzip’extension are assumed to be gzipped.So they will be automatiocally unzipped.If no aligment file is given,it is read from the standard input.Thus, the user can pipe the output of some processto ribopy to produce ribo files.

-f, --alignmentformat <alignmentformat>

Alignment File Format [default: bed]

Options:bed|bam
--name <name>

Experiment name If name is not provided, experiment name is extracted from the ribo file path. If –name parameter is omitted, experiment name is determined using the ribo file path. For example if the ribo file path is /home/user/data/WT.ribo, then, experiment name is set to ‘WT’. Experiment name can only contain alphanumeric characters, ‘_’, ‘-‘ and ‘.’.

--reference <reference>

Reference name It is good practice to give a unique name to a particular transcriptome and annotation pair and use it consistently.For convenience, the user has the freedomto choose a name for the transcript reference and the annotation, used to create a ribo file,If used consistently, one can tell whether two ribo files are coming from the same annotation or not.Note that ribo files having different reference names can not be merged. [required]

--lengthsfile, --lengths <lengthsfile>

A tab-separated file containing ref. name and ref. lengths [required]

--annotationfile, --annotation <annotationfile>

A bed file defining UTR5, CDS and UTR3 regions.Each transcript must be annotated. More explicitly, the coordinates of the regions UTR5, CDS and UTR3 must be provided for each transcript. This annotation is provided in a BED file. The annotation can not contain gaps. All nucleotides of a transcript must belong to a region. If a transcript does not have a CDS or UTR3 region, it must be excluded from the lengths file and annotation. Note that bed files are 0-based, start position is included and the end position is excluded. The order and transcript names in the lengths file must match the order in the annotation file. Annotation data is kept in ribo files. It is possible to extract the annotation from a ribo file using the dump command. [required]

--metageneradius, --radius <metageneradius>

Number of nucleotides on either side of start / stop sites for metagene analysis.

Aligning transcripts by their start / stop sitesand aggregating the coverage gives us metagene data.The number of nucleotides to be taken on either side ofstart / stop site is given by the metagene radius.

-l, --leftspan <leftspan>

Number of nucleotides to the left of start / stop sites for defining UTR5 / UTR3 junction regions. [default: 35]

-r, --rightspan <rightspan>

Number of nucleotides to the right of start / stop sites for defining UTR5 / UTR3 junction regions. [default: 15]

--lengthmin, --min <lengthmin>

Minimum read length to be counted [default: 15]

--lengthmax, --max <lengthmax>

Maximum read length to be counted [default: 35]

--ribometa <ribometa>

Metadata file, for the ribo file, in yaml format.

--expmeta <expmeta>

Metadata file, for the experiment, in yaml format.

--nocoverage

Do not store coverage. By default, coverage IS stored at nucleotide resolution, of each transcript, for each read length. Turning this flag on decreases size of ribo file at the cost of coverage data. From the alignment data, five prime end of the reads, mapping to each nucleotide position, of each transcript is computed for each read length. We call this coverage. Coverage is used for metagene analysis and region counts. By default, ribopy stores coverage data in the ribo file. Keeping coverage increases the ribo file size considerably. Users can choose NOT to keep coverage by setting this flag.

-n, --nprocess <nprocess>

Number of cores to be used. [default: 1]

Arguments

RIBOFILE

Required argument

dump

Dump selected parts of ribo files to particular formats

CLI Reference dump [OPTIONS] COMMAND [ARGS]...

annotation

Gives UTR5, CDS and UTR3 annotation in bed format.

Annotation is written to standard output if no output file is given.

Examples:
  1. Store the annotation in a bed file
ribopy dump annotation -o regions.bed sample.ribo
  1. Print annotation to standard output
ribopy dump annotation sample.ribo
CLI Reference dump annotation [OPTIONS] RIBO

Options

-o, --out <out>

Output file in bed format

Arguments

RIBO

Required argument

coverage

Prints the coverage of each transcript.

This command prints the coverage of each transcript, at nucleotide resolution, for a given read length range. For a single read length, set ‘lowerlength’ equal to ‘upperlength’. The values for the given range are aggregated by summation.

Append “.gz” to the output file name to have the output in zipped form.

File format:

options: bg, tsv

bg: Bedgraph

The columns of the bedgraph file are of the form transcript_name location_start location_end coverage_value Only the nonzero values are reported in the bedgraph file. Bedgraph is 0-based and location_start is inclusive and location_end is exclusive.

See https://genome.ucsc.edu/goldenPath/help/bedgraph.html for a detailed description of bedgraph file format.

tsv: Tab separated File

In this format, coverage values for all nucleotide positions, regardless of their value ( 0 or not ) are reported.

The columns of the file are of the form transcript_name comma_separated_coverage

Examples:
  1. Get coverage data, of WT, for read length 21 in zipped bedgraph format



ribopy --lowerlength 21 --upperlength 21 \
    -o coverage.bg.gz \
    --fromat bg sample.ribo WT
  1. Get coverage data, of treatment_1, for range 26 to 30 in tsv format



ribopy --lowerlength 26 --upperlength 30
      -o coverage.tsv
      --format tsv
      sample.ribo treatment_1
CLI Reference dump coverage [OPTIONS] RIBO EXPERIMENT

Options

-o, --out <out>

Output file in csv format

--lowerlength <lowerlength>

Minimum read length to take

--upperlength <upperlength>

Maximum read length to take

--format <format>

Output file format

Options:bg|tsv

Arguments

RIBO

Required argument

EXPERIMENT

Required argument

metagene

Dumps metagene data to a csv file or standard output.

Metagene data is obtained from coverage around start / stop site in a pre-defined radius.

Examples:

Below are some examples for different scenarios.

  1. Get metagene data around START site for the experiment WT. Report results for read lengths from 28 to 32. Agrregate data by summing accros read lengths. Also, values are summed across transcripts. Save the results in start.csv.

    

    ribopy dump metagene \
       --experiment WT \
       --site start \
       --out start.csv \
       --lowerlength 28 --upperlength 32 \
       --sumlengths \
       sample.ribo
    
  2. Get metagene data around STOP site for the experiment Treatment_1. Report results for read length 31. Values are summed across transcripts. Print the results on the standard output.

    

    ribopy dump metagene \
       --experiment WT \
       --site stop \
       --lowerlength 31 --upperlength 31 \
       sample.ribo
    
  3. Get metagene data around START site for the experiment WT. Report results for read lengths from 30 to 32. Report results for each read length. Also, values are summed across transcripts. Save the results in start.csv.

    

    ribopy dump metagene \
       --experiment WT \
       --site start \
       --out start.csv \
       --lowerlength 30 --upperlength 32 \
       sample.ribo
    
CLI Reference dump metagene [OPTIONS] RIBO

Options

-s, --site <site>

Site type. [required]

Options:start|stop
-o, --out <out>

Output file in bed format

-e, --experiment <experiment>

Name of the experiment

-l, --lowerlength <lowerlength>

Minimum read length to be taken

-u, --upperlength <upperlength>

Maximum read length to be taken

--sumlengths

Sum accross lengths

--nosumtrans

Do NOT aggregate values accross transcripts

Arguments

RIBO

Required argument

reference-lengths

Gives transcript names and their lengths.

Annotation is written to standard output if no output file is given.

The first column corresponds to transcript names and the second column corresponds to transcript lengths.

Examples:

Print the output to the terminal

ribopy dump reference-lengths sample.ribo

Save the output, in gzipped form, in lengths.csv.gz and use , to separate columns

ribopy dump reference-lengths -o lengths.csv.gz --sep "," sample.ribo
CLI Reference dump reference-lengths [OPTIONS] RIBO

Options

-o, --out <out>

Output file

-s, --sep <sep>

Column separator, default is t (tab)

Arguments

RIBO

Required argument

region-counts

Dumps a given region in csv format to a file or standard output.

Examples:

Below are some examples for different scenarios.

  1. Get number of reads mapping to the coding sequence (CDS) for the experiment WT. Report results for read lengths from 28 to 32. Agrregate data by summing accros read lengths. Also, values are summed across transcripts. Save the results in cds.csv.

    

    ribopy dump region-counts \
       --experiment WT \
       --region CDS \
       --out cds.csv \
       --lowerlength 28 --upperlength 32 \
       --sumlengths \
       --sumtrans \
       sample.ribo
    
  2. Get number of reads mapping to UTR3 for the experiment Treatment. Report results for read lengths from 30 to 32. Agrregate data by summing accros read lengths. CDS occupancy is reported for each transcript. Save the results in cds.csv.

    

    ribopy dump region-counts \
       --experiment Treatment \
       --region UTR3 \
       --out cds.csv \
       --lowerlength 30 --upperlength 32 \
       --sumlengths \
       sample.ribo
    
CLI Reference dump region-counts [OPTIONS] RIBO

Options

-r, --region <region>

Site type. [required]

Options:UTR5|UTR5_junction|CDS|UTR3|UTR3_junction
-o, --out <out>

Output file in csv format

-e, --experiment <experiment>

Name of the experiment

--lowerlength <lowerlength>

Minimum read length to be taken

--upperlength <upperlength>

Maximum read length to be taken

--sumlengths

Sum accross lengths

--sumtrans

Sum accross transcripts

Arguments

RIBO

Required argument

info

Displays a summary information about the given ribo file

CLI Reference info [OPTIONS] RIBOFILE

Arguments

RIBOFILE

Required argument

merge

Merges a set of given ribo files into one ribo file.

The input ribo files are merged into a new ribo file. The resulting ribo file has the union of the experiments of the input files

The ribo files to be merged must be compatible: They must have the same

  1. Reference Name
  2. Transcript Names & Transcript Lengths
  3. Annotation
  4. Ribo file parameters:
    1. Metagene Radius
    2. Left Span & Right Span
    3. Min & Max Read Length
  5. They can not have overlapping experiment names

This option is especially useful for Ribo-Seq Data processed together and needs to be analyzed together.

CLI Reference merge [OPTIONS] OUT_RIBO_PATH [IN_RIBO_PATHS]...

Arguments

OUT_RIBO_PATH

Required argument

IN_RIBO_PATHS

Optional argument(s)

metadata

Display, set or delete user-defined metadata

If no name is given, the metadata of the ribo file is set, displayed or deleted.

CLI Reference metadata [OPTIONS] COMMAND [ARGS]...

delete

Deletes user-defined metadata of the ribo file or experiment

If no name is provided, metadata of the ribo file is deleted.

If name is given, metadata of the corresponding experiment is deleted.

CLI Reference metadata delete [OPTIONS] RIBO

Options

--name <name>

experiment name

--force

Set metadata without prompting user.

Arguments

RIBO

Required argument

get

Displays user-defined metadata of the ribo file or experiment

If no name is provided, metadata of the ribo file is displayed.

If name is given, metadata of the corresponding experiment is displayed.

CLI Reference metadata get [OPTIONS] RIBO

Options

--name <name>

experiment name

Arguments

RIBO

Required argument

set

Stores the metadata in the meta file.

If no name is given, the metadata of the ribo file is set.

If name is provided, the metadata of the corresponding experiment is set.

The metadata must be in yaml format.

Example Ribo Metadata File Contents:
pipeline_name: RiboFlow
pipeline_version: v1.0.2
project: elongation blocker
Example Library Metadata File Contents:
Cell_Line: Human ESC
Treatment: Drug A
Enzyme: RNASEI
CLI Reference metadata set [OPTIONS] RIBO

Options

--name <name>

experiment name

--meta <meta>

Metadata File [required]

--force

Set metadata without prompting user.

Arguments

RIBO

Required argument

plot

Generate some basic plots for ribo files.

CLI Reference plot [OPTIONS] COMMAND [ARGS]...

lengthdist

Plots the distribution of the ribosome footprint lengths.

The x-axis is the length of the protected ribosome footprints. The y-axis is the raw or normalized frequecies.

At most 7 experiments can be provided for a single plot.

Pdf and png output formats are supported. If “dump” option is provided, the data is written to the provided file path.

If the frequencies are normalized using the “–normalize” option, the y-axis becomes the percentages of the frequencies.

Examples:

1) Plot CDS length distribution of exp_1 and exp_2 and normalize the frequencies.

ribopy plot lengthdist  \
   -o multiple_dist.pdf \
   -r CDS --normalize \
   project.ribo exp_1 exp_2

2)Plot only main_exp and write the data to out.csv.

ribopy plot lengthdist \
    -d out.csv \
    -o main_exp.pdf \
    -r CDS \
    project.ribo  main_exp
CLI Reference plot lengthdist [OPTIONS] RIBO [EXPERIMENTS]...

Options

-r, --region <region>

Region type. [required]

Options:UTR5|UTR5_junction|CDS|UTR3_junction|UTR3
-o, --out <out>

Output file in bed format [required]

-t, --title <title>

Plot title.

--normalize

Normalize by total metagene site coverage

-d, --dump <dump>

Dump the data to csv file

Arguments

RIBO

Required argument

EXPERIMENTS

Optional argument(s)

metagene

Generates metagene plots.

The x-axis is the relative nucleotide positions and the start /stop site is at the center (origin at 0). The y-axis is the raw or normalized coverage.

At most 7 experiments can be provided for a single plot.

For a given start or stop site, the coverage around the site of interest is plotted.

The supported output extensions are pdf and png.

If a length range is not provided, the minimum and the maximum ranges from the ribo file are read and used as the range. The values in the length range are aggregated to generate the metagene plot.

Examples:
  1. Get coverage data, of WT, for read length 21 in zipped bedgraph format
ribopy plot metagene -s start -o hela_1.pdf project.ribo HeLa_1
  1. Two experiments in one plot, stop site, with normalization
ribopy plot metagene -s start --normalize -o hela_1_2.pdf project.ribo HeLa_1 HeLa_2

3) Plot start site usiong footprints of length 20,21,22 Dump the data to out.csv file.

ribopy plot metagene -s start \
  --lowerlength 20 upperlength 22 \
  -o hela_1.pdf \
  -d out.csv \
  project.ribo HeLa_1
CLI Reference plot metagene [OPTIONS] RIBO [EXPERIMENTS]...

Options

-s, --site <site>

Site type. [required]

Options:start|stop
-o, --out <out>

Output file. [required]

--lowerlength <lowerlength>

Lower read length

--upperlength <upperlength>

Upper read length

-t, --title <title>

Plot title.

--normalize

Normalize by total metagene site coverage

-d, --dump <dump>

Dump the data to csv file

Arguments

RIBO

Required argument

EXPERIMENTS

Optional argument(s)

regioncounts

Generates barplots of the percentages of the UTR5, CDS and UTR3 counts.

The raw counts are saved to a csv file if “–dump” option is provided.

If a length range is not provided, then the range is determined using the mininum and the maximum read lengths in the ribo file.

Examples:

1) Plot the region counts for the experiment names sample for lengths from 29 to 31

ribopy plot regioncounts --lowerlength 29 --upperlength 31 \
    --out sample.region_counts.pdf test.ribo sample

2) Plot region counts for the two experiments WT and DrugA for all lengths combined

ribopy plot regioncounts --out wt_anddrug.pdf other.ribo WT DrugA
CLI Reference plot regioncounts [OPTIONS] RIBO [EXPERIMENTS]...

Options

-o, --out <out>

Output file [required]

--lowerlength <lowerlength>

Lower read length

--upperlength <upperlength>

Upper read length

-t, --title <title>

Plot title.

--horizontal

Draw bars horizontally.

-d, --dump <dump>

Dump the data to csv file

Arguments

RIBO

Required argument

EXPERIMENTS

Optional argument(s)

rnaseq

Display, set or delete RNA-Seq data

CLI Reference rnaseq [OPTIONS] COMMAND [ARGS]...

delete

Delete RNA-Seq data of a particular experiment

Example
——-

ribopy rnaseq delete –name WT test.ribo

CLI Reference rnaseq delete [OPTIONS] RIBO

Options

--name <name>

experiment name [required]

--force

Delete RNA-Seq without prompting user.

Arguments

RIBO

Required argument

get

Get transcript expression data of a given experiment

If no output parameter is provided, the results are printed to standard output.

Transcript expression is reported in two columns where the first column corresponds to transcript names and the second column corresponds to transcript expression.

Examples
——–

Save the transcript expression in a tab separated file in gzipped form.

  1. ribopy rnaseq get –name WT –out transcript_exp.tsv.gz test.ribo

Print the transcript expression on the screen

  1. ribopy rnaseq get –name WT test.ribo
CLI Reference rnaseq get [OPTIONS] RIBO

Options

--name <name>

experiment name

--out <out>

Output File

--sep <sep>

Column Separator: Default is tab

Arguments

RIBO

Required argument

set

Store the transcript expression data of an experiment

?? MISSING DOCUMENTATION ??

CLI Reference rnaseq set [OPTIONS] RIBO

Options

-n, --name <name>

experiment name [required]

-a, --alignment <alignment>

RNASeq alignments in bed or bam format.

-c, --counts <counts>

Transcript Expression File

-f, --format <format>

RNA-Seq alignment format

Options:bed|bam
--sep <sep>

Column Separator for counts file [default: ]

--force

Set RNA-Seq without prompting user.

Arguments

RIBO

Required argument