Python API

A walk-through of Python API

Quick reference

Ribo Attributes

Essential Ribo class attributes

ribopy.Ribo(ribo_file[, alias, file_mode]) Ribo is an interface to ribo files.
ribopy.Ribo.experiments List of experiments in the ribo object
ribopy.Ribo.minimum_length Reads, used to generate the ribo file, are of at least minumum_length
ribopy.Ribo.maximum_length Reads, used to generate the ribo file, are of at least minumum_length
ribopy.Ribo.metagene_radius #nucleotides to the left and right of start / stop sites in metagene coverage.
ribopy.Ribo.left_span #Nucleotides to the left of start / stop sites to define UTR5 & UTR3 junction regions
ribopy.Ribo.right_span #Nucleotides to the right of start / stop sites to define UTR5 & UTR3 junction regions
ribopy.Ribo.format_version Ribo file format version

Getter Functions

Methods for reading ribosome profiling data.

ribopy.Ribo.get_metagene(site_type[, …]) Returns metagene data at start / stop site
ribopy.Ribo.get_region_counts(region_name[, …]) Returns number of reads mapping to UTRs or CDS.
ribopy.Ribo.get_coverage(experiment[, …]) Returns coverage at nucleotide resolution.
ribopy.Ribo.get_rnaseq([experiments]) Returns region counts coming from RNA-Seq data.

Plot Functions

Some essential plots for ribosome profiling analysis,

ribopy.Ribo.plot_metagene(site_type[, …]) Generates coverage plots around start / stop sites.
ribopy.Ribo.plot_lengthdist(region_type, …) Generates distribution of the reads according to length
ribopy.Ribo.plot_region_counts(experiments) Generates bar plots of region counts

Ribo

class ribopy.Ribo(ribo_file, alias=None, file_mode='r')

Ribo is an interface to ribo files.

It provides access to ribo file attributes, metadata and ribosome profiling data in a ribo file.

Parameters:
  • ribo_file (str, BytesIO) – Path to a ribo file or a handle to a ribo file
  • file_mode (str , [choices: "r", "r+"]) – h5py file mode. Ribo file must exist. So only read (“r”) or read & write (“r+”) modes are allowed. Be extremely careful with the r+ option. For most use cases, “r” (read-only) option should be sufficient. You are strongly discouraged to delete or modify existing attributes or data tables as this can corrupt the file.
experiments

List of experiments in the ribo object

format_version

Ribo file format version

get_coverage(experiment, range_lower=0, range_upper=0, alias=False)

Returns coverage at nucleotide resolution.

Note that RNA-Seq data is an optional entity of Ribo File.

get_length_dist(region_name, experiments=[])

Returns the number of reads for each length for a given region

Parameters:region_name (str, list) –
get_metadata(experiment=None)

Returns user defined metadata in dictionary form

Parameters:experiment (str (default = None)) – If None, the metadata of the ribo file itself is returned. Otherwise, returns the metadata of the given experiment.
Returns:result
Return type:dict
get_metagene(site_type, experiments=[], sum_lengths=True, sum_references=True, range_lower=0, range_upper=0, alias=False)

Returns metagene data at start / stop site

Metagene data is reported for a range of read lengths. Every read length in this range is included in the returned data frame. This range is provided as an interval by specifying range_lower and range_upper, both of which are included. If range is not specified, the minimum and maximum read lengths from the ribo file are taken as the range definition.

Parameters:
  • site_type (str [choices: 'start' , 'stop']) – Determines the site, around which metagene coverage is going to be reported.
  • sum_lengths (bool [default: True]) – If True, metagene data is summed up across the given length range. If False, metagene data is reported for each length in the given range.
  • sum_references (bool [default: False]) – If True, metagene data is summed across references (transcripts). If False, metagene data is reported for each transcript.
  • experiments (list, str) – List of experiment(s). If empty, all experiments are included.
Returns:

metagene_df – This data frame contains coverage around start or stop site. The column labels are the relative positions, with respect to the start or stop site. The index of the dataframe depends on the two parameters: sum_lengths and sum_references. Whichever of these parameters are set to True, won’t be in the index. Therefore, if sum_lengths is True, there won’t be a length index in the metagene_df because the values are summed across the lengths.

Return type:

pandas.DataFrame

get_region_counts(region_name, sum_lengths=True, sum_references=True, range_lower=0, range_upper=0, experiments=[], alias=False)

Returns number of reads mapping to UTRs or CDS.

Region counts are reported for a range of read lengths. Every read length in this range is included in the returned data frame. This range is provided as an interval by specifying range_lower and range_upper, both of which are included. If range is not specified, the minimum and maximum read lengths from the ribo file are taken as the range definition.

Parameters:
  • region_name (str [choices: "UTR5", "UTR5_junction", "CDS", "UTR3_junction", "UTR3"]) – For the definition of these regions, check the documentation at https://ribopy.readthedocs.io The region definitions are coming from the annotation in the ribo file.
  • sum_lengths (int [default: True]) – If True, sum region counts across read lengths. If False, report region counts for each length.
  • sum_references (int [default: True]) – If True, sum region counts across references (transcripts). If False, report region counts for each reference.
  • range_lower (int) –

    Minimum read length to be included in the result. range_upper: int

    Maximum read length to be included in the result.
    experiments: list, str [default: []]
    List of experiment(s). If empty, all experiments are included.
Returns:

region_counts – The index of the dataframe depends on the parameters sum_lengths and sum_references. More precisely, if sum_lengths is False, there will be an index for read length. If True, there won’t be such an index as data is aggregated by read lengths. Similarly, if sum_references is False, there will be an index for reference (transcript) names. If True, there won’t be such an index. The columns correspond to experiments.

Return type:

pd.DataFrame

get_rnaseq(experiments=None)

Returns region counts coming from RNA-Seq data.

Note that RNA-Seq data is an optional entity of Ribo File.

In contrast to region counts for ribosome profiling data, the resulting data frame do not have separate entries for read lengths. The columns of the data frame correspond to regions. The index of the Data Frame has two levels: experiment, transcript.

Parameters:
  • experiments (list, str) – List of experiment(s) whose RNA-Seq data is to be reported
  • rnaseq_df (pd.DataFrame) – Data Frame containing RNA-Seq counts for each region.
has_coverage(experiment)

return has_coverage_data(self._handle, experiment_name)

Parameters:experiment (str) – Named of the experiment whose coverage data is inquired.
Returns:result
Return type:bool
has_metadata(experiment=None)

Checks if Ribo has metadata

Parameters:experiment (str (default = None)) – If None, the metadata of the ribo file itself is inquired. Otherwise, checks if the given experiment has metadata
Returns:result
Return type:bool
has_rnaseq(experiment)

Does experiment has RNA-Seq data?

Parameters:experiment (str) – Name of the experiment
Returns:rnaseq_exists
Return type:bool
info

All Ribo attributes packed in a dictionary.

left_span

#Nucleotides to the left of start / stop sites to define UTR5 & UTR3 junction regions

maximum_length

Reads, used to generate the ribo file, are of at least minumum_length

metagene_radius

#nucleotides to the left and right of start / stop sites in metagene coverage.

minimum_length

Reads, used to generate the ribo file, are of at least minumum_length

plot_lengthdist(region_type, experiments, title='', normalize=False, output_file='', colors=['blue', 'red', 'green', 'brown', 'orange', 'violet', 'black'])

Generates distribution of the reads according to length

The x-axis is the read length and the y-axis is the number of reads mapping to the particular region.

Parameters:
  • region_name (str [choices: "UTR5", "UTR5_junction", "CDS", "UTR3_junction", "UTR3"]) – For the definition of these regions, check the documentation at https://ribopy.readthedocs.io The region definitions are coming from the annotation in the ribo file.
  • experiments (list, str) – List of experiments to be plotted
  • title (str) – Title of the plot
  • normalize (bool (default = False)) – Normalize each experiment by total number of reads.
  • output_file (str (default = "")) – If provided, the output will be saved in this path.
  • colors ((default: ["blue", "red", "green", "brown", "orange", "violet", "black"])) – Colors of lines.
plot_metagene(site_type, title='', experiments=[], range_lower=0, range_upper=0, normalize=False, output_file='', colors=['blue', 'red', 'green', 'brown', 'orange', 'violet', 'black'])

Generates coverage plots around start / stop sites.

“metagene_radius” many nucleotides are taken on either side of the site type. The axis is the positions of the nucleotides relative to “site_type” and the y-axis is their frequency.

Setting normalize = True might be helpful when plotting more than one experiment. Values are normalized by total number of mapped reads in the experiment,

Parameters:
  • site_type (str [choices: "start", "stop"]) – Coverage plot is centered around the site type.
  • title (str, default = "") – Title of the plot
  • experiments (list, str) – List of experiments to be plotted
  • range_lower (int) – Minimum read length to be included in the metagene coverage data.
  • range_upper (int) – Maximum read length to be included in the metagene coverage data.
  • normalize (bool (default = False)) – Normalize metagene data by number of mapped reads in the experiment
  • output_file (str, (default = "")) – If non-empty, the plot will be saved in the provided path.
  • colors (list, (default: ["blue", "red", "green", "brown", "orange", "violet", "black"])) – Colors of lines.
plot_region_counts(experiments, title='', range_lower=0, range_upper=0, horizontal=True, output_file='')

Generates bar plots of region counts

The bar plot coming from the percentages of the counts of the regions: UTR5, CDS and UTR3

Parameters:
  • experiments (list, str) – List of experiments to be plotted
  • title (str) – Title of the plot
  • range_lower (int) – Minimum read length to be included in the bar plot data.
  • range_upper (int) – Maximum read length to be included in the bar plot data.
  • horizontal (bool (default = True)) – Generates bar plots horizontally. Especially for long experiment names, this is the preferred method.
  • output_file (str (default = "")) – If provided, the output will be saved in this path.
print_info(return_str=False)

Prints Ribo file information in string format.

Parameters:return_str (bool) – If True, retuns the info_str and does not print. if False, prints the info_str and returns none.
Returns:info_str – Ribo File Summary String
Return type:str

Notes

If this information is needed in a structured form, use the “info” attribute of the ribo object.

reference_name

Name of the reference (transacript assembly & annotation)

ribopy_version

Version of the ribopy used to creat the ribo file

right_span

#Nucleotides to the right of start / stop sites to define UTR5 & UTR3 junction regions

transcript_index

Transcript indices in dictionary form.

transcript_name -> transcript_length

The transcript index coming from the order of Transcripts in the ribo file,

transcript_lengths

Transcript lengths in dictionary form.

transcript_name -> transcript_length

transcript_names

Transcripts Names in the ribo reference

transcript_offsets

Transcript offsets

This gives us the initial position of the transcript in determining coverage. If the transcripts are linearly lined up according to the order in the ribo file, then this offset is the position of the first nucleotide of the transcript.