Biobix

What is PROTEOFORMER?

PROTEOFORMER is a proteogenomic pipeline that delineates true in vivo proteoforms and generates a protein sequence search space for peptide to MS/MS matching. It can be combined with canonical protein databases or used independently for identification of novel translation products. The pipeline makes use of the recently developed next generation sequencing strategy termed ribosome profiling (RIBO-seq) that provides genome-wide information on protein synthesis in vivo. RIBO-seq is based on the deep sequencing of ribosome protected mRNA fragments. RIBO-seq allows for the mapping of the location of translating ribosomes on mRNA with sub codon precision, it can indicate which portion of the genome is actually being translated at the time of the experiment as well as account for sequence variations such as single nucleotide polymorphisms and RNA splicing.

PROTEOFORMER 2.0 workflow — **Fig1. – Overview of the PROTEOFORMER 2.0 Pipeline.**

PROTEOFORMER takes as input two fastq files (NGS reads files representing ribosome-protected fragments (RPFs)) of the elongating and initiating ribosomes and outputs a FASTA protein sequence database of derived translation products based on Ensembl transcript annotations. Furthermore, specific metrics (e.g. FPKM, coverage, gene RPF abundance) are deduced to enable the verification of the RIBO-seq data quality. The alignment and RPF density information is outputted in order to allow easy upload and visual evaluation in a genome browser environment.The pipeline consists of eight major parts:

Raw Input Quality Control: This preliminary step determines the general quality of the raw sequencing data using FastQC.
Mapping: The first step makes use of transcriptome mappers (STAR or TopHAT2) to align RPFs from the input files against a references genome based on the corresponding Ensembl annotation bundle. Before mapping, reads will be preprocessed (adaptor clipping, quality filtering, pre-mapping against rRNA, tRNA,…).
Post-mapping quality control: This step analyses the quality of the preprocessed, aligned reads using FastQC and mQC. At the same time, the user gets a general overview of the data outlook.
Transcript calling: This step makes use of the ribosome profiles of the elongating ribosomes to mark transcripts with experimental evidence of translation. Extra annotation is also added in this step (CCDS id, canonical transcript).
TIS calling: This entails the identification of translation initiation sites (TIS). It implements a rule-based algorithm that combines RIBO-seq information from two related translation inhibitors: an elongation inhibitor (e.g. cycloheximide) and an initiation inhibitor (e.g. lactimidomycin) to differentiate TIS sites from elongating ribosomes.
Variation calling: This part of the pipeline uses samTools and/or a dbSNP to identify variants in the mapped reads.
Translation assembly: This step assembles all translation products based on the TIS, transcript isoform, and/or SNP information derived from the RIBO-seq data.
Translation Database: Finally a (non-)redundant FASTA-formatted database of derived translation products is generated wherein eventually all duplicate and sub-sequences are removed.
Floss Calculation: Calculates first the reference fractions and cut off values based on known protein-coding transcripts. With these, the FLOSS scores are calculated and classified for each possible translation product.
Feature summarization: The user can generate lists of RPF counts per feature (gene, transcript, exon, ORF, promotor).

Datasets without initiation inhibitor data lanes cannot be analysed with TIS calling, variation calling and translation assembly. Instead, tools are developed which use PRICE or SPECtre to determine the translated open reading frames.
Check our manuscript for full explanation. Also please cite our work if you plan on using it (http://dx.doi.org/10.1093/nar/gku1283). The full user manual can be found on GitHub.

PROTEOFORMER was developed in Perl 5 and is freely available for download in a script version and as a Galaxy implementation.

Galaxy Version

All tools can be downloaded from github. The tool documentation, reference information download instructions and installation guidelines are in the README. Pre-configured ublast- and blastp-formatted databases can be downloaded from the data download links below.

The dbSNP databases can be downloaded from the NCBI website and placed in the tool-data as described in the README.

Script Based Version

A script based version of the pipeline can be found on github.

The requirements and documentation are available in the README.

Virtual Machine

A customized virtual machine (Ubuntu 12.04 LTS) with all script dependencies and a galaxy server already installed can be downloaded from the link PROTEOFORMER VM. The Galaxy-server instance is installed in the folder /opt/galaxy-server.

PROTEOFORMER was developed and tested on Mac and Ubuntu 12.04 LTS.

Publications

Crappé, J., Ndah, E., Koch, A., Steyaert, S., Gawron, D., De Keulenaer, S., De Meester, E., De Meyer, T., Van Criekinge, W., Van Damme, P., and Menschaert, G. (2014) PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. Nucleic Acids Res. 43, e29

Verbruggen S., Ndah E., Van Criekinge W., Gessulat S., Kuster B., Wilhelm M., Van Damme P. and Menschaert G. (2019). Molecular and Cellular Proteomics, https://doi.org/10.1074/mcp.RA118.001218

Downloads

Available from github.
Virtual machine.

Data Dependencies

dbSNP database.
Species-specific ublast and blastp formated SWISSPROT databases [May 2014] can be downloaded below (see blast links).
Species-specific SQLite databases holding necessary Ensembl information can also be downloaded below (see Esembl_SQLiteDB links), or can be generated using this Python script.
Species-specific reference sequence and annotation information can be generated using this Python script.

Homo sapiens (Ensembl_SQLiteDB,
Blast).
Mus musculus (Ensembl_SQLiteDB,
Blast).
Drosophilia melanogaster (Ensembl_SQLiteDB,
Blast).
Arabidopsis thaliana (Ensembl_SQLiteDB,
Blast).

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
You may obtain a copy of the License at http://www.gnu.org/licenses/.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Lab of bioinformatics and computational genomics

Proteoformer