PROTEOFORMER is a proteogenomic pipeline that delineates true in vivo proteoforms and generates a protein sequence search space for peptide to MS/MS matching. It can be combined with canonical protein databases or used independently for identification of novel translation products. The pipeline makes use of the recently developed next generation sequencing strategy termed ribosome profiling (RIBO-seq) that provides genome-wide information on protein synthesis in vivo. RIBO-seq is based on the deep sequencing of ribosome protected mRNA fragments. RIBO-seq allows for the mapping of the location of translating ribosomes on mRNA with sub codon precision, it can indicate which portion of the genome is actually being translated at the time of the experiment as well as account for sequence variations such as single nucleotide polymorphisms and RNA splicing.
PROTEOFORMER takes as input two fastq files (NGS reads files representing ribosome-protected fragments (RPFs)) of the elongating and initiating ribosomes and outputs a FASTA protein sequence database of derived translation products based on Ensembl transcript annotations. Furthermore, specific metrics (e.g. FPKM, coverage, gene RPF abundance) are deduced to enable the verification of the RIBO-seq data quality. The alignment and RPF density information is outputted in order to allow easy upload and visual evaluation in a genome browser environment.The pipeline consists of eight major parts:
Datasets without initiation inhibitor data lanes cannot be analysed with TIS calling, variation calling and translation assembly. Instead, tools are developed which use PRICE or SPECtre to determine the translated open reading frames. Check our manuscript for full explanation. Also please cite our work if you plan on using it (http://dx.doi.org/10.1093/nar/gku1283). The full user manual can be found on GitHub.
PROTEOFORMER was developed in Perl 5 and is freely available for download in a script version and as a Galaxy implementation.
All tools can be downloaded from github. The tool documentation, reference information download instructions and installation guidelines are in the README. Pre-configured ublast- and blastp-formatted databases can be downloaded from the data download links below.
The dbSNP databases can be downloaded from the NCBI website and placed in the tool-data as described in the README.
A script based version of the pipeline can be found on github.
The requirements and documentation are available in the README.
A customized virtual machine (Ubuntu 12.04 LTS) with all script dependencies and a galaxy server already installed can be downloaded from the link PROTEOFORMER VM. The Galaxy-server instance is installed in the folder /opt/galaxy-server.
PROTEOFORMER was developed and tested on Mac and Ubuntu 12.04 LTS.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. You may obtain a copy of the License at http://www.gnu.org/licenses/. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
Copyright (C) 2014 G. Menschaert, J. Crappé, E. Ndah, A. Koch & S. Steyaert