Question

Novel peptide identification

0

Entering edit mode

4.8 years ago

jahanshanzida • 0

Hello,

I have peptide 30,000 peptide sequences from human brain sample. I want to compare these peptides with the PRIDE database peptide sequence to see whether my peptide sequence is novel or not. I am facing two problems.

Firstly, I have to download the pride dataset in Linux server because my computer doesn't support to download these huge datasets. secondly, how I can compare these in R. Please give me some suggestions.

Shanzida.

R • 1.4k views

ADD COMMENT • link 4.8 years ago by jahanshanzida • 0

0

Entering edit mode

What is the length of your peptides? You could potentially use something like blat to do sequence searches against the PRIDE database, which I assume is in fasta format?

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

Thank you so much for your reply. The length of my peptide sequences is 30,000, and it is very difficult to do these at the same time. I also tried to do a search by splitting but that also increases the rate of mistakes. That's why I tried to download peptide dataset in the Linux server and then compare that by using PGA package in Bioconductor in R . But I couldn't download peptide sequence dataset in Linux.

I also have not seen a blat search option against the PRIDE database in the PRIDE database website. Would you please give me some suggestion.

ADD REPLY • link 4.8 years ago by jahanshanzida • 0

0

Entering edit mode

PRIDE database is world’s largest data repository of mass spectrometry-based proteomics data so ignore my naive suggestion of using a sequence search tool (blat).

Someone else is going to have to help you with this one.

They appear to have a number of utilities available here: https://github.com/PRIDE-Utilities

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

Thank you so much for the reply and consideration of my problem.

ADD REPLY • link 4.8 years ago by jahanshanzida • 0

score 1 · Answer 1 · 2019-07-03

Do you mean that you have 30,000 distinct peptides that you want to search or you have one (or more) peptides of length 30,000 (i.e. 30,000 characters long)?

I think you probably mean 30,000 peptides, and I assume your peptides are probably pretty short, i.e. as short as 2 amino acids to dozens, but nothing in the thousands or characters long, right?

The PRIDE ftp server generally has RAW files (sometimes mzML, which is an open format instead of the proprietary RAW format) and sometimes summary files, like Excel or text. If you search PRIDE for human and brain, you get 172 experiments. So unless there's a specific dataset you know have peptides reported, most likely you would have to reprocess the .RAW files yourself to get peptides from PRIDE.

To process RAW files and get peptide identification from PRIDE data, you'd need a proteomics pipeline (or at the very least, a database search tool like Comet or Tide from crux). Additionally, your pipeline/database search would need to be configured specifically to your MS experiment (label-free quantification (LFQ), Tandem-Mass-Tag (TMT). iTRAQ). The PRIDE pages report which type of experiment was done.

I picked PXD005119 at random (for example). On the top right, you can see the project files. It seems like they analyzed the data with Mascott (a database search tool which outputs .mgf files). If I open one of the .mgf files, it doesn't seem to provide data at the peptide level, so it's not readily useful. Other experiments just show the .RAW files and inferred proteins, but not the peptides.

If you're hoping to find peptide sequences that are novel to the brain, you'd have to compare against many PRIDE experiments to be sure they're really novel and not just missed by the experiment.

If you're looking for new peptides and don't care if they're brain expressed, you could just compare your peptides against the human uniprot database and report anything that isn't a substring of a protein.

Since you mentionned R, I've heard good things about MSGF+ (a database search tool that would let you take RAW MS data and identify peptides) and there's a bioconductor wrapper for it: http://bioconductor.org/packages/release/bioc/html/MSGFplus.html Maybe it would be enough for a first pass, though make sure you read the papers of the tools you use and have a good understanding of what you're dealing with, as peptide identification is very error prone.

Just also wanted to point out that instead of dealing with proprietary RAW files, you can convert them to mzML using msconvert (documentation) from the ProteoWizard suite. My approach (since I was using Linux) was to download the Docker Image with Wine and use that to convert all RAW files to mzML before searching with comet.