Question: Is this de novo assembly too big for proteomics?
gravatar for benaneely
4.1 years ago by
United States
benaneely70 wrote:

I am assisting on a project which is attempting shotgun proteomic analysis on a non-model organism (invertebrate) and the sample being queried has had RNA-seq performed in parallel. I was not involved in the assembly and only have a fasta of the de novo assembly created using unknown parameters. This fasta has 96k entries with an average length of 552 and ranges from 224 to 14000 in length. The number of entries seems too large and they are too short in my opinion, but I have no experience or anything to compare it to.

If this was DNA, and I am using Mascot (a protein search algorithm), Mascot will do a 6-frame translation on the fly, though since I want to do a 3-frame translation, I need to do this before loading the database.

The first issue I have is that doing a 3-frame translation (either transeq or getorf via EMBOSS) means my search space will be almost 300k entries, which is pretty ridiculous imo (the uniprotKB for all mammalia is 1.4 million, meaning this one individual has as many as 1/5 of all known mammal protein sequences). This has deleterious effects on the search performance (both computational costs as well as decreased detection).

So how do you gauge whether a de novo assembly is "good" and usable for an application such as proteomics if all you have to go on is the generated fasta? I was thinking about maybe using something like CD-HIT to get rid of redundancy in the fasta, but I am sure there are other clever suggestions folks could offer up.

Also, is there anything else I am missing as to best practices here?

The optimum answer is going to be to take the original fastq files and deal with them myself and therefore know the assembly conditions, but for the time being I wanted to see what the community had to say, if anything.

rna-seq proteomics • 1.1k views
ADD COMMENTlink modified 3.1 years ago by h.mon30k • written 4.1 years ago by benaneely70

I would suggest removing everything below 500nt, and then using a clustering algorithm to remove isoforms of the same transcript.

ADD REPLYlink written 4.1 years ago by Adrian Pelin2.4k

You definitely need to learn how the denovo transcriptome was made. Lots of small sequences seems like it is fragmented. One long gene could become several separate FastA entries by shearing at ambiguous or unmappable sites. This has serious implications for your proteomic assembly.

ADD REPLYlink written 4.1 years ago by karl.stamm3.6k
gravatar for h.mon
3.1 years ago by
h.mon30k wrote:

For proteomics, I suggest you translate in silico your transcriptome with TransDecoder and / or GeneMarkS-T, and use the resulting peptidome as reference. Six- or three-frame translation will generate a lot of false positives.

After in silico translation, you can reduce redundancy with cd-hit.

ADD COMMENTlink modified 2.2 years ago • written 3.1 years ago by h.mon30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1507 users visited in the last hour