Question

Is this de novo assembly too big for proteomics?

0

Entering edit mode

7.9 years ago

benaneely ▴ 70

I am assisting on a project which is attempting shotgun proteomic analysis on a non-model organism (invertebrate) and the sample being queried has had RNA-seq performed in parallel. I was not involved in the assembly and only have a fasta of the de novo assembly created using unknown parameters. This fasta has 96k entries with an average length of 552 and ranges from 224 to 14000 in length. The number of entries seems too large and they are too short in my opinion, but I have no experience or anything to compare it to.

If this was DNA, and I am using Mascot (a protein search algorithm), Mascot will do a 6-frame translation on the fly, though since I want to do a 3-frame translation, I need to do this before loading the database.

The first issue I have is that doing a 3-frame translation (either transeq or getorf via EMBOSS) means my search space will be almost 300k entries, which is pretty ridiculous imo (the uniprotKB for all mammalia is 1.4 million, meaning this one individual has as many as 1/5 of all known mammal protein sequences). This has deleterious effects on the search performance (both computational costs as well as decreased detection).

So how do you gauge whether a de novo assembly is "good" and usable for an application such as proteomics if all you have to go on is the generated fasta? I was thinking about maybe using something like CD-HIT to get rid of redundancy in the fasta, but I am sure there are other clever suggestions folks could offer up.

Also, is there anything else I am missing as to best practices here?

The optimum answer is going to be to take the original fastq files and deal with them myself and therefore know the assembly conditions, but for the time being I wanted to see what the community had to say, if anything.

RNA-Seq proteomics • 1.6k views

ADD COMMENT • link updated 6.9 years ago by h.mon 35k • written 7.9 years ago by benaneely ▴ 70

1

Entering edit mode

I would suggest removing everything below 500nt, and then using a clustering algorithm to remove isoforms of the same transcript.

ADD REPLY • link 7.9 years ago by Adrian Pelin ★ 2.6k

1

Entering edit mode

You definitely need to learn how the denovo transcriptome was made. Lots of small sequences seems like it is fragmented. One long gene could become several separate FastA entries by shearing at ambiguous or unmappable sites. This has serious implications for your proteomic assembly.

ADD REPLY • link 7.9 years ago by karl.stamm 4.1k

score 1 · Answer 1 · 2017-05-30

For proteomics, I suggest you translate in silico your transcriptome with TransDecoder and / or GeneMarkS-T, and use the resulting peptidome as reference. Six- or three-frame translation will generate a lot of false positives.

After in silico translation, you can reduce redundancy with cd-hit.