Question

RNA sequencing depth for downstream proteomics

1

Entering edit mode

7.7 years ago

benaneely ▴ 70

Background: I am working on a project where we have samples from many novel species of an invertebrate (no idea on genome size) on which we want to perform proteomic analysis. Shotgun proteomics uses tandem mass spectrometry data to search against a database of known proteins. Typically CDS annotations annotations of a genome assembly (though additional experimental evidence helps the assignments). For this project the plan is to use RNA-seq followed by de novo assembly to construct an exon database for each species. We have done this once, but it didn't work great (but we are fine tuning our assembly parameters based on downstream performance).

Question: How many clusters of 100bp paired-end reads should we do to generate the de novo assembly (25, 50, 75 or 100M)? Based on the number of species we want to do, this step should be cost-efficient, but with downstream processing time, followed by proteomic analysis and computational time there as well, it may be that 100M is actually the best idea due to the quality trade-off at 25M. I would say the biggest downstream problem is having a really fragmented assembly since that makes the proteomic search space prohibitively large.

Any thoughts or suggestions are appreciated. I have zero experience in knowing what quality differences there will be between 25 and 100M sequencing coverage. Thanks.

RNA-Seq sequencing proteomics • 1.6k views

ADD COMMENT • link updated 7.7 years ago by brent_wilson ▴ 140 • written 7.7 years ago by benaneely ▴ 70

score 0 · Answer 1 · 2016-08-04

The quality of the assembly may not always depend upon the number of reads used. One thing that makes a large difference in the quality of the assembly is the use of DSN treatment to normalize the number of transcripts in the sample. Adding more reads will usually help with the assembly quality, but it's often hard to find a metric for assembly quality since contiguity (N50) may be increased without any increase in accuracy, or potentially there is a tradeoff between contiguity and reality.

Without knowing your transcriptome size, it's really hard to know how many reads to go with. You might say, use as many reads as you can afford, or try to sequence very deep for one sample and look at the difference of using differing amounts of reads to create several assemblies for that sample (sensitivity analysis), and then use your findings for that one sample to inform the other samples.

There is an interesting article on PLoSOne that you might check out as well:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4701411/

I hope this is helpful!

Brent Wilson, PhD | Project Scientist | Cofactor Genomics

4044 Clayton Ave. | St. Louis, MO 63110 | tel. 314.531.4647

Catch the latest from Cofactor on our blog.