Contig cutoff of 300 bp - why? (Trinity transcriptome assembly)
Entering edit mode
6.2 years ago
BioBing ▴ 150

Hi all,

This may be a stupid question, but I am curious to find out why exactly many papers chose a contig cutoff of 300 bp when performing a Trinity de novo transcriptome assembly.

In my head, it makes sense since short contigs probably do not provide a lot of information and are just "contaminating" the transcriptome. But this is only speculations since I have not found any reliable source with a good and precise explanation of why it is useful.

Do any of you have a good explanation? or even a paper/review you could recommend regarding the topic?

Then I would be grateful. This question has been buzzing in my head the last couple of days, and I am curious to find out why exactly we chose a 300 bp cutoff in a Trinity de novo transcriptome assembly (probably also in other assemblers as well, but I work with Trinity).

Cheers, Birgitte

Trinity RNA-Seq Assembly Contigs • 2.5k views
Entering edit mode

Can you do a local similarity search for the shorter transcripts just to see what do you miss when you apply a cutoff of 300bp?

Entering edit mode
6.2 years ago
h.mon 35k

It is because smaller transcripts are more likely to be less useful for the standard RNAseq analysis pipeline: they tend to be lowly expressed, of unknown function, and often times are some sort of artifact (chimeric transcripts, poor quality, contaminant from other species, and so on).

You can check this for yourself: 1) map the reads onto the assembled transcriptome, and check coverage by transcript length - shorter transcripts will have much lower average coverage; 2) do a similarity search (DIAMOND is quite fast) and check for significative hits (use a somewhat stringent e-value cutoff to avoid false positives), you will see short transcripts have a greater proportion of unannotated (no hits) search results.

That is not to say all short transcripts are useless or artifacts, but the regular RNAseq pipeline is looking at larger patterns, so we focus on the data we have more biological confidence and statistical power. You may keep short transcripts and try to find something interesting, but in that case you will have to do a lot of validation later on, to prove you are not looking at an artifact.

Entering edit mode
6.2 years ago
jwhan.algae ▴ 10

It is old story (the time, ESTs sequencing). As you known, mRNAs were consisted of 3 parts, 3'-UTR, 5'-UTR and ORF. Usually, the length of UTR was ~300bp, thus we cut off the length about 300bp when cDNA library was constructed. Although NGS based sequencing did not followed this rule, it is mostly useless because it is too short to analyze. Another reason, the paired sequencing include about ~350bp internal distance between reads. You could imagine it is impossible generate from the data. It could be produced from orphans not from paired reads. Therefore, most researchers removed contigs below 300bp. I hope you get a good answer for your question.


Login before adding your answer.

Traffic: 1190 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6