Question

Contig cutoff of 300 bp - why? (Trinity transcriptome assembly)

0

Entering edit mode

7.6 years ago

BioBing ▴ 150

Hi all,

This may be a stupid question, but I am curious to find out why exactly many papers chose a contig cutoff of 300 bp when performing a Trinity de novo transcriptome assembly.

In my head, it makes sense since short contigs probably do not provide a lot of information and are just "contaminating" the transcriptome. But this is only speculations since I have not found any reliable source with a good and precise explanation of why it is useful.

Do any of you have a good explanation? or even a paper/review you could recommend regarding the topic?

Then I would be grateful. This question has been buzzing in my head the last couple of days, and I am curious to find out why exactly we chose a 300 bp cutoff in a Trinity de novo transcriptome assembly (probably also in other assemblers as well, but I work with Trinity).

Cheers, Birgitte

Trinity RNA-Seq Assembly Contigs • 3.3k views

ADD COMMENT • link updated 7.6 years ago by jwhan.algae ▴ 10 • written 7.6 years ago by BioBing ▴ 150

0

Entering edit mode

Can you do a local similarity search for the shorter transcripts just to see what do you miss when you apply a cutoff of 300bp?

ADD REPLY • link 7.6 years ago by lakhujanivijay 5.9k

1

Entering edit mode

7.6 years ago

jwhan.algae ▴ 10

It is old story (the time, ESTs sequencing). As you known, mRNAs were consisted of 3 parts, 3'-UTR, 5'-UTR and ORF. Usually, the length of UTR was ~300bp, thus we cut off the length about 300bp when cDNA library was constructed. Although NGS based sequencing did not followed this rule, it is mostly useless because it is too short to analyze. Another reason, the paired sequencing include about ~350bp internal distance between reads. You could imagine it is impossible generate from the data. It could be produced from orphans not from paired reads. Therefore, most researchers removed contigs below 300bp. I hope you get a good answer for your question.

ADD COMMENT • link 7.6 years ago by jwhan.algae ▴ 10

score 3 · Accepted Answer · 2017-12-05

It is because smaller transcripts are more likely to be less useful for the standard RNAseq analysis pipeline: they tend to be lowly expressed, of unknown function, and often times are some sort of artifact (chimeric transcripts, poor quality, contaminant from other species, and so on).

You can check this for yourself: 1) map the reads onto the assembled transcriptome, and check coverage by transcript length - shorter transcripts will have much lower average coverage; 2) do a similarity search (DIAMOND is quite fast) and check for significative hits (use a somewhat stringent e-value cutoff to avoid false positives), you will see short transcripts have a greater proportion of unannotated (no hits) search results.

That is not to say all short transcripts are useless or artifacts, but the regular RNAseq pipeline is looking at larger patterns, so we focus on the data we have more biological confidence and statistical power. You may keep short transcripts and try to find something interesting, but in that case you will have to do a lot of validation later on, to prove you are not looking at an artifact.