Question

The difference between okay.cds and okay.fasta produced by tr2aacds

0

Entering edit mode

6.6 years ago

xioli2013 ▴ 10

Hi community,

I have used tr2aacds.pl from EvidentialGenes workflow to remove redundant transcripts produced by Trinity de novo.

It generated a couple of files in the okayset (okalt.aa, okalt.cds, okalt.fasta, okay.aa, okay.cds, okay.fasta). I read upon some journals you need to use the primary and alternative sequences (I figure it is okay.cds, okalt.cds files).

I concatenated okay.cds and okalt.cds and mapped the reads to the assembly, however, it resulted a dramatic mapping rate reduced from 80% to 35%.

Then I did the same to okay.fasta and okalt.fasta, this time the mapping rate is around 77%.

I am not sure so I want to understand the association between the cds files and the fasta files generated by tr2aacds.pl

Thank you,

xp

tr2aacds remove redundancy evidentialGenes • 2.3k views

ADD COMMENT • link updated 2.2 years ago by Sara • 0 • written 6.6 years ago by xioli2013 ▴ 10

0

Entering edit mode

How did you concatenated your okay.cds files? I have eight samples. I assembled them individually using Trinity. Following, i used Evigene in each individual assembled sample obtaining eight okay.cds files. Now, i want to concatenate them in order to posteriorly run Busco in all concatenated samples.

ADD REPLY • link 2.2 years ago by Sara • 0

score 3 · Accepted Answer · 2017-09-19

3

Entering edit mode

6.6 years ago

gilbert.bionet ▴ 160

The okayset/ of files you ask about from tr2aacds contain the subset of your input transcripts that are classfied as non-redundant, valid coding genes (that is, okay) with these files:

your input transcripts, with same suffix as you used for input to tr2aacds (.fasta here)
protein translations from these, with "aa" suffix, and
coding sequence, with "cds" suffix, the subrange of (1) that translates into (2)

Coding sequences are shorter than full transcript fasta sequences. The "okalt" named subset are alternative transcripts, to the "okay" primary transcripts. The tool name "tr2aacds" is a short hand for "transcripts converted to aa-protein and cds-coding sequences".

Don Gilbert

ADD COMMENT • link 6.6 years ago by gilbert.bionet ▴ 160

0

Entering edit mode

Thank you for the explanation. Based on this, using cds suffix files in the okay file should be correct. Is it valid to concatenate okay.cds and okalt.cds together as assembly? And why I am losing mapping rate when using bowtie2 to align reads to this 'assembly'?

xp

ADD REPLY • link 6.6 years ago by xioli2013 ▴ 10

0

Entering edit mode

Here is the example of mapping reads back to the fasta file which has been concatenated with .okay.cds and .okalt.cds (use bowtie2)

13139058 reads; of these:
  13139058 (100.00%) were unpaired; of these:
    8065211 (61.38%) aligned 0 times
    4721541 (35.94%) aligned exactly 1 time
    352306 (2.68%) aligned >1 times
38.62% overall alignment rate

I use tgicl only to test this, which generated better overall alignment rate:

13139058 reads; of these:
  13139058 (100.00%) were unpaired; of these:
    3131157 (23.83%) aligned 0 times
    8431567 (64.17%) aligned exactly 1 time
    1576334 (12.00%) aligned >1 times
76.17% overall alignment rate

why the differences were so huge??

Thanks,

xp

ADD REPLY • link 6.5 years ago by xioli2013 ▴ 10