Question: How do denovo genome/transcriptome assemblers treat ambiguous bases?
0
gravatar for Rohit
5.1 years ago by
Rohit1.4k
California
Rohit1.4k wrote:

 

Dear Biostarists,

I have a basic yet important question. How do assemblers treat ambiguous bases N's? (to avoid erroneous contigs)

I read that Velvet treats each N as an A, but what about other denovo Genome assemblers such as SOAPdenovo2 (open-source black box), CLC (commercial mystique), Abyss, ALLPATH, Minia and others?

How do denovo transcriptome assemblers such as Trinity, Trans-Abyss, SOAPdenovo-Trans, Rnnotator and others treat them?

 

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Rohit1.4k
1

I suspect that this will depend entirely on the tool and that you'll have to ask the authors or read the code (this may not be mentioned in the papers) to find out.

ADD REPLYlink written 5.1 years ago by Devon Ryan93k
6
gravatar for Rohit
5.1 years ago by
Rohit1.4k
California
Rohit1.4k wrote:

Dear all,

I can try to answer the question now, note that I have probably as some tools are dependent on others.

Velvet (probably Oases too) - Replace N's with A

Abyss (probably Trans-Abyss too) - Replace N's based on consensus sequences that fill that base, consensus sequence of 90% identity through DIALIGN-TX aligner

SOAPdenovo2, SOAPdenovo-Trans - Replace N's with G

ALLPATHS - Ambiguous bases are saved as random bases

Rnnotator - Uses velvet, so probably N's to A

IDBA - From the authors it is understood that sequencing depth is considered for assembly, 
Basically, we try to correct the graph based on the sequencing depth. It identifies similar paths and removes paths with very low sequencing depth comparing to neighbors. Note that it doesn't introduce new k-mers in this process. The assumption is that the actual sequence must appear in the graph and have higher depth.

Minia - If there are ambiguous bases in the input, i.e. N's in reads, then Minia cut reads around them: precisely, it discards any k-mer containing at least one N.

Trinity - Ignored first, later treated as mismatches

Non-[GATC] characters will be ignored during the early phases of Trinity (jellyfish, inchworm, and chrysalis- I think), and then likely treated as mismatches during the final butterfly phase.  Trinity simply isn't compatible for the most part, though shouldn't error-out as a result of such chars.

CLCbio - I do not have a commercial license so I do not receive support, in this case those with a commercial license should try asking them as their code is unreadble

 

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Rohit1.4k

Hi, when looking at an Abyss assembly there are lots of cases where there is a long run of a single base, where I presume N's should be, so I assumed Abyss replaced N's with a random base. Could you explain what is meant by "Replace N's based on consensus sequences that fill that base, consensus sequence of 90% identity through DIALIGN-TX aligner" - how could this result in what I am seeing? Hope you can help, thanks!

ADD REPLYlink written 4.5 years ago by jomaco190
1

The first mention of Dialign-TX comes when the algorithm implements PopBubbles. This is already at the assembly stage where N's are replaced, based on other sequences that are 90% similar to that particular path.
I guess if there are no sequences similar, random bases are assigned.

ADD REPLYlink written 4.5 years ago by Rohit1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1852 users visited in the last hour