a question about -unitigs.fa and -scaffolds.fa
1
0
Entering edit mode
9.1 years ago

Dear folks,

I see "The *-unitigs is the assembly without using any paired end or mate pair information, and the *-contigs.fa is the next step, using paired end information. The *-scaffolds.fa is your scaffolded contigs".

Here are the stats for my -unitigs.fa, -contigs.fa, and -scaffolds.fa as below. On my understanding, the difference in total length between *-unitigs.fa and *-scaffolds.fa should be the number of Ns in my -scaffolds.fa. Am I right? But, when I counted the number of Ns in *-scaffolds.fa, there are only 509 Ns. What am I missing?

Also, on my understanding, the -scaffolds.fa will use paired/mate pair information to join/scaffold the contigs together, right? If I am right, there should be 99 gaps filled with Ns in my `-scaffolds.fa. But I don't think there are 99 gaps of Ns in*-scaffolds.fa`. Where am I missing?

*-unitigs.fa
sequence #: 316 total length: 4678712   max length: 236245      N50: 34091      N90: 10761
*-contigs.fa
sequence #: 221 total length: 4776303   max length: 236245      N50: 40151      N90: 15786
*-scaffolds.fa
sequence #: 217 total length: 4776366   max length: 236245      N50: 40489      N90: 16196

Thank you so much!

PS: I just realized that I need to post question here. I am sorry if I posted it twice.

Best,
Xiaofei

abyss assembly • 3.7k views
ADD COMMENT
1
Entering edit mode
9.1 years ago
benv ▴ 730

Hi Xiaofei,

Consider that each perfectly repeat sequence is represented only once in the unitigs file, even though it may occur many times in the genome. It is my understanding that ABySS will output multiple copies of such repetitive sequences during the contig and scaffold stages. So for example if unitigs A and B both overlap a repeat unitig C, C will be merged onto the ends of both unitigs A and B when building contigs. I suspect that is the reason for the unexpected growth in total length.

As an aside, there are other things that happen during assembly that affect your total sequence length in the opposite direction (making it unexpectedly smaller):

  • removing "shim" contigs (abyss-filtergraph program)
  • second bubble popping stage (PopBubbles program)
  • merging sequences that overlap (Overlap and PathOverlap programs)
  • merging alternate paths into a consensus sequence (PathConsensus program)

If you're really curious what is happening during assembly, you can look at each step of the pipeline in the abyss-pe Makefile and also the output in the corresponding FASTA file (*-1.fa, *-2.fa, etc.).

ADD COMMENT
0
Entering edit mode

Thank you so much! It really helps! :-)

ADD REPLY

Login before adding your answer.

Traffic: 1295 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6