Why one genome assembly is more fragmented than others?
1
0
Entering edit mode
3.8 years ago
katjanjarosz ▴ 10

Hello,

I have done genome assembly for various yeast strains in pseudochromosome level. All of them was treated exactly the same way. I don't understand why one of them is more fragmented than others. The sequenced data were created using Illumina short-read and Nextera library. FastQC returned errors in "per base sequence content" and "sequence duplication levels" for all strains except this fragmented one (warning). Do you have any suggestion why this assembly could be more fragmented than others?

Any suggestions or help would be much appreciated.

assembly next-gen genome sequencing sequence • 1.4k views
ADD COMMENT
1
Entering edit mode

It is difficult to answer this without additional details. For example, what program did you use for the assembly? How about some assembly stats: number of contigs/scaffolds, average contig size and contig N50? Did you read carefully through its output for possible errors or warnings? What is the sequencing depth of your sample?

Oddly enough, fragmented assemblies can be caused both by too shallow and too deep sequencing runs. In the first instance it is because there are not enough reads available. For too deep sequencing, which I suspect may be the issue here, non-random sequencing errors may accumulate at such a level that they disrupt the contiguity of an assembly.

ADD REPLY
0
Entering edit mode

Thank You for suggestions. I used IDBA-UD to assemble reads to contigs and Ragout to create pseudochromosomes. All of the final assemblies have N50 of around 900 kb. This one has 28 kb and almost 5 times more scaffolds. Coverage is comparable in all strains, over 100x. Also, this assembly is almost 2 times larger than others.

ADD REPLY
1
Entering edit mode

Coverage is comparable in all strains, over 100x. Also, this assembly is almost 2 times larger than others.

Like I said, it sounds like too deep a coverage. As to it being two times larger, that could happen because non-random mutations make it appear as if you have several strains that differ in discreet spots. These artificial SNPs make it appear larger and more fragmented, and you can easily test if this is the case: run your assembly through cd-hit-est (available here) and cluster at 90-95% identity, which should drop its size to what is expected. If so, drop the coverage to 60-80x (BBnorm can do that) and see if that assembles better.

ADD REPLY
0
Entering edit mode

Thank you for the help. As you suggested I ran assembly through cd-hit-est with parameter -c 0.9. Unfortunately, the size is still twice as large compared to other strains.

ADD REPLY
0
Entering edit mode

Haplotype assembly is the only other thing that comes to mind. How sure are you that your strain is pure and haploid?

I'd still reduce the coverage and see what comes out of that assembly. It may be counter-intuitive that throwing away the data could lead to better assembly, but I have seen it happen many times.

ADD REPLY
0
Entering edit mode

According to the results of flow cytometry, this strain and all others are diploids. It should be pure. I calculated the average coverage differently based on mapping reads after trimming and k-mer based error correction to draft assembly. This strain has significantly lower coverage (80x) compared to other strains (above 140x). Should I still try to lover the coverage? What do you think about trying to assemble the reads before trimming and correction?

ADD REPLY
0
Entering edit mode

I don't know exactly what the reason is but did you assessment your all assemblies using BUSCO tool ? Maybe in the genome assembly you mentioned contains contamination or missing in your raw data ?

ADD REPLY
0
Entering edit mode

I didn't use BUSCO tool but I will check it out. Thank you for the suggestion. I don't know where could be the problem. Maybe I will ask data providers.

ADD REPLY
0
Entering edit mode
3.8 years ago
katjanjarosz ▴ 10

Thank You so much for the help. It turned out that this strain was mixed with other strain and that's why the assembly was 2 times larger than others. Data providers suggested to map reads to reference genome and than proceed to the assembly.

ADD COMMENT

Login before adding your answer.

Traffic: 2658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6