Question: Why one genome assembly is more fragmented than others?
0
gravatar for katjanjarosz
5 weeks ago by
katjanjarosz10
Poland
katjanjarosz10 wrote:

Hello,

I have done genome assembly for various yeast strains in pseudochromosome level. All of them was treated exactly the same way. I don't understand why one of them is more fragmented than others. The sequenced data were created using Illumina short-read and Nextera library. FastQC returned errors in "per base sequence content" and "sequence duplication levels" for all strains except this fragmented one (warning). Do you have any suggestion why this assembly could be more fragmented than others?

Any suggestions or help would be much appreciated.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by katjanjarosz10
1

It is difficult to answer this without additional details. For example, what program did you use for the assembly? How about some assembly stats: number of contigs/scaffolds, average contig size and contig N50? Did you read carefully through its output for possible errors or warnings? What is the sequencing depth of your sample?

Oddly enough, fragmented assemblies can be caused both by too shallow and too deep sequencing runs. In the first instance it is because there are not enough reads available. For too deep sequencing, which I suspect may be the issue here, non-random sequencing errors may accumulate at such a level that they disrupt the contiguity of an assembly.

ADD REPLYlink written 5 weeks ago by Mensur Dlakic6.0k

Thank You for suggestions. I used IDBA-UD to assemble reads to contigs and Ragout to create pseudochromosomes. All of the final assemblies have N50 of around 900 kb. This one has 28 kb and almost 5 times more scaffolds. Coverage is comparable in all strains, over 100x. Also, this assembly is almost 2 times larger than others.

ADD REPLYlink written 5 weeks ago by katjanjarosz10
1

Coverage is comparable in all strains, over 100x. Also, this assembly is almost 2 times larger than others.

Like I said, it sounds like too deep a coverage. As to it being two times larger, that could happen because non-random mutations make it appear as if you have several strains that differ in discreet spots. These artificial SNPs make it appear larger and more fragmented, and you can easily test if this is the case: run your assembly through cd-hit-est (available here) and cluster at 90-95% identity, which should drop its size to what is expected. If so, drop the coverage to 60-80x (BBnorm can do that) and see if that assembles better.

ADD REPLYlink written 5 weeks ago by Mensur Dlakic6.0k

Thank you for the help. As you suggested I ran assembly through cd-hit-est with parameter -c 0.9. Unfortunately, the size is still twice as large compared to other strains.

ADD REPLYlink written 5 weeks ago by katjanjarosz10

Haplotype assembly is the only other thing that comes to mind. How sure are you that your strain is pure and haploid?

I'd still reduce the coverage and see what comes out of that assembly. It may be counter-intuitive that throwing away the data could lead to better assembly, but I have seen it happen many times.

ADD REPLYlink written 5 weeks ago by Mensur Dlakic6.0k

According to the results of flow cytometry, this strain and all others are diploids. It should be pure. I calculated the average coverage differently based on mapping reads after trimming and k-mer based error correction to draft assembly. This strain has significantly lower coverage (80x) compared to other strains (above 140x). Should I still try to lover the coverage? What do you think about trying to assemble the reads before trimming and correction?

ADD REPLYlink written 4 weeks ago by katjanjarosz10

I don't know exactly what the reason is but did you assessment your all assemblies using BUSCO tool ? Maybe in the genome assembly you mentioned contains contamination or missing in your raw data ?

ADD REPLYlink written 5 weeks ago by ugurcabuk130

I didn't use BUSCO tool but I will check it out. Thank you for the suggestion. I don't know where could be the problem. Maybe I will ask data providers.

ADD REPLYlink written 5 weeks ago by katjanjarosz10
0
gravatar for katjanjarosz
24 days ago by
katjanjarosz10
Poland
katjanjarosz10 wrote:

Thank You so much for the help. It turned out that this strain was mixed with other strain and that's why the assembly was 2 times larger than others. Data providers suggested to map reads to reference genome and than proceed to the assembly.

ADD COMMENTlink written 24 days ago by katjanjarosz10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 738 users visited in the last hour