Question

large number of contigs after assembly

0

Entering edit mode

2.6 years ago

boymin2020 ▴ 80

Hi all,

Recently I have been working on a genome assembly of a fish. The estimated genome size given by Genomescope is 2509MB, which looks like a big genome. After the first-stage assembly with WTDBG2, the original genome has 47000+ contigs. It seems weird as the number is too large. I am writing to require any methods to reduce it.

Thanks,

Nanopore genome assembly wtdbg2 sequencing • 3.4k views

ADD COMMENT • link updated 2.6 years ago by colindaven 6.3k • written 2.6 years ago by boymin2020 ▴ 80

1

Entering edit mode

Perhaps you could add some more details about your data and the assembly like:

Assembly size with all your contigs? The assembly n50? Etc

What is the read length distribution? Median? n50?

Has it been previously assembled with short reads?

ADD REPLY • link 2.6 years ago by samuel.a.odonnell ▴ 510

0

Entering edit mode

I had better results with Fly, compared to wtdbg2, but for PacBio. How much coverage do you have? Did you filter you reads somehow?

ADD REPLY • link 2.6 years ago by h.mon 35k

0

Entering edit mode

In order to get a 100X coverage, we employed a company to produce near 250GB data from three batches (89GB, 31GB and 129GB). As we always get the CLEANED data from the company, we have not filtered any long reads.

ADD REPLY • link 2.6 years ago by boymin2020 ▴ 80

0

Entering edit mode

What does 'cleaned' mean in this sense? 100X is a lot of coverage. You could downsample for the longest 50X and retry But again (as I commented above), the stats of your assembly and reads will give hints as to potentially why you have so many contigs.

ADD REPLY • link 2.6 years ago by samuel.a.odonnell ▴ 510

0

Entering edit mode

'cleaned' means that the reads we got from the sequencing company have been at least adapter removed and low-quality reads filtered.

I had been downsampling as a trail using the longest, but no improvement has been made.

Below are the stats of the integrated reads from three sequencing batches generated by Nanoplot:

General summary:         
Mean read length:                  14,130.6
Mean read quality:                      9.5
Median read length:                11,310.0
Median read quality:                    9.5
Number of reads:                9,342,538.0
Read length N50:                   22,537.0
STDEV read length:                 12,281.6
Total bases:              132,015,868,566.0
Number, percentage and megabases of reads above quality cutoffs
>Q5:    9342538 (100.0%) 132015.9Mb
>Q7:    9342529 (100.0%) 132015.9Mb
>Q10:   3234510 (34.6%) 44596.2Mb
>Q12:   96883 (1.0%) 602.9Mb
>Q15:   102 (0.0%) 0.4Mb
Top 5 highest mean basecall quality scores and their read lengths
1:  17.6 (778)
2:  16.7 (453)
3:  16.6 (356)
4:  16.4 (148)
5:  16.4 (609)
Top 5 longest reads and their mean basecall quality score
1:  172042 (8.1)
2:  162767 (9.3)
3:  160974 (9.4)
4:  158567 (7.4)
5:  158502 (9.2)

Below are the code and stats of Illumina short reads generated by Jellyfish and Genomescope.

${jellyfish} count -C -m 21 -s 3G -t ${Ncores} -F 4 <(zcat ${in_sr_fq1}) <(zcat ${in_sr_fq2}) <(zcat #${in_sr_fq3}) <(zcat ${in_sr_fq4}) -o /public/home/LvZhenMing/min_handle/proj/ds/resultsFromJellyfish/mer_counts.jf
${jellyfish} histo -t ${Ncores} /public/home/LvZhenMing/min_handle/proj/ds/resultsFromJellyfish/mer_counts.jf -o /public/home/LvZhenMing/min_handle/proj/ds/resultsFromJellyfish/reads.histo
${genomescope2} -k 21 -p 2 -i /public/home/LvZhenMing/min_handle/proj/ds/resultsFromJellyfish/reads.histo -o /public/home/LvZhenMing/min_handle/proj/ds/resultsFromGenomescope2 -n genomeAssemble_ds_Genomescope2

enter image description here

ADD REPLY • link 2.6 years ago by boymin2020 ▴ 80

0

Entering edit mode

What about stats for your highly fragmented assembly? Does downsampling and re-assembling at least improve the assembly somewhat?

ADD REPLY • link 2.6 years ago by samuel.a.odonnell ▴ 510

0

Entering edit mode

The assembly generated from three sequencing batches (PAG38564, PAG10123, PAG19859) has 53178 contigs while the re-assembly only with the PAG38564 batch has 43837 contigs. It seems that downsampling did improve the assembly. However, the contig number is still too large. I have been totally confused by the stats.

ADD REPLY • link 2.6 years ago by boymin2020 ▴ 80

0

Entering edit mode

That is a big reduction in contig numbers. However you should using all the data and selecting only the longest reads.

What stats i think are more important for now are the total genome size and the L90, N90 etc. Basically is your genome size what you expect and is most of it in large contigs.

ADD REPLY • link 2.6 years ago by samuel.a.odonnell ▴ 510

0

Entering edit mode

U R right. It is better to make use of all sequencing data from different batches.

The total genome size is expected 3-4G with 76 chromosomes. The genome assessments show that this species has very competitive sequences and a high heterozygosity rate (>1%, as the above figure shows). Maybe I should try to change the assembly methods, for example, from WTDBG2 to CANU.

ADD REPLY • link 2.6 years ago by boymin2020 ▴ 80

1

Entering edit mode

Downsampling with all the data will allow you to get a better set of long reads. You can try the tool Filtlong.

Ok that is the expected size, but what is your assembly size? L90?N90? again, this will allow you to determine if the majority of your assembly is in large contigs.

With a genome that size, Canu will take a very long time.. You can try other fast assemblers like Raven, but you will probably not see any significant reduction in the contig number. It also depends if you want to try phasing, but your first issue is your contig count.

ADD REPLY • link 2.6 years ago by samuel.a.odonnell ▴ 510

0

Entering edit mode

what is the K-mer size used in genome assembly? In general, de novo assemblies are done at multiple k-mer lengths (AFAIK).

ADD REPLY • link 2.6 years ago by cpad0112 21k

0

Entering edit mode

I had used the Jellyfish and Genomescope with the kmer=21 parameter to estimate the genome size.

ADD REPLY • link 2.6 years ago by boymin2020 ▴ 80

score 2 · Answer 1 · 2021-09-16

Canu or Shasta will get you a more contiguous assembly. wtdbg2 is fast, but not that contiguous.

You will need to check the excellent canu docs if your expected heterozygosity rate is outside the tools' expectations.

I would only select the top 50X longest reads to go into this. Short reads are mainly useful for correcting the longer reads.