large number of contigs after assembly
1
0
Entering edit mode
14 months ago
boymin2020 ▴ 50

Hi all,

Recently I have been working on a genome assembly of a fish. The estimated genome size given by Genomescope is 2509MB, which looks like a big genome. After the first-stage assembly with WTDBG2, the original genome has 47000+ contigs. It seems weird as the number is too large. I am writing to require any methods to reduce it.

Thanks,

Nanopore genome assembly wtdbg2 sequencing • 1.7k views
1
Entering edit mode

Assembly size with all your contigs? The assembly n50? Etc

What is the read length distribution? Median? n50?

Has it been previously assembled with short reads?

0
Entering edit mode

I had better results with Fly, compared to wtdbg2, but for PacBio. How much coverage do you have? Did you filter you reads somehow?

0
Entering edit mode

In order to get a 100X coverage, we employed a company to produce near 250GB data from three batches (89GB, 31GB and 129GB). As we always get the CLEANED data from the company, we have not filtered any long reads.

0
Entering edit mode

What does 'cleaned' mean in this sense? 100X is a lot of coverage. You could downsample for the longest 50X and retry But again (as I commented above), the stats of your assembly and reads will give hints as to potentially why you have so many contigs.

0
Entering edit mode

'cleaned' means that the reads we got from the sequencing company have been at least adapter removed and low-quality reads filtered.

I had been downsampling as a trail using the longest, but no improvement has been made.

Below are the stats of the integrated reads from three sequencing batches generated by Nanoplot:

General summary:
Total bases:              132,015,868,566.0
Number, percentage and megabases of reads above quality cutoffs
>Q5:    9342538 (100.0%) 132015.9Mb
>Q7:    9342529 (100.0%) 132015.9Mb
>Q10:   3234510 (34.6%) 44596.2Mb
>Q12:   96883 (1.0%) 602.9Mb
>Q15:   102 (0.0%) 0.4Mb
Top 5 highest mean basecall quality scores and their read lengths
1:  17.6 (778)
2:  16.7 (453)
3:  16.6 (356)
4:  16.4 (148)
5:  16.4 (609)
Top 5 longest reads and their mean basecall quality score
1:  172042 (8.1)
2:  162767 (9.3)
3:  160974 (9.4)
4:  158567 (7.4)
5:  158502 (9.2)


Below are the code and stats of Illumina short reads generated by Jellyfish and Genomescope.

${jellyfish} count -C -m 21 -s 3G -t${Ncores} -F 4 <(zcat ${in_sr_fq1}) <(zcat${in_sr_fq2}) <(zcat #${in_sr_fq3}) <(zcat${in_sr_fq4}) -o /public/home/LvZhenMing/min_handle/proj/ds/resultsFromJellyfish/mer_counts.jf
${jellyfish} histo -t${Ncores} /public/home/LvZhenMing/min_handle/proj/ds/resultsFromJellyfish/mer_counts.jf -o /public/home/LvZhenMing/min_handle/proj/ds/resultsFromJellyfish/reads.histo
\${genomescope2} -k 21 -p 2 -i /public/home/LvZhenMing/min_handle/proj/ds/resultsFromJellyfish/reads.histo -o /public/home/LvZhenMing/min_handle/proj/ds/resultsFromGenomescope2 -n genomeAssemble_ds_Genomescope2


0
Entering edit mode

What about stats for your highly fragmented assembly? Does downsampling and re-assembling at least improve the assembly somewhat?

0
Entering edit mode

The assembly generated from three sequencing batches (PAG38564, PAG10123, PAG19859) has 53178 contigs while the re-assembly only with the PAG38564 batch has 43837 contigs. It seems that downsampling did improve the assembly. However, the contig number is still too large. I have been totally confused by the stats.

0
Entering edit mode

That is a big ​reduction in contig numbers. However you should using all the data and selecting only the longest reads.

What stats i think are more important for now are the total genome size and the L90, N90 etc. Basically is your genome size what you expect and is most of it in large contigs.

0
Entering edit mode

U R right. It is better to make use of all sequencing data from different batches.

The total genome size is expected 3-4G with 76 chromosomes. The genome assessments show that this species has very competitive sequences and a high heterozygosity rate (>1%, as the above figure shows). Maybe I should try to change the assembly methods, for example, from WTDBG2 to CANU.

1
Entering edit mode

Downsampling with all the data will allow you to get a better set of long reads. You can try the tool Filtlong.

Ok that is the expected size, but what is your assembly size? L90?N90? again, this will allow you to determine if the majority of your assembly is in large contigs.

With a genome that size, Canu will take a very long time.. You can try other fast assemblers like Raven, but you will probably not see any significant reduction in the contig number. It also depends if you want to try phasing, but your first issue is your contig count.

0
Entering edit mode

what is the K-mer size used in genome assembly? In general, de novo assemblies are done at multiple k-mer lengths (AFAIK).

0
Entering edit mode

I had used the Jellyfish and Genomescope with the kmer=21 parameter to estimate the genome size.

2
Entering edit mode
14 months ago

Canu or Shasta will get you a more contiguous assembly. wtdbg2 is fast, but not that contiguous.

You will need to check the excellent canu docs if your expected heterozygosity rate is outside the tools' expectations.

I would only select the top 50X longest reads to go into this. Short reads are mainly useful for correcting the longer reads.