Sequence duplication levels in de-novo assemblies
1
1
Entering edit mode
3.2 years ago
yp19 ▴ 70

Hi all! Is such a result ( https://imgur.com/DGxsFN7 ) concerning if the overall goal is to do de-novo genome assembly?

I continued with the data as is, assembled and predicted proteins and I did find that some proteins were duplicated (not sure if this is caused by what we see above?) anyways I then used a software for assembling heterozygous genomes and saw some improvement in # of duplicated proteins, but im unsure if this is the proper solution for this failed fastqc module. Any insight is greatly appreciated.

fastqc duplication de-novo • 1.6k views
2
Entering edit mode

Since you have extremely high coverage, what you need to do is normalize your data before you do the assembly. You can use bbnorm.sh from BBMap suite to do that. There is a guide available here.

1
Entering edit mode

Thanks for your suggestion. I tried this out. Only problem is, my assembly statistics are worse after normalization (N50 is decreased by more than half), so I am tempted to avoid this step. Any other suggestion for dealing with this high level of duplication?

2
Entering edit mode

Is there a related (or same) genome available in public databases? You could try using it to guide your assembly.

As for the other result, even if N50 decreased by half did it take out the duplications that you were concerned with?

0
Entering edit mode

Thanks! Yes, it took out the duplications. Although, I went from ~19 million (paired) reads to ~2million post normalizing. The command I used was:

bbnorm.sh in=samp.fq.gz out=normalized.fq.gz target=100 min=5

0
Entering edit mode

So it sounds like read normalization worked.

0
Entering edit mode

It is clearly indicating that if you remove the duplicated sequences from your data, it will leave only 42.31% of original data. Refer to fastqc report for bad illumina data . please attach your whole report of fastQC.

0
Entering edit mode

Please do not delete posts. The purpose of this site is two-fold: more immediately, to help people with their questions; but on the long run, to serve as a repository of knowledge. The second purpose is defeated if people delete their questions.

2
Entering edit mode
3.2 years ago
predeus ★ 1.8k

You probably would want to map the reads back to the assembly and then evaluate sequence duplication. FastQC works with raw reads and has limited power to inform you about the nature of your problems.

If after paired-end mapping you still get high duplication rate, you probably are dealing with PCR duplicates. That's likely to happen if they didn't have enough DNA during the library prep and did few too many PCR cycles. If most of the duplicates are optical, then there's a big problem with how your Illumina sequencer is set up. You can get all this info from Picard's MarkDuplicates; same tool lets you remove duplicates.

But it's much more probable that sequences are duplicated due to repeat presence. I'd suggest to try http://qb.cshl.edu/genomescope/ to evaluate your genome's haploid size, repetitiveness, and heterozygosity.

0
Entering edit mode

It seems like the duplication is there after mapping as well. Thanks for your suggestions I will try out MarkDuplicates and Genomescope to get a better understanding of the data.

1
Entering edit mode

If your data is not from a patterned flowcell (e.g. HiSeq 4000, X or NovaSeq) then you don't need to worry about optical duplicates. You would only have PCR duplicates. Take a look at clumpify.sh from BBMap suite that allow you to identify dups (PCR, optical) without doing alignments: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

0
Entering edit mode

Thanks! could you briefly explain the difference between the two duplicates (i.e. optical vs. PCR) as I don't think I understand it. Also is it possible that I am seeing this high level of duplication due to the high degree of coverage?

1
Entering edit mode

Possibly. But if either picard/clumpify mark most of these as PCR duplicates (same start and end for both paired end reads) then as predeus hinted this may be due to over-amplification during lib prep.

0
Entering edit mode

Awesome, thank you for the link! To update: I ran clumpify to remove optical and PCR duplicates and in each case only a small % of reads were removed. The % of seqs remaining if deduplicated increased ~15% in each case (from ~42-57%)

0
Entering edit mode

If you removed duplicates then the number of sequences remaining (if deduplicated) should go down. How did they go up? Looks like you purely have extreme coverage but no library artifacts.

0
Entering edit mode

The number of sequences did go down. The second sentence (%seqs remaining if deduplicated) is referring to the value produced in the fastQC sequence level duplication plot (I re ran fastQC post clumpify). Since some of the duplicated reads were removed, we get a higher % of sequences remaining on that plot. I do agree that it seems to be a coverage rather than library artifact issue.