Question: Sequence duplication levels in de-novo assemblies
1
gravatar for yp19
4 months ago by
yp1950
yp1950 wrote:

Hi all! Is such a result ( https://imgur.com/DGxsFN7 ) concerning if the overall goal is to do de-novo genome assembly?

I continued with the data as is, assembled and predicted proteins and I did find that some proteins were duplicated (not sure if this is caused by what we see above?) anyways I then used a software for assembling heterozygous genomes and saw some improvement in # of duplicated proteins, but im unsure if this is the proper solution for this failed fastqc module. Any insight is greatly appreciated.

Additional info: I have 200bp paired reads and high coverage (~1000X)

duplication fastqc de-novo • 285 views
ADD COMMENTlink modified 4 months ago by predeus1.3k • written 4 months ago by yp1950
2

Since you have extremely high coverage, what you need to do is normalize your data before you do the assembly. You can use bbnorm.sh from BBMap suite to do that. There is a guide available here.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax76k
1

Thanks for your suggestion. I tried this out. Only problem is, my assembly statistics are worse after normalization (N50 is decreased by more than half), so I am tempted to avoid this step. Any other suggestion for dealing with this high level of duplication?

ADD REPLYlink written 4 months ago by yp1950
2

Is there a related (or same) genome available in public databases? You could try using it to guide your assembly.

As for the other result, even if N50 decreased by half did it take out the duplications that you were concerned with?

ADD REPLYlink written 4 months ago by genomax76k

Thanks! Yes, it took out the duplications. Although, I went from ~19 million (paired) reads to ~2million post normalizing. The command I used was:

bbnorm.sh in=samp.fq.gz out=normalized.fq.gz target=100 min=5
ADD REPLYlink written 4 months ago by yp1950

So it sounds like read normalization worked.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax76k

It is clearly indicating that if you remove the duplicated sequences from your data, it will leave only 42.31% of original data. Refer to fastqc report for bad illumina data . please attach your whole report of fastQC.

ADD REPLYlink modified 4 months ago • written 4 months ago by 1234anjalianjali123430

Please do not delete posts. The purpose of this site is two-fold: more immediately, to help people with their questions; but on the long run, to serve as a repository of knowledge. The second purpose is defeated if people delete their questions.

ADD REPLYlink written 3 months ago by h.mon29k
2
gravatar for predeus
4 months ago by
predeus1.3k
Russia
predeus1.3k wrote:

You probably would want to map the reads back to the assembly and then evaluate sequence duplication. FastQC works with raw reads and has limited power to inform you about the nature of your problems.

If after paired-end mapping you still get high duplication rate, you probably are dealing with PCR duplicates. That's likely to happen if they didn't have enough DNA during the library prep and did few too many PCR cycles. If most of the duplicates are optical, then there's a big problem with how your Illumina sequencer is set up. You can get all this info from Picard's MarkDuplicates; same tool lets you remove duplicates.

But it's much more probable that sequences are duplicated due to repeat presence. I'd suggest to try http://qb.cshl.edu/genomescope/ to evaluate your genome's haploid size, repetitiveness, and heterozygosity.

ADD COMMENTlink written 4 months ago by predeus1.3k

It seems like the duplication is there after mapping as well. Thanks for your suggestions I will try out MarkDuplicates and Genomescope to get a better understanding of the data.

ADD REPLYlink written 4 months ago by yp1950
1

If your data is not from a patterned flowcell (e.g. HiSeq 4000, X or NovaSeq) then you don't need to worry about optical duplicates. You would only have PCR duplicates. Take a look at clumpify.sh from BBMap suite that allow you to identify dups (PCR, optical) without doing alignments: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax76k

Thanks! could you briefly explain the difference between the two duplicates (i.e. optical vs. PCR) as I don't think I understand it. Also is it possible that I am seeing this high level of duplication due to the high degree of coverage?

ADD REPLYlink written 4 months ago by yp1950
1

See: Duplicates on Illumina

Possibly. But if either picard/clumpify mark most of these as PCR duplicates (same start and end for both paired end reads) then as predeus hinted this may be due to over-amplification during lib prep.

ADD REPLYlink written 4 months ago by genomax76k

Awesome, thank you for the link! To update: I ran clumpify to remove optical and PCR duplicates and in each case only a small % of reads were removed. The % of seqs remaining if deduplicated increased ~15% in each case (from ~42-57%)

ADD REPLYlink modified 4 months ago • written 4 months ago by yp1950

If you removed duplicates then the number of sequences remaining (if deduplicated) should go down. How did they go up? Looks like you purely have extreme coverage but no library artifacts.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax76k

The number of sequences did go down. The second sentence (%seqs remaining if deduplicated) is referring to the value produced in the fastQC sequence level duplication plot (I re ran fastQC post clumpify). Since some of the duplicated reads were removed, we get a higher % of sequences remaining on that plot. I do agree that it seems to be a coverage rather than library artifact issue.

ADD REPLYlink modified 4 months ago • written 4 months ago by yp1950
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1740 users visited in the last hour