Question: Removing fastq duplicates
0
gravatar for C4
7 weeks ago by
C40
C40 wrote:

Hi I would like to remove all the duplicates that show up in the fastqc report for my DNA libraries. I am using Clumpify from bbmap which doesn't seem to remove all the duplicates with -dedup=T. I would like to remove all PCR duplicates, optical duplicates, and biological duplicates from my paired-end fastq files. What is the best way to do it? Thank you!

chip-seq next-gen sequence • 319 views
ADD COMMENTlink written 7 weeks ago by C40

What are biological duplicates in FASTQ files?

ADD REPLYlink written 7 weeks ago by _r_am32k

"Technical duplicates arising from PCR artefacts, or biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected." I took this from the explanation of duplicates in fastqc (link below) [1] I would like to basically remove all the duplicates seen in fastqc report, and clumpify doesn't seem to be working so well. Please let me know if something was unclear, I am fairly new to this [1]:https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by C40

Are you sure you should be removing biological duplicates (which I guess are reads duplicated because they represent meaningful relative abundance of underlying biological sequences) is something you should be doing with ChIP-Seq data?

It makes sense to exclude such duplicates in Whole Genome/Exome Sequencing, but not in RNAseq, for example. Not all of FASTQC's "errors" are to be acted on and ameliorated in every case. In some cases, they can be explained by perfectly rational experimental reasons.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by _r_am32k

The library complexity of my ATAC-seq (not RNA-seq) is quite low, and since we sequenced quite deeply, it has led to many duplications. The overall biological duplicates present is actually very low as compared to PCR duplicates. Therefore, I would like to remove all the duplicates from fastq files that showed up on fastqc. Please let me know a tool that can efficiently remove PCR duplicates from paired-end fastq files? I tried clumpify which doesn't seem to be removing all of them, at least in my experience. Thanks for the advise.

ADD REPLYlink written 7 weeks ago by C40
1

I would first markDuplicates from picard suite, standard values should be fine here.

picard MarkDuplicates INPUT=<your_input> OUTPUT=<picard_output> METRICS_FILES=<your_txt_file> VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=TRUE REMOVE_DUPLICATES=FALSE

Then you can use samtools view to keep only reads that are not duplicated.

samtools view -q 30 -F 1804 -f 2 -b -h <picard_output> > <deduplicated.bam>

With -F flag you are excluding the reads marked as duplicates, not mapped, not in proper pairs.
With -f flag you are including the reads mapped in proper pairs.

See explanation of sam flags here

Note: You could also remove the duplicates directly from picard by setting the REMOVE_DUPLICATES=TRUE option. However, I prefer to do it with samtools.

Hope it helps!

ADD REPLYlink modified 6 weeks ago by _r_am32k • written 7 weeks ago by jordi.planells330

I appreciate this, but was hoping to remove duplicates from fastqs. Any suggestions for that?

ADD REPLYlink written 6 weeks ago by C40

I am using Clumpify from bbmap which doesn't seem to remove all the duplicates with -dedup=T

How did you reach that conclusion? What is the clumpify.sh command you are using?

ADD REPLYlink written 7 weeks ago by GenoMax94k

Hi, I tested the output fastq using fastqc and saw that some reads were removed by clumpify but not all of them. This was my command for 100bp R1/R2:

clumpify.sh in1=fastq.gz in2=fastq.gz out1=clumped1.fastq.gz out2=clumped2.fastq.gz containment=f dedupe=t k=90 subs=2 -da

Should I be merging the paired-end fastq files before clumpify using bbmerge? Please let me know if I could make any changes to my command. Thanks!

ADD REPLYlink modified 6 weeks ago by _r_am32k • written 6 weeks ago by C40

clumpify is looking at the sequence of reads end-to-end when it is deciding if something is a sequence duplicate. In case of paired-end reads it looks at both reads in order to make that decision during comparisons. I hope you were supplying R1 and R2 files to in1= and in2= option (it is not clear from your command line). What is the length of your reads? Using k=90 sounds overly long. Is there a reason to change that default?

I would say try following first and then post the stats you get here.

clumpify.sh in1=R1_reads.fq.gz in2=R2_reads.fq.gz  out1=R1_clumped.fq.gz out2=R2_clumped.fq.gz dedupe

By default clumpify allows 2 errors but you can make the comparison strict by using hdist=0 option.

tested the output fastq using fastqc and saw that some reads were removed by clumpify but not all of them

FastQC does not look at entire dataset when it does QC estimates so you can't rely on it and say that clumpify is not working. Can you post a screenshot of what your pre-clumpify data looks like?

In case you have not seen this blog post from authors of FastQC about sequence duplication you may want to take a look.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by GenoMax94k

Hi geomax, thanks a lot! I am using k=90 as my reads are 100bps, so as to remove more exact duplicates. This is my fastqc duplication before clumpify: https://ibb.co/yRtkNY8 This is after clumpify: https://ibb.co/Bck0Gtm I will also post the stats that I get after using cumpify with default

ADD REPLYlink written 6 weeks ago by C40
1

The value of k should be smaller than 1/2 length of your reads otherwise how would the program be able to identify k-mers to do initial matching.

so as to remove more exact duplicates

to do that you should just add hdist=0 to your command. That will only allow perfect sequence matches.

ADD REPLYlink written 6 weeks ago by GenoMax94k

Would you suggest using bbmerge prior to using clumpify dedup for paired-end fastq files. I take this from here, where they suggest using bbmerge and concatenating merged and unmerged fastq files before clumpify, since when paired-end reads are provided, clumpify clumps only based on R1 https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/clumpify-guide/ I have noticed that bbmerged paired-end fastqs work like a charm for duplicate removal, as comapred to provding paired-end files with in1 and in2 streams. However, I would not know how to unmerge the merged fastq files, is there a tool for that? Perhaps, I could cat the two paired-end fastqs? (This is because I realized clumpify would only mark a read as duplicate if it is present in both read pairs). Thanks so much for your advise!

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by C40
1

If your reads are able to merge then by all means follow Brian's advice on the guide page you linked above.

However, I would not know how to unmerge the merged fastq files, is there a tool for that?

While there isn't a named tool that would do that but the following workaround should work.

  1. Merge your reads with bbmerge. Grab a copy of the read headers of the reads that were merged by going a grep "^@Seq_ID" merged file | sed 's/@//' > merged_headers.
  2. Add unmerged reads to the merged and then run clumpify.
  3. Get the read headers from the clumped file. grep "^@Seq_ID" clumped file | sed 's/@//' > clumped_headers.
  4. Find which read headers are common betwenn merged_headers and clumped_header by using comm. Something like comm -12 merged_headers clumped_headers > file_with_headers_that_were_common.
  5. Take those merged read headers (which we know survived clumping) and retrieve their original paired-end reads from original R1 and R2 (prior to merging) files using filterbyname.sh using names=file_with_headers_that_were_common.
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by GenoMax94k

Hi GenoMax, thanks a lot for this. Your work-around works really well, however I am losing many reads, because of which my mapping% is decreasing. I believe that is just because I do not have very good data. Could you tell me why you suggested finding common between merged and clumped? Why couldn't I just use all the clumped fastq headers (including merged and unmerged)?

ADD REPLYlink written 6 weeks ago by C40

Could you tell me why you suggested finding common between merged and clumped?

Since you had originally asked

I would not know how to unmerge the merged fastq files

As we can't "unmerge" merged data we have to go back to the original data and get the reads from there. By doing an intersection of clumped and merged headers we can find out which headers made it to final clumped data.

ADD REPLYlink written 6 weeks ago by GenoMax94k

My understanding was that dedupe would work better if I bbmerge R1+R2. However, wouldn't clumped_headers have all the R1 and R2 fastq headers that got clumped and deduped. So, I could just use those clumped_headers to filter out R1 and R2, instead of only merged+clumped headers.

ADD REPLYlink written 6 weeks ago by C40

If you tried that and it works then no problem. Since you were going to merge "bbmerged" reads and singletons I was being careful.

ADD REPLYlink written 6 weeks ago by GenoMax94k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2439 users visited in the last hour