Question

Removing fastq duplicates

2

Entering edit mode

3.4 years ago

C4 ▴ 30

Hi I would like to remove all the duplicates that show up in the fastqc report for my DNA libraries. I am using Clumpify from bbmap which doesn't seem to remove all the duplicates with -dedup=T. I would like to remove all PCR duplicates, optical duplicates, and biological duplicates from my paired-end fastq files. What is the best way to do it? Thank you!

sequence next-gen ChIP-Seq • 8.5k views

ADD COMMENT • link 3.4 years ago by C4 ▴ 30

0

Entering edit mode

What are biological duplicates in FASTQ files?

ADD REPLY • link 3.4 years ago by Ram 43k

0

Entering edit mode

"Technical duplicates arising from PCR artefacts, or biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected." I took this from the explanation of duplicates in fastqc (link below) [1] I would like to basically remove all the duplicates seen in fastqc report, and clumpify doesn't seem to be working so well. Please let me know if something was unclear, I am fairly new to this [1]:https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html

ADD REPLY • link 3.4 years ago by C4 ▴ 30

0

Entering edit mode

Are you sure you should be removing biological duplicates (which I guess are reads duplicated because they represent meaningful relative abundance of underlying biological sequences) is something you should be doing with ChIP-Seq data?

It makes sense to exclude such duplicates in Whole Genome/Exome Sequencing, but not in RNAseq, for example. Not all of FASTQC's "errors" are to be acted on and ameliorated in every case. In some cases, they can be explained by perfectly rational experimental reasons.

ADD REPLY • link 3.4 years ago by Ram 43k

0

Entering edit mode

The library complexity of my ATAC-seq (not RNA-seq) is quite low, and since we sequenced quite deeply, it has led to many duplications. The overall biological duplicates present is actually very low as compared to PCR duplicates. Therefore, I would like to remove all the duplicates from fastq files that showed up on fastqc. Please let me know a tool that can efficiently remove PCR duplicates from paired-end fastq files? I tried clumpify which doesn't seem to be removing all of them, at least in my experience. Thanks for the advise.

ADD REPLY • link 3.4 years ago by C4 ▴ 30

1

Entering edit mode

I would first markDuplicates from picard suite, standard values should be fine here.

picard MarkDuplicates INPUT=<your_input> OUTPUT=<picard_output> METRICS_FILES=<your_txt_file> VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=TRUE REMOVE_DUPLICATES=FALSE

Then you can use samtools view to keep only reads that are not duplicated.

samtools view -q 30 -F 1804 -f 2 -b -h <picard_output> > <deduplicated.bam>

With -F flag you are excluding the reads marked as duplicates, not mapped, not in proper pairs.
With -f flag you are including the reads mapped in proper pairs.

See explanation of sam flags here

Note: You could also remove the duplicates directly from picard by setting the REMOVE_DUPLICATES=TRUE option. However, I prefer to do it with samtools.

Hope it helps!

ADD REPLY • link updated 3.4 years ago by Ram 43k • written 3.4 years ago by jordi.planells ▴ 480

0

Entering edit mode

I appreciate this, but was hoping to remove duplicates from fastqs. Any suggestions for that?

ADD REPLY • link 3.4 years ago by C4 ▴ 30

0

Entering edit mode

I am using Clumpify from bbmap which doesn't seem to remove all the duplicates with -dedup=T

How did you reach that conclusion? What is the clumpify.sh command you are using?

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

Hi, I tested the output fastq using fastqc and saw that some reads were removed by clumpify but not all of them. This was my command for 100bp R1/R2:

clumpify.sh in1=fastq.gz in2=fastq.gz out1=clumped1.fastq.gz out2=clumped2.fastq.gz containment=f dedupe=t k=90 subs=2 -da

Should I be merging the paired-end fastq files before clumpify using bbmerge? Please let me know if I could make any changes to my command. Thanks!

ADD REPLY • link updated 3.4 years ago by Ram 43k • written 3.4 years ago by C4 ▴ 30

0

Entering edit mode

clumpify is looking at the sequence of reads end-to-end when it is deciding if something is a sequence duplicate. In case of paired-end reads it looks at both reads in order to make that decision during comparisons. I hope you were supplying R1 and R2 files to in1= and in2= option (it is not clear from your command line). What is the length of your reads? Using k=90 sounds overly long. Is there a reason to change that default?

I would say try following first and then post the stats you get here.

clumpify.sh in1=R1_reads.fq.gz in2=R2_reads.fq.gz  out1=R1_clumped.fq.gz out2=R2_clumped.fq.gz dedupe

By default clumpify allows 2 errors but you can make the comparison strict by using hdist=0 option.

tested the output fastq using fastqc and saw that some reads were removed by clumpify but not all of them

FastQC does not look at entire dataset when it does QC estimates so you can't rely on it and say that clumpify is not working. Can you post a screenshot of what your pre-clumpify data looks like?

In case you have not seen this blog post from authors of FastQC about sequence duplication you may want to take a look.

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

Hi geomax, thanks a lot! I am using k=90 as my reads are 100bps, so as to remove more exact duplicates. This is my fastqc duplication before clumpify: https://ibb.co/yRtkNY8 This is after clumpify: https://ibb.co/Bck0Gtm I will also post the stats that I get after using cumpify with default

ADD REPLY • link 3.4 years ago by C4 ▴ 30

1

Entering edit mode

The value of k should be smaller than 1/2 length of your reads otherwise how would the program be able to identify k-mers to do initial matching.

so as to remove more exact duplicates

to do that you should just add hdist=0 to your command. That will only allow perfect sequence matches.

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

Would you suggest using bbmerge prior to using clumpify dedup for paired-end fastq files. I take this from here, where they suggest using bbmerge and concatenating merged and unmerged fastq files before clumpify, since when paired-end reads are provided, clumpify clumps only based on R1 https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/clumpify-guide/ I have noticed that bbmerged paired-end fastqs work like a charm for duplicate removal, as comapred to provding paired-end files with in1 and in2 streams. However, I would not know how to unmerge the merged fastq files, is there a tool for that? Perhaps, I could cat the two paired-end fastqs? (This is because I realized clumpify would only mark a read as duplicate if it is present in both read pairs). Thanks so much for your advise!

ADD REPLY • link 3.4 years ago by C4 ▴ 30

1

Entering edit mode

If your reads are able to merge then by all means follow Brian's advice on the guide page you linked above.

However, I would not know how to unmerge the merged fastq files, is there a tool for that?

While there isn't a named tool that would do that but the following workaround should work.

Merge your reads with bbmerge. Grab a copy of the read headers of the reads that were merged by going a grep "^@Seq_ID" merged file | sed 's/@//' > merged_headers.
Add unmerged reads to the merged and then run clumpify.
Get the read headers from the clumped file. grep "^@Seq_ID" clumped file | sed 's/@//' > clumped_headers.
Find which read headers are common betwenn merged_headers and clumped_header by using comm. Something like comm -12 merged_headers clumped_headers > file_with_headers_that_were_common.
Take those merged read headers (which we know survived clumping) and retrieve their original paired-end reads from original R1 and R2 (prior to merging) files using filterbyname.sh using names=file_with_headers_that_were_common.

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

Hi GenoMax, thanks a lot for this. Your work-around works really well, however I am losing many reads, because of which my mapping% is decreasing. I believe that is just because I do not have very good data. Could you tell me why you suggested finding common between merged and clumped? Why couldn't I just use all the clumped fastq headers (including merged and unmerged)?

ADD REPLY • link 3.4 years ago by C4 ▴ 30

0

Entering edit mode

Could you tell me why you suggested finding common between merged and clumped?

Since you had originally asked

I would not know how to unmerge the merged fastq files

As we can't "unmerge" merged data we have to go back to the original data and get the reads from there. By doing an intersection of clumped and merged headers we can find out which headers made it to final clumped data.

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

My understanding was that dedupe would work better if I bbmerge R1+R2. However, wouldn't clumped_headers have all the R1 and R2 fastq headers that got clumped and deduped. So, I could just use those clumped_headers to filter out R1 and R2, instead of only merged+clumped headers.

ADD REPLY • link 3.4 years ago by C4 ▴ 30

0

Entering edit mode

If you tried that and it works then no problem. Since you were going to merge "bbmerged" reads and singletons I was being careful.

ADD REPLY • link 3.4 years ago by GenoMax 141k