Duplicate Paired-End Illumina Reads
3
5
Entering edit mode
11.3 years ago
Yannick Wurm ★ 2.4k

Removing duplicate reads from data appears to be important to reduce the impact of a heterogenous PCR step prior to sequencing.

Tools for doing this on single-end reads include fastx_collapser.

And if paired data is mapped to a reference, maq rmdup or others can do the trick.

But if you're doing de novo assembly, you want to remove duplicates before doing anything else. Is there a tool that does this automagically on paired-end reads?

Cheers, yannick

next-gen sequencing paired short • 16k views
1
Entering edit mode

Keith's suggestion works, but fastx_collapser actually needs to be replaced by brentp's code (See Is There A Fastq Alternative To Fastx_Collapser (Outputs Fasta)? ) and galaxy's fastq joiner is in slow as hell python.

0
Entering edit mode

So Keith's suggestion works - but galaxy's fastq joiner/splitter are slow as hell.

3
Entering edit mode
11.3 years ago

I guess that there isn't a suitable reference and your definiton of "duplicate" is simply that the forward and reverse reads are identical. In that case, you could use Galaxy's Fastq Joiner to temporarily merge the two reads, then use fastx_collapser to eliminate duplicates before using Galaxy's Fastq Splitter to restore the separate reads. Both ends have to be the same length to use Splitter/Joiner.

1
Entering edit mode

Most of the galaxy components can be run on the command line if you have a local install. For python scripts make sure the PYTHONPATH environmental variable is set to your lib folder. e.g. PYTHONPATH="/opt/galaxy_dist/lib/"

1
Entering edit mode

Also, it seems to assume an Illumina quality metric. When I ran it on a file using the Sanger (i.e. Phred) quality metric, it exited complaining that the decoded value was out of range,

1
Entering edit mode

What's more, it rejects valid Fastq files that have lower case nucleotide sequence characters.

0
Entering edit mode

hmmm can fastq joiner be run outside of galaxy? (ie in the command line?)

0
Entering edit mode

I don't know whether or not you still have to run Galaxy components via a web server when you have a local installation.

0
Entering edit mode

Thanks guys! Downloading Galaxy and setting "export PYTHONPATH=PYTHONPATH:/my/galaxy-dist/lib/" was indeed sufficient. It's a pity though that galaxy's tools don't offer help in the commandline.

0
Entering edit mode

hmmm let me correct myself on that one... fastx_collapser does not do the job correctly: it invariably converts my fastq to fasta.

0
Entering edit mode

whoopdeedoo another question

2
Entering edit mode
9.5 years ago
Pengfei Yu ▴ 40

Hi Yannick,

I also faced you problem when dealing with pair-end data. And following Keith's idea, I wrote a simple python script to achieve this goal. (Here I only select the first appearance for duplicated reads with same sequences in both directions.)

The script need library of Biopython and itertools.

Hope it could also help others.

#!/usr/bin/env python
# yupf05@gmail.com

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import itertools
import os,sys, argparse

def ParseArg():
p=argparse.ArgumentParser( description = 'Remove duplicated reads which have same sequences for both forward and reverse reads. Choose the one appears first.', epilog = 'Library dependency: Bio, itertools')
if len(sys.argv)==1:
print >>sys.stderr,p.print_help()
exit(0)
return p.parse_args()

def Main():
Unique_seqs=set()
args=ParseArg()
outfile1 = open("Rm_dupPE_"+args.input1,"w")
outfile2 = open("Rm_dupPE_"+args.input2,"w")
fastq_iter1 = SeqIO.parse(open(args.input1),"fastq")
fastq_iter2 = SeqIO.parse(open(args.input2),"fastq")
for rec1, rec2 in itertools.izip(fastq_iter1, fastq_iter2):
if str((rec1+rec2).seq) not in Unique_seqs:
SeqIO.write(rec1,outfile1,"fastq")
SeqIO.write(rec2,outfile2,"fastq")
outfile1.close()
outfile2.close()

if __name__ == '__main__':
Main()

0
Entering edit mode

Thank you Pengfei. I am afraid however that your code may run out of RAM? Do you know how much RAM would be required if you have a Hiseq lane (40 gigabytes * 2)? Cheers yannick

2
Entering edit mode

Hi Yannick, sorry, I never logged in this forum after the previous post. The huge set may consume large memories. I worked on the server of our lab which has RAM of ~150Gb. This process normally takes less than 5% of the RAM. If you want to find better choice, I would recommend "FastUniq" as introduced in this paper: http://www.ncbi.nlm.nih.gov/pubmed/23284954.

An update: the FastUniq is faster than my python script but consuming much more memory. My script usually need the memory of ~60% of the total size of output fastq files after removing PCR duplicates. So if the two fastq files after removing PCR duplicates have total size of 2.5GB, the script takes maximum memory of around 1.5GB. But the FastUniq need more than 5GB to get the final result.

1
Entering edit mode
7.8 years ago

Hi, I quickly wrote the following one-liners for dereplicating FASTQ files and also added them to my one-liners page.

Perhaps this is what you are looking for?

forward.fastq --> dereplication --> forward_derep.fasta

$head forward_derep.fasta >55961afb4a0df6979393aa7f2ad02cac;size=340; CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT >c5a3043fda41c9b4f972826c3bba8154;size=263; CCTATGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT >12071fab7db0b36abcc540059f9bb78b;size=119; CCTACGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTGGAAAGTTGGGGGTGAAATCCCAG >76d3994572a1c37336f2b4dd897434a7;size=107; CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTCGCGTGAAGGAAGACTGTCCTAAGGATTGTAAACTTCTTTTATACGGGAATAACGGGCGATACGAGTATTGCATTGAATGTACCGTAAGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGTTGTTCGGTA >c6e34aa95f6b068fcd12e3308e92815a;size=104; CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGCAGGCCTGAACCAGCCAAGTAGCGTGAAGGATGACTGCCCTATGGGTTGTAAACTTCTTTTATATGGGAATAAAGTTTTCCACGTGTGGAATTTTGTATGTACCATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGACAGTTA  forward.fastq + reverse.fastq --> dereplication --> forward_derep.fasta + reverse_derep.fasta Explanation: • I push each line to @a array, and then replace it with a slice of itself. We do @a = @a[@a-4..$#a], which means, replace @a with last 4 elements of @a.
• if ($. % 4 == 0) means that after every fourth line my @a is populated with a read • For paired-end reads I am concatenating both forward and reverse reads together using paste which inserts a \t. For dereplication, I am simply concatenating them using : as a delimiter, and storing the concatenating string in a hash $d{} with counts
• I am using Digest::MD5=md5_hex to generate a checksum as FASTA header name
• size=* is in usearch format where * represents the abundance count of such reads
• for paired-end reads, checksum is same for both forward and reverse reads, the only difference being 1 and 2 in FASTA headers