Question

Duplicate Paired-End Illumina Reads

5

Entering edit mode

13.5 years ago

Yannick Wurm ★ 2.5k

Removing duplicate reads from data appears to be important to reduce the impact of a heterogenous PCR step prior to sequencing.

Tools for doing this on single-end reads include fastx_collapser.

And if paired data is mapped to a reference, maq rmdup or others can do the trick.

But if you're doing de novo assembly, you want to remove duplicates before doing anything else. Is there a tool that does this automagically on paired-end reads?

Cheers, yannick

next-gen sequencing paired short • 17k views

ADD COMMENT • link updated 8.5 years ago by Biostar 20 • written 13.5 years ago by Yannick Wurm ★ 2.5k

1

Entering edit mode

Keith's suggestion works, but fastx_collapser actually needs to be replaced by brentp's code (See Is There A Fastq Alternative To Fastx_Collapser (Outputs Fasta)? ) and galaxy's fastq joiner is in slow as hell python.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.5 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

So Keith's suggestion works - but galaxy's fastq joiner/splitter are slow as hell.

ADD REPLY • link 13.5 years ago by Yannick Wurm ★ 2.5k

Ram · Answer 1 · 2010-10-04

3

Entering edit mode

13.5 years ago

biobot 0.0.77.a.1099 6.2k

I guess that there isn't a suitable reference and your definiton of "duplicate" is simply that the forward and reverse reads are identical. In that case, you could use Galaxy's Fastq Joiner to temporarily merge the two reads, then use fastx_collapser to eliminate duplicates before using Galaxy's Fastq Splitter to restore the separate reads. Both ends have to be the same length to use Splitter/Joiner.

ADD COMMENT • link 13.5 years ago by biobot 0.0.77.a.1099 6.2k

1

Entering edit mode

Most of the galaxy components can be run on the command line if you have a local install. For python scripts make sure the PYTHONPATH environmental variable is set to your lib folder. e.g. PYTHONPATH="/opt/galaxy_dist/lib/"

ADD REPLY • link 13.5 years ago by Alastair Kerr 5.3k

1

Entering edit mode

Also, it seems to assume an Illumina quality metric. When I ran it on a file using the Sanger (i.e. Phred) quality metric, it exited complaining that the decoded value was out of range,

ADD REPLY • link 13.5 years ago by biobot 0.0.77.a.1099 6.2k

1

Entering edit mode

What's more, it rejects valid Fastq files that have lower case nucleotide sequence characters.

ADD REPLY • link 13.5 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

hmmm can fastq joiner be run outside of galaxy? (ie in the command line?)

ADD REPLY • link 13.5 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

I don't know whether or not you still have to run Galaxy components via a web server when you have a local installation.

ADD REPLY • link 13.5 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

Thanks guys! Downloading Galaxy and setting "export PYTHONPATH=PYTHONPATH:/my/galaxy-dist/lib/" was indeed sufficient. It's a pity though that galaxy's tools don't offer help in the commandline.

ADD REPLY • link 13.5 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

hmmm let me correct myself on that one... fastx_collapser does not do the job correctly: it invariably converts my fastq to fasta.

ADD REPLY • link 13.5 years ago by Yannick Wurm ★ 2.5k

0

Entering edit mode

whoopdeedoo another question

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.5 years ago by Yannick Wurm ★ 2.5k

Ram · Answer 2 · 2012-08-16

Hi Yannick,

I also faced you problem when dealing with pair-end data. And following Keith's idea, I wrote a simple python script to achieve this goal. (Here I only select the first appearance for duplicated reads with same sequences in both directions.)

The script need library of Biopython and itertools.

Hope it could also help others.

#!/usr/bin/env python
# yupf05@gmail.com

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import itertools
import os,sys, argparse


def ParseArg():
    p=argparse.ArgumentParser( description = 'Remove duplicated reads which have same sequences for both forward and reverse reads. Choose the one appears first.', epilog = 'Library dependency: Bio, itertools')
    p.add_argument('input1',type=str,metavar='reads1',help='forward input fastq/fasta file')
    p.add_argument('input2',type=str,metavar='reads2',help='reverse input fastq/fasta file')
    if len(sys.argv)==1:
        print >>sys.stderr,p.print_help()
        exit(0)
    return p.parse_args()


def Main():
    Unique_seqs=set()
    args=ParseArg()
    outfile1 = open("Rm_dupPE_"+args.input1,"w")
    outfile2 = open("Rm_dupPE_"+args.input2,"w")
    fastq_iter1 = SeqIO.parse(open(args.input1),"fastq")
    fastq_iter2 = SeqIO.parse(open(args.input2),"fastq")
    for rec1, rec2 in itertools.izip(fastq_iter1, fastq_iter2):
        if str((rec1+rec2).seq) not in Unique_seqs:
            SeqIO.write(rec1,outfile1,"fastq")
            SeqIO.write(rec2,outfile2,"fastq")
            Unique_seqs.add(str((rec1+rec2).seq))
    outfile1.close()
    outfile2.close()

if __name__ == '__main__':
    Main()

Ram · Answer 3 · 2014-04-08

Hi, I quickly wrote the following one-liners for dereplicating FASTQ files and also added them to my one-liners page.

Perhaps this is what you are looking for?

forward.fastq --> dereplication --> forward_derep.fasta

$ head forward_derep.fasta

>55961afb4a0df6979393aa7f2ad02cac;size=340;
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>c5a3043fda41c9b4f972826c3bba8154;size=263;
CCTATGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>12071fab7db0b36abcc540059f9bb78b;size=119;
CCTACGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTGGAAAGTTGGGGGTGAAATCCCAG
>76d3994572a1c37336f2b4dd897434a7;size=107;
CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTCGCGTGAAGGAAGACTGTCCTAAGGATTGTAAACTTCTTTTATACGGGAATAACGGGCGATACGAGTATTGCATTGAATGTACCGTAAGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGTTGTTCGGTA
>c6e34aa95f6b068fcd12e3308e92815a;size=104;
CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGCAGGCCTGAACCAGCCAAGTAGCGTGAAGGATGACTGCCCTATGGGTTGTAAACTTCTTTTATATGGGAATAAAGTTTTCCACGTGTGGAATTTTGTATGTACCATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGACAGTTA

forward.fastq + reverse.fastq --> dereplication --> forward_derep.fasta + reverse_derep.fasta

Explanation:

I push each line to @a array, and then replace it with a slice of itself. We do @a = @a[@a-4..$#a], which means, replace @a with last 4 elements of @a.
if ($. % 4 == 0) means that after every fourth line my @a is populated with a read
For paired-end reads I am concatenating both forward and reverse reads together using paste which inserts a \t. For dereplication, I am simply concatenating them using : as a delimiter, and storing the concatenating string in a hash $d{} with counts
I am using Digest::MD5=md5_hex to generate a checksum as FASTA header name
- size=* is in usearch format where * represents the abundance count of such reads
for paired-end reads, checksum is same for both forward and reverse reads, the only difference being 1 and 2 in FASTA headers