Question: Duplicate Paired-End Illumina Reads
5
gravatar for Yannick Wurm
7.9 years ago by
Yannick Wurm2.3k
Queen Mary University London
Yannick Wurm2.3k wrote:

Removing duplicate reads from data appears to be important to reduce the impact of a heterogenous PCR step prior to sequencing.

Tools for doing this on single-end reads include fastx_collapser.

And if paired data is mapped to a reference, maq rmdup or others can do the trick.

But if you're doing de novo assembly, you want to remove duplicates before doing anything else. Is there a tool that does this automagically on paired-end reads?

Cheers, yannick

ADD COMMENTlink modified 2.8 years ago by Biostar ♦♦ 20 • written 7.9 years ago by Yannick Wurm2.3k
1

Keith's suggestion works, but fastx_collapser actually needs to be replaced by brentp's code http://biostar.stackexchange.com/questions/2801/is-there-a-fastq-alternative-to-fastx-collapser-outputs-fasta and galaxy's fastq joiner is in slow as hell python.

ADD REPLYlink written 7.9 years ago by Yannick Wurm2.3k

So Keith's suggestion works - but galaxy's fastq joiner/splitter are slow as hell.

ADD REPLYlink written 7.9 years ago by Yannick Wurm2.3k
3
gravatar for iw9oel_ad
7.9 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

I guess that there isn't a suitable reference and your definiton of "duplicate" is simply that the forward and reverse reads are identical. In that case, you could use Galaxy's Fastq Joiner to temporarily merge the two reads, then use fastx_collapser to eliminate duplicates before using Galaxy's Fastq Splitter to restore the separate reads. Both ends have to be the same length to use Splitter/Joiner.

ADD COMMENTlink written 7.9 years ago by iw9oel_ad6.0k
1

Most of the galaxy components can be run on the command line if you have a local install. For python scripts make sure the PYTHONPATH environmental variable is set to your lib folder. e.g. PYTHONPATH="/opt/galaxy_dist/lib/"

ADD REPLYlink written 7.9 years ago by Alastair Kerr5.2k
1

Also, it seems to assume an Illumina quality metric. When I ran it on a file using the Sanger (i.e. Phred) quality metric, it exited complaining that the decoded value was out of range,

ADD REPLYlink written 7.9 years ago by iw9oel_ad6.0k
1

What's more, it rejects valid Fastq files that have lower case nucleotide sequence characters.

ADD REPLYlink written 7.9 years ago by iw9oel_ad6.0k

hmmm can fastq joiner be run outside of galaxy? (ie in the command line?)

ADD REPLYlink written 7.9 years ago by Yannick Wurm2.3k

I don't know whether or not you still have to run Galaxy components via a web server when you have a local installation.

ADD REPLYlink written 7.9 years ago by iw9oel_ad6.0k

Thanks guys! Downloading Galaxy and setting "export PYTHONPATH=PYTHONPATH:/my/galaxy-dist/lib/" was indeed sufficient. It's a pity though that galaxy's tools don't offer help in the commandline.

ADD REPLYlink written 7.9 years ago by Yannick Wurm2.3k

hmmm let me correct myself on that one... fastx_collapser does not do the job correctly: it invariably converts my fastq to fasta.

ADD REPLYlink written 7.9 years ago by Yannick Wurm2.3k

whoopdeedoo another question: http://biostar.stackexchange.com/questions/2801/is-there-a-fastq-alternative-to-fastx-collapser-outputs-fasta

ADD REPLYlink written 7.9 years ago by Yannick Wurm2.3k
2
gravatar for Pengfei Yu
6.0 years ago by
Pengfei Yu40
South Plainfield, NJ
Pengfei Yu40 wrote:

Hi Yannick,

I also faced you problem when dealing with pair-end data. And following Keith's idea, I wrote a simple python script to achieve this goal. (Here I only select the first appearance for duplicated reads with same sequences in both directions.) The script need library of "Biopython" and "itertools".

Hope it could also help others.

    #!/usr/bin/env python
    # yupf05@gmail.com

    from Bio import SeqIO
    from Bio.SeqRecord import SeqRecord
    import itertools
    import os,sys, argparse


    def ParseArg():
        p=argparse.ArgumentParser( description = 'Remove duplicated reads which have same sequences for both forward and reverse reads. Choose the one appears first.', epilog = 'Library dependency: Bio, itertools')
        p.add_argument('input1',type=str,metavar='reads1',help='forward input fastq/fasta file')
        p.add_argument('input2',type=str,metavar='reads2',help='reverse input fastq/fasta file')
        if len(sys.argv)==1:
            print >>sys.stderr,p.print_help()
            exit(0)
        return p.parse_args()


    def Main():

        Unique_seqs=set()
        args=ParseArg()
        outfile1 = open("Rm_dupPE_"+args.input1,"w")
        outfile2 = open("Rm_dupPE_"+args.input2,"w")
        fastq_iter1 = SeqIO.parse(open(args.input1),"fastq")
        fastq_iter2 = SeqIO.parse(open(args.input2),"fastq")
        for rec1, rec2 in itertools.izip(fastq_iter1, fastq_iter2):
            if str((rec1+rec2).seq) not in Unique_seqs:
                SeqIO.write(rec1,outfile1,"fastq")
                SeqIO.write(rec2,outfile2,"fastq")
                Unique_seqs.add(str((rec1+rec2).seq))
        outfile1.close()
        outfile2.close()

    if __name__ == '__main__':
        Main()
ADD COMMENTlink written 6.0 years ago by Pengfei Yu40

Thank you Pengfei. I am afraid however that your code may run out of RAM? Do you know how much RAM would be required if you have a Hiseq lane (40 gigabytes * 2)? Cheers yannick

ADD REPLYlink written 6.0 years ago by Yannick Wurm2.3k
2

Hi Yannick, sorry, I never logged in this forum after the previous post. The huge set may consume large memories. I worked on the server of our lab which has RAM of ~150Gb. This process normally takes less than 5% of the RAM. If you want to find better choice, I would recommend "FastUniq" as introduced in this paper: http://www.ncbi.nlm.nih.gov/pubmed/23284954.

An update: the FastUniq is faster than my python script but consuming much more memory. My script usually need the memory of ~60% of the total size of output fastq files after removing PCR duplicates. So if the two fastq files after removing PCR duplicates have total size of 2.5GB, the script takes maximum memory of around 1.5GB. But the FastUniq need more than 5GB to get the final result.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Pengfei Yu40
1
gravatar for umer.zeeshan.ijaz
4.4 years ago by
Glasgow, UK
umer.zeeshan.ijaz1.7k wrote:

Hi, I quickly wrote the following one-liners for dereplicating FASTQ files and also added them to my one-liners page: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/oneliners.html?#DEREP

Perhaps this is what you are looking for?

<script src="&lt;a href=" 10106498"="">10106498"></script>

forward.fastq --> dereplication --> forward_derep.fasta

$ head forward_derep.fasta

55961afb4a0df6979393aa7f2ad02cac;size=340; CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT c5a3043fda41c9b4f972826c3bba8154;size=263; CCTATGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT 12071fab7db0b36abcc540059f9bb78b;size=119; CCTACGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTGGAAAGTTGGGGGTGAAATCCCAG 76d3994572a1c37336f2b4dd897434a7;size=107; CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTCGCGTGAAGGAAGACTGTCCTAAGGATTGTAAACTTCTTTTATACGGGAATAACGGGCGATACGAGTATTGCATTGAATGTACCGTAAGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGTTGTTCGGTA c6e34aa95f6b068fcd12e3308e92815a;size=104; CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGCAGGCCTGAACCAGCCAAGTAGCGTGAAGGATGACTGCCCTATGGGTTGTAAACTTCTTTTATATGGGAATAAAGTTTTCCACGTGTGGAATTTTGTATGTACCATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGACAGTTA

<script src="&lt;a href=" 10106653"="">10106653"></script>

forward.fastq + reverse.fastq --> dereplication --> forward_derep.fasta + reverse_derep.fasta

$ head *_derep.fasta ==> forward_derep.fasta <==

948b3e6c0d9c5140a20b3e950baf513b 1;size=45; CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT 84ee339f8f7d47fb5278cc7d211c2856 1;size=32; CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT 087be525fc7401eb8dc965c0b0b20a83 1;size=30; CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT 5046bcb4cbabfb76f0f6966c0ada2695 1;size=30; CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT 792f79536de130a5e6ded001bbe8a191 1;size=28; CCTATGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT

==> reverse_derep.fasta <==

948b3e6c0d9c5140a20b3e950baf513b 2;size=45; GACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG 84ee339f8f7d47fb5278cc7d211c2856 2;size=32; GACTACCCGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG 087be525fc7401eb8dc965c0b0b20a83 2;size=30; GACTACAAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG 5046bcb4cbabfb76f0f6966c0ada2695 2;size=30; GACTACCGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG 792f79536de130a5e6ded001bbe8a191 2;size=28; GACTACTAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG

Explanation:

-I push each line to @a array, and then replace it with a slice of itself. We do @a = @a[@a-4..$#a], which means, replace @a with last 4 elements of @a.

-if ($. % 4 == 0) means that after every fourth line my @a is populated with a read

-For paired-end reads I am concatenating both forward and reverse reads together using paste which inserts a "\t". For dereplication, I am simply concatenating them using ":" as a delimeter, and storing the concatenating string in a hash $d{} with counts

-I am using Digest::MD5=md5_hex to generate a checksum as FASTA header name

-size=* is in usearch format where * represents the abundance count of such reads

-for paired-end reads, checksum is same for both forward and reverse reads, the only difference being " 1" and " 2" in FASTA headers

ADD COMMENTlink written 4.4 years ago by umer.zeeshan.ijaz1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1599 users visited in the last hour