Question

remove bad reads from fastq.gz file

0

Entering edit mode

10.4 years ago

biotechdiya ▴ 30

Hi,

I have a fastq.gz file and one of the read is causing problem with the alignment. I want to remove the read. I tried the following code but still no success. Have any one used any tools of which can handle the bad reads.

#!/usr/bin/env python
import sys
import gzip

def test():
    f = sys.stdin
    lastname = ''
    i=0
    minread = None
    try:
        while True:
            lastname = ''
            name = f.next().strip()
            seq = f.next().strip()
            plus = f.next().strip()
            qual = f.next().strip()
            i += 4


            if name == '@D93Z3NS1:147:C3KT5ACXX:1:2216:13222:14536 D93Z3NS1:147:C3KT5ACXX:1:2216:13222:14536':
               continue
            print '%s\n%s\n+\n%s\n'%(name,seq,qual)

    except:
     pass

if __name__ == '__main__':
    test()

Thanks,

RNA-Seq • 6.4k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.4 years ago by biotechdiya ▴ 30

Ram · Answer 1 · 2014-04-23

1

Entering edit mode

10.4 years ago

Pierre Lindenbaum 163k

using a linux cmd-line....

gunzip -c fastq.gz |\
paste - - - - |\
awk -F '    ' '$1!="@D93Z3NS1:147:C3KT5ACXX:1:2216:13222:14536D93Z3NS1:147:C3KT5ACXX:1:2216:13222:14536"' |\
tr "\t" "\n" |\
gzip > out.fq.gz

ADD COMMENT • link updated 4.7 years ago by Ram 44k • written 10.4 years ago by Pierre Lindenbaum 163k

0

Entering edit mode

Thank you for your reply. I tried the above linux cmd code and it does not work

-bash:  paste: command not found
-bash:  tr: command not found
-bash:  awk: command not found
-bash:  gzip: command not found

Thanks,

Diya

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.4 years ago by biotechdiya ▴ 30

1

Entering edit mode

you did it wrong or, that's a PATH problem, or your linux is missing the core utilities (!!!!!!) http://en.wikipedia.org/wiki/GNU_Core_Utilities

ADD REPLY • link 10.4 years ago by Pierre Lindenbaum 163k

Ram · Answer 2 · 2014-04-23

Assuming you wanted tools to remove the bad reads from the dataset for better analysis, basically the following options are used during data pre-processing. (There might be many other software tools if you to explore more)

removing adapters- Cutadapt, FAR can trim the adaptor regions from the regions. (you can check this Biostar post also)
removing low quality bases, filtering low quality reads - fastq_quality_trimmer,fastq_quality_filter from fastx toolkit serves the purpose
removing PolyA/T/N reads - options -trim_tail_left, -trim_tail_right, -trim_ns_left, -trim_ns_right from Prinseq should be useful

Apart from these is sequence contamination

Decontamination - Deconseq might be useful, If the reads are especially from RNA-Sequencing, then rRNA contamination removal from Silva database is useful and also in the practice.
if the reads are from Illumina Sequencing - One can also try running QUAKE to correct substitution sequencing errors.

Ram · Answer 3 · 2014-04-23

you need to define what you mean by bad read.

above you need to skip three more lines when you hit the incorrect id. Something like:

if name == '@something':
    f.next()
    f.next()
    f.next()
    continue

also iterate directly on the file

for name in f:
    seq = f.next()
    tmp = f.next()
    qual = f.next()

then you don't need the entire catch/except block that silences the error, that's not good practice