remove bad reads from fastq.gz file
3
0
Entering edit mode
10.0 years ago
biotechdiya ▴ 30

Hi,

I have a fastq.gz file and one of the read is causing problem with the alignment. I want to remove the read. I tried the following code but still no success. Have any one used any tools of which can handle the bad reads.

#!/usr/bin/env python
import sys
import gzip

def test():
    f = sys.stdin
    lastname = ''
    i=0
    minread = None
    try:
        while True:
            lastname = ''
            name = f.next().strip()
            seq = f.next().strip()
            plus = f.next().strip()
            qual = f.next().strip()
            i += 4


            if name == '@D93Z3NS1:147:C3KT5ACXX:1:2216:13222:14536 D93Z3NS1:147:C3KT5ACXX:1:2216:13222:14536':
               continue
            print '%s\n%s\n+\n%s\n'%(name,seq,qual)

    except:
     pass

if __name__ == '__main__':
    test()

Thanks,

RNA-Seq • 6.2k views
ADD COMMENT
1
Entering edit mode
10.0 years ago

using a linux cmd-line....

gunzip -c fastq.gz |\
paste - - - - |\
awk -F '    ' '$1!="@D93Z3NS1:147:C3KT5ACXX:1:2216:13222:14536D93Z3NS1:147:C3KT5ACXX:1:2216:13222:14536"' |\
tr "\t" "\n" |\
gzip > out.fq.gz
ADD COMMENT
0
Entering edit mode

Thank you for your reply. I tried the above linux cmd code and it does not work

-bash:  paste: command not found
-bash:  tr: command not found
-bash:  awk: command not found
-bash:  gzip: command not found

Thanks,

Diya

ADD REPLY
1
Entering edit mode

you did it wrong or, that's a PATH problem, or your linux is missing the core utilities (!!!!!!) http://en.wikipedia.org/wiki/GNU_Core_Utilities

ADD REPLY
1
Entering edit mode
10.0 years ago
Prakki Rama ★ 2.7k

Assuming you wanted tools to remove the bad reads from the dataset for better analysis, basically the following options are used during data pre-processing. (There might be many other software tools if you to explore more)

  1. removing adapters- Cutadapt, FAR can trim the adaptor regions from the regions. (you can check this Biostar post also)

  2. removing low quality bases, filtering low quality reads - fastq_quality_trimmer,fastq_quality_filter from fastx toolkit serves the purpose

  3. removing PolyA/T/N reads - options -trim_tail_left, -trim_tail_right, -trim_ns_left, -trim_ns_right from Prinseq should be useful

Apart from these is sequence contamination

  1. Decontamination - Deconseq might be useful, If the reads are especially from RNA-Sequencing, then rRNA contamination removal from Silva database is useful and also in the practice.

  2. if the reads are from Illumina Sequencing - One can also try running QUAKE to correct substitution sequencing errors.

ADD COMMENT
0
Entering edit mode
10.0 years ago

you need to define what you mean by bad read.

above you need to skip three more lines when you hit the incorrect id. Something like:

if name == '@something':
    f.next()
    f.next()
    f.next()
    continue

also iterate directly on the file

for name in f:
    seq = f.next()
    tmp = f.next()
    qual = f.next()

then you don't need the entire catch/except block that silences the error, that's not good practice

ADD COMMENT

Login before adding your answer.

Traffic: 2144 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6