Question: Found Invalid Nucleotide Sequence
3
gravatar for Love
7.9 years ago by
Love100
Love100 wrote:

Hello, I used samtools to generate a fastq file(consenus sequence). Then I used fastx to filter the quality. The command is:

fastq_quality_filter -i cns.fastq -o cns_Qual20.fastq -q 30 -p 80 -Q 33 -v

However I got an error:

fastq_quality_filter: found invalid nucleotide sequence (gaTCACAGGTCTATCACCCTATTAACCACTCACGGgagctctccatgcatttggtatttt) on line 2

The top lines in the sequence file like

@chrM
gaTCACAGGTCTATCACCCTATTAACCACTCACGGgagctctccatgcatttggtatttt
cgtttggggggtatgcacgcgatagcattgcgagacgctggagccggagcaccctatgtc
gcagtatctgtctttgattcctgcctcatcctattatttatcgcacctacgttcaatatt
acaggcgaacatacttactaaagtgtgttaattaattaatgcttgtaggacataataata
acaattgaatgtctgcacagccgctttccacacagacatcataacaaaaaatttccacca

Thanks for help.

filter fastq quality • 5.0k views
ADD COMMENTlink modified 4.9 years ago by Biostar ♦♦ 20 • written 7.9 years ago by Love100
1
gravatar for Damian Kao
7.9 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

[?]This page[?] says:

Some functions of FASTX-Toolkit do not work with FASTA-formatted sequences on multiple lines, thus it is sometimes necessary to transform the file so that fasta_formatter each sequence is on a single line.

Pierre is probably right. You need to reformat your fastq so it's in 4 lines.

Try using this script to reformat you fastq into 4 lines:

import sys

inFile = open(sys.argv[1],'r')

header = ''
seq = ''
qual = ''

seqs = False
quals = False
for line in inFile:
    if line[0] == "@":
        if header != '':
            print "@" + header
            print seq.upper()
            print "+" + header
            print qual

        header = line[1:].strip()
        seqs = True
        quals = False
        qual = ''
        seq = ''
    elif line[0] == "+":
        seqs = False
        quals = True
    else:
        if quals:
            qual += line.strip()
        if seqs:
            seq += line.strip()

print "@" + header
print seq
print "+" + header
print qual

Save as yourName.py. Use by: python yourName.py yourFastaq.fastq > reformatted.fastq

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by Damian Kao15k

NameError: name 'sys' is not defined

ADD REPLYlink written 7.9 years ago by Zhshqzyc490

sorry, my fault

ADD REPLYlink written 7.9 years ago by Zhshqzyc490

I've changed the script to print out upper case sequence letters. Maybe it will help?

ADD REPLYlink written 7.9 years ago by Damian Kao15k
0
gravatar for Pierre Lindenbaum
7.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

does fastq_quality_filter accepts the fastq files having more than 4 lines per records (name,seq,name2,qualitie) ?

ADD COMMENTlink written 7.9 years ago by Pierre Lindenbaum123k

I don't know. But in my previous thread http://biostar.stackexchange.com/questions/14838/what-is-the-minimum-quality-score The guy said that it is fine.

ADD REPLYlink written 7.9 years ago by Love100

I convert it to 4 lines per records, still wrong. A very simple test file: @chr1 gaTCACAGGTCTATCACCCTA +chr1 efcfffffcfeefffcfffff Then the error: fastq_quality_filter: found invalid nucleotide sequence (gaTCACAGGTCTATCACCCTA) on line 2

ADD REPLYlink written 7.9 years ago by Love100

Kind of a long shot, maybe it doesn't like lower case letters?

ADD REPLYlink written 7.9 years ago by Damian Kao15k

But does lower case have specific physical meaning?

ADD REPLYlink written 7.9 years ago by Love100

And still wrong for upper case, did I download a wrong fastx?

ADD REPLYlink written 7.9 years ago by Love100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1315 users visited in the last hour