I struggled also several times with this kind of problem. Somehow (I don't know why) you may have a non-ascii character in your fastq file.
And for some reason (which I also don't know) bowtie2 explains that it has the ascii code "-126" which does not exist, as all ascii signs have to be positive.
It is relatively likely that you have non-ascii character in your quality lines of your fastq file.
You can search for a line containing a non-ascii character by using the following commands. If the file is gziped use this:
zcat FILE.fastq.gz | perl -e '$line = 1; while (<>) {if(/[^[:ascii:]]/) {print "LINE: $line\n$_";} $line++;}'
Otherwise this:
perl -e '$line = 1; while (<>) {if(/[^[:ascii:]]/) {print "LINE: $line\n$_";} $line++;}' -f FILE.fastq
Let's say you received such an output:
LINE: 21410096
CCCCCGGGGGGGGGG�GGGG
You non-ascii sign will be depicted as this question-mark symbol.
After receiving the responsible line you can "enlarge" the "surrounding area" by using sed:
sed -n '21410093,21410096p' FILE.fastq
So we got something like this as an output:
@sequence HEADER
CTGGCTGGGAAGGGGCTGGCT
+quality HEADER
CCCCCGGGGGGGGGG�GGGG
Such an corrupted entry can easily be removed by using sed. In this case the corrupted read was located at the lines
21410093 to 21410096, so the following command will work:
sed -n '21410093,21410096d' Fasta.fastq > repaired.fastq
This would be problematic if you have paired-end reads. Either remove the read entry at the same position in the corresponding fastq file, or replace the corrupting character in you initial fastq file. I would prefer the latter:
This can be done by using sed:
sed '21410096s/CCCCCGGGGGGGGGG�GGGG/CCCCCGGGGGGGGGGIGGGGG' FILE.fastq >FILE.repaired.fastq
(actually I'm surprised that this one worked).
I hope this helps.
Sounds like you have a flipped bit somewhere in the fastq file. Try to subset it until you find the approximate line number. You can probably then look through the file and quickly identify the weird character (there's probably a quick awk command to do all of this, but I don't know it off-hand).
ASCII 126 is the tilde (~).
Will this work?
Is that even a good idea?
It's -126, which I guess would be ascii 251 (a square root sign?) if one assumed char to be unsigned.
I am dealing with 8 paired fastq files so I need to search it in all the 18 fastq Files I have?!
You should be able to tell which fastq file is the problem by when the error occurs and what the last alignment written was. Just grep for that last alignment's read name and you'll know the file. The problem read should be within ~128K of that (I think that's the buffer size bowtie2 uses).
Ok I
grep
-ed the last sequence like thatIt suggested two files
using the command
But no result!
get the line number with
grep -n
and then use awk to extract the next 100,000 (or so, these are estimated numbers!) reads from each into a new file. One of those files should then cause the error.And then you can more easily use bowtie2's --upto and -s options to narrow down the region (in fact, you can use that to avoid using awk to subset the file).
That's because I made a mistake. ~ is not the character you are looking for :-)
We should look for a pattern that matches non-Phred64 characters.
OP, did you find a solution to this?