Hi everybody,
I have aligned bam files (STAR version 2.6.1) and want to quantify them using htseq-count (version 0.9.1). Some bams run through without a problem but some give following error:
...
13300000 SAM alignment record pairs processed.
13400000 SAM alignment record pairs processed.
Error occured when processing SAM input (record #26816395 in file /folder/file.bam):
'ascii' codec can't decode byte 0xe4 in position 4: ordinal not in range(128)
[Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]
When I go back to the bam file, the specific line (26816395) contains a question mark in the quality part of the read
(either ? or ^?). So far, I have dealt with the problem by just removing the line since it were only one or a few reads.
But there's a file with 30 or more failing lines and I don't want to lose all those reads.
The problem is that it is not REALLY a question mark (I know this because when I grep '?' there's no result).
This means that the question mark only is a substitution for a non-ascii character.
So my question: How do I remove non-ascii characters from my bam file?
(alternatively: do you know where the question marks come from in the first place? Can I re-run STAR with different parameters?)
I've been trying to figure this out for quite a while so I appreciate any input or work around :)
Thanks!
A question mark is a perfectly valid character to have in the quality line. A ^ is not. I'd examine the original fastq, see if it's there.
That's true, but I'm guessing that it is not really a question mark, otherwise I would be able to grep it via ''grep '?''', but I can't. Correct? The line in the fasta file also looks normal..
So these characters are coming from your original sequence files? Have you checked the offending read identified in the BAM?
As I said, the line in the bam file looks normal, it just contains a question mark: