I have aligned bam files (STAR version 2.6.1) and want to quantify them using htseq-count (version 0.9.1). Some bams run through without a problem but some give following error:
... 13300000 SAM alignment record pairs processed. 13400000 SAM alignment record pairs processed. Error occured when processing SAM input (record #26816395 in file /folder/file.bam): 'ascii' codec can't decode byte 0xe4 in position 4: ordinal not in range(128) [Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]
When I go back to the bam file, the specific line (26816395) contains a question mark in the quality part of the read
(either ? or ^?). So far, I have dealt with the problem by just removing the line since it were only one or a few reads.
But there's a file with 30 or more failing lines and I don't want to lose all those reads.
The problem is that it is not REALLY a question mark (I know this because when I grep '?' there's no result).
This means that the question mark only is a substitution for a non-ascii character.
So my question: How do I remove non-ascii characters from my bam file?
(alternatively: do you know where the question marks come from in the first place? Can I re-run STAR with different parameters?)
I've been trying to figure this out for quite a while so I appreciate any input or work around :)