Question: Non-ascii characters in bam file cause htseq-count error
0
gravatar for loui_
12 days ago by
loui_0
loui_0 wrote:

Hi everybody,

I have aligned bam files (STAR version 2.6.1) and want to quantify them using htseq-count (version 0.9.1). Some bams run through without a problem but some give following error:


...
13300000 SAM alignment record pairs processed.
13400000 SAM alignment record pairs processed.
Error occured when processing SAM input (record #26816395 in file /folder/file.bam):
'ascii' codec can't decode byte 0xe4 in position 4: ordinal not in range(128)
[Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]

When I go back to the bam file, the specific line (26816395) contains a question mark in the quality part of the read
(either ? or ^?). So far, I have dealt with the problem by just removing the line since it were only one or a few reads.
But there's a file with 30 or more failing lines and I don't want to lose all those reads.
The problem is that it is not REALLY a question mark (I know this because when I grep '?' there's no result).
This means that the question mark only is a substitution for a non-ascii character.

So my question: How do I remove non-ascii characters from my bam file?
(alternatively: do you know where the question marks come from in the first place? Can I re-run STAR with different parameters?)

I've been trying to figure this out for quite a while so I appreciate any input or work around :)

Thanks!

rna-seq star htseq-count • 751 views
ADD COMMENTlink modified 12 days ago by swbarnes26.7k • written 12 days ago by loui_0
1

A question mark is a perfectly valid character to have in the quality line. A ^ is not. I'd examine the original fastq, see if it's there.

ADD REPLYlink written 12 days ago by swbarnes26.7k

That's true, but I'm guessing that it is not really a question mark, otherwise I would be able to grep it via ''grep '?''', but I can't. Correct? The line in the fasta file also looks normal..

ADD REPLYlink written 12 days ago by loui_0

So these characters are coming from your original sequence files? Have you checked the offending read identified in the BAM?

ADD REPLYlink modified 12 days ago • written 12 days ago by genomax73k

As I said, the line in the bam file looks normal, it just contains a question mark:

A00125:103:HF5GFDSXX:1:1458:14696:27445 83  14  63373824    255 101M    =   63373767    -158    TGAGATCTGTCTGTCTCAGCCTCCCAAGGGCTGGAATCACAAGCTTGAGCCATTACACCTGTCTCTTACGCCATCTAATTCCACCCTAATCTCCATCTCCC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NH:i:1  HI:i:1  AS:i:199    nM:i:0  NM:i:1  MD:Z:85?15  jM:B:c,-1   jI:B:i,-1   MC:Z:101M
ADD REPLYlink written 12 days ago by loui_0
2
gravatar for genomax
12 days ago by
genomax73k
United States
genomax73k wrote:

You seem to be using an older version of STAR (current is 2.7.3a). Is there any chance you can try an upgrade and see if that fixes this?

ADD COMMENTlink written 12 days ago by genomax73k

I used this because I used it for other experiments as well but that's a good idea. I will give it a try.

ADD REPLYlink written 12 days ago by loui_0

I installed the newest STAR version available on our system (2.7.1a), run everything again and this indeed resolved the problem! Thanks a lot!

Do you want to post your comment as a reply so I can accept it?

ADD REPLYlink modified 11 days ago • written 11 days ago by loui_0
2
gravatar for swbarnes2
12 days ago by
swbarnes26.7k
United States
swbarnes26.7k wrote:

It looks like the problem is that you have a question mark in your MD bam tag. I'm not sure that's legal. So why did STAR put it there?

I notice when I blast your sequence, it comes up with a hit that has an N at base 86, right where the MD tag indicates a discrepancy. I wonder if that's the problem, that STAR and/or featureCounts doesn't like having an N in the reference.

ADD COMMENTlink modified 12 days ago • written 12 days ago by swbarnes26.7k

Good observation, thanks a lot! I checked some other sequences and that seems to be the problem. Nevertheless, I can't come up with a solution.. I already allow 6 mismatches in the alignment but obviously that doesn't change the Ns.

I also checked the reads in IGV and all the ones I checked are intronic so I added the option "-m intersection-strict" to htseq-count but that did not help either. Any other suggestions? Everything is highly appreciated!

ADD REPLYlink written 11 days ago by loui_0

As stated above, the solution to all problems was installing the newest version of STAR (2.7.1 in my case). I assume this is a bug in the older versions!

Thanks for your help.

ADD REPLYlink written 11 days ago by loui_0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1273 users visited in the last hour