Question

Non-ascii characters in bam file cause htseq-count error

0

Entering edit mode

4.6 years ago

loui_ • 0

Hi everybody,

I have aligned bam files (STAR version 2.6.1) and want to quantify them using htseq-count (version 0.9.1). Some bams run through without a problem but some give following error:

...
13300000 SAM alignment record pairs processed.
13400000 SAM alignment record pairs processed.
Error occured when processing SAM input (record #26816395 in file /folder/file.bam):
'ascii' codec can't decode byte 0xe4 in position 4: ordinal not in range(128)
[Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]

When I go back to the bam file, the specific line (26816395) contains a question mark in the quality part of the read
(either ? or ^?). So far, I have dealt with the problem by just removing the line since it were only one or a few reads.
But there's a file with 30 or more failing lines and I don't want to lose all those reads.
The problem is that it is not REALLY a question mark (I know this because when I grep '?' there's no result).
This means that the question mark only is a substitution for a non-ascii character.

So my question: How do I remove non-ascii characters from my bam file?
(alternatively: do you know where the question marks come from in the first place? Can I re-run STAR with different parameters?)

I've been trying to figure this out for quite a while so I appreciate any input or work around :)

Thanks!

RNA-Seq STAR HTseq-count • 2.3k views

ADD COMMENT • link updated 4.6 years ago by swbarnes2 14k • written 4.6 years ago by loui_ • 0

1

Entering edit mode

A question mark is a perfectly valid character to have in the quality line. A ^ is not. I'd examine the original fastq, see if it's there.

ADD REPLY • link 4.6 years ago by swbarnes2 14k

0

Entering edit mode

That's true, but I'm guessing that it is not really a question mark, otherwise I would be able to grep it via ''grep '?''', but I can't. Correct? The line in the fasta file also looks normal..

ADD REPLY • link 4.6 years ago by loui_ • 0

0

Entering edit mode

So these characters are coming from your original sequence files? Have you checked the offending read identified in the BAM?

ADD REPLY • link 4.6 years ago by GenoMax 141k

0

Entering edit mode

As I said, the line in the bam file looks normal, it just contains a question mark:

A00125:103:HF5GFDSXX:1:1458:14696:27445 83  14  63373824    255 101M    =   63373767    -158    TGAGATCTGTCTGTCTCAGCCTCCCAAGGGCTGGAATCACAAGCTTGAGCCATTACACCTGTCTCTTACGCCATCTAATTCCACCCTAATCTCCATCTCCC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NH:i:1  HI:i:1  AS:i:199    nM:i:0  NM:i:1  MD:Z:85?15  jM:B:c,-1   jI:B:i,-1   MC:Z:101M

ADD REPLY • link 4.6 years ago by loui_ • 0

score 2 · Accepted Answer · 2019-10-09

2

Entering edit mode

4.6 years ago

GenoMax 141k

You seem to be using an older version of STAR (current is 2.7.3a). Is there any chance you can try an upgrade and see if that fixes this?

ADD COMMENT • link 4.6 years ago by GenoMax 141k

0

Entering edit mode

I used this because I used it for other experiments as well but that's a good idea. I will give it a try.

ADD REPLY • link 4.6 years ago by loui_ • 0

0

Entering edit mode

I installed the newest STAR version available on our system (2.7.1a), run everything again and this indeed resolved the problem! Thanks a lot!

Do you want to post your comment as a reply so I can accept it?

ADD REPLY • link 4.5 years ago by loui_ • 0

score 2 · Accepted Answer · 2019-10-09

2

Entering edit mode

4.6 years ago

swbarnes2 14k

It looks like the problem is that you have a question mark in your MD bam tag. I'm not sure that's legal. So why did STAR put it there?

I notice when I blast your sequence, it comes up with a hit that has an N at base 86, right where the MD tag indicates a discrepancy. I wonder if that's the problem, that STAR and/or featureCounts doesn't like having an N in the reference.

ADD COMMENT • link 4.6 years ago by swbarnes2 14k

0

Entering edit mode

Good observation, thanks a lot! I checked some other sequences and that seems to be the problem. Nevertheless, I can't come up with a solution.. I already allow 6 mismatches in the alignment but obviously that doesn't change the Ns.

I also checked the reads in IGV and all the ones I checked are intronic so I added the option "-m intersection-strict" to htseq-count but that did not help either. Any other suggestions? Everything is highly appreciated!

ADD REPLY • link 4.5 years ago by loui_ • 0

0

Entering edit mode

As stated above, the solution to all problems was installing the newest version of STAR (2.7.1 in my case). I assume this is a bug in the older versions!

Thanks for your help.

ADD REPLY • link 4.5 years ago by loui_ • 0