Question: Non-ascii characters in bam file cause htseq-count error
0
gravatar for loui_
9 months ago by
loui_0
loui_0 wrote:

Hi everybody,

I have aligned bam files (STAR version 2.6.1) and want to quantify them using htseq-count (version 0.9.1). Some bams run through without a problem but some give following error:


...
13300000 SAM alignment record pairs processed.
13400000 SAM alignment record pairs processed.
Error occured when processing SAM input (record #26816395 in file /folder/file.bam):
'ascii' codec can't decode byte 0xe4 in position 4: ordinal not in range(128)
[Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]

When I go back to the bam file, the specific line (26816395) contains a question mark in the quality part of the read
(either ? or ^?). So far, I have dealt with the problem by just removing the line since it were only one or a few reads.
But there's a file with 30 or more failing lines and I don't want to lose all those reads.
The problem is that it is not REALLY a question mark (I know this because when I grep '?' there's no result).
This means that the question mark only is a substitution for a non-ascii character.

So my question: How do I remove non-ascii characters from my bam file?
(alternatively: do you know where the question marks come from in the first place? Can I re-run STAR with different parameters?)

I've been trying to figure this out for quite a while so I appreciate any input or work around :)

Thanks!

rna-seq star htseq-count • 965 views
ADD COMMENTlink modified 9 months ago by swbarnes28.1k • written 9 months ago by loui_0
1

A question mark is a perfectly valid character to have in the quality line. A ^ is not. I'd examine the original fastq, see if it's there.

ADD REPLYlink written 9 months ago by swbarnes28.1k

That's true, but I'm guessing that it is not really a question mark, otherwise I would be able to grep it via ''grep '?''', but I can't. Correct? The line in the fasta file also looks normal..

ADD REPLYlink written 9 months ago by loui_0

So these characters are coming from your original sequence files? Have you checked the offending read identified in the BAM?

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax87k

As I said, the line in the bam file looks normal, it just contains a question mark:

A00125:103:HF5GFDSXX:1:1458:14696:27445 83  14  63373824    255 101M    =   63373767    -158    TGAGATCTGTCTGTCTCAGCCTCCCAAGGGCTGGAATCACAAGCTTGAGCCATTACACCTGTCTCTTACGCCATCTAATTCCACCCTAATCTCCATCTCCC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NH:i:1  HI:i:1  AS:i:199    nM:i:0  NM:i:1  MD:Z:85?15  jM:B:c,-1   jI:B:i,-1   MC:Z:101M
ADD REPLYlink written 9 months ago by loui_0
2
gravatar for genomax
9 months ago by
genomax87k
United States
genomax87k wrote:

You seem to be using an older version of STAR (current is 2.7.3a). Is there any chance you can try an upgrade and see if that fixes this?

ADD COMMENTlink written 9 months ago by genomax87k

I used this because I used it for other experiments as well but that's a good idea. I will give it a try.

ADD REPLYlink written 9 months ago by loui_0

I installed the newest STAR version available on our system (2.7.1a), run everything again and this indeed resolved the problem! Thanks a lot!

Do you want to post your comment as a reply so I can accept it?

ADD REPLYlink modified 9 months ago • written 9 months ago by loui_0
2
gravatar for swbarnes2
9 months ago by
swbarnes28.1k
United States
swbarnes28.1k wrote:

It looks like the problem is that you have a question mark in your MD bam tag. I'm not sure that's legal. So why did STAR put it there?

I notice when I blast your sequence, it comes up with a hit that has an N at base 86, right where the MD tag indicates a discrepancy. I wonder if that's the problem, that STAR and/or featureCounts doesn't like having an N in the reference.

ADD COMMENTlink modified 9 months ago • written 9 months ago by swbarnes28.1k

Good observation, thanks a lot! I checked some other sequences and that seems to be the problem. Nevertheless, I can't come up with a solution.. I already allow 6 mismatches in the alignment but obviously that doesn't change the Ns.

I also checked the reads in IGV and all the ones I checked are intronic so I added the option "-m intersection-strict" to htseq-count but that did not help either. Any other suggestions? Everything is highly appreciated!

ADD REPLYlink written 9 months ago by loui_0

As stated above, the solution to all problems was installing the newest version of STAR (2.7.1 in my case). I assume this is a bug in the older versions!

Thanks for your help.

ADD REPLYlink written 9 months ago by loui_0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1359 users visited in the last hour