Question: htseq-count SAM Processing error "can't decode byte 0xa7"
0
gravatar for Glubbdrubb
9 months ago by
Glubbdrubb0
Glubbdrubb0 wrote:

I am getting strange errors from htseq-count when analyzing bam files created by STAR. They are sorted by coordinate (default output by STAR). I am analysing about 18 bam files, and I get the error for about 40% of them.

I get different errors, depending on whether or not I I read it in directly or if it is piped in via samtools view. Here are my examples:

htseq-count -t exon -s no -m union -r pos -f bam mapped_reads/sample1.bam genome.gtf > htseq_count/sample1.count

27414 GFF lines processed.

Error occured when processing SAM input (record #17507 in file mapped_reads/sample1.bam): 'ascii' codec can't decode byte 0xa7 in position 2: ordinal not in range(128) [Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]

Here is line 17507 from sample1.bam:

FCC4TF9ACXX:8:1308:20606:31534# 403     PRELSG_01_v1    293     1       75M     =       215     -153    TGTGACAGTAATCCAACTTGGTACGAGAGGATTAGTTGGTTCAGACAATTGGTACAGCAATTGGTTGACAAACCA     bb`bb_cccbccb_]T^Z]Z`\Z]eceec`bVgb^`cbd_]eebefffebaaecec]ffaZffhdddgbbechge     NH:i:3  HI:i:2  AS:i:144        nM:i:0  MD:Z:75

I can't see anything strange in that line.

Here is the error I get when I pipe in via samtools

samtools view mapped_reads/sample1.bam | htseq-count -t exon -s no -m union -r pos - genome.gtf > htseq_count/sample1.count

27414 GFF lines processed.

Error occured when processing SAM input (line 17478): 'utf-8' codec can't decode byte 0xa7 in position 8038: invalid start byte [Exception type: UnicodeDecodeError, raised in codecs.py:321]

And here is line 17478:

FCC4TF9ACXX:8:1314:2076:30660#  339     PRELSG_01_v1    287     0       75M     =       286     -76     TAGTATTGTGACAGTAATCCAACTTGGTACGAGAGGATTAGTTGGTTCAGACAATTGGTACAGCAATTGGTTGAC     aaa`_Ta_]Yaaaaaa``_Z__ZGUZZGTFb^\cdd_Y\VVeeeba\MNXNXI[hfeaXOXSa[e[cca^[efae     NH:i:8  HI:i:7  AS:i:143        nM:i:0  MD:Z:75

Can anyone figure this out?

P.S. I realize STAR will output gene counts, but it assumes strandedness, which my data is not, unfortunately.

EDIT: After converting all my fastq files to ASCII-33 encoding using reformat.sh, I reran the pipeline. Many of the new bam files still give me errors. I will post an example here.

Error occured when processing SAM input (record #166582 in file mapped_reads/sample1.bam): 'ascii' codec can't decode byte 0x82 in position 2: ordinal not in range(128) [Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]

Here are the first 2 records of that bam file:

FCC4MT3ACXX:7:1205:3160:39383#  419     PRELSG_01_v1    2       1       5S70M   =       85      158     ATGGGTATCGATCCTTTATATTTGCAAAATTACGGAATTTATTGCTTACTGTGCATATAGAGGTGTCTGAAAAGT     G:??E:CCDGH;FGGG?BGF?**9<?EHDBFHIGA=GC=CECC)73=A@;.;;@C@CCDCCD@(5553:@CDC##     NH:i:3  HI:i:3  AS:i:141
    nM:i:1  MD:Z:38T31

FCC4MT3ACXX:7:2106:14022:53098# 419     PRELSG_01_v1    2       1       5S70M   =       17      90      ATGGGTATCGATCCTTTATATTTGCAAAATTACGGAATTTATTTCTTACTGTGCATATAGAGGTGTCTGAAAAGT     IIIJJEGGJJJJJJJJJGIJIJIJJGIJJJJJJJJIJJIJJJHHHHHHHFFDFFEFCEEEEEE;?ACCDC@CCC:     NH:i:3  HI:i:3  AS:i:143
    nM:i:0  MD:Z:70

I can't see any illegal characters...

EDIT: Here is record #166582

FCC4MT3ACXX:7:1115:11607:25097# 339     PRELSG_05_v1    617015  1       66M9S   =       616991  -90     TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCACAGGACGA     DDDDD@DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDFHJJJJJIJJIIGJJJIHJJJ     NH:i:4  HI:i:2  AS:i:139        nM:i:0  MD:Z:66

EDIT: As a new user, I am rate limited to 5 posts per hour, which is why I'm replying to your comment here. It is a Plasmodium relictum genome from PlasmoDB. And I can't explain these oddities in the bam file. Especially since most of the samples work fine in htseq-count.

rna-seq htseq • 581 views
ADD COMMENTlink modified 9 months ago • written 9 months ago by Glubbdrubb0
1
gravatar for Bastien Hervé
9 months ago by
Bastien Hervé4.4k
Limoges, CBRS, France
Bastien Hervé4.4k wrote:

I bet you used an old plateform to sequence your data. Old Illumina sequencers were base on a Phred64 score which is not accepted in a lot of recent software

https://www.drive5.com/usearch/manual/quality_score.html

In your quality sequence you have some ` and [ ] which do not exist anymore in ASCII_BASE 33 (Phred33)

I suggest you to transform your fastq from phred64 to phred33 quality, with BBmap for example

See also : Converting Phred64 fastq to Phred33 fastq

ADD COMMENTlink modified 9 months ago • written 9 months ago by Bastien Hervé4.4k

It was an old MiSeq platform, so I'll do as you say and let know.

ADD REPLYlink written 9 months ago by Glubbdrubb0

Please see my edit :-)

ADD REPLYlink written 9 months ago by Glubbdrubb0

I need this record please :

record #166582 in file mapped_reads/sample1.bam

ADD REPLYlink written 9 months ago by Bastien Hervé4.4k

My mistake. Got the position and record mixed up. Here ya go:

FCC4MT3ACXX:7:1115:11607:25097# 339     PRELSG_05_v1    617015  1       66M9S   =       616991  -90     TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCACAGGACGA     DDDDD@DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDFHJJJJJIJJIIGJJJIHJJJ     NH:i:4  HI:i:2  AS:i:139        nM:i:0  MD:Z:66
ADD REPLYlink written 9 months ago by Glubbdrubb0

What was your command line ? And version software ?

ADD REPLYlink written 9 months ago by Bastien Hervé4.4k

htseq version 0.11.0

htseq-count -q -t exon -f bam -r pos -s no -m union mapped_reads/D8_Bird8_42-3p.bam p.relictum/PlasmoDB_genome/PlasmoDB-39_PrelictumSGS1-like.gtf > htseq_count/D8_Bird8_42-3p.count
ADD REPLYlink written 9 months ago by Glubbdrubb0

Try to remove unmapped reads and not primary alignment

samtools view -f 4 -F 256 input.bam

ADD REPLYlink modified 9 months ago • written 9 months ago by Bastien Hervé4.4k

I get an empty result with that. I don't think STAR produces bam files with unmapped reads using the default parameters.

ADD REPLYlink written 9 months ago by Glubbdrubb0

Oups !

samtools view -F 260 input.bam
ADD REPLYlink written 9 months ago by Bastien Hervé4.4k

I did that. The resulting bam file is about a third smaller. But htseq count produced the same "ordinal not in range" error. Here is the corresponding record in the new bam file:

FCC4MT3ACXX:7:2203:9969:30526#  163     PRELSG_06_v1    268502  255     75M     =       268607  258     CCTACATCACCACATTGTCTCATAATATCTTTTAAATGTTGCCATTTGCAATTGTCTGGCAAATTGCTAACAATA     GJGAFGDGIIJJIJHIJHICHIGCHEGGHGIIJJCHGHFHIGIIEEIIHIGDCHEGEEFGHHFFDCDFFEEE>3;     NH:i:1  HI:i:1  AS:i:143        nM:i:1  MD:Z:75
ADD REPLYlink written 9 months ago by Glubbdrubb0

This is weird

nM:i:1

You have a huge space between AS:i:143 and nM:i:1

Plus the 'n' is 'nM' should be in uppercase

ADD REPLYlink written 9 months ago by Bastien Hervé4.4k

What is your data ? What is PRELSG_05_v1 ?

ADD REPLYlink written 9 months ago by Bastien Hervé4.4k

It is a Plasmodium relictum genome from PlasmoDB. And I can't explain these oddities in the bam file. Especially since most of the samples work fine in htseq-count. When I view the bam file in less (piped in from samtools view), all the white spaces are tabs, but not all tabs are the same length. You can see this for yourself by searching for a tab character in less (but don't search for \t, that will just do a literal search for that string).

It is odd that the QUAL field has a lower value character in it. Can you show me the specifications for that field?

EDIT: I checked the bam files for which htseq-count works. It has the variable length tabs and nM in the QUAL fields.

P.S. As a new user with low reputation, I am limited to 5 posts per 6 hours, which is why I didn't reply to your last message until now. If I hit that limit again, I'll try help you out by editing my main question.

ADD REPLYlink modified 9 months ago • written 9 months ago by Glubbdrubb0

Especially since most of the samples work fine in htseq-count

You got htseq-count output results for some bam of your experiment ?

It is odd that the QUAL field has a lower value character in it

What do you mean by has a lower value character ? Put an example please

How many bam are you struggling with ? And how many succeded htseq-count process ?

ADD REPLYlink modified 9 months ago • written 9 months ago by Bastien Hervé4.4k

I meant lower case, not lower value..

I have 18 bam files, 10 of them work without a problem. Just some info about the nature of the source data: The paired-end reads in the FASTQ files are trimmed.

ADD REPLYlink written 9 months ago by Glubbdrubb0

Be aware !

This is Quality :

GJGAFGDGIIJJIJHIJHICHIGCHEGGHGIIJJCHGHFHIGIIEEIIHIGDCHEGEEFGHHFFDCDFFEEE>3;

This are TAGs :

NH:i:1  HI:i:1  AS:i:143        nM:i:1  MD:Z:75

And Quality as well as TAGs can not have lower case character (except the i for TAGs)

This is why the :

nM:i:1

is very strange to me

Bam files with nM:i:1 (in lower case character) succeded htseq-count ?

ADD REPLYlink modified 9 months ago • written 9 months ago by Bastien Hervé4.4k

Ah yes, my mistake. I misread the description of sam fields.

Yup, those bam files worked in htseq-count and htseq-qa (qa can only accept sam files, so I converted them first). Here's an example line from a bam file that worked:

FCC4MT3ACXX:8:1109:8871:93069#  419     PRELSG_01_v1    1       1       5S70M   =       57      131     ATGTGCTTTCGATCCTTTTTATTTTCATAATTACGGAATTTATTTCTTACTGTGCATATAGAGGTGTCTGAAAAG     <4****:*:4)):0)0BG*?)?FB****/)8C8=BF@(;.7)7@=ECCA>>B;@?C77A################     NH:i:3  HI:i:3  AS:i:135        nM:i:4  MD:Z:2A10A5G2A47
ADD REPLYlink modified 9 months ago • written 9 months ago by Glubbdrubb0

Did you moved these bad bam from window to linux or vice versa ?

You could try to focus on a "bad" bam.

For each error you get extract the alignment, copy it in a text file, and re-run till your bam is working or till you get enought alignment to compare them closely (5 to 10 alignments)

Look at the number of tab, space, space that are not space (non-breaking space, ascii 160), upper case, lower case...

Try to upload this alignment text file with 5 to 10 alignments and put the link here please

ADD REPLYlink modified 9 months ago • written 9 months ago by Bastien Hervé4.4k

The bam files were only worked on in Linux, using linux tools.

I wrote a python script that printed the line number of the first line that produces an error. https://pastebin.com/TEki6jhe This produces exactly the same error as htseq-count if I don't catch the exception. When I examined the offending line, I couldn't detect the problem by eye.

But then I used tr to remove ALL non ascii characters:

tr -cd '\1-\176' < D8_Bird8_42-3p.sam > sample.sam

This new file is 54 bytes smaller than the original, and is parsed by my script and htseq-count without error. Although this doesn't explain the cause of the error, it does seem to fix it. What do you think? If you agree, I'll add this as the answer.

P.S. I got the tr trick from https://alvinalexander.com/linux-unix/how-to-use-unix-tr-command-filter-remove-extended-ascii-characters-files

ADD REPLYlink modified 9 months ago • written 9 months ago by Glubbdrubb0

If you can't see the non ascii characters by eye, it's propably the non-breaking line space as ascii 160. Try your tr command and only remove the character 160, to see if you got multiple non ascii character or just this one

ADD REPLYlink written 9 months ago by Bastien Hervé4.4k

What is the octal encoding for ASCII 160 in octal? Cause my tr command requires octal specifications. I assumed 160 is decimal, and according to http://ascii-table.com/calculator.php , 160 dec encodes to 240 octal.

I deleted all of those characters using

tr -d '\240' < D8_Bird8_42-3p.sam > sample.sam

but the resulting file was identical. I think there must be other invisible ASCII characters, or I converted from 160 to 240 incorrectly.

ADD REPLYlink modified 9 months ago • written 9 months ago by Glubbdrubb0

I think that with this command :

tr -cd '\1-\176' < D8_Bird8_42-3p.sam > sample.sam

You removed what is not in 1-176

ADD REPLYlink written 9 months ago by Bastien Hervé4.4k

Yes, but keep in mind those are octal ascii codes, not decimal ascii codes. Removing only 160 (which is 240 octal) did not make any difference to the file.

ADD REPLYlink written 9 months ago by Glubbdrubb0

Try this on a single sam result that failed and compare the two lines

ADD REPLYlink modified 9 months ago • written 9 months ago by Bastien Hervé4.4k

That didn't reveal anything. I think the invisible character is lost when I extract the lines. But then I did a diff of the original and cleaned files:

diff -c sample.sam D8_Bird8_42-3p.sam | less

This reveals the rather odd <82> near the end of the affected lines. Sometimes it even repeats. You can get the whole thing here : https://pastebin.com/dwkr4THJ . In my text editor and browser, it is represented as unprintable character: �

Very strange. I'd love to know why STAR created these.

Once they're "cleaned" I get much better compression ratios with my bam files.

P.S. Yes, I actually typed a '<' then '82' and then '>'. That's not a strange html rendering issue.

ADD REPLYlink modified 9 months ago • written 9 months ago by Glubbdrubb0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1488 users visited in the last hour