Question: htseq-count SAM Processing error "can't decode byte 0xa7"
0
gravatar for Glubbdrubb
4 days ago by
Glubbdrubb0
Glubbdrubb0 wrote:

I am getting strange errors from htseq-count when analyzing bam files created by STAR. They are sorted by coordinate (default output by STAR). I am analysing about 18 bam files, and I get the error for about 40% of them.

I get different errors, depending on whether or not I I read it in directly or if it is piped in via samtools view. Here are my examples:

htseq-count -t exon -s no -m union -r pos -f bam mapped_reads/sample1.bam genome.gtf > htseq_count/sample1.count

27414 GFF lines processed.

Error occured when processing SAM input (record #17507 in file mapped_reads/sample1.bam): 'ascii' codec can't decode byte 0xa7 in position 2: ordinal not in range(128) [Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]

Here is line 17507 from sample1.bam:

FCC4TF9ACXX:8:1308:20606:31534# 403     PRELSG_01_v1    293     1       75M     =       215     -153    TGTGACAGTAATCCAACTTGGTACGAGAGGATTAGTTGGTTCAGACAATTGGTACAGCAATTGGTTGACAAACCA     bb`bb_cccbccb_]T^Z]Z`\Z]eceec`bVgb^`cbd_]eebefffebaaecec]ffaZffhdddgbbechge     NH:i:3  HI:i:2  AS:i:144        nM:i:0  MD:Z:75

I can't see anything strange in that line.

Here is the error I get when I pipe in via samtools

samtools view mapped_reads/sample1.bam | htseq-count -t exon -s no -m union -r pos - genome.gtf > htseq_count/sample1.count

27414 GFF lines processed.

Error occured when processing SAM input (line 17478): 'utf-8' codec can't decode byte 0xa7 in position 8038: invalid start byte [Exception type: UnicodeDecodeError, raised in codecs.py:321]

And here is line 17478:

FCC4TF9ACXX:8:1314:2076:30660#  339     PRELSG_01_v1    287     0       75M     =       286     -76     TAGTATTGTGACAGTAATCCAACTTGGTACGAGAGGATTAGTTGGTTCAGACAATTGGTACAGCAATTGGTTGAC     aaa`_Ta_]Yaaaaaa``_Z__ZGUZZGTFb^\cdd_Y\VVeeeba\MNXNXI[hfeaXOXSa[e[cca^[efae     NH:i:8  HI:i:7  AS:i:143        nM:i:0  MD:Z:75

Can anyone figure this out?

P.S. I realize STAR will output gene counts, but it assumes strandedness, which my data is not, unfortunately.

rna-seq • 64 views
ADD COMMENTlink modified 4 days ago by Bastien Hervé2.7k • written 4 days ago by Glubbdrubb0
1
gravatar for Bastien Hervé
4 days ago by
Bastien Hervé2.7k
Limoges, CBRS, France
Bastien Hervé2.7k wrote:

I bet you used an old plateform to sequence your data. Old Illumina sequencers were base on a Phred64 score which is not accepted in a lot of recent software

https://www.drive5.com/usearch/manual/quality_score.html

In your quality sequence you have some ` and [ ] which do not exist anymore in ASCII_BASE 33 (Phred33)

I suggest you to transform your fastq from phred64 to phred33 quality, with BBmap for example

See also : Converting Phred64 fastq to Phred33 fastq

ADD COMMENTlink modified 4 days ago • written 4 days ago by Bastien Hervé2.7k

It was an old MiSeq platform, so I'll do as you say and let know.

ADD REPLYlink written 4 days ago by Glubbdrubb0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1520 users visited in the last hour