Question

htseq-count error utf-8 codec can't decode

0

Entering edit mode

5.7 years ago

alexadeanfitz ▴ 10

Hello,

I am trying to run htseq-count on bam files against a reference bacterial genome in gtf format, downloaded from Ensembl

The RNA-Seq reads were single-end fastq files, I performed BWA-MEM using Galaxy which generated the bam files. I ran flagstat and got a high percentage of mapped reads, and visualized the alignment using IGV. If I am not wrong, I believe the files are already indexed/sorted by Galaxy. I downloaded the .bam (and associated .bai files to the same directory), then used these files to perform ht-seq against my reference genome (I downloaded the gtf file, and made sure chromosome identifiers matched, and removed the header as I saw previously suggested in other answers here).

This is the first few lines of my reference genome.gtf

NC_008463.1 PseudoCAP   CDS 483 2027    .   +   0   gene_id "A0K_RS00005"; transcript_id "5841281"; locus_tag "A0K_RS00005"; name "chromosomal replication initiator protein DnaA"; replicon_xref "NZ_AAQW01000001"
NC_008463.1 PseudoCAP   CDS 2056    3159    .   +   0   gene_id "A0K_RS00010"; transcript_id "50342"; locus_tag "A0K_RS00010"; name "DNA polymerase III subunit beta"; replicon_xref "NZ_AAQW01000001"
NC_008463.1 PseudoCAP   CDS 3169    4278    .   +   0   gene_id "A0K_RS00015"; transcript_id "50345"; locus_tag "A0K_RS00015"; name "recombinase RecF"; replicon_xref "NZ_AAQW01000001"

This is the first few lines of one of my BWA-MEM generated .bam files

@HD VN:1.3 SO:coordinate
@SQ SN:NC_008463.1 LN:6537648
@RG ID:A30_SRR2288354.fastq SM:A30_SRR2288354.fastq PL:ILLUMINA LB:SRP062593vPA14 DS:BWA-MEM on SRP062593vPA14 DT:2018-08-13
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 6 -v 1 -R @RG\tID:A30_SRR2288354.fastq\tSM:A30_SRR2288354.fastq\tPL:ILLUMINA\tLB:SRP062593vPA14\tDS:BWA-MEM on SRP062593vPA14\tDT:2018-08-13 localref.fa /galaxy-repl/main/files/026/475/dataset_26475523.dat
SRR2288354.45585072 2048    NC_008463.1 1   60  67H33M  *   0   0   TTTAAAGAGACCGGCGATTCTAGTGAAATCGAA   DDCB@@C@CDDDDDDDBDDA>:@ACC>@>:?<C   NM:i:0 MD:Z:33 AS:i:33 XS:i:0 RG:Z:A30_SRR2288354.fastq SA:Z:NC_008463.1,6537582,+,67M33S,60,0;
SRR2288354.45612256 0   NC_008463.1 1   60  33S47M  *   0   0   CTGGGTTGACGACTTGAGGTCGCAGTGACCCCGTTTAAAGAGACCGGCGATTCTAGTGAAATCGAACGGGCAGGTCAATT    CCCFFFFFGHHGHJIJJJJFHIIJIGHIIJIIIGHHIJJJJIIIJJJHFDDDCCDDCCDEDDDDDDDDBBDDBDACCCDE    NM:i:0 MD:Z:47 AS:i:47 XS:i:0 RG:Z:A30_SRR2288354.fastq SA:Z:NC_008463.1,6537616,+,33M47S,60,0;
SRR2288354.2322100  16  NC_008463.1 1   60  21S60M  *   0   0   CTTGAGGTCGCAGTGACCCCGTTTAAAGAGACCGGCGATTCTAGTGAAATCGAACGGGCAGGTCAATTTCCAACCAGCGAT   DDDDDDDDDDDDDDDDDDDFFFFHHHHHHIIJJJJHJIIJJJJJJJJJJJJJJJJJJJJIIIIJJIJJHHHHHFFFFFBBB   NM:i:0 MD:Z:60 AS:i:60 XS:i:0 RG:Z:A30_SRR2288354.fastq

The MAPQ scores of all the reads are 60. The FLAGs alternate between 0, 16, 2048 and 2064. The POS increases as you move down the file, however there are multiple reads that have the same POS value.

Below is my attempt at performed htseq-count, and the error I received

(ProjectLearn) Maxwells-iMac:Alexa maxwelllab$ htseq-count -m union -s no -r pos -t CDS -i gene_id --nonunique all -f bam /Users/Alexa/Alexa/SRP062593_sra/SRP062593_htseq\ /SRP062593_A30_2.bam /Users/Alexa/Alexa/SRP062593_sra/SRP062593_htseq\ /PA14_reference_NC_2.gtf > SRP062593_A30_2.counts
Error occured when processing GFF file (line 8700 of file /Users/Alexa/Alexa/SRP062593_sra/SRP062593_htseq /PA14_reference_NC_2.gtf):
  'utf-8' codec can't decode byte 0xa0 in position 1259: invalid start byte
  [Exception type: UnicodeDecodeError, raised in codecs.py:321]

I am very new to bioinformatics and would appreciate any advice as to how to move forward. I am not sure why it reads "error when processing GFF file" as I input a GTF file (which I read is recommended for htseq-count). Is there something I am missing to define the reference genome as GTF? Also, in regards to the UnicodeDecode error, I know the default is utf-8, should I be using something else to run htseq-count? Is there additional filtering steps of the bam files that I am missing?

Thank you!

Alexa

RNA-Seq htseq-count • 2.6k views

ADD COMMENT • link 5.7 years ago by alexadeanfitz ▴ 10

1

Entering edit mode

There's probably a special character on line 8700 of the GTF file, have a look around there.

ADD REPLY • link 5.7 years ago by Devon Ryan 104k

0

Entering edit mode

I used linecache to print that line (8700), as well as the line just before to see if it is reading the file correctly however I only get a " symbol

>>> import linecache
>>>linecache.getline("/Users/Alexa/Alexa/SRP062593_sra/SRP062593_htseq/PA14_reference_NC_2.gtf", 8700)
''
>>>linecache.getline("/Users/Alexa/Alexa/SRP062593_sra/SRP062593_htseq/PA14_reference_NC_2.gtf", 8699)
''

When I count the lines in the file, it actually only goes up to 8699 then gives me the same error I saw with htseq-count

...
8697
8698
8699
Traceback (most recent call last):
  File "<pyshell#54>", line 2, in <module>
    for line in test:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 1259: invalid start byte

ADD REPLY • link updated 5.7 years ago by Devon Ryan 104k • written 5.7 years ago by alexadeanfitz ▴ 10

0

Entering edit mode

No one has tested anything in python 3.7, try using 3.6 or another non-bleeding edge version.

ADD REPLY • link 5.7 years ago by Devon Ryan 104k

0

Entering edit mode

I am using python 3.6

ADD REPLY • link 5.7 years ago by alexadeanfitz ▴ 10

score 1 · Accepted Answer · 2018-08-21

Error resolved for anyone having the same issue:

I had downloaded the gff3 file from Ensembl bacteria and converted it to gtf using Galaxy. I did this because at first on Ensembl when I searched the reference genome I wanted, I did not see an option for a gtf file however I was able to go back to this: http://bacteria.ensembl.org/info/website/ftp/index.html and they should have gtf files for any common reference genomes, I downloaded this gtf file and htseq-count is processing just fine

Alexa