HTseq reports missing attribute name
1
0
Entering edit mode
12 months ago
Bjorn • 0

Hello,

I am running this htseq command

htseq-count -r pos -t gene -i gene -s yes -f bam \ 
/Volumes/cachannel/ZebraFinchBrain/CB-4a_genomemapping/sorted_alignmentcb4a.bam \
/Volumes/cachannel/ZebraFinchBrain/GCF_003957565.2/Taeniopygia_guttata.bTaeGut1_v1.p.110.chr.gff3 >  \
/Volumes/cachannel/ZebraFinchBrain/HTSEQ_withautomate/output_counts.txt

However I get this error:

Error processing GFF file (line 75 of file /Volumes/cachannel/ZebraFinchBrain/GCF_003957565.2/Taeniopygia_guttata.bTaeGut1_v1.p.110.chr.gff3):
  Feature gene:ENSTGUG00000013637 does not contain a 'gene' attribute
  [Exception type: ValueError, raised in features.py:387]

But in my gff3 file there is clearly a gene attribute:

1   ensembl gene    34955   72180   .   +   .   ID=gene:ENSTGUG00000013637;Name=DCBLD2;biotype=protein_coding;description=discoidin%2C CUB and LCCL domain containing 2 [Source:NCBI gene%3BAcc:100218192];gene_id=ENSTGUG00000013637;logic_name=ensembl;version=2
1   ensembl mRNA    34955   72180   .   +   .   ID=transcript:ENSTGUT00000014203;Parent=gene:ENSTGUG00000013637;Name=DCBLD2-201;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=ENSTGUT00000014203;version=2

Does anyone know what is going on here?

htseq • 815 views
ADD COMMENT
0
Entering edit mode
12 months ago

it is important to distinguish between the type column and an attribute

the attributes are in last column in the form attribute_name=attribute_value

using those definitions your line does not contain a gene attribute

it only contains a gene_id attribute (for what is worth that is probably what you need in the end)

ADD COMMENT
0
Entering edit mode

Thanks! I was able to get around this error by using feature counts. However, I ended up getting zero reads. This is odd considering that my Bam file had an alignment of about 90%. I think that my .gff3 file and bam file are somehow incompatible.

Here is some of the .bam file. AS you can see it is region CM012081.2

J00138:157:HVCVCBBXX:4:1127:4919:41036  73      CM012081.2      207     60      135M    =       207     0       GGCGTCGCAGACCAAGAGCATAT
GAGTGCATTGGGGTGTTTGTCAGGCGCAATGTTAGAGCCCCATCTCCTCCCTACAGATCCTGTTCTAGTTTCAGTCACAAGGCCAGCCCCAGAGTGCCTGTGCCAGCCCCAG        AAFFFJJJJJJJJJJ
JJJJJJJJJJJFJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJFJFJJJJJJJJJJJJJJJJJJJ7FJJF<JJJFJJJJJJJAFFJJF<JFJ<FJJJJJJFJ        AS:i:0

here is the head of my gff3 file with the alias almost matching the bam file except for the .1 instead of .2

1   ensembl gene    34955   72180   .   +   .   ID=gene:ENSTGUG00000013637;Name=DCBLD2;biotype=protein_coding;description=discoidin%2C CUB and LCCL domain containing 2 [Source:NCBI gene%3BAcc:100218192];gene_id=ENSTGUG00000013637;logic_name=ensembl;version=2
1   ensembl mRNA    34955   72180   .   +   .   ID=transcript:ENSTGUT00000014203;Parent=gene:ENSTGUG00000013637;Name=DCBLD2-201;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=ENSTGUT00000014203;version=2
1   ensembl exon    34955   35066   .   +   .   Parent=transcript:ENSTGUT00000014203;Name=ENSTGUE00000142648;constitutive=1;ensembl_end_phase=1;ensembl_phase=0;exon_id=ENSTGUE00000142648;rank=1;version=2

Here is my output. I do not believe that the gene counts are all zero. This has been a sustained issue for 2 weeks and I am not sure what to do.

Geneid  Chr Start   End Strand  Length  /Volumes/cachannel/ZebraFinchBrain/CB-4a_genomemapping/sorted_alignmentcb4a.bam
gene:ENSTGUG00000013637 1   34955   72180   +   37226   0
gene:ENSTGUG00000013635 1   45219   302771  -   257553  0
gene:ENSTGUG00000020928 1   74112   115790  -   41679   0
gene:ENSTGUG00000013634 1   115954  334268  +   218315  0
gene:ENSTGUG00000027178 1   159330  160474  -   1145    0
gene:ENSTGUG00000013633 1   303920  334270  +   30351   0
gene:ENSTGUG00000013632 1   331197  351031  +   19835   0
gene:ENSTGUG00000013631 1   349213  372968  -   23756   0
ADD REPLY
0
Entering edit mode

your BAM file is mapped against a genome with chromosome called CM012081.2 but your annotation file has the chromosome called 1

the chromosomal names have to match

otherwise, it cannot match a gene to the the alignments

ADD REPLY
0
Entering edit mode

Thank you. I have been trying to build a genome for this GCF file but I keep running into the same error.

.fna genome >NC_044211.2 Taeniopygia guttata isolate Blue55 chromosome 1, bTaeGut1.4.pri, whole genome shotgun sequence AAGCCAGCCATATCAGTATCCCAGCCCGCAGGGATCTGACTGCCACCCAGCAATGGCCAGCCACTATGTGTGCCCCTGTA TCACT

however, I keep getting errors on during the hisat2-build command. That there are no A nucleotides, but when I open my .fna there clearly are. I haven't seen many examples of this on the internet. Any ideas?

Exited GFM loop fchr[A]: 0 fchr[C]: 306050481 fchr[G]: 526435265 fchr[T]: 746725375 fchr[$]: 1052636476 Exiting GFM::buildToDisk() Returning from initFromVector Wrote 355100291 bytes to primary GFM file: prof.1.ht2 Wrote 263159124 bytes to secondary GFM file: prof.2.ht2 Re-opening _in1 and _in2 as input streams Returning from GFM constructor assert_eq: expected (57130, 0xdf2a) got (57344, 0xe000) ./hgfm.h:2127 Assertion failed: (0), function HGFM, file ./hgfm.h, line 2127. Abort trap: 6

ADD REPLY
0
Entering edit mode

sounds like your genome file is not quite correct, you need to make sure it is in FASTA format all the way through, perhaps get it from a reliable source.

I tried to build the index on a valid data and it worked just fine:

# Install bio
pip install bio

# Fetch the accession number as fasta
bio fetch NC_044211 -format fasta > NC_044211.fa

# The indexing completes with no error
hisat2-build NC_044211.fa NC_044211
ADD REPLY

Login before adding your answer.

Traffic: 1432 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6