Question: htseq-count error Feature does not contain a 'gene_id' attribute
1
gravatar for madzayasodara
2.9 years ago by
UCLA
madzayasodara10 wrote:

I'm trying to do read count using htseq-count

htseq-count --mode=union --stranded=no --idattr=gene_id -f sam ./filteredfirstZBND1X_gff.sam ./filteredsecZBND1X_gff.sam /home/madzays/data/finc h_data/ZF_transcriptome/taeGut2_gff/taeGut2.gff > ZBND1X_gff_counts.sam

but get the error

Error occured when processing GFF file (line 11 of file  /home/madzays/data/finch_data/ZF_transcriptome/taeGut2_gff/taeGut2.gff):
Feature id1 does not contain a 'gene_id' attribute   [Exception type:
ValueError, raised in count.py:77]

this is what my gff file (from NCBI) looks like

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Taeniopygia_guttata-3.2.4
#!genome-build-accession NCBI_Assembly:GCF_000151805.1
##sequence-region NC_011462.1 1 118548696
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=59729
NC_011462.1     RefSeq  region  1       118548696       .       +       .       ID=id0;Dbxref=taxon:59729;Name=1;chromosome=1;gbkey=Src;genome=chromosome;isolate=Black17;mol_type=genomic DNA;sex=male
NC_011462.1     Gnomon  gene    17853   28371   .       +       .       ID=gene0;Dbxref=GeneID:100227449;Name=CLIC6;gbkey=Gene;gene=CLIC6;gene_biotype=protein_coding;partial=true;start_range=.,17853
NC_011462.1     Gnomon  mRNA    17853   28371   .       +       .       ID=rna0;Parent=gene0;Dbxref=GeneID:100227449,Genbank:XM_002186596.2;Name=XM_002186596.2;gbkey=mRNA;gene=CLIC6;partial=true;product=chloride intracellular channel 6;start_range=.,17853;transcript_id=XM_002186596.2
rna-seq read count gff htseq • 4.9k views
ADD COMMENTlink modified 2.9 years ago by WouterDeCoster44k • written 2.9 years ago by madzayasodara10
  1. Try to convert gff to gtf and feed it to htseq-count. try searching for gffread util.
  2. Shift to featurecounts.
ADD REPLYlink written 2.2 years ago by cpad011214k
2
gravatar for WouterDeCoster
2.9 years ago by
Belgium
WouterDeCoster44k wrote:

Yes. There is no gene_id attribute in your gff. The error is pretty self-explanatory in that sense.

Try with --idattr=gene. Based on the few lines you posted that attribute is present.

ADD COMMENTlink written 2.9 years ago by WouterDeCoster44k

K it's doing its thing now. thanks!

ADD REPLYlink written 2.9 years ago by madzayasodara10

Oh wait it run into a line without that attribute. Do you know if NCBI gffs have any "must have" attribute I could use? thanks

ADD REPLYlink written 2.9 years ago by madzayasodara10
1

Nope. You can print out the line which caused the problem using e.g.
sed -n 26p taeGut2.gff for line 26, then try to figure what that line does have in common with the rest.

ADD REPLYlink written 2.9 years ago by WouterDeCoster44k

Yep it runs for "gbkey". However it outputs

-bash-4.2$ head -50 ZBND1X_gff_counts_reverse.sam
C_region        0       0
V_segment       0       0
exon    0       0
mRNA    0       0
misc_RNA        0       0
ncRNA   0       0
precursor_RNA   0       0
rRNA    0       0
tRNA    0       0
__no_feature    21536704        15508244
__ambiguous     0       0
__too_low_aQual 0       0
__not_aligned   6233570 4564356
__alignment_not_unique  2340806 1668760

I think it has to do with the gtf file having these RefSeq ids instead of chromosome locations (e.g. chr1) Any ideas how could I sub them?

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by madzayasodara10

Perhaps ensembl has a gff for your organism?

ADD REPLYlink written 2.9 years ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1357 users visited in the last hour