Question

cant find gene_id

0

Entering edit mode

4.4 years ago

meisam.radfar • 0

i want to use htseq-count for counting reads of my RNA-seq project

i use GFF3 file ( wheat ) from plantensemble website but when i write this command:

htseq-count -q -i gene_id -f bam X.bam Triticum_aestivum.IWGSC.49.gff3

i see this:

Feature transcript:ENSRNA050013875-T1 does not contain a ‘gene_id’ attribute
[exception type: ValueError, raised in features.py.329]

i use ID=gene but this is not work too.

in my gff3 file i can see:

ID=gene:ENSRNA050013875
gene_id=ENSRNA050013875
ID=transcript:ENSRNA050013875-T1
parent=gene:ENSRNA050013875
transcript_id=ENSRNA050013875-T1
parent=transcript:ENSRNA050013875-T1
Name:ENSRNA050013875-E1
exon_id=ENSRNA050013875-E1

just Name and exon_id work but i dont need them.

RNA-Seq software error • 1.7k views

ADD COMMENT • link 4.4 years ago by meisam.radfar • 0

0

Entering edit mode

what if you use 'ID' as input for -i of htseq-count ?

alternatively you can consider using FeatureCounts in stead of htseq-count, that one is more lenient on those gene-id things (and also faster)

ADD REPLY • link 4.4 years ago by lieven.sterck 15k

0

Entering edit mode

Feature transcript:ENSRNA050013875-T1 does not contain a ‘ID’ attribute [exception type: ValueError, raised in features.py.329]

ADD REPLY • link 4.4 years ago by meisam.radfar • 0

0

Entering edit mode

ok, then likely your GFF file is mal-formatted.

Try to run of the tools from the AGAT package on it to fix it?

AGAT - Another Gff Analysis Toolkit

ADD REPLY • link 4.4 years ago by lieven.sterck 15k

0

Entering edit mode

thank you but i think i have a different problem

my gff3 file has gene_id=ENSRNA050013875 but this error said that transcript: ENSRNA050013875-T1 dose not contain gene_id

i think the problem is -T1

i can see gene_id=ENSRNA050013875 but i cant see gene_id=ENSRNA050013875-T1

my gff3 file:

##gff-version 3
##sequence-region   1A 1 594102056
##sequence-region   1B 1 689851870
##sequence-region   1D 1 495453186
##sequence-region   2A 1 780798557
##sequence-region   2B 1 801256715
##sequence-region   2D 1 651852609
##sequence-region   3A 1 750843639
##sequence-region   3B 1 830829764
##sequence-region   3D 1 615552423
##sequence-region   4A 1 744588157
##sequence-region   4B 1 673617499
##sequence-region   4D 1 509857067
##sequence-region   5A 1 709773743
##sequence-region   5B 1 713149757
##sequence-region   5D 1 566080677
##sequence-region   6A 1 618079260
##sequence-region   6B 1 720988478
##sequence-region   6D 1 473592718
##sequence-region   7A 1 736706236
##sequence-region   7B 1 750620385
##sequence-region   7D 1 638686055
##sequence-region   Un 1 480980714
#!genome-build International Wheat Genome Sequencing Consortium IWGSC
#!genome-version IWGSC
#!genome-date 2018-07
#!genome-build-accession GCA_900519105.1
1A  IWGSC   chromosome  1   594102056   .   .   .   ID=chromosome:1A;Alias=chr1A,LS992080.1
###
1A  Ensembl_Plants  ncRNA_gene  200 1683    .   +   .   ID=gene:ENSRNA050013875;Name=LSU_rRNA_eukarya;biotype=rRNA;description=Eukaryotic large subunit ribosomal RNA;gene_id=ENSRNA050013875;logic_name=rfam_12.2_gene
1A  Ensembl_Plants  rRNA    200 1683    .   +   .   ID=transcript:ENSRNA050013875-T1;Parent=gene:ENSRNA050013875;biotype=rRNA;transcript_id=ENSRNA050013875-T1
1A  Ensembl_Plants  exon    200 1683    .   +   .   Parent=transcript:ENSRNA050013875-T1;Name=ENSRNA050013875-E1;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSRNA050013875-E1;rank=1
###
1A  Ensembl_Plants  ncRNA_gene  5023    6833    .   +   .   ID=gene:ENSRNA050013847;Name=SSU_rRNA_eukarya;biotype=rRNA;description=Eukaryotic small subunit ribosomal RNA;gene_id=ENSRNA050013847;logic_name=rfam_12.2_gene
1A  Ensembl_Plants  rRNA    5023    6833    .   +   .   ID=transcript:ENSRNA050013847-T1;Parent=gene:ENSRNA050013847;biotype=rRNA;transcript_id=ENSRNA050013847-T1
1A  Ensembl_Plants  exon    5023    6833    .   +   .   Parent=transcript:ENSRNA050013847-T1;Name=ENSRNA050013847-E1;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSRNA050013847-E1;rank=1
###

my x.sam file:

@HD VN:1.0  SO:coordinate
@SQ SN:1A   LN:594102056
@SQ SN:1B   LN:689851870
@SQ SN:1D   LN:495453186
@SQ SN:2A   LN:780798557
@SQ SN:2B   LN:801256715
@SQ SN:2D   LN:651852609
@SQ SN:3A   LN:750843639
@SQ SN:3B   LN:830829764
@SQ SN:3D   LN:615552423
@SQ SN:4A   LN:744588157
@SQ SN:4B   LN:673617499
@SQ SN:4D   LN:509857067
@SQ SN:5A   LN:709773743
@SQ SN:5B   LN:713149757
@SQ SN:5D   LN:566080677
@SQ SN:6A   LN:618079260
@SQ SN:6B   LN:720988478
@SQ SN:6D   LN:473592718
@SQ SN:7A   LN:736706236
@SQ SN:7B   LN:750620385
@SQ SN:7D   LN:638686055
@SQ SN:Un   LN:480980714
@PG ID:TopHat   VN:2.1.1    CL:/usr/local/bin/tophat-2.1.1.Linux_x86_64/tophat -G Triticum_aestivum.IWGSC.49.gff3 Triticum_aestivum 19358_1_1P.fq 19358_1_2P.fq
@PG ID:samtools PN:samtools PP:TopHat   VN:1.11 CL:samtools view -h accepted_hits.bam
@PG ID:samtools.1   PN:samtools PP:samtools VN:1.11 CL:samtools view -h 19358.sam
A00783:167:HTLJ3DSXX:3:1105:3839:19476  419 1A  540 0   135M    =   558 153 CGGGAGTAACTATGACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATTAACGAGATTCCCACTGTCCCTGTCTACTATCCAGCGAAACCACAGCCAAGGGAAAGGGCTTGG F::FFF:FFFFFFFFFFFFFF,FFFFFF:FFFFFFF,FFF:,FFF,:FFFFFF,FFF,,FFFFFFFF,FFFFFFFFFFF:F:F,FF:F,:FFFF:FFFFFFFFFF,::FFFFFF,FFF:FF:,:F:,FFFFFF:F AS:i:-3 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:126C8  YT:Z:UU XS:A:+  NH:i:5  CC:Z:1B CP:i:544586970  HI:i:0
A00783:167:HTLJ3DSXX:3:2124:11532:32957 161 1A  540 0   135M    Un  394611870   0   CGGGAGTAACTATGACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATTAACGAGATTCCCACTGTCCCTGTCTACTATCCAGCGAAACCACAGCCAAGGGAACGGGCTTGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:135    YT:Z:UU XS:A:+  NH:i:20 CC:Z:=  CP:i:540    HI:i:0

ADD REPLY • link updated 4.4 years ago by GenoMax 152k • written 4.4 years ago by meisam.radfar • 0

1

Entering edit mode

Try to re-run your htseq-count command but use the -t parameter to indicate which type of feature (third column in GFF) you want to count. This usually is something like 'exon' , 'gene' , 'mRNA' ... in your case perhaps 'ncRNA_gene' . Check the htseq-count manual for details.

ADD REPLY • link 4.4 years ago by lieven.sterck 15k

0

Entering edit mode

yes this is true way to use both of -t and -i:

htseq-count -q -t gene -i ID -f bam x.sorted.bam x.gff3 > result.txt

because of difference GFF3 and GTF files

ADD REPLY • link 4.4 years ago by meisam.radfar • 0

0

Entering edit mode

did you process this GFF file in some way?

from what you posted that does not look like a GFF file (the formatting that is) . Did you open it in windows/DOS or such?

ADD REPLY • link 4.4 years ago by lieven.sterck 15k

0

Entering edit mode

no i open and see gff3 file by "gedit Triticum.ae...49.gff3" on ubuntu terminal

ADD REPLY • link 4.4 years ago by meisam.radfar • 0

0

Entering edit mode

Can you post the exact full lines from your GFF3 file of the ENSRNA050013875 gene/transcript ?

ADD REPLY • link 4.4 years ago by lieven.sterck 15k