GTF/GFF file for feature count
1
1
Entering edit mode
7.1 years ago
#### ▴ 220

I am using GFF file for feature count to produce counts for RNA-Seq analysis and the organism is non-model organism, while calculating counts I am unable to get the proper counts and as the assembly is not good and the gff

  #!genome-build RproC3                                                           
  #!genome-version RproC3                                                         
  #!genome-date 2015-04                                                           
  #!genome-build-accession GCA_000181055.3                                                                
KQ034291        VectorBase      gene    36335   45838   0       +       0       gene_id "RPRC000679";"
KQ034291        VectorBase      transcript      36335   45838   0       +       0       gene_id "RPRC000679"; transcript_id "RPRC000679-RA";"
KQ034291        VectorBase      exon    36335   36356   0       +       0       gene_id "RPRC000679"; transcript_id "RPRC000679-RA"; exon_number "1";"
KQ034291        VectorBase      CDS     36335   36356   0       +       0       gene_id "RPRC000679"; transcript_id "RPRC000679-RA"; exon_number "1";"
KQ034291        VectorBase      exon    40565   40684   0       +       0       gene_id "RPRC000679"; transcript_id "RPRC000679-RA"; exon_number "2";"
KQ034291        VectorBase      CDS     40565   40684   0       +       2       gene_id "RPRC000679"; transcript_id "RPRC000679-RA"; exon_number "2";"
KQ034291        VectorBase      exon    40763   40941   0       +       0       gene_id "RPRC000679"; transcript_id "RPRC000679-RA"; exon_number "3";"
KQ034291        VectorBase      CDS     40763   40941   0       +       2       gene_id "RPRC000679"; transcript_id "RPRC000679-RA"; exon_number "3";"
KQ034291        VectorBase      exon    45833   45838   0       +       0       gene_id "RPRC000679"; transcript_id "RPRC000679-RA"; exon_number "4";"
KQ034291        VectorBase      CDS     45833   45835   0       +       0       gene_id "RPRC000679"; transcript_id "RPRC000679-RA"; exon_number "4";"
KQ034291        VectorBase      stop_codon      45836   45838   0       +       0       gene_id "RPRC000679"; transcript_id "RPRC000679-RA"; exon_number "4";"
KQ034291        VectorBase      gene    48738   55400   0       -       0       gene_id "RPRC003242";"
KQ034291        VectorBase      transcript      48738   55400   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA";"
KQ034291        VectorBase      exon    55216   55400   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "1";"
KQ034291        VectorBase      CDS     55216   55289   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "1";"
KQ034291        VectorBase      start_codon     55287   55289   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "1";"
KQ034291        VectorBase      exon    53297   53592   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "2";"
KQ034291        VectorBase      CDS     53297   53592   0       -       1       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "2";"
KQ034291        VectorBase      exon    52421   52605   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "3";"
KQ034291        VectorBase      CDS     52421   52605   0       -       2       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "3";"
KQ034291        VectorBase      exon    51858   51907   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "4";"
KQ034291        VectorBase      CDS     51858   51907   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "4";"
KQ034291        VectorBase      exon    51146   51248   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "5";"
KQ034291        VectorBase      CDS     51146   51248   0       -       1       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "5";"
KQ034291        VectorBase      exon    50189   50352   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "6";"
KQ034291        VectorBase      CDS     50189   50352   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "6";"
KQ034291        VectorBase      exon    48738   48965   0       -       0       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "7";"
KQ034291        VectorBase      CDS     48884   48965   0       -       1       gene_id "RPRC003242"; transcript_id "RPRC003242-RA"; exon_number "7";

"

where the first column id is same for all the genes and coz of which the count file contains the id "KQ034291" repeatedly and nothing else. However, I want to have the gtf/gff file with gene names like RPRC00679,RPRC003242 and so on , so that it shall help me to get unique gene counts , is there a way to do this?

RNA-Seq genecounts GFF/GTF • 9.2k views
ADD COMMENT
0
Entering edit mode

First column should refer to chromosome name, which in your case seems to be KQ034291. I am not sure why you have (line numbers?) before that name. Where did you acquire this file from?

ADD REPLY
0
Entering edit mode

I am also not sure but it was download from database. However I can get rid of it. But can I have the gene name instead of scaffold id in the first column?

ADD REPLY
0
Entering edit mode

You can but then file will not be in GTF/GFF format. featureCounts should understand the gene_id attribute in the file you posted.

ADD REPLY
0
Entering edit mode

YEs it will recognise at the sequences for alignment used will have the same gene_id.....so i want to know how to do that?

ADD REPLY
0
Entering edit mode

Only after you fix the first column (chromosome names need to match your alignment file). Have you looked at the manual/in-line help for featureCounts? The two options you want to pay attention to are

 -t <string>         Specify feature type in GTF annotation. `exon' by 
                      default. Features used for read counting will be 
                      extracted from annotation using the provided value.

  -g <string>         Specify attribute type in GTF annotation. `gene_id' by 
                      default. Meta-features used for read counting will be 
                      extracted from annotation using the provided value.
ADD REPLY
0
Entering edit mode

I am aware about these two options you have mentioned, I have edited the gtf file mentioned above, I am getting following warning while running featureCounts with no output file:

Warning: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id' 
The attributes included in your GTF annotation are 'gene_id "RPRC000679"; transcript_id "RPRC000679-RA"; exon_number "1";"' 

||    Features : 91569                                                        ||
||    Meta-features : 1                                                       ||
||    Chromosomes/contigs : 16843                                             ||
||

According to which 9th column has some problem, which is not the real case. As I also did cut-f 9 *.gtf and here is the output :

gene_id "RPRC009988";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "1";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "1";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "1";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "2";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "2";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "3";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "3";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "4";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "4";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "5";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "5";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "6";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "6";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "7";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "7";"
gene_id "RPRC009988"; transcript_id "RPRC009988-RA"; exon_number "7";"

So I have no clue what is going wrong here , any idea??

ADD REPLY
0
Entering edit mode

Closing a post is not an appropriate action when a question has been answered (geneally mods use that action to close posts deemed inappropriate/duplicate etc). You should accept an answer (green check mark) (moved @Devon's post to an answer) to indicate this question has been answered.

ADD REPLY
2
Entering edit mode
7.1 years ago

All of your lines end with an extra ". Try removing it.

ADD COMMENT
0
Entering edit mode

Devon Thanks, along with " there was a wide space as well after removing both it worked.Thanks

ADD REPLY
0
Entering edit mode

Hi, I have a similar problem with my GTF file. Feature count is giving the error "failed to find the gene identifier attribute in the 9th column of the provided GTF file." Please kindly if you tell me how to remove the " and wide space. Arumoy.

ADD REPLY
0
Entering edit mode
sed "s/''$//" foo.gtf > foo.fixed.gtf

Note that the middle ticks are two apostrophes, not a ". Assuming you have EXACTLY the problem faced in the original post then that will fix it.

ADD REPLY
0
Entering edit mode

Thanks a lot, Devon. I really appreciate your help.

ADD REPLY

Login before adding your answer.

Traffic: 2473 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6