Question

featureCount error - cannot identify gene_id

0

Entering edit mode

4.4 years ago

seung.ho.steven.choi • 0

Hello all. I am trying to assess differential exon usage using DEXSeq on two datasets - one published in human cells and my own RNA-seq dataset from a mouse cell line. The human cell dataset serves as a positive control to make sure the pipeline is working as expected, with changes in exon usage that I should be able to recapitulate.

My strategy was to use featureCounts (Rsubread) in order to get count data for each exon. I generated the flattened gtf using the python script provided at https://github.com/vivekbhr/Subread_to_DEXSeq for both mouse and human .gtf formatted gene annotations downloaded from ensembl (most compatible with DEXSeq).

I used the following loop script:

module load Python/3.6.4
for X in Homo_sapiens.GRCh38.98.chr Mus_musculus.GRCm38.98.chr
do
python /hpchome/meyerlab/sc486/applications/Subread_to_DEXSeq-master/dexseq_prepare_annotation2.py -f $X.DEXSeq.gtf $X.gtf $X.DEXSeq.gff
done

and got these two "flattened" gtf files as output - one for mouse and one for human.

Mouse - GRCm38.98

sc486@dcc-core-51  /work/sc486/transcriptomes $ head Mus_musculus.GRCm38.98.chr.DEXSeq.gff
1   dexseq_prepare_annotation2.py   aggregate_gene  3073253 3074322 .   +   .   gene_id "ENSMUSG00000102693"
1   dexseq_prepare_annotation2.py   exonic_part 3073253 3074322 .   +   .   gene_id "ENSMUSG00000102693"; transcripts "ENSMUST00000193812"; exonic_part_number "001"
1   dexseq_prepare_annotation2.py   aggregate_gene  3102016 3102125 .   +   .   gene_id "ENSMUSG00000064842"
1   dexseq_prepare_annotation2.py   exonic_part 3102016 3102125 .   +   .   gene_id "ENSMUSG00000064842"; transcripts "ENSMUST00000082908"; exonic_part_number "001"
1   dexseq_prepare_annotation2.py   aggregate_gene  3205901 3671498 .   -   .   gene_id "ENSMUSG00000051951"
1   dexseq_prepare_annotation2.py   exonic_part 3205901 3206522 .   -   .   gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000162897"; exonic_part_number "001"
1   dexseq_prepare_annotation2.py   exonic_part 3206523 3207317 .   -   .   gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265+ENSMUST00000162897"; exonic_part_number "002"
1   dexseq_prepare_annotation2.py   exonic_part 3213439 3213608 .   -   .   gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265"; exonic_part_number "003"
1   dexseq_prepare_annotation2.py   exonic_part 3213609 3214481 .   -   .   gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265+ENSMUST00000162897"; exonic_part_number "004"
1   dexseq_prepare_annotation2.py   exonic_part 3214482 3215632 .   -   .   gene_id "ENSMUSG00000051951"; transcripts "ENSMUST00000159265+ENSMUST00000070533+ENSMUST00000162897"; exonic_part_number "005"

Human - GRCh38.98

sc486@dcc-core-51  /work/sc486/transcriptomes $ head Homo_sapiens.GRCh38.98.chr.DEXSeq.gff
1   dexseq_prepare_annotation2.py   aggregate_gene  11869   14409   .   +   .   gene_id "ENSG00000223972"
1   dexseq_prepare_annotation2.py   exonic_part 11869   12009   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "001"
1   dexseq_prepare_annotation2.py   exonic_part 12010   12057   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "002"
1   dexseq_prepare_annotation2.py   exonic_part 12058   12178   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "003"
1   dexseq_prepare_annotation2.py   exonic_part 12179   12227   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "004"
1   dexseq_prepare_annotation2.py   exonic_part 12613   12697   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "005"
1   dexseq_prepare_annotation2.py   exonic_part 12698   12721   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "006"
1   dexseq_prepare_annotation2.py   exonic_part 12975   13052   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000450305"; exonic_part_number "007"
1   dexseq_prepare_annotation2.py   exonic_part 13221   13374   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328+ENST00000450305"; exonic_part_number "008"
1   dexseq_prepare_annotation2.py   exonic_part 13375   13452   .   +   .   gene_id "ENSG00000223972"; transcripts "ENST00000456328"; exonic_part_number "009"

The issue arises when I try to use the featureCount function with these gtfs on their respective datasets. Both datasets were processed from fastq to sam (HISAT2) and then to bam (samtools sort function) in the exact same way (same jobs/scripts just with different indices for mouse/human genomes).

However, only the mouse dataset successfully gives back counts. The error that it spits out for human dataset is as follows:

command for featureCounts on the human dataset:

featureCounts -f -O --fracOverlap 0.1 -p -s 1 -T 2 -F GTF -t 'exonic_part' -a Homo_sapiens.GRCh38.98.chr.DEXSeq.gtf -o /work/sc486/Liu2017-hnRNPG_RNAseq/featurecount.output SRR2661793.sorted.bam SRR2661794.sorted.bam SRR2661795.sorted.bam SRR2661796.sorted.bam SRR2661797.sorted.bam SRR2661798.sorted.bam SRR2661799.sorted.bam SRR2661800.sorted.bam SRR2661801.sorted.bam SRR2661802.sorted.bam SRR2661803.sorted.bam SRR2661804.sorted.bam

outputted error:

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v1.6.5

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 12 BAM files                                     ||
||                           o SRR2661793.sorted.bam                          ||
||                           o SRR2661794.sorted.bam                          ||
||                           o SRR2661795.sorted.bam                          ||
||                           o SRR2661796.sorted.bam                          ||
||                           o SRR2661797.sorted.bam                          ||
||                           o SRR2661798.sorted.bam                          ||
||                           o SRR2661799.sorted.bam                          ||
||                           o SRR2661800.sorted.bam                          ||
||                           o SRR2661801.sorted.bam                          ||
||                           o SRR2661802.sorted.bam                          ||
||                           o SRR2661803.sorted.bam                          ||
||                           o SRR2661804.sorted.bam                          ||
||                                                                            ||
||             Output file : featurecount.output                              ||
||                 Summary : featurecount.output.summary                      ||
||              Annotation : Homo_sapiens.GRCh38.98.chr.DEXSeq.gtf (GTF)      ||
||      Dir for temp files : /work/sc486/Liu2017-hnRNPG_RNAseq                ||
||                                                                            ||
||                 Threads : 2                                                ||
||                   Level : feature level                                    ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : counted                                          ||
||   Min overlapping bases : 1                                                ||
||   Min overlapping frac. : 10.0% to reads                                   ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file Homo_sapiens.GRCh38.98.chr.DEXSeq.gtf ...             ||

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id' 
An example of attributes included in your GTF annotation is 'gene_id "ENSG00000179818+ENSG00000244617"; transcripts "ENST00000659305+ENST00000413069+ENST00000664754+ENST00000625788+ENST00000628659+ENST00000630685+ENST00000664014+ENST00000449178+ENST00000628147+ENST00000653289+ENST00000628322+ENST00000596028+ENST00000626515+ENST00000627050+ENST00000626509+ENST00000653606+ENST00000630522+ENST00000419963+ENST00000629485+ENST00000625518+ENST00000629184+ENST00000419542+ENST00000416395+ENST00000596573+ENST00000626842+ENST00000670313+ENST00000664533+ENST00000629605+ENST00000657291+ENST00000630520+ENST00000661703+ENST00000628374+ENST00000626558+ENST00000628667+ENST00000670785+ENST00000658454+ENST00000662425+ENST00000630280+ENST00000653064+ENST00000627327+ENST00000654958+ENST00000625490+ENST00000421255+ENST00000612202+ENST00000458698+ENST00000422515+ENST00000425601+ENST00000614826+ENST00000609543+ENST00000655420+ENST00000435880+ENST00000630377+ENST00000627325+ENST00000625864+ENST00000627189+ENST00000439670+ENST00000628305+ENST00000665838+ENST00000661863+ENST00000420309+ENST00000599673+ENST00000654878+ENST00000667307+ENST00000423402+ENST00000668191+ENST00000594376+ENST00000659586+ENST00000628837+ENST00000669570+ENST00000425333+ENST00000666167+ENST00000600002+ENST00000626609+ENST00000653811+ENST00000458686+ENST00000631107+ENST00000629943+ENST00000429599+ENST00000653596+ENST00000669896+ENST00000452431+ENST00000626162+ENST00000599427+ENST00000629415+ENST00000660404+ENST00000659941+ENST00000657575+ENST00000656326+ENST00000626898+ENST00000630975+ENST00000630746+ENST00000670036+ENST00000629825+ENST00000630214+ENST00000661989+ENST00000626370+ENST00000668711+ENST00000662103+ENST00000629603+ENST00000629836+ENST00000627918+ENST00000628263+ENST00000457076+ENST00000655481+ENST00000654601+ENST00000627155+ENST00000660903+ENST00000667952+ENST00000601396+ENST00000434781+ENST00000629178+ENST00000366234+ENST00000629339+ENST00000601431+ENST00000664419+ENST00000602091+ENST00000669812+ENST00000625625+ENST00000442040+ENST00000665708+ENST00000415222+ENST00000418564+ENST00000444320+ENST00000655096+ENST00000655140+ENST00000437019+ENST00000630759+ENST00000627487+ENST00000629374+ENST00000416506+ENST00000663033+ENST00000597318+ENST00000603347+ENST00000631110+ENST00000655058+ENST00000413791+ENST00000656308+ENST00000665172+ENST00000664958+ENST00000625783+ENST00000664049+ENST00000457770+ENST00000629506+ENST00000657164+ENST00000629084+ENST00000664483+ENST00000626495+ENST00000670786+ENST00000625447+ENST00000629407+ENST00000421843+ENST00000667874+ENST00000654716+ENST00000666074+ENST00000411429+ENST00000654336+ENST00000669583+ENST00000625955+ENST00000659531+ENST00000413436+ENST00000665264+ENST00000630985+ENST00000625215+ENST00000628551+ENST00000664773+ENST00000456161+ENST00000432604+ENST00000662224+ENST00000594548+ENST00000626632+ENST00000629533+ENST00000653114+ENST00000625227+ENST00000670845+ENST00000627074+ENST00000629188+ENST00000656852+ENST00000627398+ENST00000418308+ENST00000626440+EN' 
The program has to terminate.

Both gtf's (mouse/human) that were outputted above and used for featureCount definitely appear to have the string 'gene_id' in the 9th column. So now, I am at a loss as to why one works (mouse) and the other does not (human) on two datasets processed the same exact way with two gtf's generated the same way. Anything I might have missed?

Thank you all in advance for your input.

dexseq rsubread featurecount software error • 1.7k views

ADD COMMENT • link 4.4 years ago by seung.ho.steven.choi • 0