Hello
I am trying to convert a .gff file (version 3) downloaded from NCBI to a .gtf file, but I am having some trouble. I expected the 9th column of the outputted .gtf files to be "gene_id", but I can't get it to work.
After trying many times, I was recommended to use the .gtf files provided in the iGenomes website.
However, even though it shows me that "gene_id" is in the 9th column (see below), when I try using it in featureCounts, it shows me a warning message:
This is how the .gtf file from iGenomes looks like:
$ head genes.gtf
1    Gnomon    exon    33312    35627    .    +    .    gene_id "LOC104970773"; gene_name "LOC104970773"; p_id "P29020"; transcript_id "XM_010800890.1"; tss_id "TSS27855";
1    Gnomon    CDS    34375    35558    .    +    0    gene_id "LOC104970773"; gene_name "LOC104970773"; p_id "P29020"; transcript_id "XM_010800890.1"; tss_id "TSS27855";
1    Gnomon    exon    122904    125014    .    -    .    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    CDS    124853    125014    .    -    0    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    CDS    135820    136001    .    -    2    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    exon    135820    136001    .    -    .    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    CDS    137158    137264    .    -    1    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    exon    137158    137264    .    -    .    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    CDS    137834    137959    .    -    1    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
1    Gnomon    exon    137834    137959    .    -    .    gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";
Here is the warning outputted by featureCounts:
./featureCounts \
  -T 1 \
  -a ~/RNASeq/Bos_taurus/NCBI/UMD_3.1.1/Annotation/Genes/genes.gtf -s 2 -g gene_id \
  -t exon \
  -o charolais_${name}.featCounts.txt \
  ~/RNASeq/RNASeq_STAR/NCBI_UMD3.1.1/charolais_${name}_aligned_Aligned.sortedByCoord.out.bam
        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
    v1.4.6-p5
//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S /home/undergrad1/RNASeq/RNASeq_STAR/NCBI_U ... ||
||                                                                            ||
||             Output file : charolais_145A_TGACCA.featCounts.txt             ||
||             Annotations : /home/undergrad1/RNASeq/Bos_taurus/NCBI/UMD_ ... ||
||                                                                            ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||         Strand specific : inversed                                         ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||       Read orientations : fr                                               ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//
//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file /home/undergrad1/RNASeq/Bos_taurus/NCBI/UMD_3.1.1 ... ||
Warning: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
The attributes included in your GTF annotation are 'gene_name "CFAP20"; p_id "P19047"; transcript_id "NM_001003905.1"; tss_id "TSS12705";'
It seems like the columns are off by one. Does anyone know a proper way to convert a .gff file to a .gtf file or a way I can fix this gtf file from iGenomes?
I've tried the gffread tool from cuffsuite, and some scripts I found online (such as in here http://blog.nextgenetics.net/?e=27) but none of them seemed to work properly. Any help is appreciated!
Your GTF file doesn't conform to the definition: here they write that gene_id and transcript_id must be at the start of the 9th column: "Any other attributes or comments must appear after these two and will be ignored." By the way, this is also written in the blog post you linked to.
Thank you!
I realized that, however I do not know how to change it.
I expected the file downloaded from iGenomes above to be right, but apparently it is not.
Any ideas on how to edit the gtf file without messing it up?
Thanks!
With a perl one-liner like this:
perl -ne 'chomp; @a=split/\t/; %h=split(/ /,$a[8]); $a[8]=join(" ",("gene_id",$h{"gene_id"},"transcript_id",$h{"transcript_id"})); print join("\t",@a),"\n";' iGenome.gtf > corrected.gtfNote that this keeps only the gene_id and transcript_id attributes. Adapt to your needs.
Thank you so much, Jean! Your code was really helpful!