GFF3 to GTF conversion - 9th column
0
0
Entering edit mode
7.0 years ago

Hello

I am trying to convert a .gff file (version 3) downloaded from NCBI to a .gtf file, but I am having some trouble. I expected the 9th column of the outputted .gtf files to be "gene_id", but I can't get it to work.

After trying many times, I was recommended to use the .gtf files provided in the iGenomes website.

However, even though it shows me that "gene_id" is in the 9th column (see below), when I try using it in featureCounts, it shows me a warning message:

This is how the .gtf file from iGenomes looks like:

$head genes.gtf 1 Gnomon exon 33312 35627 . + . gene_id "LOC104970773"; gene_name "LOC104970773"; p_id "P29020"; transcript_id "XM_010800890.1"; tss_id "TSS27855"; 1 Gnomon CDS 34375 35558 . + 0 gene_id "LOC104970773"; gene_name "LOC104970773"; p_id "P29020"; transcript_id "XM_010800890.1"; tss_id "TSS27855"; 1 Gnomon exon 122904 125014 . - . gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167"; 1 Gnomon CDS 124853 125014 . - 0 gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167"; 1 Gnomon CDS 135820 136001 . - 2 gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167"; 1 Gnomon exon 135820 136001 . - . gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167"; 1 Gnomon CDS 137158 137264 . - 1 gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167"; 1 Gnomon exon 137158 137264 . - . gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167"; 1 Gnomon CDS 137834 137959 . - 1 gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167"; 1 Gnomon exon 137834 137959 . - . gene_id "CLIC6"; gene_name "CLIC6"; p_id "P22316"; transcript_id "XM_002684585.4"; tss_id "TSS4167";  Here is the warning outputted by featureCounts: ./featureCounts \ -T 1 \ -a ~/RNASeq/Bos_taurus/NCBI/UMD_3.1.1/Annotation/Genes/genes.gtf -s 2 -g gene_id \ -t exon \ -o charolais_${name}.featCounts.txt \
~/RNASeq/RNASeq_STAR/NCBI_UMD3.1.1/charolais_{name}_aligned_Aligned.sortedByCoord.out.bam ========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ v1.4.6-p5 //========================== featureCounts setting ===========================\\ || || || Input files : 1 BAM file || || S /home/undergrad1/RNASeq/RNASeq_STAR/NCBI_U ... || || || || Output file : charolais_145A_TGACCA.featCounts.txt || || Annotations : /home/undergrad1/RNASeq/Bos_taurus/NCBI/UMD_ ... || || || || Threads : 1 || || Level : meta-feature level || || Paired-end : no || || Strand specific : inversed || || Multimapping reads : not counted || || Multi-overlapping reads : not counted || || Read orientations : fr || || || \\===================== http://subread.sourceforge.net/ ======================// //================================= Running ==================================\\ || || || Load annotation file /home/undergrad1/RNASeq/Bos_taurus/NCBI/UMD_3.1.1 ... || Warning: failed to find the gene identifier attribute in the 9th column of the provided GTF file. The specified gene identifier attribute is 'gene_id' The attributes included in your GTF annotation are 'gene_name "CFAP20"; p_id "P19047"; transcript_id "NM_001003905.1"; tss_id "TSS12705";'  It seems like the columns are off by one. Does anyone know a proper way to convert a .gff file to a .gtf file or a way I can fix this gtf file from iGenomes? I've tried the gffread tool from cuffsuite, and some scripts I found online (such as in here http://blog.nextgenetics.net/?e=27) but none of them seemed to work properly. Any help is appreciated! gtf NCBI gff3 featureCounts • 7.6k views ADD COMMENT 1 Entering edit mode Your GTF file doesn't conform to the definition: here they write that gene_id and transcript_id must be at the start of the 9th column: "Any other attributes or comments must appear after these two and will be ignored." By the way, this is also written in the blog post you linked to. ADD REPLY 0 Entering edit mode Thank you! I realized that, however I do not know how to change it. I expected the file downloaded from iGenomes above to be right, but apparently it is not. Any ideas on how to edit the gtf file without messing it up? Thanks! ADD REPLY 4 Entering edit mode With a perl one-liner like this: perl -ne 'chomp; @a=split/\t/; %h=split(/ /,a[8]); $a[8]=join(" ",("gene_id",$h{"gene_id"},"transcript_id",\$h{"transcript_id"})); print join("\t",@a),"\n";' iGenome.gtf > corrected.gtf

Note that this keeps only the gene_id and transcript_id attributes. Adapt to your needs.

0
Entering edit mode