Question

How to repair a gff/gtf file that is missing gene id, gene name, and transcript name?

0

Entering edit mode

6.8 years ago

O.rka ▴ 750

I am trying to run the TagReadWithGeneFunction tool from https://github.com/broadinstitute/Drop-seq and it requires my gtf file to have gene name, gene id, transcript name, and transcript id. My gff file from maker did not have most of these fields.

Is there a tool I can use to fix either my gff file or my converted gtf file?

I used gffread from http://ccb.jhu.edu/software/stringtie/gff.shtml to convert to gtf.

Here is the output of TagReadWithGeneFunction:

-bash-4.1$ cat dropseq_output/Frozen_HJC7MBGX5/logging/star_gene_exon_tagged.bam.log
INFO    2019-01-24 02:24:17 TagReadWithGeneFunction

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    TagReadWithGeneFunction -INPUT dropseq_output/Frozen_HJC7MBGX5/merged.bam -ANNOTATIONS_FILE /usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gtf -OUTPUT dropseq_output/Frozen_HJC7MBGX5/star_gene_exon_tagged.bam
**********


02:24:19.432 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/jespinoz/Algorithms/Mapping/dropseq_pipeline/Drop-seq_tools-2.1.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Jan 24 02:24:19 EST 2019] TagReadWithGeneFunction INPUT=dropseq_output/Frozen_HJC7MBGX5/merged.bam OUTPUT=dropseq_output/Frozen_HJC7MBGX5/star_gene_exon_tagged.bam ANNOTATIONS_FILE=/usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gtf    GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf READ_FUNCTION_TAG=XF USE_STRAND_INFO=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Jan 24 02:24:19 EST 2019] Executing as jespinoz@dell-15.jcvi.org on Linux 2.6.32-696.18.7.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.1.0(e9a342e_1547567906)
[Thu Jan 24 02:24:21 EST 2019] org.broadinstitute.dropseqrna.metrics.TagReadWithGeneFunction done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=2058354688
Exception in thread "main" picard.annotation.AnnotationException: Invalid GTF line:
000000F|arrow   maker   exon    266 1689    .   +   .   transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
Problems:
Missing gene_id
Missing gene_name
Missing transcript_name
    at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:97)
    at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
    at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
    at htsjdk.samtools.util.PeekableIterator.<init>(PeekableIterator.java:38)
    at org.broadinstitute.dropseqrna.utils.FilteredIterator.<init>(FilteredIterator.java:37)
    at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.<init>(GTFReader.java:109)
    at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.<init>(GTFReader.java:107)
    at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:74)
    at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:70)
    at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadGTFFile(GeneAnnotationReader.java:67)
    at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:51)
    at org.broadinstitute.dropseqrna.metrics.TagReadWithGeneFunction.doWork(TagReadWithGeneFunction.java:113)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
    at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

Here is the top few lines of my gff file:

-bash-4.1$ head /usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gff
##gff-version 3
000161F|arrow   .   contig  1   17082   .   .   .   ID=000161F|arrow;Name=000161F|arrow
000161F|arrow   maker   gene    35  1654    .   -   .   ID=CylG030441|arrow-snap-gene-0.5;Name=CylG030441|arrow-snap-gene-0.5;Alias=maker-000161F|arrow-snap-gene-0.5;
000161F|arrow   maker   mRNA    35  1654    .   -   .   ID=CylG030441|arrow-snap-gene-0.5-mRNA-1;Parent=CylG030441|arrow-snap-gene-0.5;Name=CylG030441|arrow-snap-gene-0.5-mRNA-1;Alias=maker-000161F|arrow-snap-gene-0.5-mRNA-1;_AED=0.13;_QI=0|0|0|1|0|0|2|0|510;_eAED=0.07;
000161F|arrow   maker   exon    1406    1654    .   -   .   ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:exon:28;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow   maker   exon    35  1315    .   -   .   ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:exon:27;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow   maker   CDS 1406    1654    .   -   0   ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:cds;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow   maker   CDS 35  1315    .   -   0   ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:cds;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow   maker   gene    15768   16170   .   +   .   ID=CylG030449|arrow-snap-gene-0.4;Name=CylG030449|arrow-snap-gene-0.4;Alias=maker-000161F|arrow-snap-gene-0.4;
000161F|arrow   maker   mRNA    15768   16170   .   +   .   ID=CylG030449|arrow-snap-gene-0.4-mRNA-1;Parent=CylG030449|arrow-snap-gene-0.4;Name=CylG030449|arrow-snap-gene-0.4-mRNA-1;Alias=maker-000161F|arrow-snap-gene-0.4-mRNA-1;_AED=0.41;_QI=0|0|0|1|0|0|3|0|94;_eAED=0.41;

Lastly, here is my converted gtf file:

-bash-4.1$ head /usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gtf
000000F|arrow   maker   exon    266 1689    .   +   .   transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
000000F|arrow   maker   exon    1781    2574    .   +   .   transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
000000F|arrow   maker   CDS 266 634 .   +   0   transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
000000F|arrow   maker   exon    326 473 .   +   .   transcript_id "CylT000002-RA|arrow-noncoding-gene-0.302-snoRNA-1";
000000F|arrow   maker   exon    2798    3631    .   -   .   transcript_id "CylT000003-RA|arrow-processed-gene-0.179-mRNA-1";
000000F|arrow   maker   CDS 2798    3631    .   -   0   transcript_id "CylT000003-RA|arrow-processed-gene-0.179-mRNA-1";
000000F|arrow   maker   exon    6084    6181    .   +   .   transcript_id "CylT000004-RA|arrow-noncoding-gene-0.303-snoRNA-1";
000000F|arrow   maker   exon    7306    9415    .   +   .   transcript_id "CylT000005-RA|arrow-snap-gene-0.2-mRNA-1";
000000F|arrow   maker   exon    9971    10218   .   +   .   transcript_id "CylT000005-RA|arrow-snap-gene-0.2-mRNA-1";
000000F|arrow   maker   CDS 7306    9415    .   +   0   transcript_id "CylT000005-RA|arrow-snap-gene-0.2-mRNA-1";

annotation alignment RNA-Seq gene gtf • 6.1k views

ADD COMMENT • link 6.8 years ago by O.rka ▴ 750

1

Entering edit mode

If your function only needs the missing gene_id, gene_name, and transcript_name in the attributes column, without a real value, you can try to add them to every line with no value (hence gene_id ""; gene_name ""; transcript_name "") after the transcript_id. You can try AWK for such manipulation. No guarantees, I don't know what the function exactly needs to prevent the error.

ADD REPLY • link 6.8 years ago by Benn 8.4k

0

Entering edit mode

It looks like for this record: 000161F|arrow maker mRNA 35 1654 . - . ID=CylG030441|arrow-snap-gene-0.5-mRNA-1;Parent=CylG030441|arrow-snap-gene-0.5;Name=CylG030441|arrow-snap-gene-0.5-mRNA-1;Alias=maker-000161F|arrow-snap-gene-0.5-mRNA-1;_AED=0.13;_QI=0|0|0|1|0|0|2|0|510;_eAED=0.07; The gene id/name is is CylG030441|arrow-snap-gene-0.5 while transcript id/name CylG030441|arrow-snap-gene-0.5-mRNA-1.

ADD REPLY • link 6.8 years ago by O.rka ▴ 750