Entering edit mode
6.8 years ago
O.rka
▴
750
I am trying to run the TagReadWithGeneFunction tool from https://github.com/broadinstitute/Drop-seq and it requires my gtf file to have gene name, gene id, transcript name, and transcript id. My gff file from maker did not have most of these fields.
Is there a tool I can use to fix either my gff file or my converted gtf file?
I used gffread from http://ccb.jhu.edu/software/stringtie/gff.shtml to convert to gtf.
Here is the output of TagReadWithGeneFunction:
-bash-4.1$ cat dropseq_output/Frozen_HJC7MBGX5/logging/star_gene_exon_tagged.bam.log
INFO 2019-01-24 02:24:17 TagReadWithGeneFunction
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** TagReadWithGeneFunction -INPUT dropseq_output/Frozen_HJC7MBGX5/merged.bam -ANNOTATIONS_FILE /usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gtf -OUTPUT dropseq_output/Frozen_HJC7MBGX5/star_gene_exon_tagged.bam
**********
02:24:19.432 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/jespinoz/Algorithms/Mapping/dropseq_pipeline/Drop-seq_tools-2.1.0/jar/lib/picard-2.18.14.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Jan 24 02:24:19 EST 2019] TagReadWithGeneFunction INPUT=dropseq_output/Frozen_HJC7MBGX5/merged.bam OUTPUT=dropseq_output/Frozen_HJC7MBGX5/star_gene_exon_tagged.bam ANNOTATIONS_FILE=/usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gtf GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf READ_FUNCTION_TAG=XF USE_STRAND_INFO=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Jan 24 02:24:19 EST 2019] Executing as jespinoz@dell-15.jcvi.org on Linux 2.6.32-696.18.7.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.1.0(e9a342e_1547567906)
[Thu Jan 24 02:24:21 EST 2019] org.broadinstitute.dropseqrna.metrics.TagReadWithGeneFunction done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=2058354688
Exception in thread "main" picard.annotation.AnnotationException: Invalid GTF line:
000000F|arrow maker exon 266 1689 . + . transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
Problems:
Missing gene_id
Missing gene_name
Missing transcript_name
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:97)
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
at htsjdk.samtools.util.PeekableIterator.<init>(PeekableIterator.java:38)
at org.broadinstitute.dropseqrna.utils.FilteredIterator.<init>(FilteredIterator.java:37)
at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.<init>(GTFReader.java:109)
at org.broadinstitute.dropseqrna.annotation.GTFReader$FilteringGTFParser.<init>(GTFReader.java:107)
at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:74)
at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:70)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadGTFFile(GeneAnnotationReader.java:67)
at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:51)
at org.broadinstitute.dropseqrna.metrics.TagReadWithGeneFunction.doWork(TagReadWithGeneFunction.java:113)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)
Here is the top few lines of my gff file:
-bash-4.1$ head /usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gff
##gff-version 3
000161F|arrow . contig 1 17082 . . . ID=000161F|arrow;Name=000161F|arrow
000161F|arrow maker gene 35 1654 . - . ID=CylG030441|arrow-snap-gene-0.5;Name=CylG030441|arrow-snap-gene-0.5;Alias=maker-000161F|arrow-snap-gene-0.5;
000161F|arrow maker mRNA 35 1654 . - . ID=CylG030441|arrow-snap-gene-0.5-mRNA-1;Parent=CylG030441|arrow-snap-gene-0.5;Name=CylG030441|arrow-snap-gene-0.5-mRNA-1;Alias=maker-000161F|arrow-snap-gene-0.5-mRNA-1;_AED=0.13;_QI=0|0|0|1|0|0|2|0|510;_eAED=0.07;
000161F|arrow maker exon 1406 1654 . - . ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:exon:28;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow maker exon 35 1315 . - . ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:exon:27;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow maker CDS 1406 1654 . - 0 ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:cds;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow maker CDS 35 1315 . - 0 ID=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1:cds;Parent=CylT030441-RA|arrow-snap-gene-0.5-mRNA-1;
000161F|arrow maker gene 15768 16170 . + . ID=CylG030449|arrow-snap-gene-0.4;Name=CylG030449|arrow-snap-gene-0.4;Alias=maker-000161F|arrow-snap-gene-0.4;
000161F|arrow maker mRNA 15768 16170 . + . ID=CylG030449|arrow-snap-gene-0.4-mRNA-1;Parent=CylG030449|arrow-snap-gene-0.4;Name=CylG030449|arrow-snap-gene-0.4-mRNA-1;Alias=maker-000161F|arrow-snap-gene-0.4-mRNA-1;_AED=0.41;_QI=0|0|0|1|0|0|3|0|94;_eAED=0.41;
Lastly, here is my converted gtf file:
-bash-4.1$ head /usr/local/projdata/0568/projects/PLANKTON/illumina_aallen/jespinoz/References/Cylindrotheca/annotations/cylindrotheca_rd3.all.maker.gtf
000000F|arrow maker exon 266 1689 . + . transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
000000F|arrow maker exon 1781 2574 . + . transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
000000F|arrow maker CDS 266 634 . + 0 transcript_id "CylT000001-RA|arrow-processed-gene-0.141-mRNA-1";
000000F|arrow maker exon 326 473 . + . transcript_id "CylT000002-RA|arrow-noncoding-gene-0.302-snoRNA-1";
000000F|arrow maker exon 2798 3631 . - . transcript_id "CylT000003-RA|arrow-processed-gene-0.179-mRNA-1";
000000F|arrow maker CDS 2798 3631 . - 0 transcript_id "CylT000003-RA|arrow-processed-gene-0.179-mRNA-1";
000000F|arrow maker exon 6084 6181 . + . transcript_id "CylT000004-RA|arrow-noncoding-gene-0.303-snoRNA-1";
000000F|arrow maker exon 7306 9415 . + . transcript_id "CylT000005-RA|arrow-snap-gene-0.2-mRNA-1";
000000F|arrow maker exon 9971 10218 . + . transcript_id "CylT000005-RA|arrow-snap-gene-0.2-mRNA-1";
000000F|arrow maker CDS 7306 9415 . + 0 transcript_id "CylT000005-RA|arrow-snap-gene-0.2-mRNA-1";
If your function only needs the missing
gene_id,gene_name, andtranscript_namein the attributes column, without a real value, you can try to add them to every line with no value (hencegene_id ""; gene_name ""; transcript_name "") after thetranscript_id. You can try AWK for such manipulation. No guarantees, I don't know what the function exactly needs to prevent the error.It looks like for this record:
000161F|arrow maker mRNA 35 1654 . - . ID=CylG030441|arrow-snap-gene-0.5-mRNA-1;Parent=CylG030441|arrow-snap-gene-0.5;Name=CylG030441|arrow-snap-gene-0.5-mRNA-1;Alias=maker-000161F|arrow-snap-gene-0.5-mRNA-1;_AED=0.13;_QI=0|0|0|1|0|0|2|0|510;_eAED=0.07;The gene id/name is isCylG030441|arrow-snap-gene-0.5while transcript id/nameCylG030441|arrow-snap-gene-0.5-mRNA-1.