I want to run tophat using the -G option:
"Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file. If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first."
This sounds great.
According to this GTF 2.2 spec - http://mblab.wustl.edu/GTF22.html - a GTF file can use exon or 3UTR or 5UTR features to represent exons. It also includes stuff about start_codon and CDS features. There are also gene and transcript id name-value pairs in the extra features field.
I don't think tophat cares about translations, so I'm guessing it can work just fine if I give it GTF with exon features only. Probably it doesn't need the "gene" extra feature attribute either.
Does anyone know the minimal data tophat needs to align reads onto a virtual transcriptome?
Would this work?
chr1 BLAH exon 150 200 . + . transcript_id "X"; chr1 BLAH exon 300 401 . + . transcript_id "X"; chr1 BLAH exon 501 650 . + . transcript_id "X"; chr1 BLAH exon 700 800 . + . transcript_id "X"; chr1 BLAH exon 900 1000 . + . transcript_id "X";
Also, how would I test this?
Does the tophat code contain unit tests I could use to make sure a given GTF file is correctly read?