I need a simplified annotation file that contains a single "complete" annotation for each gene of the human genome. In other words, what I need is similar to when an annotation track in the UCSC Genome Browser is changed from full view to dense (see images below). Does anyone know of a simple way to collapse individual gene isoforms of a gtf file into single "complete" gene annotations?
Thanks for the reply. From what I understand the script filters away short isoforms and only keeps the longest. While that works for some genes (like LSM4) this approach could result in losing the annotation of certain exons only present in shorter isoforms. I would like to obtain an annotation file that has one variant per gene with all known and putative exons.
This would create an biologically invalid gff, in my inflexible database influenced mind - so I guess you'd have to come up with a solution.
Options for handling gtf/gff:
Thank you for the resources! The collapse_annotation script looks promising.
I think it is a bit harsh to call it biologically invalid. Though we lose resolution of gene isoforms we maintain all exons that are assigned to a particular gene. If you only need to know if a sequence can be assigned to a particular gene then isoforms just complicates the process.
admitted, the phrase is a bit extreme. I totally see the purpose, a colleage has done the same, though tracing which exon comes from which isoform. Just in case you ever think about annotation of transcript/protein changes.
AGAT does not have any script to collapse features in this way (yet).
I have never seen this done before and not sure I understand the benefit of doing that. But I guess that's up to you. In any case, this will probably require merging exons and coming up with specific rules to resolve conflicting mRNA models. I'm curious about the downstream purpose of this procedure.
I guess one benefit is to reduce the search space while maintaining information of exons associated with individual genes.
Personally I wanted this type of annotation to make it super quick and easy to find overlap between a sequence and the most upstream and downstream exon of genes. Some genes have e.g. one or more isoforms with exon 1 positioned downstream of the most upstream exon of the gene as a whole.
Luckily, since I am only looking at exons at the extreme ends, it is not too complicated to extract these and I have come up with a procedure that works fine. But having an annotation file with collapsed gene isoforms would have simplified the procedure.