I want to filter the "RefSeq Select" transcript isoforms in the GRCh37.p13 human genome annotation gff (GCF_000001405.25_GRCh37.p13_genomic.gff.gz). Specifically my goal is to retain for each gene a transcript isoform with a
tag=RefSeq Select attribute if exists (as I understood I can consider them canonical), and in case none is present, the longest isoform.
My idea was to use @Juke34's AGAT tools. I planned to
- filter the longest isoforms
- filter a second gff for the tag using
- create an ID list of the genes contained in that second gff
- remove those IDs from the longest isoform gff
- merge the resulting gff with the filtered
On the one hand, the number of steps in this strategy makes it a bit delicate, on the other hand, it falls flat at step 2 anyway:
# get gff (~20mb) wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz # build toy set zgrep "gene=KRAS;" GCF_000001405.25_GRCh37.p13_genomic.gff.gz \ >kras_excerpt.gff agat_sp_filter_feature_by_attribute_presence.pl \ --gff kras_excerpt.gff \ --attribute tag \ --flip \ --output test_flip # empty result cat test_flip ##gff-version 3
It's quite likely my googlefoo failed me this time: Does anybody have some recommendation how to approach this? Or is the only solution writing yet another gff processor tool?
Juke34 identified the original problem to the second step, I didn't add
--type level2, so it correctly filtered out all level one features not containing the tag, like genes, etc. Lovely gff world