I have a de novo annotation (transcripts, and sometimes isoforms) in the GTF format plus the following fields: locus_id
and transcript_id
.
I also have a reference annotation (mm10), which contains a gene_id
(Ensembl ID).
What I would like to do is to import those features as R GenomicFeatures objects (inside a TxDB).
There is a map between the official gene_id
and the locus_id
from my de novo assembly (and it is unique, no fused genes or anything like that). If a de novo feature and a reference feature intersect, they are always supposed to have the same locus_id
and gene_id
respectively.
Note that there are some gene_id
which do not have a corresponding locus_id
because they were not expressed in my RNASeq samples. Also, there are no novel gene models so all my locus_id
should map to a unique gene_id
.
=> Once I have imported my custom annotation, how can I assign the correct gene_id
to each of its features? I would like the gene_id
to be added as a new attribute if possible, rather than replacing the locus_id
I also know it can be done with Bedtools How To Get Annotation For Bed File From Another Bed File
The ultimate goal is to be able to query that database by the gene_id
(and sometimes the transcript_id
), and to quickly obtain the 3'UTR, CDS and introns positions. I hope R makes my life easier than using a few bedtools script for all these tasks.
Thanks a lot if you can help me or advise me! I think this may be rather common task, but I have not found any tutorial.