Question

R Genomic Features: managing de novo annotations and mapping to the reference

0

Entering edit mode

8.9 years ago

cyril-cros ▴ 950

I have a de novo annotation (transcripts, and sometimes isoforms) in the GTF format plus the following fields: locus_id and transcript_id.

I also have a reference annotation (mm10), which contains a gene_id (Ensembl ID).

What I would like to do is to import those features as R GenomicFeatures objects (inside a TxDB).

There is a map between the official gene_id and the locus_id from my de novo assembly (and it is unique, no fused genes or anything like that). If a de novo feature and a reference feature intersect, they are always supposed to have the same locus_id and gene_id respectively.

Note that there are some gene_id which do not have a corresponding locus_id because they were not expressed in my RNASeq samples. Also, there are no novel gene models so all my locus_id should map to a unique gene_id.

=> Once I have imported my custom annotation, how can I assign the correct gene_id to each of its features? I would like the gene_id to be added as a new attribute if possible, rather than replacing the locus_id

I also know it can be done with Bedtools How To Get Annotation For Bed File From Another Bed File

The ultimate goal is to be able to query that database by the gene_id (and sometimes the transcript_id), and to quickly obtain the 3'UTR, CDS and introns positions. I hope R makes my life easier than using a few bedtools script for all these tasks.

Thanks a lot if you can help me or advise me! I think this may be rather common task, but I have not found any tutorial.

RNA-Seq GenomicFeatures • 1.9k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by cyril-cros ▴ 950

Ram · Answer 1 · 2015-06-10

Ok, the answer seems to be bedtools.

I can use bedtools intersect to get: featureRef <TAB> gene_id <TAB> featureDeNovo <TAB> locus_id,transcript_id
I can use awk to be left with a map in this format: gene_id <TAB> locus_id
Then sort and uniq to simplify, and obtain a table gene_id <TAB> locus_id
I can use awk to get my de novo annotation like this: featureDeNovo <TAB> locus_id,transcript_id <TAB> locus_id
a unix join with my table leaves me with: featureDeNovo <TAB> locus_id,transcript_id <TAB> locus_id <TAB> gene_id
And then awk: featureDeNovo <TAB> locus_id,transcript_id,gene_id