R Genomic Features: managing de novo annotations and mapping to the reference
1
0
Entering edit mode
8.9 years ago
cyril-cros ▴ 950

I have a de novo annotation (transcripts, and sometimes isoforms) in the GTF format plus the following fields: locus_id and transcript_id.

I also have a reference annotation (mm10), which contains a gene_id (Ensembl ID).

What I would like to do is to import those features as R GenomicFeatures objects (inside a TxDB).

There is a map between the official gene_id and the locus_id from my de novo assembly (and it is unique, no fused genes or anything like that). If a de novo feature and a reference feature intersect, they are always supposed to have the same locus_id and gene_id respectively.

Note that there are some gene_id which do not have a corresponding locus_id because they were not expressed in my RNASeq samples. Also, there are no novel gene models so all my locus_id should map to a unique gene_id.

=> Once I have imported my custom annotation, how can I assign the correct gene_id to each of its features? I would like the gene_id to be added as a new attribute if possible, rather than replacing the locus_id

I also know it can be done with Bedtools How To Get Annotation For Bed File From Another Bed File

The ultimate goal is to be able to query that database by the gene_id (and sometimes the transcript_id), and to quickly obtain the 3'UTR, CDS and introns positions. I hope R makes my life easier than using a few bedtools script for all these tasks.

Thanks a lot if you can help me or advise me! I think this may be rather common task, but I have not found any tutorial.

RNA-Seq GenomicFeatures • 1.9k views
ADD COMMENT
0
Entering edit mode
8.9 years ago
cyril-cros ▴ 950

Ok, the answer seems to be bedtools.

  • I can use bedtools intersect to get: featureRef <TAB> gene_id <TAB> featureDeNovo <TAB> locus_id,transcript_id
  • I can use awk to be left with a map in this format: gene_id <TAB> locus_id
  • Then sort and uniq to simplify, and obtain a table gene_id <TAB> locus_id
  • I can use awk to get my de novo annotation like this: featureDeNovo <TAB> locus_id,transcript_id <TAB> locus_id
  • a unix join with my table leaves me with: featureDeNovo <TAB> locus_id,transcript_id <TAB> locus_id <TAB> gene_id
  • And then awk: featureDeNovo <TAB> locus_id,transcript_id,gene_id
ADD COMMENT

Login before adding your answer.

Traffic: 2115 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6