Merging/Intersecting Different Gene Annotations - Should I Extend Coordinates?
1
0
Entering edit mode
10.5 years ago
PoGibas 5.1k

I want to create gene data-set (as big as possible), hence I am using several gene annotations. However, genes in different annotations overlap (it's the same gene). For reducing biases I overlap different annotations and if genes overlap leave only one gene.

Question:

To ensure this overlap I was thinking to expand gene coordinates - is this necessary? If so, how big extension should be (5bp/100bp)?

Example:

Want to create lncRNA data-set (in the following steps it will be used to search for genomic features).
Input:

  1. GENCODE lncRNA annotation (version 18 - 04/09/2013);
  2. Cabili lncRNA annotation (Cabili et al., 2011 (CSHLP)).

Workflow:

  1. Extract GENCODE genes start/end coordinates;
  2. Extract Cabili genes start/end coordinates;
  3. Extend Cabili coordinates ( -/+ nbp );
  4. Use BedTools intersect;
  5. If genes intersect leave GENCODE gene (as it's a newer annotation (though this step is really subjective)).

I do realize that this extension question depends on the situation and how reliable annotation is, but still hope that someone could suggest something.

bedtools merge • 3.7k views
ADD COMMENT
0
Entering edit mode

What do you plan on doing with this dataset?

ADD REPLY
0
Entering edit mode

I updated my question: "in the following steps it will be used to search for genomic features"

ADD REPLY
1
Entering edit mode

You should think about what you exactly will want to do with these features. For RNA-seq? For wetlab (primers/probes..)? For phylogenetic studies? Your strategy of how you want to merge the features might be different for these purposes. There probably isn't one single method of merging these annotations that will be good for all purposes.

ADD REPLY
0
Entering edit mode

This should be simply enrichment analysis for any feature (e.g., sequence motif, chromatin modification, repeat count).

ADD REPLY
1
Entering edit mode
10.5 years ago
JacobS ▴ 980

My first instinct is that arbitrarily extending coordinates to try to resolves differences between two annotations is a dangerous practice. You wouldn't want to accidently combine two nearby features of no related function just because of their proximity. I'm not sure what kind of organism you are working with, but there are such things as annotation combiners that are specifically designed to use various forms of evidence from several programs to build a final, comprehensive annotation. JIGSAW comes to mind, and a quick websearch found this link, but you should search for other combiners to fit your need. For example, I think JIGSAW is only for eukaryotes, while something like GenePRIMP is only for prokaryotes.

ADD COMMENT

Login before adding your answer.

Traffic: 2678 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6