I apologise for this question regarding probably the most annoying thing in our workflows. Annotation.
Yes, I have searched and googled but I have not found a satisfying answer without re-inventing the wheel again and again and again... it has been years since the last post, therefore I would like your input on how you do this.
Question: What is a quick, easy and comprehensive way to annotate of a large set of Genomic Intervals
R GRanges object or Bed file
For each interval:
Gene ID / Transcript ID: In which gene / transcript does overlap: not nearby feature!!!
Gene Type: ncRNA, protein coding, ...
Gene Region: intron, exon, intergenic, 5' UTR, 3' UTR, CDS, ..
Decision logic: ncRNAs > microRNAs > protein coding, exon > intron, ... etc
other info would be also very nice.
Here some reasons why I am not very happy about certain ways:
1. ChIP seq tools
Chip Seq annotation packages, they all annotate the nearby gene, but the annotation can be from a gene which is > 10 kb away, and has nothing to do with the genomic region they annotate. I think this is sometimes missleading. This packages include
- R package ChIPpeakAnno: does not use strand information for annotation, does wrong annotation, should not be used anymore
- R package ChIPseeker: fixes ChIPpeakAnno problems, but does annotate only nearest features, no annotation of genomic region. no apparent decision logic, takes the first hit of TxDb if multiple features are hit, which is fine if the TxDb is tuned (see TxDb).
- HOMER: does annotate nearest features, do not know of any decision logic
- PeakAnalyzer: does annotate nearest features, do not know of any decision logic
2. R TxDb
Absolutely a valid option, but the standard ones do not contain gene type and decision logic. Basically, every user has to build their TxDb (so basically that the first hit they receive is already in the right decision order so the processing can be fast) and I personally think creating TxDbs is really not well documented if documented at all. If you have a good documentation, please let me know!
3. Bed tools / Bedops
Same as TxDb, absolutely valid option. Fast, but you have to download and prepare everything yourself. Re-inventing the wheel.
Here is a nice way to do Bedops C: Annotating Genomic Intervals for one annotation type. Multiple annotation types and logic has to be implemented. Not complicated but probably a time-saver would be nice. If you have good scripts or efficient workflow, please share it.
4. DAS services
Haven't used them, probably slow and not ideal for annotation of ten thousands of intervals.
I would appreciate your comments and help on this. Probably we can collect some time-savers here. Thank you.