structural annotation of sequence
1
0
Entering edit mode
7.2 years ago
solo7773 ▴ 90

========

Hi there, let's have a look at the structure of human genome first.

enter image description here

Now, given a sequence, referred to human reference genome hg19,

chromosome, start, end, seq
19, 41808719, 41808745, GGTGGTGGTGGCTTCCGGGGCCGCGGG

I want to know which component of the figure above this sequence belongs to. Could anyone tell me how to? Thanks in advance.

============== update ============

I just found ANNOVAR, it can do region-based annotation, I think it will work for me. I recommend it to others who may face the same problem as me.

genome • 2.1k views
ADD COMMENT
0
Entering edit mode

Maybe blast2GO can help? I do not think it's a trivial task to map a sequence to its structure/function. Also, I'm assuming this is from human chr19.

You could always look for annotations by the chromosome coordinates.

ADD REPLY
0
Entering edit mode

Thanks @Ram , it's from chromosome 19 of hg19. I want to categorise sequences based on their feature. Now I am doing pilot study to find a suitable way to categorise them. Suggestions are welcomed.

ADD REPLY
3
Entering edit mode
7.2 years ago
d-cameron ★ 2.9k

Isn't the sequence redundant given you have the genome coordinates?

For the human genome, these categories are already annotated, it's just a matter of performing lookup on the annotations. If you're familar with R, then you can use BioConductor for the annotations on the left side of your diagram, and load the RepeatMasker annotations to add the annotations on the right side.

If it's just a handful of sequences, you could just manually inspect in something like the UCSC Genome Browser (you'll want to display tracks from the "Genes and Gene Predictions", and "Repeats" categories).

You might want to keep in mind that reality is more complicated than your diagram. Genes sequences can include ancestral transposable elements, UTRs can overlap genes on the other strand, and so on. Each base could be annotated with multiple annotations.

ADD COMMENT
0
Entering edit mode

Hi d-cameron, this sequence is referred to human reference genome hg19, chromosome 19. Its coordinate is known and given in the post. I have thousands of sequence to work with. I need to find our which parts these sequence belong to the genome. I can use R and Bioconductor. I really appreciate if you can provide further suggestions.

ADD REPLY
0
Entering edit mode

Do you have an orientation/direction to your sequence? Do you want to match genes on either strand?

This can be done relatively simply using Bioconductor. For example:

library(VariantAnnotation)
library(GenomicRanges)
library(GenomicFeatures)
library(rtracklayer)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(dplyr)

# load your input file using read.table() then construct a new GRanges() object
mytable <- read.csv(file="input.csv", header=TRUE, sep=",")
# Note: you need to check your coordinates as they could be 0-based or 1-based positions
gr <- GRanges(seqnames=mytable$chromosome, ranges=IRanges(start=mytable$start, end=mytable$end), strand="*")
seqlevelsStyle(gr) <- "UCSC"

# Annotate each range with all overlapping genes
gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
hits <- as.data.frame(findOverlaps(gr, gns, ignore.strand=TRUE))
hits$SYMBOL <- biomaRt::select(org.Hs.eg.db, gns[hits$subjectHits]$gene_id, "SYMBOL")$SYMBOL
hits$gene_strand <- as.character(strand(gns[hits$subjectHits]))
hits <- hits %>%
  group_by(queryHits) %>%
  summarise(SYMBOL=paste(SYMBOL, collapse=","), gene_strand=paste0(gene_strand, collapse=""))
gr$SYMBOL <- ""
gr$geneStrand <- ""
gr$SYMBOL[hits$queryHits] <- hits$SYMBOL
gr$geneStrand[hits$queryHits] <- hits$gene_strand
ADD REPLY
0
Entering edit mode

Hi d-cameron, thanks very much for your useful code. With your code, I will be able to match the gene on the + strand.

ADD REPLY

Login before adding your answer.

Traffic: 2898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6