Extract surrounding genome sequences given certain locations using R
2
1
Entering edit mode
5.6 years ago
DVA ▴ 610

Hello,

I wonder what is the best way/package to extract ~3k bps regions surrounding a certain given location in R. E.g. the location can be Human Chromosome 4 Position 50000. Then I'd like to have the sequence from chr4:47000-53000 of hg19/hg38.

Thank you.

R package genome sequence • 5.7k views
0
Entering edit mode

Do you have to do this with R specifically?

0
Entering edit mode

Yes (unfortunately).

3
Entering edit mode
5.6 years ago
Chun-Jie Liu ▴ 280

You can use biomaRt

GENES = useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "useast.ensembl.org")

YouRange<- getBM(attributes = c("chromosome_name", "start_position", "end_position"), filters = c("chromosome_name","start","end"), values = list(chromosome_name = "4", start = 50000-3000, end = 50000 + 3000), mart = GENES)


You can use getSequence get sequences.

getSequence(chromosome = 4, start = 50000-3000, end = 50000+3000,type='hgnc_symbol',seqType = 'gene_exon_intron', mart = GENES)


getBM helps you get the annotation information from Biomart, getSequence helps you get the sequence you need.

This is a simple example for example question, you can check details of biomaRt to solve your specific question.

0
Entering edit mode

Thank you! I'm going to take a look right now:)

0
Entering edit mode

Your code gives me an empty data.frame. I had to specify chromosome as "4" instead of "chr4".

0
Entering edit mode

Thanks for your correctness. This is a simple example for using biomaRt. The previous version of Ensembl support 'chr' character. You can choose the Ensembl release or genome build by useMart or useEnsembl.

The Ensembl give the tutorial.

0
Entering edit mode

No problem. I never know what type of nomenclature I have to use myself, since sometimes it seems to be "chr4" and sometimes "4". I though Ensembl was using "chr4" whereas UCSF style was "4" but it seems I was mistaken...

6
Entering edit mode
5.6 years ago
ddiez ★ 1.9k

An option is to use the BSgenome packages and the method getSeq.

library(BSgenome.Hsapiens.NCBI.GRCh38)
# this gives you the object Hsapiens, which you can inspect (output not included).
Hsapiens

# get the 3000 bases upstream of position chr4:47000.
getSeq(Hsapiens, "4", start = 47000 - 3000, end = 47000 - 1)
3000-letter "DNAString" instance
seq: TTTTCCTTTAAAATTGGGGTAATAGCCCATATCCTCCACATCTCAGAAGGTTCATTTTTATTGGTCCAAAGGAAG...CTGAGTTCTGTGAGTATTTTAATCAAATTCTCAAACTTGAGAAAGGGGTTATGGGAGTCCTCACTTCTTAGTAGC

0
Entering edit mode

Thank you very much.

0
Entering edit mode

I wrote a small script to extract sequences of multiple SNPs. getSeq function doesn't seem to support it. Any alternate solutions please?

0
Entering edit mode
suppressMessages(library(GenomicFeatures))
suppressMessages(library(BSgenome.Hsapiens.UCSC.hg19))

gr=GRanges(seqnames = paste0('chr',seq(1,3)),IRanges(start=c(1,100000,100000),end=c(2,100001,100001)))
tmp=Views(Hsapiens,gr)
sequence=DNAStringSet(as.character(tmp))