Question: Extract surrounding genome sequences given certain locations using R
1
gravatar for DVA
3.1 years ago by
DVA530
United States
DVA530 wrote:

Hello,

I wonder what is the best way/package to extract ~3k bps regions surrounding a certain given location in R. E.g. the location can be Human Chromosome 4 Position 50000. Then I'd like to have the sequence from chr4:47000-53000 of hg19/hg38.

Thank you.

package sequence R genome • 2.8k views
ADD COMMENTlink modified 3.1 years ago by ddiez1.8k • written 3.1 years ago by DVA530

Do you have to do this with R specifically?

ADD REPLYlink written 3.1 years ago by Joe16k

Yes (unfortunately).

ADD REPLYlink written 3.1 years ago by DVA530
3
gravatar for Chun-Jie Liu
3.1 years ago by
Chun-Jie Liu260
US, Houston
Chun-Jie Liu260 wrote:

You can use biomaRt

GENES = useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "useast.ensembl.org")

YouRange<- getBM(attributes = c("chromosome_name", "start_position", "end_position"), filters = c("chromosome_name","start","end"), values = list(chromosome_name = "4", start = 50000-3000, end = 50000 + 3000), mart = GENES)

You can use getSequence get sequences.

getSequence(chromosome = 4, start = 50000-3000, end = 50000+3000,type='hgnc_symbol',seqType = 'gene_exon_intron', mart = GENES)

getBM helps you get the annotation information from Biomart, getSequence helps you get the sequence you need.

This is a simple example for example question, you can check details of biomaRt to solve your specific question.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Chun-Jie Liu260

Thank you! I'm going to take a look right now:)

ADD REPLYlink written 3.1 years ago by DVA530

Your code gives me an empty data.frame. I had to specify chromosome as "4" instead of "chr4".

ADD REPLYlink written 3.1 years ago by ddiez1.8k

Thanks for your correctness. This is a simple example for using biomaRt. The previous version of Ensembl support 'chr' character. You can choose the Ensembl release or genome build by useMart or useEnsembl.

The Ensembl give the tutorial.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Chun-Jie Liu260

No problem. I never know what type of nomenclature I have to use myself, since sometimes it seems to be "chr4" and sometimes "4". I though Ensembl was using "chr4" whereas UCSF style was "4" but it seems I was mistaken...

ADD REPLYlink written 3.1 years ago by ddiez1.8k
6
gravatar for ddiez
3.1 years ago by
ddiez1.8k
Japan
ddiez1.8k wrote:

An option is to use the BSgenome packages and the method getSeq.

library(BSgenome.Hsapiens.NCBI.GRCh38)
# this gives you the object Hsapiens, which you can inspect (output not included).
Hsapiens

# get the 3000 bases upstream of position chr4:47000.
getSeq(Hsapiens, "4", start = 47000 - 3000, end = 47000 - 1)
3000-letter "DNAString" instance
seq: TTTTCCTTTAAAATTGGGGTAATAGCCCATATCCTCCACATCTCAGAAGGTTCATTTTTATTGGTCCAAAGGAAG...CTGAGTTCTGTGAGTATTTTAATCAAATTCTCAAACTTGAGAAAGGGGTTATGGGAGTCCTCACTTCTTAGTAGC
ADD COMMENTlink written 3.1 years ago by ddiez1.8k

Thank you very much.

ADD REPLYlink written 3.1 years ago by DVA530

I wrote a small script to extract sequences of multiple SNPs. getSeq function doesn't seem to support it. Any alternate solutions please?

ADD REPLYlink written 15 months ago by Uday Rangaswamy120
suppressMessages(library(GenomicFeatures))
suppressMessages(library(BSgenome.Hsapiens.UCSC.hg19))

gr=GRanges(seqnames = paste0('chr',seq(1,3)),IRanges(start=c(1,100000,100000),end=c(2,100001,100001)))
tmp=Views(Hsapiens,gr)
sequence=DNAStringSet(as.character(tmp))
ADD REPLYlink written 6 months ago by rob.tesf0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1363 users visited in the last hour