Question: How To Gather Sequences Of Snp Flanking Regions In A Given Range (Alternatives To Biomart)
1
gravatar for Dr.Bunsen
8.8 years ago by
Dr.Bunsen150
Dr.Bunsen150 wrote:

I have a large list of chromosomal regions for which I would like to generate +/-10 bp of flanking sequence surrounding all SNPs for all alleles that are found in these regions. A hypothetical output for two alleles found in a given range on chromosome 1 might look like this:

>chr1|1234500|G
AAAAAAAAAAGAAAAAAAAAA
>chr1|1234500|C
AAAAAAAAAACAAAAAAAAAA

Normally, I generate an XML query and download the data from BioMart. Using BioMart my query always hangs (Biomart seems unusually slow lately). Does anyone know how I can generate the sequences for all alleles of all SNPs found within a given range using something other than BioMart? Thanks in advance for any help you can provide.

I should also mention, that another solution that would work perfectly fine would be a method to download all the DB132 SNPs with flanking sequences. From there I can whip up a script to filter only the sequences I need from my regions of interest.

biomart snp • 5.0k views
ADD COMMENTlink modified 8.8 years ago by Michael Lawrence90 • written 8.8 years ago by Dr.Bunsen150
3
gravatar for Aaron H
8.8 years ago by
Aaron H170
United States/San Francisco/UCSF
Aaron H170 wrote:

I would use bedtools with two steps.

slopBed to add 10 bases to the interval around each snp:

$ slopBed -i snps.bed -g hg19.genome -b 10 > flanking_snps.bed

fastaFromBed to retrieve those regions from the fasta file:

$ fastaFromBed -fi hg19.fasta -bed flanking_snps.bed -fo flanking_snps.fasta
ADD COMMENTlink modified 8 months ago by RamRS28k • written 8.8 years ago by Aaron H170

Thanks Aaron. I tried Pierre's answer and it seems to work, but I definitely didn't get all the DB132 SNPs. Only about 20K. I'm not familiar with bedtools, so thanks for suggesting it. Do I need to provide the snps.bed file?

ADD REPLYlink written 8.8 years ago by Dr.Bunsen150

bedtools is great for this sort of job. You would need to get your snps into a format that bedtools can work with: vcf,bed,gff

ADD REPLYlink written 8.8 years ago by Aaron H170

Thanks Aarron. One additional question. I installed bedtools via Homebrew and can't seem to located the hg19.genome file. In the docs it says, "BEDTools includes predefined genome files for human and mouse in the /genomes directory included in the BEDTools distribution." Do you know what directory these files reside in? Thanks.

ADD REPLYlink written 8.8 years ago by Dr.Bunsen150

I haven't installed with Homebrew but you can get them from the source code, http://code.google.com/p/bedtools/downloads/detail?name=BEDTools.v2.14.3.tar.gz&can=2&q=

ADD REPLYlink written 8.8 years ago by Aaron H170

Thanks Aaron. I did eventually get it working with Homebrew. Just so your aware, there seems to be an issue with the location of the genome directory when installed via Homebrew. This is a fantastic tool! Thanks for making me aware of it's existence.

ADD REPLYlink written 8.8 years ago by Dr.Bunsen150
2
gravatar for Pierre Lindenbaum
8.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

I should also mention, that another solution that would work perfectly fine would be a method to download all the DB132 SNPs with flanking sequences.

The best would be to download the human genome and to extract the flanking sequences for a given SNP/location (using, e.g. samtools faidx )

You could akso do something like this, using ucsc mysql and ncbi efetch.

$ mysql -N --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select name from snp132 where chrom="chr1" and chromStart>1000 and chromStart<1000000' |\
  sed 's/^rs//' |\
  while read L; do curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.cgi?db=snp&id=${L}&rettype=fasta&retmode=text" | egrep -v '^1\:' ; done

>gnl|dbSNP|rs112750067 rs=112750067|pos=251|len=501|taxid=9606|mol="genomic"|class=1|alleles="C/T"|build=135
CCTAACCCTA ACCCTAACCC TAACCCTAAC CCAACCCTAA CCCTAACCCT AACCCTAACC CTAACCCTAA CCCCTAACCC
TAACCCTAAC CCTAACCCTA ACCTAACCCT AACCCTAACC CTAACCCTAA CCCTAACCCT AACCCTAACC CTAACCCCTA
ACCCTAACCC TAAACCCTAA ACCCTAACCC TAACCCTAAC CCTAACCCTA ACCCCAACCC CAACCCCAAC CCCAACCCCA
ACCCCAACCC 
Y
AACCCCTAAC CCTAACCCTA ACCCTACCCT AACCCTAACC CTAACCCTAA CCCTAACCCT AACCCCTAAC CCCTAACCCT
AACCCTAACC CTAACCCTAA CCCTAACCCT AACCCCTAAC CCTAACCCTA ACCCTAACCC TCGCGGTACC CTCAGCCGGC
CCGCCCGCCC GGGTCTGACC TGAGGAGAAC TGTGCTCCGC CTTCAGAGTA CCACCGAAAT CTGTGCAGAG GACAACGCAG
CTCCGCCCTC

>gnl|dbSNP|rs56289060 rs=56289060|pos=434|len=934|taxid=9606|mol="genomic"|class=2|alleles="-/C"|build=129
TAACCCTAAC CCTAACCCTA ACCCTAACCC TAACCCTAAC CCTAACCCTA ACCCTAACCC TAACCCTAAC CCTAACCCTA
ACCCTAACCC TAACCCTAAC CCTAACCCAA CCCTAACCCT AACCCTAACC CTAACCCTAA CCCTAACCCC TAACCCTAAC
CCTAACCCTA ACCCTAACCT AACCCTAACC CTAACCCTAA CCCTAACCCT AACCCTAACC CTAACCCTAA CCCCTAACCC
TAACCCTAAA CCCTAAACCC TAACCCTAAC CCTAACCCTA ACCCTAACCC CAACCCCAAC CCCAACCCCA ACCCCAACCC
CAACCCTAAC CCCTAACCCT AACCCTAACC CTACCCTAAC CCTAACCCTA ACCCTAACCC TAACCCTAAC CCCTAACCCC
TAACCCTAAC CCTAACCCTA ACCCTAACCC TAA
N
CCCTAACCCC TAACCCTAAC CCTAACCCTA ACCCTCGCGG TACCCTCAGC CGGCCCGCCC GCCCGGGTCT GACCTGAGGA
GAACTGTGCT CCGCCTTCAG AGTACCACCG AAATCTGTGC AGAGGACAAC GCAGCTCCGC CCTCGCGGTG CTCTCCGGGT
CTGTGCTGAG GAGAACGCAA CTCCGCCGGC GCAGGCGCAG AGAGGCGCGC CGCGCCGGCG CAGGCGCAGA CACATGCTAG
CGCGTCGGGG TGGAGGCGTG GCGCAGGCGC AGAGAGGCGC GCCGCGCCGG CGCAGGCGCA GAGACACATG CTACCGCGTC
CAGGGGTGGA GGCGTGGCGC AGGCGCAGAG AGGCGCACCG CGCCGGCGCA GGCGCAGAGA CACATGCTAG CGCGTCCAGG
GGTGGAGGCG TGGCGCAGGC GCAGAGACGC AAGCCTACGG GCGGGGGTTG GGGGGGCGTG TGTTGCAGGA GCAAAGTCGC
ACGGCGCCGG GCTGGGGCGG

>gnl|dbSNP|rs112155239 rs=112155239|pos=251|len=501|taxid=9606|mol="genomic"|class=1|alleles="A/C"|build=132
CCTAACCCTA ACCCTAACCC TAACCCTAAC CCTAACCCTA ACCCCTAACC CTAACCCTAA ACCCTAAACC CTAACCCTAA
CCCTAACCCT AACCCTAACC CCAACCCCAA CCCCAACCCC AACCCCAACC CCAACCCTAA CCCCTAACCC TAACCCTAAC
CCTACCCTAA CCCTAACCCT AACCCTAACC CTAACCCTAA CCCCTAACCC CTAACCCTAA CCCTAACCCT AACCCTAACC
CTAACCCTAA 
M
CCCTAACCCT AACCCTAACC CTAACCCTCG CGGTACCCTC AGCCGGCCCG CCCGCCCGGG TCTGACCTGA GGAGAACTGT
GCTCCGCCTT CAGAGTACCA CCGAAATCTG TGCAGAGGAC AACGCAGCTC CGCCCTCGCG GTGCTCTCCG GGTCTGTGCT
GAGGAGAACG CAACTCCGCC GGCGCAGGCG CAGAGAGGCG CGCCGCGCCG GCGCAGGCGC AGACACATGC TAGCGCGTCG
GGGTGGAGGC

>gnl|dbSNP|rs112766696 rs=112766696|pos=201|len=401|taxid=9606|mol="genomic"|class=2|alleles="-/C"|build=132
CTAACCCTAA ACCCTAAACC CTAACCCTAA CCCTAACCCT AACCCTAACC CCAACCCCAA CCCCAACCCC AACCCCAACC
CCAACCCTAA CCCCTAACCC TAACCCTAAC CCTACCCTAA CCCTAACCCT AACCCTAACC CTAACCCTAA CCCCTAACCC
CTAACCCTAA CCCTAACCCT AACCCTAACC CTAACCCTAA 
N
CCCTAACCCT AACCCTAACC CTAACCCTCG CGGTACCCTC AGCCGGCCCG CCCGCCCGGG TCTGACCTGA GGAGAACTGT
GCTCCGCCTT CAGAGTACCA CCGAAATCTG TGCAGAGGAC AACGCAGCTC CGCCCTCGCG GTGCTCTCCG GGTCTGTGCT
GAGGAGAACG CAACTCCGCC GGCGCAGGCG CAGAGAGGCG
ADD COMMENTlink modified 8 months ago by RamRS28k • written 8.8 years ago by Pierre Lindenbaum129k

Wow, thank you so much Pierre. I did not know about efetch. Two questions:

  1. what does the final grep do?
  2. the flanking SNP sequences seem to be of variable length; do you know why?

Thanks again.

ADD REPLYlink modified 8 months ago by RamRS28k • written 8.8 years ago by Dr.Bunsen150
  1. grep: the NCBI adds an extra line starting with "1:".
  2. in dbSNP, the flanking sequences don't have the same length from one SNP to another (larger sequences=increased specificity)
ADD REPLYlink modified 8 months ago by RamRS28k • written 8.8 years ago by Pierre Lindenbaum129k
1
gravatar for Mary
8.8 years ago by
Mary11k
Boston MA area
Mary11k wrote:

Yeah, I've seen a lot of people report the hanging problem with BioMart. I'll give 2 suggestions there, and then a third:

  1. At "old" BioMart you can chunk the query by chromosome and re-run it for each. Longer, but might not time out. You could do this with Galaxy and set up a workflow that you just re-set for each chromosome. Shouldn't take too long.
  2. The "new" BioMart interface might have less server burden and you can try there? But I haven't tried it for a query like this yet. I'd be interested to know if it works for you.

But you can do the same thing at UCSC, and in the output request for flanking sequence just like you can at BioMart. From the Table Browser you can construct the query, upload your list of SNPs or the ranges, and then in the output pulldown you'll move to a subsequent step where you can set your flanking sizes.

ADD COMMENTlink modified 8 months ago by RamRS28k • written 8.8 years ago by Mary11k

Thanks for the help Mary. Before I posted this question I tried option 1 and it still hung. I'm querying SNP sequences from about 2% of the genome and UCSC only permits 1000 regions to be uploaded. Thanks for the http://central.biomart.org/ tip. I had no clue this existed. Unfortunately, it doesn't seem possible to gather flanking sequence from here though.

ADD REPLYlink modified 8 months ago by RamRS28k • written 8.8 years ago by Dr.Bunsen150

Well, like the Galaxy suggestion for BioMart you could also set a workflow for the UCSC mining. Alternatively--you can query the UCSC server directly: http://genome.ucsc.edu/FAQ/FAQdownloads#download29

ADD REPLYlink modified 10 months ago by RamRS28k • written 8.8 years ago by Mary11k

The SQL query is a good idea, but I don't believe you can retrieve sequences via SQL? I think this is what Pierre is saying here.

ADD REPLYlink modified 8 months ago by RamRS28k • written 8.8 years ago by Dr.Bunsen150
1
gravatar for Michael Lawrence
8.8 years ago by
United States
Michael Lawrence90 wrote:

Something like:

library(rtracklayer)
library(BSgenome.Hsapiens.UCSC.hg19)
snps <- import("snps.bed")
seqs <- getSeq(Hsapiens, ranges(snps) + 10)
ADD COMMENTlink modified 8 months ago by RamRS28k • written 8.8 years ago by Michael Lawrence90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 679 users visited in the last hour