Question

Extract Flanking Region Of A Snp

0

Entering edit mode

12.7 years ago

Patrick ▴ 40

Hello,

I have a file containing a lot of SNP in a tabular format with the chromosome and the position on the chromosome and the allele change

here is an exemple

chr1  2898298 rsxxxxxx A T

The file contain thousands of data like this

I want to extract flanking region of each snp and export both sequences in fasta format for each SNP with both allele, for my example one with A and one sequence with T flanked by 25 bp from each side

I want to do it with UCSC not Ensembl, I browsed biostar and I found how to connect to UCSC via MySQL. Can someone help me to design the query using mysql, I will wrap it in a perl script and try to export both sequences

Regards

Patrick

snp sequence ucsc • 5.9k views

ADD COMMENT • link updated 12.7 years ago by Jorge Amigo 14k • written 12.7 years ago by Patrick ▴ 40

5

Entering edit mode

what is wrong with these solutions? Get Flanking Sequence Given A List Of Positions

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 12.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

I'd download the latest human genome release and use bio::db::sam modules' access reference functions to grab the flanking regions. It's very very fast.

ADD REPLY • link 12.7 years ago by Vitis ★ 2.5k

Ram · Answer 1 · 2011-08-19

If you do not mind downloading the human reference genome (~900MB), you can:

wget --no-check-certificate https://raw.github.com/lh3/misc/master/seq/seqtk/seqtk.c
wget --no-check-certificate https://raw.github.com/lh3/misc/master/seq/seqtk/khash.h
wget --no-check-certificate https://raw.github.com/lh3/misc/master/seq/seqtk/kseq.h
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
gcc -g -Wall -O2 seqtk.c -lz -o seqtk
sed s,^chr,, your.list | awk -v OFS="\t" '{print $1,$2-26,$2+25}' > snps.bed
./seqtk subseq -t hs37d5.fa.gz snps.bed > snps.seq
paste your.list snps.seq > snps.final.txt

As to UCSC, I do know where the reference sequence is kept in MySQL, but for such a task, it is frequently simpler to do locally. The human reference genome is pretty small in comparison to other data sets. BTW, if you want to get the a sequence in a couple of regions, a more convenient way is:

samtools faidx ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz 1:1,000,000-1,010,000 2:2,000,000-2,010,000

But this method becomes inefficient when you want to get thousands of regions.

Ram · Answer 2 · 2011-08-21

2

Entering edit mode

12.7 years ago

Jorge Amigo 14k

for this purpose I used to suggest to my lab colleages this script I found years ago, which queries EMBL and GenBank, from the very interesting bioperl repository. but as Jeremy has pointed out, I see great solutions on this post which I don't see why they wouldn't suit your needs.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.7 years ago by Jorge Amigo 14k