I have a list of ~1000 polymorphic positions (mapped to hg19/b37) spanning a gene region on chromosome 12. Only about half of these are in dbSNP - the rest are either novel or sequencing errors. I want to genotype these SNPs in a second sample, and to design the assays I need flanking sequence.
My question is - given a list of positions, only half of which are in dbSNP, how can I get flanking sequence?
I read the answers to a similar question from Kiriya. I've tried using Biomart, UCSC, and Galaxy, but get no results returned when I enter the positions in the formats required by those tools.
I use R for this:
library('BSgenome.Hsapiens.UCSC.hg19') chr <- 'chr19' position <- 59900243 alleles <- '[T/C]' offset <- 60 seq <- paste(getSeq(Hsapiens,chr,position-offset,position-1), alleles, getSeq(Hsapiens,chr,position+1,position+offset), sep='')
I hope that helps.
You can use bedtools to do this. Say you have a file positions.bed that looks like
chr11 1234 1234 chr11 91234 91234 chr11 191234 191234 chr11 391234 391234
and the corresponding fasta from your genome (here, chr11.fa) then the following script:
FLANK_BP=25 mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \ "select chrom, size from hg19.chromInfo" > hg19.genome flankBed -i positions.bed -g hg19.genome -b $FLANK_BP > flank.bed fastaFromBed -fi chr11.fa -bed flank.bed -fo flank.fasta
flank.fasta with the sequences you are looking for. You
can get different up/downstream distances from flankBed, so check out the options
on that as well.
Please try BioMart again, because I just toyed around with the Ensembl Gene 61 mart and it does return parts of the flanking sequence.
The FASTA result is:
If you click on 'XML', you are given a small XML-document that you can use to query the mart programmatically (see http://www.biomart.org/user-docs.pdf). For this example, the XML looks like this:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE Query> [?]
<Dataset name = "hsapiens_gene_ensembl" interface = "default" > <Filter name = "upstream_flank" value = "10"/> <Filter name = "chromosome_name" value = "12"/> <Filter name = "end" value = "131783000"/> <Filter name = "start" value = "131780000"/> <Attribute name = "ensembl_gene_id" /> <Attribute name = "ensembl_transcript_id" /> <Attribute name = "gene_flank" /> </Dataset>
If you click on 'Perl', BioMart generates some Perl code for you as an alternative to running the XML-queries. It should be fairly easy to modify the Perl code to your needs.
This is also pretty straightforward with the Ensembl Perl API:
#!/usr/bin/perl use strict; use warnings; use Bio::EnsEMBL::Registry; my $reg = 'Bio::EnsEMBL::Registry'; $reg->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $slice_adaptor = $reg->get_adaptor( 'human', 'core', 'slice'); my $chromosome = '7'; my $position = 24966446; my $offset = 60; my $slice = $slice_adaptor->fetch_by_region('chromosome', $chromosome, $position-$offset, $position+$offset); print $slice->seq;