I have a list of ~1000 polymorphic positions (mapped to hg19/b37) spanning a gene region on chromosome 12. Only about half of these are in dbSNP - the rest are either novel or sequencing errors. I want to genotype these SNPs in a second sample, and to design the assays I need flanking sequence.
My question is - given a list of positions, only half of which are in dbSNP, how can I get flanking sequence?
I read the answers to a similar question from Kiriya. I've tried using Biomart, UCSC, and Galaxy, but get no results returned when I enter the positions in the formats required by those tools.
This worked wonderfully. For anyone else who hasn't used bioconductor before, you'll need to run a few lines of code to get this working.
source("http://www.bioconductor.org/biocLite.R");
source("http://www.bioconductor.org/biocLite.R");
biocLite("BSgenome"); biocLite("BSgenome.Hsapiens.UCSC.hg19");
library('BSgenome.Hsapiens.UCSC.hg19')
Adrian, thanks for the tip. This look like it will work. I'll give this a try after downloading the 850 MB human genome with biocLite("BSgenome.Hsapiens.UCSC.hg19").
and the corresponding fasta from your genome (here, chr11.fa)
then the following script:
FLANK_BP=25
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
"select chrom, size from hg19.chromInfo" > hg19.genome
flankBed -i positions.bed -g hg19.genome -b $FLANK_BP > flank.bed
fastaFromBed -fi chr11.fa -bed flank.bed -fo flank.fasta
will create flank.fasta with the sequences you are looking for. You
can get different up/downstream distances from flankBed, so check out the options
on that as well.
click on 'Attributes' and then on the 'Sequences' radio-button
for the example I selected 'Flank (Gene)'
in 'Upstream flank' I entered 10 just as an example
click on 'Filters'
select chromosome 12
as base pairs I entered 131780000 and 131783000 (which spans over ENSG00000248703, 131780658 to 131782517)
click on 'Results'
The FASTA result is:
>ENSG00000248703|ENST00000508505
AAATCCCTGC
If you click on 'XML', you are given a small XML-document that you can use to query the mart programmatically (see http://www.biomart.org/user-docs.pdf). For this example, the XML looks like this:
<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
<Filter name = "upstream_flank" value = "10"/>
<Filter name = "chromosome_name" value = "12"/>
<Filter name = "end" value = "131783000"/>
<Filter name = "start" value = "131780000"/>
<Attribute name = "ensembl_gene_id" />
<Attribute name = "ensembl_transcript_id" />
<Attribute name = "gene_flank" />
</Dataset>
[?]
If you click on 'Perl', BioMart generates some Perl code for you as an alternative to running the XML-queries. It should be fairly easy to modify the Perl code to your needs.
ADD COMMENT
• link
updated 5.1 years ago by
Ram
44k
•
written 13.5 years ago by
Joachim
★
2.9k
1
Entering edit mode
Hm. The strands are encoded as '1' and '-1'. Can you please try these values? I get a result back when I use '-1' instead of '+' in your example.
Thanks for the tip. I followed your instructions but I'm trying to use the "Multiple Chromosomal Regions" checkbox instead of the Chromosome/Base Pair check box. It says to enter (Chr:Start:End:Strand). When I input this:
12:131780658:131780658:+
12:131782517:131782517:+
I get no results returned. I need to provide my assay designer with a list of positions, and the flanking region around each SNP:
Thanks for the tip. I followed your instructions but I'm trying to use the "Multiple Chromosomal Regions" checkbox instead of the Chromosome/Base Pair check box. It says to enter (Chr:Start:End:Strand). When I input this: 12:131780658:131780658:+ [?] 12:131782517:131782517:+ I get no results returned. I need to provide my assay designer with a list of positions, and the flanking region around each SNP:
True. If you are still interested in using BioMart, can you please raise this issue on the BioMart mailing list? Thanks. http://www.biomart.org/contact.html
Unfortunately BioMart will only return values anchored to the gene, if you go through Ensembl Genes, or variation, if you start with database Ensembl variations. When you enter a coordinate in Ensembl Genes, the filter will be applied for genes in that region. The flanking sequence will be exported for the gene, not for the original sequence position you entered.
You can use the Ensembl API to do this query- if you're interestedm it's the Core API (Perl)
how big is the gene region? Can't you just download a subsequence of fasta chr12 and parse that?