Question: Get Flanking Sequence Given A List Of Positions
4
gravatar for Stephen
3.0 years ago by
Stephen1.9k
Nashville, TN
Stephen1.9k wrote:

Hi.

I have a list of ~1000 polymorphic positions (mapped to hg19/b37) spanning a gene region on chromosome 12. Only about half of these are in dbSNP - the rest are either novel or sequencing errors. I want to genotype these SNPs in a second sample, and to design the assays I need flanking sequence.

My question is - given a list of positions, only half of which are in dbSNP, how can I get flanking sequence?

I read the answers to a similar question from Kiriya. I've tried using Biomart, UCSC, and Galaxy, but get no results returned when I enter the positions in the formats required by those tools.

Thanks.

ADD COMMENTlink modified 7 months ago by Biostar ♦♦ 0 • written 3.0 years ago by Stephen1.9k

how big is the gene region? Can't you just download a subsequence of fasta chr12 and parse that?

ADD REPLYlink written 3.0 years ago by Russh1.1k
7
gravatar for Adrian Cortes
3.0 years ago by
Adrian Cortes400
Brisbane, Australia
Adrian Cortes400 wrote:

I use R for this:

library('BSgenome.Hsapiens.UCSC.hg19')
chr <- 'chr19'
position <- 59900243
alleles <- '[T/C]'
offset <- 60

seq <- paste(getSeq(Hsapiens,chr,position-offset,position-1),
             alleles,
             getSeq(Hsapiens,chr,position+1,position+offset),
             sep='')

I hope that helps.

ADD COMMENTlink written 3.0 years ago by Adrian Cortes400
1

This worked wonderfully. For anyone else who hasn't used bioconductor before, you'll need to run a few lines of code to get this working. source("http://www.bioconductor.org/biocLite.R"); source("http://www.bioconductor.org/biocLite.R"); biocLite("BSgenome"); biocLite("BSgenome.Hsapiens.UCSC.hg19"); library('BSgenome.Hsapiens.UCSC.hg19')

ADD REPLYlink written 3.0 years ago by Stephen1.9k

Adrian, thanks for the tip. This look like it will work. I'll give this a try after downloading the 850 MB human genome with biocLite("BSgenome.Hsapiens.UCSC.hg19").

ADD REPLYlink written 3.0 years ago by Stephen1.9k
6
gravatar for brentp
3.0 years ago by
brentp17k
Denver, Colorado
brentp17k wrote:

You can use bedtools to do this. Say you have a file positions.bed that looks like

  chr11 1234    1234
  chr11 91234   91234
  chr11 191234  191234
  chr11 391234  391234

and the corresponding fasta from your genome (here, chr11.fa) then the following script:

FLANK_BP=25
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
    "select chrom, size from hg19.chromInfo"  > hg19.genome
flankBed  -i positions.bed -g hg19.genome -b $FLANK_BP > flank.bed

fastaFromBed -fi chr11.fa -bed flank.bed -fo flank.fasta

will create flank.fasta with the sequences you are looking for. You can get different up/downstream distances from flankBed, so check out the options on that as well.

ADD COMMENTlink written 3.0 years ago by brentp17k
4
gravatar for Joachim
3.0 years ago by
Joachim2.8k
San Francisco, California
Joachim2.8k wrote:

Please try BioMart again, because I just toyed around with the Ensembl Gene 61 mart and it does return parts of the flanking sequence.

  • go to http://www.biomart.org/biomart/martview
  • select 'ENSEMBL GENES 61' as database
  • select 'Homo sapiens genes' as dataset
  • click on 'Attributes' and then on the 'Sequences' radio-button
  • for the example I selected 'Flank (Gene)'
  • in 'Upstream flank' I entered 10 just as an example
  • click on 'Filters'
  • select chromosome 12
  • as base pairs I entered 131780000 and 131783000 (which spans over ENSG00000248703, 131780658 to 131782517)
  • click on 'Results'

The FASTA result is:

>ENSG00000248703|ENST00000508505
AAATCCCTGC

If you click on 'XML', you are given a small XML-document that you can use to query the mart programmatically (see http://www.biomart.org/user-docs.pdf). For this example, the XML looks like this:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE Query> [?]

<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
    <Filter name = "upstream_flank" value = "10"/>
    <Filter name = "chromosome_name" value = "12"/>
    <Filter name = "end" value = "131783000"/>
    <Filter name = "start" value = "131780000"/>
    <Attribute name = "ensembl_gene_id" />
    <Attribute name = "ensembl_transcript_id" />
    <Attribute name = "gene_flank" />
</Dataset>

[?]

If you click on 'Perl', BioMart generates some Perl code for you as an alternative to running the XML-queries. It should be fairly easy to modify the Perl code to your needs.

ADD COMMENTlink written 3.0 years ago by Joachim2.8k
1

Hm. The strands are encoded as '1' and '-1'. Can you please try these values? I get a result back when I use '-1' instead of '+' in your example.

ADD REPLYlink written 3.0 years ago by Joachim2.8k

Thanks for the tip. I followed your instructions but I'm trying to use the "Multiple Chromosomal Regions" checkbox instead of the Chromosome/Base Pair check box. It says to enter (Chr:Start:End:Strand). When I input this:

12:131780658:131780658:+ 12:131782517:131782517:+

I get no results returned. I need to provide my assay designer with a list of positions, and the flanking region around each SNP:

sequence chrom position build GTCTC[T/C]CAGC 19 59900243 36 cctgc[A/G]tcagcc 19 59751810 36 cacatct[T/C]gtacag 19 59814896 36 CCACA[T/C]AGAC 19 59976309 36

One SNP per line.

ADD REPLYlink written 3.0 years ago by Stephen1.9k

That didn't format correctly. I need to give my assay designer something that looks like [?]this[?]:

http://pastebin.com/765gMUi6

ADD REPLYlink written 3.0 years ago by Stephen1.9k

Thanks for the tip. I followed your instructions but I'm trying to use the "Multiple Chromosomal Regions" checkbox instead of the Chromosome/Base Pair check box. It says to enter (Chr:Start:End:Strand). When I input this: 12:131780658:131780658:+ [?] 12:131782517:131782517:+ I get no results returned. I need to provide my assay designer with a list of positions, and the flanking region around each SNP:

http://pastebin.com/765gMUi6

ADD REPLYlink written 3.0 years ago by Stephen1.9k

Ah, that worked, but I'm still getting only a single flanking sequence back, not flanking info for each of the sites I enter.

ADD REPLYlink written 3.0 years ago by Stephen1.9k

True. If you are still interested in using BioMart, can you please raise this issue on the BioMart mailing list? Thanks. http://www.biomart.org/contact.html

ADD REPLYlink written 3.0 years ago by Joachim2.8k

Unfortunately BioMart will only return values anchored to the gene, if you go through Ensembl Genes, or variation, if you start with database Ensembl variations. When you enter a coordinate in Ensembl Genes, the filter will be applied for genes in that region. The flanking sequence will be exported for the gene, not for the original sequence position you entered.

You can use the Ensembl API to do this query- if you're interestedm it's the Core API (Perl)

http://www.ensembl.org/info/docs/api/core/index.html

http://www.ensembl.org/info/docs/api/core/core_tutorial.html

ADD REPLYlink written 3.0 years ago by Giulietta - Ensembl Helpdesk980
2
gravatar for Bert Overduin
3.0 years ago by
Bert Overduin2.6k
EMBL-EBI
Bert Overduin2.6k wrote:

This is also pretty straightforward with the Ensembl Perl API:

#!/usr/bin/perl

use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';

$reg->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

my $slice_adaptor = $reg->get_adaptor( 'human', 'core', 'slice');

my $chromosome = '7';
my $position = 24966446;
my $offset = 60;

my $slice = $slice_adaptor->fetch_by_region('chromosome', $chromosome, $position-$offset, $position+$offset);

print $slice->seq;
ADD COMMENTlink written 3.0 years ago by Bert Overduin2.6k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 370 users visited in the last hour