I have a list of genomic positions in the human genome, and I want to get a general idea about the genomic context of the flanking regions (GC richness, spanning a gene (exon/intron), etc.). The file is chr pos (eg. 13 1234567) Is there a simple way to go about this? Thank-you!
Your question is a bit generic the word features is broad. The scale at which you want to proceed is also important. The answer will depend on whether you have one, dozens or millions of potential regions.
The simplest option is to go the the UCSC genome browser find your region and download the tracks that contain the data that you are interested in.
More automated solutions include:
query the UCSC genome browser, for some examples see:
Use a biomart service http://www.biomart.org/martservice.html
download bulk datasets and intersect with BedTools
There are also a large number of ChIP-Seq tools that aim to automatize this step, you might want to look at those.
If you wish to do this kind of thing in a script, you can get many features associated with a particular region of a genome using the Ensembl Perl API. You will need to install the API and perl mysql DBI/DBD modules. Then you can retrieve sequences for your regions of interest (with arbitrary flank) as follows:
#!/usr/bin/perl use warnings; use strict; use DBD::mysql; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', # alternatively 'useastdb.ensembl.org' -user => 'anonymous' ); my $chr = 13; my $pos = 111234567; my $flank = 50; my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' ); my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $chr, $pos-$flank, $pos+$flank ); my $sequence = $slice->seq(); print $sequence, "\n"; exit();
Examples for retrieving many types of additional information (including exons, introns, etc.) can be found in the Core API Tutorial.