4.5 years ago by
Seattle, WA USA
As a general example, you could get a GFF file of annotations via GENCODE, and convert GFF to BED via BEDOPS gff2bed. If you're working with a recent build of the human genome, for instance:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
| gunzip --stdout - \
| gff2bed - \
> annotations.bed
BEDOPS bedops and bedmap facilitate making associations between BED-formatted reads and annotations. You could do bedops -e 1, for example, to get reads that overlap annotations. You could do further filtering with awk or grep to get subsets of these annotations, such as exons or genes.
To do sequence lookups, you could use samtools faidx against a directory of indexed FASTA files for your genome of interest (for instance, UCSC makes FASTA files available to download for human and other builds).
An example Perl script could parse a BED file containing reads-of-interest into sequence data, via indexed FASTA files:
#!/usr/bin/env perl
use strict;
use warnings;
# contains per-chromosome UCSC FASTA (.fa) files and samtools-indexed index (.fai) files
my $fastaDir = "/foo/bar/baz";
while (<STDIN>) {
chomp;
my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
my $queryKey = "$chr:$start-$stop";
my $queryResult = `samtools faidx $fastaDir/$chr.fa $queryKey`; chomp $queryResult;
my @lines = split("\n", $queryResult);
my @seqs = @lines[1..(scalar @lines - 1)];
my $seq = join("", @seqs);
if ($strand eq "-") { $seq = revdnacomp($seq); }
my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
print STDOUT $header."\n".uc($seq)."\n";
}
sub revdnacomp {
my $dna = shift @_;
my $revcomp = reverse($dna);
$revcomp =~ tr/ACGTacgt/TGCAtgca/;
return $revcomp;
}
Hopefully, these are useful starting points.