How to batch save short prespecified sequences from the human genome
2
0
Entering edit mode
4.0 years ago

I would like to save sequences from, say, chr1:100-150, chr1:300-350 etc to a file. I have several hundred such genome regions that I want to do this for using GRCh37. I would be grateful if someone could point me to a quick way of doing this.

Edit: samtools faidx ref.fa chr1:100-150 works nicely. Many thanks to answers below.

sequence • 635 views
ADD COMMENT
1
Entering edit mode
4.0 years ago

If you have a BED file of intervals (in.bed) and a directory of FASTA for your genome assembly of interest (e.g., hg19), you could use a script like bed2faidxsta.pl to write data to a FASTA file (out.fa):

$ ./bed2faidxsta.pl --fastaDir=/path/to/hg19/fasta < in.bed > out.fa

It relies on samtools and indexed FASTA files.

Here's the script:

ADD COMMENT
0
Entering edit mode

Brilliant thank you. I am currently downloading chromFa.tar.gz from https://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/. Could you please tell me how I can index these?

ADD REPLY
1
Entering edit mode
4.0 years ago

Use the Ensembl perl API with something like this:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;


my $input = $ARGV[0];

my ($chr,$coords) = split(/:/,$input);
my ($start,$end) = split(/-/,$coords);

my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_db(-host => 'ensembldb.ensembl.org',
                             -port => 3337,
                             -user => 'anonymous',
                             -passwd => undef,
                             -db_version => '99');
my $slice_adaptor = $registry->get_adaptor( 'Homo sapiens', 'Core', 'Slice' );
# Obtain a slice covering the region
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $chr, $start, $end);
print ">$chr:$start-$end\n",$slice->seq(),"\n";
ADD COMMENT

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6