I would like to get the sequence of specific regions in the genome (specified by a bed file- 'heterochromatin_mm9.bed'). Unfortunately when I go to golden gate the data is divided into chromosomes where I would prefer everything would be in one fasta file:
Name Last modified Size Description
Parent Directory -
chr1.fa.gz 25-Jul-2007 10:26 60M
chr1_random.fa.gz 25-Jul-2007 10:29 329K
chr2.fa.gz 25-Jul-2007 10:26 56M
chr3.fa.gz
So I wrote a python script to 'wget' all this data and specifically take the fasta sequences that I need.
import subprocess
for i in ['chr1','chr2','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chr10','chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18','chr19','chrX','chrY']:
subprocess.call(['wget','http://hgdownload.cse.ucsc.edu/goldenpath/mm9/chromosomes/'+str(i)+'.fa.gz'])
subprocess.call(['gunzip',str(i)+'.fa.gz'])
subprocess.call(['bedrolls',getfasta','-fi', str(i)+'.fa', '-bed', '/home/projects/heterochromatin_mm9.bed', '>', 'LOCK'+str(i)+'.fasta'])
subprocess.call(['rm',str(i)+'.fa'])
counter+=1
However I'm convinced there's a more efficient way of doing this where one could have the final product result in a single fasta file? Thanks