The most efficient way to 'getFastaFromBed' from 'golden gate data'
1
0
Entering edit mode
7.0 years ago

I would like to get the sequence of specific regions in the genome (specified by a bed file- 'heterochromatin_mm9.bed'). Unfortunately when I go to golden gate the data is divided into chromosomes where I would prefer everything would be in one fasta file:

      Name                    Last modified       Size  Description
      Parent Directory                             -   
      chr1.fa.gz              25-Jul-2007 10:26   60M  
      chr1_random.fa.gz       25-Jul-2007 10:29  329K  
      chr2.fa.gz              25-Jul-2007 10:26   56M  
      chr3.fa.gz

So I wrote a python script to 'wget' all this data and specifically take the fasta sequences that I need.

import subprocess
for i in ['chr1','chr2','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chr10','chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18','chr19','chrX','chrY']:
    subprocess.call(['wget','http://hgdownload.cse.ucsc.edu/goldenpath/mm9/chromosomes/'+str(i)+'.fa.gz'])
    subprocess.call(['gunzip',str(i)+'.fa.gz'])
    subprocess.call(['bedrolls',getfasta','-fi', str(i)+'.fa', '-bed', '/home/projects/heterochromatin_mm9.bed', '>', 'LOCK'+str(i)+'.fasta'])
    subprocess.call(['rm',str(i)+'.fa'])
    counter+=1

However I'm convinced there's a more efficient way of doing this where one could have the final product result in a single fasta file? Thanks

Assembly sequence • 1.6k views
ADD COMMENT
2
Entering edit mode
7.0 years ago

You can just download the files once, index them with samtools, and do lookups against the indexed files as needed. You can (and should) keep the chromosomes separate, but it is easy to put them into one directory.

Here's a generic shell-based approach, which uses Kent utilities fetchChromSizes to get a list of chromosomes for the build mm9:

$ mkdir mm9
$ cd mm9
$ while read chr; do echo "$chr"; wget -qO- "http://hgdownload.cse.ucsc.edu/goldenpath/mm9/chromosomes/$chr.fa.gz" | gunzip -c - > $chr.fa; samtools faidx $chr.fa; done < <(fetchChromSizes mm9 | cut -f1)

Then each time you want to convert a BED to FASTA, you can use the reference genome files you have already downloaded and indexed:

For example, if you're still in the mm9 directory:

$ bed2faidxsta.pl < in.bed > out.fa

Otherwise, just pass in the path to the indexed files:

$ bed2faidxsta.pl --fastaDir="/path/to/indexed/mm9" < in.bed > out.fa
ADD COMMENT

Login before adding your answer.

Traffic: 2475 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6