Question

The most efficient way to 'getFastaFromBed' from 'golden gate data'

0

Entering edit mode

7.0 years ago

chrisclarkson100 ▴ 150

I would like to get the sequence of specific regions in the genome (specified by a bed file- 'heterochromatin_mm9.bed'). Unfortunately when I go to golden gate the data is divided into chromosomes where I would prefer everything would be in one fasta file:

      Name                    Last modified       Size  Description
      Parent Directory                             -   
      chr1.fa.gz              25-Jul-2007 10:26   60M  
      chr1_random.fa.gz       25-Jul-2007 10:29  329K  
      chr2.fa.gz              25-Jul-2007 10:26   56M  
      chr3.fa.gz

So I wrote a python script to 'wget' all this data and specifically take the fasta sequences that I need.

import subprocess
for i in ['chr1','chr2','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chr10','chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18','chr19','chrX','chrY']:
    subprocess.call(['wget','http://hgdownload.cse.ucsc.edu/goldenpath/mm9/chromosomes/'+str(i)+'.fa.gz'])
    subprocess.call(['gunzip',str(i)+'.fa.gz'])
    subprocess.call(['bedrolls',getfasta','-fi', str(i)+'.fa', '-bed', '/home/projects/heterochromatin_mm9.bed', '>', 'LOCK'+str(i)+'.fasta'])
    subprocess.call(['rm',str(i)+'.fa'])
    counter+=1

However I'm convinced there's a more efficient way of doing this where one could have the final product result in a single fasta file? Thanks

Assembly sequence • 1.6k views

ADD COMMENT • link updated 7.0 years ago by Alex Reynolds 35k • written 7.0 years ago by chrisclarkson100 ▴ 150

score 2 · Answer 1 · 2017-04-29

You can just download the files once, index them with samtools, and do lookups against the indexed files as needed. You can (and should) keep the chromosomes separate, but it is easy to put them into one directory.

Here's a generic shell-based approach, which uses Kent utilities fetchChromSizes to get a list of chromosomes for the build mm9:

$ mkdir mm9
$ cd mm9
$ while read chr; do echo "$chr"; wget -qO- "http://hgdownload.cse.ucsc.edu/goldenpath/mm9/chromosomes/$chr.fa.gz" | gunzip -c - > $chr.fa; samtools faidx $chr.fa; done < <(fetchChromSizes mm9 | cut -f1)

Then each time you want to convert a BED to FASTA, you can use the reference genome files you have already downloaded and indexed:

For example, if you're still in the mm9 directory:

$ bed2faidxsta.pl < in.bed > out.fa

Otherwise, just pass in the path to the indexed files:

$ bed2faidxsta.pl --fastaDir="/path/to/indexed/mm9" < in.bed > out.fa