Batch Fetching Fasta Sequences From Bed File
4
3
Entering edit mode
13.0 years ago
Carl ▴ 80

Hi,

I have a number of bed files, for which I would like to extract the corresponding fasta sequences. What is the best way to do this in batch (i.e. without using e.g. the UCSC table browser), since UCSC does not provide (to my knowledge) any web service.. any Bioperl solution ? Or some other web service ? If possible, I would like to avoid installing the genome locally (otherwise the bedtools would be a solution).

Thx!

fasta bed • 21k views
ADD COMMENT
0
Entering edit mode

How many sequences are you talking? a dozen? thousands? millions? The solution may very well differ depending on your needs.

ADD REPLY
0
Entering edit mode

make sure there is a reason for avoiding to download the genome - I've never seen an analysis that did not need to be repeated many times over and that can quickly backfire and you'll end up with far slower solutions

ADD REPLY
9
Entering edit mode
13.0 years ago
brentp 24k

You can do this with bedtools if you download the sequence data here's a shell script for the entire process if you're using hg19:

BED=your.bed
for chr in `seq 1 22` X Y
do
    wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz | gunzip -c >> hg19.fa
done

fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta

EDIT: argh. Somehow I missed that you explicitly said you didn't want to download the sequence data or bedtools would be a solution.

ADD COMMENT
1
Entering edit mode

".../chr1.fa.gz ..." produced a fasta file with the chr1 sequence concatenated 24 times, I needed to put the variable name in place of the 1 ".../chr1$chr.fa.gz..." to get this to work:

BED=$1
for chr in `seq 1 22` X Y
do
    wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr$chr.fa.gz | gunzip -c >> hg19.fa
done

fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta

(NB - "BED=$1" - bedfile name is first argument passed to script)

ADD REPLY
4
Entering edit mode
13.0 years ago
lh3 33k

I have a solution for hg18/b36, but not for hg19/b37.

awk '{print $1":"($2+1)"-"$3}' in.bed | xargs samtools faidx ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/technical/reference/human_b36_male.fa.gz > out.fa

For razip compressed and faidx indexed fasta files, samtools is able to retrieve subsequences over network.

In addition, for a couple of regions, you can still use UCSC/Ensembl URL to retrieve sequences. For many regions, I believe Ensembl has APIs, though I have not used that.

ADD COMMENT
0
Entering edit mode

so all that's required for hg19 is an razip'ed file on a public server, yes?

ADD REPLY
0
Entering edit mode

And it is accompanied by a ".fai" file, the fasta index: two files on a public HTTP/FTP server.

ADD REPLY
4
Entering edit mode
13.0 years ago

Yes, there is a web service (DAS) to fetch the genomic DNA sequence: e.g.:

http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:10001000,10003000

see this other post for some solutions related to your problem: Perl To Retrieve Sequences From Ucsc Genome Browser

ADD COMMENT
0
Entering edit mode
7.0 years ago

Use a command like this:

twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit -bed=input.bed test.fa

or for a single region:

twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit test.fa -seq=chr21 -start=1 -end=10000

Requires the UCSC tool twoBitToFa, available from http://hgdownload.cse.ucsc.edu/admin/exe/

If you're not on the hg19 genome, you have to index your .fa file first with faToTwoBit

ADD COMMENT

Login before adding your answer.

Traffic: 1596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6