Batch Fetching Fasta Sequences From Bed File
4
3
Entering edit mode
10.5 years ago
Carl ▴ 80

Hi,

I have a number of bed files, for which I would like to extract the corresponding fasta sequences. What is the best way to do this in batch (i.e. without using e.g. the UCSC table browser), since UCSC does not provide (to my knowledge) any web service.. any Bioperl solution ? Or some other web service ? If possible, I would like to avoid installing the genome locally (otherwise the bedtools would be a solution).

Thx!

fasta bed • 18k views
0
Entering edit mode

How many sequences are you talking? a dozen? thousands? millions? The solution may very well differ depending on your needs.

0
Entering edit mode

make sure there is a reason for avoiding to download the genome - I've never seen an analysis that did not need to be repeated many times over and that can quickly backfire and you'll end up with far slower solutions

9
Entering edit mode
10.5 years ago
brentp 23k

You can do this with bedtools if you download the sequence data here's a shell script for the entire process if you're using hg19:

BED=your.bed
for chr in seq 1 22 X Y
do
done

fastaFromBed -fi hg19.fa -bed $BED -fo$BED.fasta


EDIT: argh. Somehow I missed that you explicitly said you didn't want to download the sequence data or bedtools would be a solution.

1
Entering edit mode

".../chr1.fa.gz ..." produced a fasta file with the chr1 sequence concatenated 24 times, I needed to put the variable name in place of the 1 ".../chr1$chr.fa.gz..." to get this to work: BED=$1
for chr in seq 1 22 X Y
do
wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr$chr.fa.gz | gunzip -c >> hg19.fa done fastaFromBed -fi hg19.fa -bed$BED -fo $BED.fasta  (NB - "BED=$1" - bedfile name is first argument passed to script)

4
Entering edit mode
10.5 years ago
lh3 32k

I have a solution for hg18/b36, but not for hg19/b37.

awk '{print $1":"($2+1)"-"\$3}' in.bed | xargs samtools faidx ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/technical/reference/human_b36_male.fa.gz > out.fa


For razip compressed and faidx indexed fasta files, samtools is able to retrieve subsequences over network.

In addition, for a couple of regions, you can still use UCSC/Ensembl URL to retrieve sequences. For many regions, I believe Ensembl has APIs, though I have not used that.

0
Entering edit mode

so all that's required for hg19 is an razip'ed file on a public server, yes?

0
Entering edit mode

And it is accompanied by a ".fai" file, the fasta index: two files on a public HTTP/FTP server.

4
Entering edit mode
10.5 years ago

Yes, there is a web service (DAS) to fetch the genomic DNA sequence: e.g.:

http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:10001000,10003000

see this other post for some solutions related to your problem: Perl To Retrieve Sequences From Ucsc Genome Browser

0
Entering edit mode
4.5 years ago

Use a command like this:

twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit -bed=input.bed test.fa


or for a single region:

twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit test.fa -seq=chr21 -start=1 -end=10000


If you're not on the hg19 genome, you have to index your .fa file first with faToTwoBit