Question

Converting a BED file to FASTA

0

Entering edit mode

4.6 years ago

niveven • 0

Hi everyone! I need to convert many .bed coordinates to FASTA format so I can use MEME FIMO on them (to my best knowledge it only accepts FASTA format..). I found a method here: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html but it seem to require downloading the entire genome assembly in FASTA format, is there any other way of doing it? Maybe a tool that takes the relevant sequence from the web according to the bed coordinates? Thanks, Niv

conversion bed fasta • 4.3k views

ADD COMMENT • link updated 4.3 years ago by Alex Reynolds 36k • written 4.6 years ago by niveven • 0

score 3 · Answer 1 · 2020-11-20

You could loop through merged BED intervals from N sorted BED files to get their sequence via a RESTful query to the UCSC Genome Browser.

Pre-merging intervals will give a faster result by reducing the number of queries you make to the UCSC service.

Here's a bash script to demonstrate this:

#!/bin/bash 

GENOME=hg38

while read INTERVAL
do
    ELEMENTS=(${INTERVAL//\t/ })
    CHROMOSOME=${ELEMENTS[0]}
    START=${ELEMENTS[1]}
    END=${ELEMENTS[2]}
    echo ">${CHROMOSOME}:${START}-${END}"
    QUERY="https://api.genome.ucsc.edu/getData/sequence?genome=${GENOME}&chrom=${CHROMOSOME}&start=${START}&end=${END}"
    SEQUENCE=`wget -qO- ${QUERY} | jq --raw-output .dna`
    echo "${SEQUENCE}"
done < <( bedops --merge file1.bed file2.bed ... fileN.bed )

To run it, just edit file1.bed file2.bed ... fileN.bed in this script, replacing it with your list of N sorted files.

Also change the value of GENOME if you are working with another assembly other than hg38 (hg19, mm10, etc.).

Then run the script, e.g.:

$ ./query.sh > answer.fa

Your result will be a FASTA-formatted text file called answer.fa.

Tools needed to make this script work include readily-available packages like wget, BEDOPS, and jq:

You can use package managers to install these, if you don't already have them installed.

Note: If you are in Europe or Asia, you might get a faster result using a UCSC mirror site: https://genome.ucsc.edu/goldenPath/help/api.html#Mirrors

In that case, just modify the https URL in the QUERY variable in the script, accordingly.