Converting a BED file to FASTA
1
0
Entering edit mode
3.4 years ago
niveven • 0

Hi everyone! I need to convert many .bed coordinates to FASTA format so I can use MEME FIMO on them (to my best knowledge it only accepts FASTA format..). I found a method here: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html but it seem to require downloading the entire genome assembly in FASTA format, is there any other way of doing it? Maybe a tool that takes the relevant sequence from the web according to the bed coordinates? Thanks, Niv

conversion bed fasta • 3.4k views
ADD COMMENT
3
Entering edit mode
3.4 years ago

You could loop through merged BED intervals from N sorted BED files to get their sequence via a RESTful query to the UCSC Genome Browser.

Pre-merging intervals will give a faster result by reducing the number of queries you make to the UCSC service.

Here's a bash script to demonstrate this:

#!/bin/bash 

GENOME=hg38

while read INTERVAL
do
    ELEMENTS=(${INTERVAL//\t/ })
    CHROMOSOME=${ELEMENTS[0]}
    START=${ELEMENTS[1]}
    END=${ELEMENTS[2]}
    echo ">${CHROMOSOME}:${START}-${END}"
    QUERY="https://api.genome.ucsc.edu/getData/sequence?genome=${GENOME}&chrom=${CHROMOSOME}&start=${START}&end=${END}"
    SEQUENCE=`wget -qO- ${QUERY} | jq --raw-output .dna`
    echo "${SEQUENCE}"
done < <( bedops --merge file1.bed file2.bed ... fileN.bed )

To run it, just edit file1.bed file2.bed ... fileN.bed in this script, replacing it with your list of N sorted files.

Also change the value of GENOME if you are working with another assembly other than hg38 (hg19, mm10, etc.).

Then run the script, e.g.:

$ ./query.sh > answer.fa

Your result will be a FASTA-formatted text file called answer.fa.

Tools needed to make this script work include readily-available packages like wget, BEDOPS, and jq:

You can use package managers to install these, if you don't already have them installed.

Note: If you are in Europe or Asia, you might get a faster result using a UCSC mirror site: https://genome.ucsc.edu/goldenPath/help/api.html#Mirrors

In that case, just modify the https URL in the QUERY variable in the script, accordingly.

ADD COMMENT
0
Entering edit mode

Hi Alex Thank you very much for the detailed answer! I have a question regarding the request to the server - can I request a specific strand from the server? I need to recheck myself, but I think I'll have to specify the strand or I will get the reverse compliment instead of the desired sequence itself.

By the way, do you have an estimation as for the time it will take do convert a large bulk of files? Thanks again, Niv

ADD REPLY
0
Entering edit mode

You just need to read in the strand information from INTERVAL and store it in a variable. In most BED files, the strand will be in the sixth column. Depending on the value of that strand variable (+ or -, say), you would reverse complement the value of SEQUENCE.

I've written ~90% of what you have to do. So that you learn something, I'll leave it as an exercise to you to fill in the rest, if you choose to use this script.

I don't know how long it would take to download sequences, but I don't imagine it would be too long. In the long run, it will almost certainly be faster to download sequences in bulk for your genome, and to do queries to locally-stored files, however, unless you are doing one-off work.

ADD REPLY

Login before adding your answer.

Traffic: 1777 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6