Converting a BED file to FASTA
1
0
Entering edit mode
10 months ago
niveven • 0

Hi everyone! I need to convert many .bed coordinates to FASTA format so I can use MEME FIMO on them (to my best knowledge it only accepts FASTA format..). I found a method here: https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html but it seem to require downloading the entire genome assembly in FASTA format, is there any other way of doing it? Maybe a tool that takes the relevant sequence from the web according to the bed coordinates? Thanks, Niv

conversion bed fasta • 1.1k views
3
Entering edit mode
10 months ago

You could loop through merged BED intervals from N sorted BED files to get their sequence via a RESTful query to the UCSC Genome Browser.

Pre-merging intervals will give a faster result by reducing the number of queries you make to the UCSC service.

Here's a bash script to demonstrate this:

#!/bin/bash

GENOME=hg38

do
ELEMENTS=(${INTERVAL//\t/ }) CHROMOSOME=${ELEMENTS[0]}
START=${ELEMENTS[1]} END=${ELEMENTS[2]}
echo ">${CHROMOSOME}:${START}-${END}" QUERY="https://api.genome.ucsc.edu/getData/sequence?genome=${GENOME}&chrom=${CHROMOSOME}&start=${START}&end=${END}" SEQUENCE=wget -qO-${QUERY} | jq --raw-output .dna
echo "${SEQUENCE}" done < <( bedops --merge file1.bed file2.bed ... fileN.bed )  To run it, just edit file1.bed file2.bed ... fileN.bed in this script, replacing it with your list of N sorted files. Also change the value of GENOME if you are working with another assembly other than hg38 (hg19, mm10, etc.). Then run the script, e.g.: $ ./query.sh > answer.fa


Your result will be a FASTA-formatted text file called answer.fa.

Tools needed to make this script work include readily-available packages like wget, BEDOPS, and jq:

You can use package managers to install these, if you don't already have them installed.

Note: If you are in Europe or Asia, you might get a faster result using a UCSC mirror site: https://genome.ucsc.edu/goldenPath/help/api.html#Mirrors

In that case, just modify the https URL in the QUERY variable in the script, accordingly.

0
Entering edit mode

Hi Alex Thank you very much for the detailed answer! I have a question regarding the request to the server - can I request a specific strand from the server? I need to recheck myself, but I think I'll have to specify the strand or I will get the reverse compliment instead of the desired sequence itself.

By the way, do you have an estimation as for the time it will take do convert a large bulk of files? Thanks again, Niv

0
Entering edit mode

You just need to read in the strand information from INTERVAL and store it in a variable. In most BED files, the strand will be in the sixth column. Depending on the value of that strand variable (+ or -, say), you would reverse complement the value of SEQUENCE.

I've written ~90% of what you have to do. So that you learn something, I'll leave it as an exercise to you to fill in the rest, if you choose to use this script.

I don't know how long it would take to download sequences, but I don't imagine it would be too long. In the long run, it will almost certainly be faster to download sequences in bulk for your genome, and to do queries to locally-stored files, however, unless you are doing one-off work.