You could loop through merged BED intervals from N sorted BED files to get their sequence via a RESTful query to the UCSC Genome Browser.
Pre-merging intervals will give a faster result by reducing the number of queries you make to the UCSC service.
Here's a bash
script to demonstrate this:
#!/bin/bash
GENOME=hg38
while read INTERVAL
do
ELEMENTS=(${INTERVAL//\t/ })
CHROMOSOME=${ELEMENTS[0]}
START=${ELEMENTS[1]}
END=${ELEMENTS[2]}
echo ">${CHROMOSOME}:${START}-${END}"
QUERY="https://api.genome.ucsc.edu/getData/sequence?genome=${GENOME}&chrom=${CHROMOSOME}&start=${START}&end=${END}"
SEQUENCE=`wget -qO- ${QUERY} | jq --raw-output .dna`
echo "${SEQUENCE}"
done < <( bedops --merge file1.bed file2.bed ... fileN.bed )
To run it, just edit file1.bed file2.bed ... fileN.bed
in this script, replacing it with your list of N sorted files.
Also change the value of GENOME
if you are working with another assembly other than hg38
(hg19
, mm10
, etc.).
Then run the script, e.g.:
$ ./query.sh > answer.fa
Your result will be a FASTA-formatted text file called answer.fa
.
Tools needed to make this script work include readily-available packages like wget
, BEDOPS, and jq
:
You can use package managers to install these, if you don't already have them installed.
Note: If you are in Europe or Asia, you might get a faster result using a UCSC mirror site: https://genome.ucsc.edu/goldenPath/help/api.html#Mirrors
In that case, just modify the https URL in the QUERY
variable in the script, accordingly.
Hi Alex Thank you very much for the detailed answer! I have a question regarding the request to the server - can I request a specific strand from the server? I need to recheck myself, but I think I'll have to specify the strand or I will get the reverse compliment instead of the desired sequence itself.
By the way, do you have an estimation as for the time it will take do convert a large bulk of files? Thanks again, Niv
You just need to read in the strand information from
INTERVAL
and store it in a variable. In most BED files, the strand will be in the sixth column. Depending on the value of that strand variable (+
or-
, say), you would reverse complement the value ofSEQUENCE
.I've written ~90% of what you have to do. So that you learn something, I'll leave it as an exercise to you to fill in the rest, if you choose to use this script.
I don't know how long it would take to download sequences, but I don't imagine it would be too long. In the long run, it will almost certainly be faster to download sequences in bulk for your genome, and to do queries to locally-stored files, however, unless you are doing one-off work.