Question: Is there a more efficient way of checking multiple sequences for how many hits they have in the human genome?
gravatar for rmartson
3.5 years ago by
rmartson30 wrote:

Currently I'm running a blast search for each flank sequence and then waiting to get the number of hits so they can be filtered into or out of a new file. There's probably a way to run multiple BLAST searches simultaneously, right? I'm just not sure how it would be done.

"""This script BLASTs every 200bp flank sequence against the human genome to check for matches. Any flanking sequences with more than 1 hit are excluded from the new flank list."""

from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

print "Enter dataset"
dataset = raw_input()

flanks = list(SeqIO.parse("flanks" + str(dataset) + ".fasta", "fasta"))
flanksnew = open("flanksnew" + str(dataset) + ".fasta", "w")
i = len(flanks)

for record in flanks:
    if i>1:
        print str(i)+" records left."
        print "1 record left."
    result_handle = NCBIWWW.qblast("blastn", "nt", sequence=record.seq, entrez_query = "txid9605[ORGN]")
    blast_records = list(NCBIXML.parse(result_handle))
    if len(blast_records)<2:
        print >> flanksnew, '>' +
        print >> flanksnew, record.seq

print "Total flanks:"
print len(flanks)
print "True flanks:"
print len(list(SeqIO.parse("flanksnew" + str(dataset) + ".fasta", "fasta")))
blast biopython • 1.9k views
ADD COMMENTlink modified 3.5 years ago by Alex Reynolds31k • written 3.5 years ago by rmartson30

How long are your query sequences? You could set up a local BLAT server and run things pretty quickly that way. No need to use Python.

ADD REPLYlink written 3.5 years ago by Alex Reynolds31k

200bp Wouldn't I need to download the entire human genome? I have the space, actually. I just have no idea how to go about it. That sounds ideal though.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by rmartson30

If you insist in using Python for this you should have a look at the multiprocessing module for parallelizing the blast searches.

ADD REPLYlink written 3.5 years ago by WouterDeCoster45k
gravatar for Alex Reynolds
3.5 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Downloading BLAT

To get BLAT source code:

$ mkdir /tmp/blat && cd /tmp/blat
$ wget
$ unzip

Patching (optional)

I decided to make blat a static binary to avoid missing shared library errors. Here's a patch you can use to modify the blat makefile:

$ cat >> static-blat-makefile.patch
< L += -lm $(SOCKETLIB)
> L += -lm -ldl $(SOCKETLIB) -static-libgcc 
<       ${CC} ${COPT} ${CFLAGS} -o ${DESTDIR}${BINDIR}/blat $O $(MYLIBS) $L
>       ${CC} ${COPT} ${CFLAGS} -o ${DESTDIR}${BINDIR}/blat $O -static $(MYLIBS) $L

You may need static library packages installed on your system. The names of these packages will depend on your version of Linux.

Then apply the patch:

$ cd /tmp/blat/blatSrc/blat
$ cp makefile makefile.original
$ patch makefile.original -i ../../static-blat-makefile.patch -o makefile

You may decide not to apply this patch. You could probably skip this step. I just don't like dynamic binaries.

Building BLAT

In any case, you will want to go to the top level of the blatSrc directory and run make to build the kit:

$ cd /tmp/blat/blatSrc && make

This will take a few minutes to build binaries.

Installing BLAT

To install them into ${HOME}/bin/${MACHTYPE}, run:

$ make install

This destination is a subdirectory of your home directory.

Once it is built and installed, you can copy the binary to /usr/local/bin or somewhere in your shell's PATH that makes sense to you. For me, my ${MACHTYPE} is x86_64 and I like having binaries in /usr/local/bin:

$ sudo cp ~areynolds/bin/x86_64/blat /usr/local/bin/blat

Adjust this to the particulars of your setup.

Downloading genomes

Once you have installed blat, the next step is to download a FASTA file for your genome of interest. If you wanted hg38, for instance:

$ for chr in `seq 1 22` X Y; do echo $chr; wget -qO-$chr.fa.gz | gunzip -c - >> hg38.fa; done

Optimizing queries

Once you have this file hg38.fa, you can start doing queries, but it can help speed up searches if you first make an OOC file:

$ blat /path/to/hg38.fa /dev/null /dev/null -makeOoc=/path/to/hg38.fa.11.ooc -repMatch=1024

When you do searches, you'd pass this OOC file as an option to skip over regions with over-represented sequences.


Once you have this OOC file made, you can do searches with your FASTA file containing sequences of interest:

$ blat /path/to/hg38.fa /path/to/your-sequences.fa -ooc=/path/to/hg38.fa.11.ooc search-results.psl

The blat binary will write any search results to a PSL-formatted text file called search-results.psl. You can name this whatever you want.

The PSL format is described on the UCSC site.


If you have very many sequences, you can parallelize this by simply splitting up your input sequences file into smaller pieces, and running multiple blat processes, one process for each piece of your original sequences file, writing many PSL files as output.

Set operations

It can help to use a tool like BEDOPS psl2bed to convert PSL to a BED file to do set operations, but that depends on what you want to do with the results. In any case, to convert a PSL file to a sorted BED file:

$ psl2bed < search-results.psl > search-results.bed
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Alex Reynolds31k

Thanks for all the info but I ended up installing BLAT and downloading the hg38 assembly already. It probably just took much longer without all the info in one place.

Any idea which set of fasta files in my "chroms" folder I should now be running BLAT against and how? I'm guessing there are a lot of redundant copies here.

Usually it's just blat database.fasta query.fasta output.psl -out=pslx

But now I need to do something like blat /chroms/[subset of fasta files].fasta query.fasta output.psl -out=pslx

Okay, I just saw you've described that in your reply too sorry. I'm downloading the genome as a single fasta file now.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by rmartson30

[Had a problem but fixed it]

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by rmartson30

psl2bed gives me this

/usr/local/bin/psl2bed: line 140:  7344 Segmentation fault: 11  ${cmd} ${options} - 0<&0
ADD REPLYlink written 3.5 years ago by rmartson30

I have patched psl2bed to fix this issue. It may still be a day or two before a new release is out. If you need something now, you could build this from source.

If you need to install a compiler on your OS X computer, you could do the following:

$ xcode-select --install

Once that's out of the way, or if you already have a compiler toolkit installed, you can do the following

$ cd /tmp
$ git clone
$ cd bedops
$ git checkout v2p4p27_mergedPoolMemory 
$ make
$ make install
$ cp bin/* /usr/local/bin

Then run psl2bed, as described above.

ADD REPLYlink written 3.5 years ago by Alex Reynolds31k

It may be easier to build binaries from source, if you are using pre-built binaries.

ADD REPLYlink written 3.5 years ago by Alex Reynolds31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2456 users visited in the last hour