Question: [Help]: Extracting Sequence From Fasta Using Blast Informations?
gravatar for francois.maginiot
7.9 years ago by
francois.maginiot10 wrote:

Hi there! This is my first post on your forum and I hope you'll be able to help me.

This is my first intership in bioinformatics and I'm kind of learning everything by myself. I learnt how to use python 2 days ago. I'm actually working with a large dataset (200Go) of metagenomics sequences. My boss asked me to blast them against the 16SMicrobial NCBI database and then "copy/paste" in a new file the sequences with hits, assuming they are 16S related. Easier to say than to do when you blast about 413000 sequences.

So I decided to try to make a script with Python that would do it. I also installed Biopython because it seemed to be a very good tool.

My first idea is to use a script i found on the Biopython website. this script is used to retrieve sequences with no blast hits. This would at least help to separate sequences i don't from sequences i need.

from Bio import SeqIO
from Bio.Blast import NCBIXML

q_dict =  SeqIO.to_dict(SeqIO.parse(open("queries.fasta"), "fasta"))

hits = []
for record in NCBIXML.parse(open("BLAST_RESULTS.xml")):
  # As of BLAST 2.2.19 the xml output for multiple-query blast searches
  # skips queries with no hits so we could just append the ID of every blast 
  # record to our 'hit list'. It's possible that the NCBI will change this
  # behaviour in the future so let's make the appending conditional on there
  # being some hits (ie, a non-empty alignments list) recorded in the blast record
  if record.alignments:
    # The blast record's 'query' contains the sequences description as a
    # string. We used the ID as the key in our dictionary so we'll need to
    # split the string and get the first field to remove the right entries 

misses = set(q_dict.keys()) - set(hits)
orphan_records = [q_dict[name] for name in misses]

This is my first and only hint, but I'm sure there is a better way to do it.

Can someone help me please?

ADD COMMENTlink written 7.9 years ago by francois.maginiot10

I think we need a bit more clarification on what you are trying to accomplish. So you've parsed out the IDs you want to keep and now you want to get the fasta sequences that correspond to those ids?

ADD REPLYlink modified 7.9 years ago • written 7.9 years ago by Damian Kao15k

If all you need are the IDs with and without a match, the tabular output is enough and easier to work with than the XML (and smaller on disk too), but you seem to have everything done and working. What are you asking for? How to save these as a FASTA file?

SeqIO.write(orphan_records, "orphans.fasta", "fasta")

If you want just the records with this, something like this would work:

with_hits = [q_dict[name] for name in hits]

The only other refinement I'd suggest is using a generator expression rather than a list expression - it would avoid having everything in memory. i.e.

with_hits = (q_dict[name] for name in hits)


orphan_records = (q_dict[name] for name in misses)
ADD REPLYlink modified 7.9 years ago • written 7.9 years ago by Peter5.8k
gravatar for francois.maginiot
7.9 years ago by
francois.maginiot10 wrote:


thanks for your help!

In fact i need to identify sequences whith hits against the NCBI database and separate them. At the end, i would like the script to create a fasta/txt file containing the entire sequences (not only the ID) having a hit against the NCBI database.

The code i posted kind of do the contrary but it's my first and only lead for the moment so i think i could modify it or use as a solid base for my script.

I don't know if I'm clear enough...

ADD COMMENTlink written 7.9 years ago by francois.maginiot10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1400 users visited in the last hour