Hi there! This is my first post on your forum and I hope you'll be able to help me.
This is my first intership in bioinformatics and I'm kind of learning everything by myself. I learnt how to use python 2 days ago. I'm actually working with a large dataset (200Go) of metagenomics sequences. My boss asked me to blast them against the 16SMicrobial NCBI database and then "copy/paste" in a new file the sequences with hits, assuming they are 16S related. Easier to say than to do when you blast about 413000 sequences.
So I decided to try to make a script with Python that would do it. I also installed Biopython because it seemed to be a very good tool.
My first idea is to use a script i found on the Biopython website. this script is used to retrieve sequences with no blast hits. This would at least help to separate sequences i don't from sequences i need.
from Bio import SeqIO from Bio.Blast import NCBIXML q_dict = SeqIO.to_dict(SeqIO.parse(open("queries.fasta"), "fasta")) hits =  for record in NCBIXML.parse(open("BLAST_RESULTS.xml")): # As of BLAST 2.2.19 the xml output for multiple-query blast searches # skips queries with no hits so we could just append the ID of every blast # record to our 'hit list'. It's possible that the NCBI will change this # behaviour in the future so let's make the appending conditional on there # being some hits (ie, a non-empty alignments list) recorded in the blast record if record.alignments: # The blast record's 'query' contains the sequences description as a # string. We used the ID as the key in our dictionary so we'll need to # split the string and get the first field to remove the right entries hits.append(record.query.split()) misses = set(q_dict.keys()) - set(hits) orphan_records = [q_dict[name] for name in misses]
This is my first and only hint, but I'm sure there is a better way to do it.
Can someone help me please?