Question

Taxonomy Of Blast Hits

9

Entering edit mode

14.2 years ago

Darked89 4.6k

Lets have 200k genomic contigs with some (unknown) bacterial contamination.

I blasted (blastn vs nr) all of them, got tabulated output and passed the uniq acc nos ca 5k to Batch Entrez. Since neither my target genome nor bacterias causing contamination are not sequenced, I got a shotgun of results (3000 Eukaryota, 2000 Bacteria, few viruses).

Now for a tricky part: what I need is: sequence_identifier + taxonomic_id(s) + main_tax_group

something along the line:

A000001 573 Bacteria

Apart from writing a script storing the sequence & taxonomy info into say MySQL, then going through blast top hits output, are there any tools (taverna work flows?) which can do it for me?

re Pierre

Primary input is text blast output of:

blastcl3 -p blastn -m 9 -e 0.00001 -b 1 -i frag01 -o out_blastn_frag01

I grep-ed and awk-ed hit acc numbers from second column. Resulting text file (one acc no per line) was feed to Batch Entrez. As far as I can tell there is no way of selecting output in form: A000001 573 Bacteria The most parsable output seems to be TinyXML, but then I will download full bacterial genomes / eukaryotic chromosomes worth of sequence which at this stage I do not need.

Ideally instead of two extremes (E.coli K12 + Bacteria) getting a whole taxonomic path:

cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia; Escherichia coli

will be preferred. That way one can zoom in (select more than just species/strain and taxonomic Kingdom).

So at this moment I am split between using (1) just blast tabulated text output or selecting some Batch Entrez output which then I will be able to combine with (1).

re giovanni single line which gets squezzed a bit here:

contig62836  gi|119525916|gb|CP000508.1|     93.18   44      3       0       1109    1152    262350  262393  2e-06   63.9

Before each of the top hits there is blast header with hash sign in front:

# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score

So simple:

grep -A 1 Fields out_blastn_frag0* | grep contig | awk '{ print $2}' | awk 'FS="|" {print $4}' | sort | uniq > all_uniq_hits_100302.txt

gives me list off unique accession numbers of my top hits suitable for Batch Entrez.

re XML: yes, but I tried to avoid too much network traffic. XML for half a million contigs is a lot of data. save for oneliners I am using python.

blast taxonomy • 14k views

ADD COMMENT • link updated 6 months ago by Ram 43k • written 14.2 years ago by Darked89 4.6k

1

Entering edit mode

hum, not sure I understand what is your input... An example ?

ADD REPLY • link 14.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Can you also show an example of the table output from blast? Anyway it is better to use the xml output as it is more stable over time. Also, are you doing this in any particular programming language or tool?

ADD REPLY • link 14.2 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 1 · 2010-03-02

6

Entering edit mode

14.2 years ago

Pierre Lindenbaum 161k

I you want to get the TinySeq XML without getting the sequence, I would create a SAX parser that would only get the value of the TaxonId and ignoring the other field (see "class TinySeqHandler" in http://code.google.com/p/lindenb/source/browse/trunk/proj/tinytools/src/org/lindenb/tinytools/TwitterOmics.java for an example). Having the taxonId you can get the full lineage from

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=YOUR_TAXON_ID&retmode=xml

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 14.2 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

please don't be shy about continuing your discussion here... If you continue your discussion in private, then it is of no use for the other readers.

ADD REPLY • link 14.2 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

Pierre's Java program did work as promised. For 6.6k accessions it took ca 6 hours. Thank you :-). Now it is my part to combine it with blast output /contig sizes etc.

ADD REPLY • link 14.2 years ago by Darked89 4.6k

0

Entering edit mode

Dear Pierre, I just send you an email to your yahoo address. In short I think it is a great idea. i will look into your code tomorrow. Thank you

ADD REPLY • link 14.2 years ago by Darked89 4.6k

0

Entering edit mode

put the source code I wrote on gist: http://gist.github.com/320585

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 14.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Sorry but the email was non-technical. All about what I possibly can do for Pierre for (in some way at least) doing my homework. Surely it may be interesting in the longer run how do we return the favors (authorship? $$$, invitation to give a talk?) but often person asking the question is not at the helm (can not promise much). Hope it explains a bit.

As for Pierre's program, once it stops running and I check the output I will write about it.

ADD REPLY • link 14.2 years ago by Darked89 4.6k

Ram · Answer 2 · 2010-03-02

6

Entering edit mode

14.2 years ago

Michael 54k

From the description of your input data I guess that you are trying to do a taxonomic classification of sequences in a metagenomics approach. I further assume that you have about 200.000 reads or sequences (or do you alternatively mean assembled contigs of length 200 kB?). I am not sure if I completely understand the question, but whatever you do, filtering out the tax ids with your own script might not be the best option.

I assume further you wish to compute a tree of the taxonomic composition of the data in total.

That way one can zoom in (select more than just species/strain and taxonomic Kingdom)

For this task you might want to try the MEGAN (Metagenome Analysis) software.

Actually, what you are describing looks very much like one of the publications they have in their publications list:

H. N. Poinar, C. Schwarz, Ji Qi, B. Shapiro, R. D. E. MacPhee, B. Buigues, A. Tikhonov, D. H. Huson, L. P. Tomsho, A. Auch, M. Rampp, W. Miller, S. C. Schuster, Metagenomics to Paleogenomics: Large-Scale Sequencing of Mammoth DNA, Science 311:392-394, 2006

There is also a tutorial on setting the right BLAST parameters for use with short reads.

So in principle, this program could do the job or at least you can have a look at the right parameters for blast.

ADD COMMENT • link updated 6 months ago by Ram 43k • written 14.2 years ago by Michael 54k

0

Entering edit mode

Thank you. These are genomic contigs. "Metagenomics" is accidental. Input DNA was from two sources, one of which contained bacterias. No idea if these come from dirty root, lived between plant cells, within them(?) but it does not look like bacterial lab strain jumping to a new bottle.
Sequence produced by 454s, assembled by Newbler. My contigs are anywhere from 100bp to 1Mbp. So I am expecting one plant genome and at least one, possibly many bacterial species. Single, 1Mb large bacterial contig can hit multiple species of bacteria (blastp using predicted genes, 40-70% similarity hits).

ADD REPLY • link 14.2 years ago by Darked89 4.6k

0

Entering edit mode

Many plants undergo symbiotic interactions with soil-bacteria, e.g. legume palnts and nitrogen-fixing rhizobia, these form root nodules. If you sample from the wild, you have possibly discovered the plant and its symbiont(s) in between root cells, otherwise just dirt. Just speculation, depends on how you got the sample.

Anyway, maybe then it's maybe best to sort out the individual reads on the domain level (bacteria vs. eukaryota) and assemble afterwards. At least if mainly interested in a pure assembly.

ADD REPLY • link 14.2 years ago by Michael 54k

0

Entering edit mode

re reassembly: indeed this is what we will do in the end. But since we already have draft assembly reducing the total sequence length I went for contig screen first. As a bonus one can confirm that 454 contigs contain our plant DNA by mapping (blast pre-screened) ESTs, GSS sequences and Illumina short reads from bacteria-free (so far...) leaves / mRNAs.

ADD REPLY • link 14.2 years ago by Darked89 4.6k

Ram · Answer 3 · 2010-04-19

Instead of submitting your IDs to Batch Entrez, you could simply extract tax info from the taxonomy flat file available from NCBI:

ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz = protein taxonomy info
ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz = nucleotide taxonomy info

Saves mucking around with XML format. Also avoids making tons of calls to EUtils over the network.

This might be useful too: http://www.bioperl.org/wiki/Module:Bio::DB::Taxonomy