Hello,
I am dealing with short sequence reads (>1,000,000 reads) from a metagenome and a metatranscriptome. I used a BLAST-like program called RAPSearch 2, which provides me with a BLAST Tabular output format like this:
query_id subject_id %_identity, alignment_length mismatch gap query_start query_end subject_start subject_end log_evalue bit_score
HWI-ST... 438753.AZC_3721 60.7843 51 20 0 1 51 213 263 -9.38 68.17
and an alignment format like this:
HWI-ST... vs 438753.AZC_3721 bits=68.17 log(E-value)=-9.38 identity=60.7843% aln-len=51 mismatch=20 gap-openings=0 nFrame=0
Query: 1 EGRLASLLTDVAAGRLAPLYNYMKDLPAMEGTPAPFLPRRYIERMLGSSSS 51
EGRL +L D A+G L PLYN+M DLP + GTP PFLP+ Y+ R LG SSS
Sbjct: 213 EGRLDQVLHDAASGTLEPLYNFMNDLPGIGGTPVPFLPKTYVSRTLGLSSS 263
What I want to achieve is to compare the information from the output with mapping files to retrieve crosslinked information in a tab delimited file format. More precisely, I have an output (searched against the eggNOG database) containing the information abou.
Protein name (and the basic output like evalue, alignment length,...)
I downloaded mapping files, which contain:
- nog name: taxID.protein name (this is the actual subject_id from the output files!)
- cog name: protein name
- nog name: function
- nog name: description
- species name: tax id
- cog name: functional category
I would like to collect this information for my search results. I already spent a lot of time browsing through available scripts and programs (eg. BioPython) but I could not yet find a suitable solution for this. Maybe I did oversee something? Additionally, I would like to use for example the TaxID to retrieve more detailed taxonomic classification (eg. Kingdom; Phylum; ...)
Has anyone an idea on how to start here? I would appreciate any idea on this matter.
Thank you!
Edit: I just noticed that Seed is in my post title. I also searched against the RefSeq protein database and the UniRef90 database and wanted to analyze the Seed content of the output as there also exist mapping information against each other. However, I still didn't find a way to do this
Thank you 5heikki for this hint. However, as far as I understood join will only bring files together, meaning that 1 line (fle 1) will connect to 1 line (file2; therefore they need to be ordered) resulting in an output only with 1 of each connecting lines.
Output can be customized, e.g.
Would join tables based on first column value and output field 1 of table2, field 2 of table1 and field 2 of table2