Question: Multiple mapping (Taxonomy, Seed, ...) of blast(-like) tabular output
0
gravatar for BSP
4.1 years ago by
BSP0
European Union
BSP0 wrote:

Hello,

I am dealing with short sequence reads (>1,000,000 reads) from a metagenome and a metatranscriptome. I used a BLAST-like program called RAPSearch 2, which provides me with a BLAST Tabular output format like this:

# Fields (separated by tab): query_id, subject_id,  %_identity, alignment_length, mismatch, gap, query_start, query_end, subject_start, subject_end, log_evalue, bit score

HWI-ST...    438753.AZC_3721    60.7843    51    20    0    1    51    213    263    -9.38    68.17

and an alignment format like this:

HWI-ST... vs 438753.AZC_3721 bits=68.17 log(E-value)=-9.38 identity=60.7843% aln-len=51 mismatch=20 gap-openings=0 nFrame=0
Query:         1 EGRLASLLTDVAAGRLAPLYNYMKDLPAMEGTPAPFLPRRYIERMLGSSSS 51
                 EGRL  +L D A+G L PLYN+M DLP + GTP PFLP+ Y+ R LG SSS
Sbjct:       213 EGRLDQVLHDAASGTLEPLYNFMNDLPGIGGTPVPFLPKTYVSRTLGLSSS 263

 

What I want to achieve is to compare the information from the output with mapping files to retrieve crosslinked information in a tab delimited file format. More precisely, I have an output (searched against the eggNOG database) containing the information about 

Protein name (and the basic output like evalue, alignment length,...)

I downloaded mapping files, which contain:

1) nog name    taxID.protein name (this is the actual subject_id from the output files!)

2) cog name    protein name

2) nog name    function

3) nog name    description

4) species name    tax id

4) cog name functional category

 I would like to collect this information for my search results. I already spend alot time browsing through available scripts and programs (eg. BioPython) but I could not yet find a suitable solution for this. Maybe I did oversee something? Additionally, I would like to use for example the TaxID to retrieve more detailed taxonomic classification (eg. Kingdom; Phylum; ...)

Has anyone an idea on how to start here? I would appreciate any idea on this matter.

Thank you!

Edit: I just noticed that Seed is in my post title. I also searched against the RefSeq protein database and the UniRef90 database and wanted to analyze the Seed content of the output as there also exist mapping information against each other. However, I still didn't find a way to do this

blast mapping rapsearch2 • 2.0k views
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by BSP0
0
gravatar for 5heikki
4.1 years ago by
5heikki7.6k
Finland
5heikki7.6k wrote:

Sounds like a job for join.

ADD COMMENTlink written 4.1 years ago by 5heikki7.6k
0
gravatar for BSP
4.1 years ago by
BSP0
European Union
BSP0 wrote:

Thank you 5heikki for this hint. However, as far as I understood join will only bring files together, meaning that 1 line (fle 1) will connect to 1 line (file2; therefore they need to be ordered) resulting in an output only with 1 of each connecting lines.

ADD COMMENTlink written 4.1 years ago by BSP0

Output can be customized, e.g.

join -1 1 -2 1 -o 2.1,1.2,2.2 <(sort -k1,1 table1) <(sort k1,1 table2)

Would join tables based on first column value and output field 1 of table2, field 2 of table1 and field 2 of table2

ADD REPLYlink written 4.1 years ago by 5heikki7.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1679 users visited in the last hour