NCBI TaxIds from Diamond Blast output
1
3
Entering edit mode
5.2 years ago
jblommaert92 ▴ 70

Hi,

This question will seem INCREDIBLY similar to the question here, but it's not.

I ran diamond blastx on a large dataset with the output options -f 6 qseqid sseqid bitscore

This gives me a file that looks like

OHJ07_1_contig_1    ELU09376.1  66.6
OHJ07_1_contig_1    KFM76682.1  65.9
OHJ07_1_contig_1    JAT94707.1  63.2
OHJ07_1_contig_1    JAT94707.1  46.6
OHJ07_1_contig_1    XP_002400485.1  62.4
OHJ07_1_contig_1    XP_002400485.1  46.2
OHJ07_1_contig_1    XP_014787375.1  61.6
OHJ07_1_contig_1    XP_014787375.1  40.4
OHJ07_1_contig_1    KOF67573.1  61.6


And I'd like to get taxIds for each of these accession versions. The previous question, linked above, seems to only work on accession numbers, not accession version numbers.

I have also downloaded the NCBI prot.accession2taxid file, and tried to grep the accession version numbers back to this, but the output is in the order of the NCBI file, not mine. The NCBI file looks like

accession   accession.version   taxid   gi
P29373  P29373.2    9606    132401
P22935  P22935.2    10090   132402
P18902  P18902.1    9913    132403
P02753  P02753.3    9606    62298174
P27485  P27485.2    9823    3041715
P06912  P06912.2    9986    1710096


My grep output (head) looks like this

P07201  P07201.2    6584    266918
P21329  P21329.1    7221    134082
P21328  P21328.1    7227    134083
P04571  P04571.1    126592  134316
P04572  P04572.1    6357    134318
P19217  P19217.1    9913    135052
P17248  P17248.3    9913    110283011


so obviously, I can't just paste the grep output with my diamond blast output.

Eventually I need a file that looks like my blast output, but with taxids in the place of accession version numbers, ie.

    OHJ07_1_contig_1    283909  66.6
OHJ07_1_contig_1    407821  65.9


The way I see it, there are two approaches, somehow use the accession version numbers with the script provided here by Pierre and Steve, or I can try and sort my grep output from the NCBI file I downloaded so that I can paste the files and continue with the rest of the analysis.

I have tried the latter, to sort my grep file, but my script (below), is hideously slow and will take about 3 months to finish the whole job

while read line; do grep -m 1 $line prot.accession2taxid; done<test>taxIds  Any help would be appreciated! EDIT: The "test" file I use in the above "solution", is just the column of accession versions pulled out of my blast output file genome ncbi taxid blast • 3.4k views ADD COMMENT 0 Entering edit mode Accession versions should not matter correct? Version differences/changes will not change the taxid. ADD REPLY 0 Entering edit mode I'm not sure exactly how EUtils works, and the example uses accession numbers, not versions. Using my blast output and the bash script from the other question only gave me a blank output, so I thought the problem might be accession numbers? I also just tried entering http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=$ELU09376 (and the same with .1 after the accession number which gave me a text file reading:

  ID list is empty! In it there are neither IDs nor accessions.
Error: CEFetchPApplication::proxy_stream(): HTTP/1.1 400 Bad Request


so I guess the problem is not my Accession Versions, but something else?

2
Entering edit mode
5.2 years ago

Versions should not change taxid. This might work for you in Linux using awk:

awk 'NR==FNR{taxid[$1]=$3} \
NR!=FNR{split($2,acc,".");tmp=taxid[acc[1]]; \ if(tmp){print$1, taxid[acc[1]], \$3}}' \
prot.accession2taxid blastx.output > your_output.txt


AWK reads files line by line. NR==FNR is true for the first file and it stores taxid for your accession in memory. Then for the second file NR!=FNR and it removes accession.version and finds taxid for that accession in memory as tmp. If it find it, then it prints line where you have taxid in place of accession.

0
Entering edit mode

This worked perfectly, and I almost understand it too, thanks!