Question: blast results: grepping and replacing
0
gravatar for kinggang
2.3 years ago by
kinggang0
kinggang0 wrote:

Hi, I am stuck with my blast result (outfmt 6). In first column i got the contig id;like second column also i got subject id. But i need subjectid protein and organism.

eg (tabular format)


blast output: (tabular format)

TRINITY_DN2892_c0_g1_i1 gi|743827192|ref|XP_010933525.1|    94.958  238 12  0   742 29  34  271 2.15e-148   433

TRINITY_DN2892_c0_g1_i2 gi|743827192|ref|XP_010933525.1|    95.062  243 12  0   730 2   34  276 7.07e-153   444

TRINITY_DN2873_c0_g1_i1 gi|743848015|ref|XP_010939248.1|    91.447  456 39  0   2   1369    5   460 0.0 850

TRINITY_DN2838_c0_g1_i1 gi|743771805|ref|XP_010916192.1|    90.344  901 85  2   170 2872    2   900 0.0 1504

TRINITY_DN2838_c0_g1_i1 gi|743771805|ref|XP_010916192.1|    91.176  238 21  0   2904    3617    898 1135    2.14e-137   456

etc...

**my expected output: (Tabular format)**

TRINITY_DN2892_c0_g1_i1 gi|743827192|ref|XP_010933525.1| PREDICTED: aspartic proteinase nepenthesin-2-like isoform X2 [Elaeis guineensis]   94.958  238 12  0   742 29  34  271 2.15e-148   433

TRINITY_DN2892_c0_g1_i2 gi|743827192|ref|XP_010933525.1| PREDICTED: aspartic proteinase nepenthesin-2-like isoform X2 [Elaeis guineensis]   95.062  243 12  0   730 2   34  276 7.07e-153   444

TRINITY_DN2873_c0_g1_i1 gi|743848015|ref|XP_010939248.1| PREDICTED: oligopeptide transporter 3-like [Elaeis guineensis] 91.447  456 39  0   2   1369    5   460 0.0 850

TRINITY_DN2838_c0_g1_i1 gi|743771805|ref|XP_010916192.1| PREDICTED: uncharacterized protein LOC105041089 isoform X1 [Elaeis guineensis] 90.344  901 85  2   170 2872    2   900 0.0 1504

TRINITY_DN2838_c0_g1_i1 gi|743771805|ref|XP_010916192.1| PREDICTED: uncharacterized protein LOC105041089 isoform X1 [Elaeis guineensis] 91.176  238 21  0   2904    3617    898 1135    2.14e-137   456

etc...

I have around one lakh sequences, so it is difficult doing again blast and manual editing. For that I have taken all the subject id and greped with standalone database and i got the name list in another text file .

eg textfile

gi|743771814|ref|XP_010916197.1| PREDICTED: uncharacterized protein LOC105041093 [Elaeis guineensis] gi|743771805|ref|XP_010916192.1| PREDICTED: uncharacterized protein LOC105041089 isoform X1 [Elaeis guineensis] gi|743771786|ref|XP_010916183.1| PREDICTED: uncharacterized protein LOC105041085 isoform X2 [Elaeis guineensis]

now i want to incorporate this text file to my blast result like my expected output (greping and replacing in second column ). Is der any script or awk command for doing this job ??

thank you

awk blast grep • 703 views
ADD COMMENTlink modified 2.3 years ago by 5heikki8.6k • written 2.3 years ago by kinggang0

did you add "stitle" in your blast command ?

ADD REPLYlink written 2.3 years ago by ashish270
0
gravatar for Michael Dondrup
2.3 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

I think you have a better source for the annotation, and that is the blastdb you used to search in the first place. Try

cat myblastresultsfile.txt | cut -f2 | cut -f4 -d'|' | \
blastdbcmd -db myblastdb -entry_batch - -outfmt "%t %L"

you need to experiment a little with the -outfmt parameter to get the fields you need because they depend on how the db was generated.

In general, it is advisable not to use tabular blast output as primary output format, as it is just lacking exactly those fields you need most (following Murphy), use ASN1 instead and generate output as needed.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Michael Dondrup47k
0
gravatar for Jake Warner
2.3 years ago by
Jake Warner800
Jake Warner800 wrote:

I had the same problem so I wrote a workaround for it in python. It fetches the descriptions and appends them to the second column of the input file.

import subprocess, sys, os

# getdescriptors.py
# usage: getdescriptors.py infile

infile = sys.argv[1]
outfile = '%s_edited.txt' %(infile[:-4])

openedin = open(infile, 'r')
openedout = open(outfile, 'w')

print "fetching descriptors, this takes a while"
for line in openedin:
    lineParsed = line.split('\t')
    print lineParsed[0]
    entry = lineParsed[1].split('|')[3]
    output = subprocess.check_output("blastdbcmd -db /scratch/db/nr/blastDB/nr -dbtype prot -entry '" + entry + "'", shell=True)
    record2 = output.split(' ',1)[1].split('\n',1)[0]
    record2 = record2.split(']',1)
    if len(record2) > 1:
        record2 = record2[0] + "]"
    else:
        record2 = record2[0]
    results = (lineParsed[0] + lineParsed[1] + record2 + '\t'+ lineParsed[2] + 
    '\t'+ lineParsed[3] + '\t' + lineParsed[4] + '\t' + lineParsed[5] + 
    '\t'+ lineParsed[6] + '\t' + lineParsed[7] + '\t' + lineParsed[8] + 
    '\t'+ lineParsed[9] + '\t' + lineParsed[10] + '\t'+ lineParsed[11] + '\n')
    openedout.write(results)
print "done"

openedin.close() 
openedout.close()
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Jake Warner800
0
gravatar for 5heikki
2.3 years ago by
5heikki8.6k
Finland
5heikki8.6k wrote:

This should work as long as there's no pipe (|) in stitle:

join -t "|" -o 1.1,1.2,1.3,1.4,2.5,1.5 -1 2 -2 2 <(sort -t "|" -k2,2 BlastFile) <(sort -t "|" -k2,2 StitleFile)
ADD COMMENTlink written 2.3 years ago by 5heikki8.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 867 users visited in the last hour