Hi everybody, I did a blastp (about 7,000 sequences), after that, I want to make a table of content, some sequences has the same hit (from gene family) so I did a python script for count duplicates and the result from that script looks like:
Hypothetical protein 400 Hypothetical protein, putative 200 hypothetical Protein 40 Hypotetycal protein 2 etc... etc.. with different gene`s names and different errors
In my result I have different counts for the same target because they have not the same name, so, wich I did was check almost one by one that errors and I did a table, and it looks like:
Variants Rename Hypothetical protein Hypothetical protein Hypothetical protein, putative Hypothetical protein hypothetical Protein Hypothetical protein trans-sialidase trans-sialidase trans-sialidase, putative, trans-sialidase mucin-associated surface protein (MASP), putative mucin-associated surface protein (MASP) mucin-associated surface protein (MASP) mucin-associated surface protein (MASP) etc etc
I have to do another blastp with other sequences but with same database, so if I don´t do anything I will have the same error, wich I want to do is rename that sequences (column variants) that have the variants names for one wich I can count with my script (Rename column), the fasta headers looks like:
>TcC.509233.70 | organism=TCCBR_STRA_6789 | product=Spastin, putative | location=TcChr15-S:273711-276341(-) | length=876 | sequence_SO=chromosome | SO=protein_codingT TKAANNNSTRVTHGNSLLQRVRQSSYCKGIPEETCLAVLQQVVDRACPVSFSGISGLEVC >TcCLB.506503.53 | organism=TCCBR_STRA_6789 | product=hypothetical protein | location=TcChr40-S:277103-278386(-) | length=427 | sequence_SO=chromosome | SO=protein_coding MSFEHNASLGLRGSGGKHFSRCPPYMHSGRGASPPKRLPSRRATSHGETGPKVPAHRAYG
I have tried to write a python script for doing that but I can´t,i have tried doing a split '|' but I can´t replace the name :( somebody has did something like this? somebody have some advices for doing that? Help!