Fasta header, search and replace...?
2
0
Entering edit mode
6.3 years ago
Buffo ★ 2.0k

Hi everybody, I did a blastp (about 7,000 sequences), after that, I want to make a table of content, some sequences has the same hit (from gene family) so I did a python script for count duplicates and the result from that script looks like:

Hypothetical protein                  400
Hypothetical protein, putative        200
hypothetical Protein                   40
Hypotetycal protein                     2
etc... 
etc.. with different gene`s names and different errors

In my result I have different counts for the same target because they have not the same name, so, wich I did was check almost one by one that errors and I did a table, and it looks like:

Variants                                                    Rename
Hypothetical protein                                  Hypothetical protein
Hypothetical protein, putative                        Hypothetical protein  
hypothetical Protein                                  Hypothetical protein
trans-sialidase                                        trans-sialidase
trans-sialidase, putative,                             trans-sialidase
mucin-associated surface protein (MASP), putative      mucin-associated surface protein (MASP)
mucin-associated surface protein (MASP)                mucin-associated surface protein (MASP)
etc 
etc

I have to do another blastp with other sequences but with same database, so if I don´t do anything I will have the same error, wich I want to do is rename that sequences (column variants) that have the variants names for one wich I can count with my script (Rename column), the fasta headers looks like:

>TcC.509233.70 | organism=TCCBR_STRA_6789 | product=Spastin, putative | location=TcChr15-S:273711-276341(-) | length=876 | sequence_SO=chromosome | SO=protein_codingT
TKAANNNSTRVTHGNSLLQRVRQSSYCKGIPEETCLAVLQQVVDRACPVSFSGISGLEVC
>TcCLB.506503.53 | organism=TCCBR_STRA_6789 | product=hypothetical protein | location=TcChr40-S:277103-278386(-) | length=427 | sequence_SO=chromosome | SO=protein_coding
MSFEHNASLGLRGSGGKHFSRCPPYMHSGRGASPPKRLPSRRATSHGETGPKVPAHRAYG

I have tried to write a python script for doing that but I can´t,i have tried doing a split '|' but I can´t replace the name :( somebody has did something like this? somebody have some advices for doing that? Help!

sequence database fasta genome • 2.7k views
ADD COMMENT
3
Entering edit mode
6.3 years ago

Try seqkit replace, download, usage of subcommand replace

$ seqkit replace    --pattern '(product=[^,]+),?[^\|]* \|'    --replacement '$1 |'    seq.fa
>TcC.509233.70 | organism=TCCBR_STRA_6789 | product=Spastin | location=TcChr15-S:273711-276341(-) | length=876 | sequence_SO=chromosome | SO=protein_codingT
TKAANNNSTRVTHGNSLLQRVRQSSYCKGIPEETCLAVLQQVVDRACPVSFSGISGLEVC
>TcCLB.506503.53 | organism=TCCBR_STRA_6789 | product=hypothetical protein | location=TcChr40-S:277103-278386(-) | length=427 | sequence_SO=chromosome | SO=protein_coding
MSFEHNASLGLRGSGGKHFSRCPPYMHSGRGASPPKRLPSRRATSHGETGPKVPAHRAYG

You can also edit on the table file with csvtk, download, usage of subcommand mutate

$ head -n 3 table.tsv 
Variants
Hypothetical protein
Hypothetical protein, putative

$ csvtk -t mutate --fields Variants --pattern '(^[^,]+),?' --name Rename table.tsv > renamed_table.tsv

$ csvtk -t pretty renamed_table.tsv 
Variants                                            Rename
Hypothetical protein                                Hypothetical protein
Hypothetical protein, putative                      Hypothetical protein
hypothetical Protein                                hypothetical Protein
trans-sialidase                                     trans-sialidase
trans-sialidase, putative,                          trans-sialidase
mucin-associated surface protein (MASP), putative   mucin-associated surface protein (MASP)
mucin-associated surface protein (MASP)             mucin-associated surface protein (MASP)
ADD COMMENT
0
Entering edit mode

I never had used that commands but it works!! Thank you so much but share your knowledge! and for your time

ADD REPLY
1
Entering edit mode

Thanks for using my seqkit and csvtk, there are much more functions that you can explorer on the websites. They are both open-source at Github: seqkit and csvtk.

ADD REPLY
0
Entering edit mode
6.3 years ago
Asaf 9.4k

You can split by "|" then iterating over the elements of the split, split each element on "=" and if [0]=='product' look for the second element in your dictionary, replace the string if needed and then join back on "=" and join the entire list on "|" and write this string as the name of the fasta to another file.

ADD COMMENT

Login before adding your answer.

Traffic: 2360 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6