Sorry if this question is not relevant to this forum, however, I don't know anywhere to ask it. I have a fasta file containing thousands of sequences that their headers are like below:
Gene.1019::c44525_g4_i4::g.1019::m.1019 Gene.1019::c44525_g4_i4::g.1019 ORF type:complete len:339 (+) c44525_g4_i4:48-1064(+)
and in the another text file, there are two columns with hundred rows like below: (In this text file, just the part, not total, of headers from the fasta file exist)
I want to have the text file with the below information:
I'm familiar with Linux, but such a task for me as a biologist isn't easy. Could you please advise me with your helpful commands or tools.
Thanks in advance
You can use any scripting language like python here. Step1 : Open fasta file Step2 : read fasta sequence and then store Gene.XXX as a key and fasta sequence as a value of that gene in a dictionary. Step3 : Open 2nd file and read it. Step4 : Open the dictionary and see any key of that fasta sequence match with any row of 1st column, if yes then grab 2nd column and print what you want. Even you can do without making dictionary by just reading two files same time.
I would suggest using a dictionary, definitely most straightforward.
While pseduocode is great @seta is looking for a solution that can be used right away :-)
Do you have any experience with (Bio)Python? Did I understand correctly that you want to keep the part of the identifier up to the second "::" and the part before the first "::" needs to match the field in the second file?
Sorry, I'm learning to work in this field and have no previous experience. In the example, I just have
Gene.1019::c44525_g4_i4as the identifier. Your straightforward help would be highly appreciated.
Could you give me a few more examples? I'll write the script tomorrow morning (CEST).
Sure. Here is an example of the fasta file:
and an another text file is something like below:
I would like to have identifiers as following:
Thank you for your help in advance.