Combine fasta files by matching partial header in specific format
0
0
Entering edit mode
4.1 years ago

I have three fasta files reflecting protein sequences for each gene in xls format (space separated). The first column contains header, while the other column contains sequence. For example:

File1:

sample  1   2   3   4   5   6
BnaA03g18710D   M   A   A   A   V   S
BnaA03g18710D_S25   M   A   A   A   V   S
BnaA03g18710D_S31   M   A   A   A   V   S

File2:

sample  1   2   3   4   5   6
BnaA03g18710D_a M   A   A   A   V   S
BnaA03g18710D_S25_a M   A   A   A   V   S
BnaA03g18710D_S31_a M   A   A   A   V   S

File3:

sample  1   2   3   4   5   6
BnaA03g18710D_b M   A   A   A   V   S
BnaA03g18710D_S25_b M   A   A   A   V   S
BnaA03g18710D_S31_b M   A   A   A   V   S

I am intersted to merge them in the follwoing order:

sample  1   2   3   4   5   6
BnaA03g18710D   M   A   A   A   V   S
BnaA03g18710D_a M   A   A   A   V   S
BnaA03g18710D_b M   A   A   A   V   S
BnaA03g18710D_S25   M   A   A   A   V   S
BnaA03g18710D_S25_a M   A   A   A   V   S
BnaA03g18710D_S25_b M   A   A   A   V   S
BnaA03g18710D_S31   M   A   A   A   V   S
BnaA03g18710D_S31_a M   A   A   A   V   S
BnaA03g18710D_S31_b M   A   A   A   V   S

I have tried cat, sed and other commands but wasn't able to make the desired format. Any help will be highly appreciated.

RNA-Seq • 786 views
ADD COMMENT
1
Entering edit mode

Try to cat them together, and sort them by first column, then remove the sample columns by grep -v 'sample. To get the header line, simply cat the first line of the first file with the output you obtained from the strategy I just described. I am sure you manage to do that.

ADD REPLY
0
Entering edit mode

Small nitpick but these are not fasta format files.

ADD REPLY

Login before adding your answer.

Traffic: 2735 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6