combine fasta sequences in combination from two different files
1
0
Entering edit mode
4.1 years ago
shome ▴ 10

I have two files that look as follows:

file 1

>sp|P0|H1_HUMAN dhj OS=Homo sapiens OX=9606 GN=CDH1 PE=1 SV=3 
MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHLERGRVLGRVNFEDCTGRQRTAYFSLDTRFKVGTDGVITVKRPLRFHNPQIHFLVYAWDSTYRKFSTKVTLNTVG

>sp|Q4|C1_RAT C-1 jkjk OS=Rattus norvegicus OX=10116 GN=Cdh1 PE=1 SV=1 
QIKSNRDKETTVFYSITGPGADKPPVGVFIIERETGWLKVTQPLDREAIDKYLLYSHAVS

file 2

>sp|P641|A1_CHICK link OS=Gallus gallus OX=9031 GN=CDH1 PE=1 SV=2 

MGRRWGSPALQRFPVLVLLLLLQVCGRRCDEAAPCQPGFAAETFSFSVPQDSVAAGRELG

>sp|QF2|A2_BOVIN hjh OS=Bos taurus OX=9913 GN=CDH1 PE=2 SV=1 

MGPWSRSLSALCCCCRCNPWLCREPEPCIPGFGAESYTFTVPRRNLERGRVLGRVSFEGC

I am looking to combine all the fasta sequences from file1 with file2 and save it in new file output.fasta.

desired output file: output.fasta

 >sp|P0|H1_HUMAN dhj OS=Homo sapiens OX=9606 GN=CDH1 PE=1 SV=3_sp|P641|A1_CHICK link OS=Gallus gallus OX=9031 GN=CDH1 PE=1 SV=2
MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHLERGRVLGRVNFEDCTGRQRTAYFSLDTRFKVGTDGVITVKRPLRFHNPQIHFLVYAWDSTYRKFSTKVTLNTVGMGRRWGSPALQRFPVLVLLLLLQVCGRRCDEAAPCQPGFAAETFSFSVPQDSVAAGRELG
 >sp|P0|H1_HUMAN dhj OS=Homo sapiens OX=9606 GN=CDH1 PE=1 SV=3_>sp|QF2|A2_BOVIN hjh OS=Bos taurus OX=9913 GN=CDH1 PE=2 SV=1
MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHLERGRVLGRVNFEDCTGRQRTAYFSLDTRFKVGTDGVITVKRPLRFHNPQIHFLVYAWDSTYRKFSTKVTLNTVGMGPWSRSLSALCCCCRCNPWLCREPEPCIPGFGAESYTFTVPRRNLERGRVLGRVSFEGC
>sp|Q4|C1_RAT C-1 jkjk OS=Rattus norvegicus OX=10116 GN=Cdh1 PE=1 SV=1_>sp|P641|A1_CHICK link OS=Gallus gallus OX=9031 GN=CDH1 PE=1 SV=2
QIKSNRDKETTVFYSITGPGADKPPVGVFIIERETGWLKVTQPLDREAIDKYLLYSHAVSMGRRWGSPALQRFPVLVLLLLLQVCGRRCDEAAPCQPGFAAETFSFSVPQDSVAAGRELG
>sp|Q4|C1_RAT C-1 jkjk OS=Rattus norvegicus OX=10116 GN=Cdh1 PE=1 SV=1_>sp|QF2|A2_BOVIN hjh OS=Bos taurus OX=9913 GN=CDH1 PE=2 SV=1
QIKSNRDKETTVFYSITGPGADKPPVGVFIIERETGWLKVTQPLDREAIDKYLLYSHAVSMGPWSRSLSALCCCCRCNPWLCREPEPCIPGFGAESYTFTVPRRNLERGRVLGRVSFEGC
unix fasta awk shell • 1.8k views
ADD COMMENT
0
Entering edit mode

Have you tried cat *.fasta > out.fasta?

how to combine multiple fasta file into a larger fasta file

ADD REPLY
0
Entering edit mode

Hi Arup, Actually, I want to combine entry1 from file 1 with all possible entries of file2(and do the same for all entries of file) and save in output.fasta. Cat *.fasta will merge all fasta sequences no matter what...

ADD REPLY
2
Entering edit mode

You'll need to use custom BioPerl/BioPython code. What you are doing is not a standard operation. In fact, it is odd enough to warrant the question: What are you doing and why are you doing that?

ADD REPLY
0
Entering edit mode

I need combined fasta sequences of entries from file 1 and file 2 to do a residue correlation analysis.Ok..I will check it out with biopython,but I thought it is possible with awk/unix..

ADD REPLY
0
Entering edit mode

Maybe with bioawk - but the operation is complicated enough to warrant a more robust, verifiable, reproducible approach, which one-liners are not.

ADD REPLY
0
Entering edit mode

sayaneshome.rsg : Take a look at seqkit (https://github.com/shenwei356/seqkit ). It may have an option (concat perhaps ) to do something like this.

ADD REPLY
5
Entering edit mode
4.1 years ago

assuming two lines per fasta record and no empty line.

cat input1.fa | paste  - - | while read L1; do cat input2.fa | paste - - | while read L2; do echo -e "$L1\t$L2" ; done ; done | awk -F '\t' '{printf("%s|%s\n%s%s\n",$1,substr($3,2),$2,$4);}'


>sp|P0|H1_HUMAN dhj OS=Homo sapiens OX=9606 GN=CDH1 PE=1 SV=3 |sp|P641|A1_CHICK link OS=Gallus gallus OX=9031 GN=CDH1 PE=1 SV=2 
MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHLERGRVLGRVNFEDCTGRQRTAYFSLDTRFKVGTDGVITVKRPLRFHNPQIHFLVYAWDSTYRKFSTKVTLNTVGMGRRWGSPALQRFPVLVLLLLLQVCGRRCDEAAPCQPGFAAETFSFSVPQDSVAAGRELG
>sp|P0|H1_HUMAN dhj OS=Homo sapiens OX=9606 GN=CDH1 PE=1 SV=3 |sp|QF2|A2_BOVIN hjh OS=Bos taurus OX=9913 GN=CDH1 PE=2 SV=1 
MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHLERGRVLGRVNFEDCTGRQRTAYFSLDTRFKVGTDGVITVKRPLRFHNPQIHFLVYAWDSTYRKFSTKVTLNTVGMGPWSRSLSALCCCCRCNPWLCREPEPCIPGFGAESYTFTVPRRNLERGRVLGRVSFEGC
>sp|Q4|C1_RAT C-1 jkjk OS=Rattus norvegicus OX=10116 GN=Cdh1 PE=1 SV=1 |sp|P641|A1_CHICK link OS=Gallus gallus OX=9031 GN=CDH1 PE=1 SV=2 
QIKSNRDKETTVFYSITGPGADKPPVGVFIIERETGWLKVTQPLDREAIDKYLLYSHAVSMGRRWGSPALQRFPVLVLLLLLQVCGRRCDEAAPCQPGFAAETFSFSVPQDSVAAGRELG
>sp|Q4|C1_RAT C-1 jkjk OS=Rattus norvegicus OX=10116 GN=Cdh1 PE=1 SV=1 |sp|QF2|A2_BOVIN hjh OS=Bos taurus OX=9913 GN=CDH1 PE=2 SV=1 
QIKSNRDKETTVFYSITGPGADKPPVGVFIIERETGWLKVTQPLDREAIDKYLLYSHAVSMGPWSRSLSALCCCCRCNPWLCREPEPCIPGFGAESYTFTVPRRNLERGRVLGRVSFEGC
ADD COMMENT
0
Entering edit mode

Hi @Pierre,is there anyway to do this if there are multiple lines per fasta record?

ADD REPLY
2
Entering edit mode

Linearize the fasta files. Courtesy of code from @Pierre:

ADD REPLY
0
Entering edit mode

Thank you it worked based on your and Pierre's input.

ADD REPLY
0
Entering edit mode

Hi If my fasta headers contain lines like this : >A0A2I3MB61_PAPAN/29-158; >A0A2IB61_HUMAN/29-10; how to only merge the fasta inputs where the string between _ and / matches. For instance, the fasta sequences only should combine if both are from human or from papan else wont..

ADD REPLY
0
Entering edit mode

I'll repeat my advice from before

You'll need to use custom BioPerl/BioPython code

ADD REPLY

Login before adding your answer.

Traffic: 3101 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6