Question: combine fasta sequences in combination from two different files
0
gravatar for shome
12 months ago by
shome0
shome0 wrote:

I have two files that look as follows:

file 1

>sp|P0|H1_HUMAN dhj OS=Homo sapiens OX=9606 GN=CDH1 PE=1 SV=3 
MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHLERGRVLGRVNFEDCTGRQRTAYFSLDTRFKVGTDGVITVKRPLRFHNPQIHFLVYAWDSTYRKFSTKVTLNTVG

>sp|Q4|C1_RAT C-1 jkjk OS=Rattus norvegicus OX=10116 GN=Cdh1 PE=1 SV=1 
QIKSNRDKETTVFYSITGPGADKPPVGVFIIERETGWLKVTQPLDREAIDKYLLYSHAVS

file 2

>sp|P641|A1_CHICK link OS=Gallus gallus OX=9031 GN=CDH1 PE=1 SV=2 

MGRRWGSPALQRFPVLVLLLLLQVCGRRCDEAAPCQPGFAAETFSFSVPQDSVAAGRELG

>sp|QF2|A2_BOVIN hjh OS=Bos taurus OX=9913 GN=CDH1 PE=2 SV=1 

MGPWSRSLSALCCCCRCNPWLCREPEPCIPGFGAESYTFTVPRRNLERGRVLGRVSFEGC

I am looking to combine all the fasta sequences from file1 with file2 and save it in new file output.fasta.

desired output file: output.fasta

 >sp|P0|H1_HUMAN dhj OS=Homo sapiens OX=9606 GN=CDH1 PE=1 SV=3_sp|P641|A1_CHICK link OS=Gallus gallus OX=9031 GN=CDH1 PE=1 SV=2
MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHLERGRVLGRVNFEDCTGRQRTAYFSLDTRFKVGTDGVITVKRPLRFHNPQIHFLVYAWDSTYRKFSTKVTLNTVGMGRRWGSPALQRFPVLVLLLLLQVCGRRCDEAAPCQPGFAAETFSFSVPQDSVAAGRELG
 >sp|P0|H1_HUMAN dhj OS=Homo sapiens OX=9606 GN=CDH1 PE=1 SV=3_>sp|QF2|A2_BOVIN hjh OS=Bos taurus OX=9913 GN=CDH1 PE=2 SV=1
MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHLERGRVLGRVNFEDCTGRQRTAYFSLDTRFKVGTDGVITVKRPLRFHNPQIHFLVYAWDSTYRKFSTKVTLNTVGMGPWSRSLSALCCCCRCNPWLCREPEPCIPGFGAESYTFTVPRRNLERGRVLGRVSFEGC
>sp|Q4|C1_RAT C-1 jkjk OS=Rattus norvegicus OX=10116 GN=Cdh1 PE=1 SV=1_>sp|P641|A1_CHICK link OS=Gallus gallus OX=9031 GN=CDH1 PE=1 SV=2
QIKSNRDKETTVFYSITGPGADKPPVGVFIIERETGWLKVTQPLDREAIDKYLLYSHAVSMGRRWGSPALQRFPVLVLLLLLQVCGRRCDEAAPCQPGFAAETFSFSVPQDSVAAGRELG
>sp|Q4|C1_RAT C-1 jkjk OS=Rattus norvegicus OX=10116 GN=Cdh1 PE=1 SV=1_>sp|QF2|A2_BOVIN hjh OS=Bos taurus OX=9913 GN=CDH1 PE=2 SV=1
QIKSNRDKETTVFYSITGPGADKPPVGVFIIERETGWLKVTQPLDREAIDKYLLYSHAVSMGPWSRSLSALCCCCRCNPWLCREPEPCIPGFGAESYTFTVPRRNLERGRVLGRVSFEGC
awk shell unix fasta • 297 views
ADD COMMENTlink modified 12 months ago by Pierre Lindenbaum134k • written 12 months ago by shome0

Have you tried cat *.fasta > out.fasta?

how to combine multiple fasta file into a larger fasta file

ADD REPLYlink modified 12 months ago • written 12 months ago by Arup Ghosh2.8k

Hi Arup, Actually, I want to combine entry1 from file 1 with all possible entries of file2(and do the same for all entries of file) and save in output.fasta. Cat *.fasta will merge all fasta sequences no matter what...

ADD REPLYlink written 12 months ago by shome0
2

You'll need to use custom BioPerl/BioPython code. What you are doing is not a standard operation. In fact, it is odd enough to warrant the question: What are you doing and why are you doing that?

ADD REPLYlink written 12 months ago by Ram32k

I need combined fasta sequences of entries from file 1 and file 2 to do a residue correlation analysis.Ok..I will check it out with biopython,but I thought it is possible with awk/unix..

ADD REPLYlink written 12 months ago by shome0

Maybe with bioawk - but the operation is complicated enough to warrant a more robust, verifiable, reproducible approach, which one-liners are not.

ADD REPLYlink written 12 months ago by Ram32k

sayaneshome.rsg : Take a look at seqkit (https://github.com/shenwei356/seqkit ). It may have an option (concat perhaps ) to do something like this.

ADD REPLYlink written 12 months ago by GenoMax96k
5
gravatar for Pierre Lindenbaum
12 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

assuming two lines per fasta record and no empty line.

cat input1.fa | paste  - - | while read L1; do cat input2.fa | paste - - | while read L2; do echo -e "$L1\t$L2" ; done ; done | awk -F '\t' '{printf("%s|%s\n%s%s\n",$1,substr($3,2),$2,$4);}'


>sp|P0|H1_HUMAN dhj OS=Homo sapiens OX=9606 GN=CDH1 PE=1 SV=3 |sp|P641|A1_CHICK link OS=Gallus gallus OX=9031 GN=CDH1 PE=1 SV=2 
MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHLERGRVLGRVNFEDCTGRQRTAYFSLDTRFKVGTDGVITVKRPLRFHNPQIHFLVYAWDSTYRKFSTKVTLNTVGMGRRWGSPALQRFPVLVLLLLLQVCGRRCDEAAPCQPGFAAETFSFSVPQDSVAAGRELG
>sp|P0|H1_HUMAN dhj OS=Homo sapiens OX=9606 GN=CDH1 PE=1 SV=3 |sp|QF2|A2_BOVIN hjh OS=Bos taurus OX=9913 GN=CDH1 PE=2 SV=1 
MGPWSRSLSALLLLLQVSSWLCQEPEPCHPGFDAESYTFTVPRRHLERGRVLGRVNFEDCTGRQRTAYFSLDTRFKVGTDGVITVKRPLRFHNPQIHFLVYAWDSTYRKFSTKVTLNTVGMGPWSRSLSALCCCCRCNPWLCREPEPCIPGFGAESYTFTVPRRNLERGRVLGRVSFEGC
>sp|Q4|C1_RAT C-1 jkjk OS=Rattus norvegicus OX=10116 GN=Cdh1 PE=1 SV=1 |sp|P641|A1_CHICK link OS=Gallus gallus OX=9031 GN=CDH1 PE=1 SV=2 
QIKSNRDKETTVFYSITGPGADKPPVGVFIIERETGWLKVTQPLDREAIDKYLLYSHAVSMGRRWGSPALQRFPVLVLLLLLQVCGRRCDEAAPCQPGFAAETFSFSVPQDSVAAGRELG
>sp|Q4|C1_RAT C-1 jkjk OS=Rattus norvegicus OX=10116 GN=Cdh1 PE=1 SV=1 |sp|QF2|A2_BOVIN hjh OS=Bos taurus OX=9913 GN=CDH1 PE=2 SV=1 
QIKSNRDKETTVFYSITGPGADKPPVGVFIIERETGWLKVTQPLDREAIDKYLLYSHAVSMGPWSRSLSALCCCCRCNPWLCREPEPCIPGFGAESYTFTVPRRNLERGRVLGRVSFEGC
ADD COMMENTlink modified 12 months ago • written 12 months ago by Pierre Lindenbaum134k

Hi @Pierre,is there anyway to do this if there are multiple lines per fasta record?

ADD REPLYlink written 12 months ago by shome0
2

Linearize the fasta files. Courtesy of code from @Pierre:

ADD REPLYlink modified 12 months ago • written 12 months ago by GenoMax96k

Thank you it worked based on your and Pierre's input.

ADD REPLYlink written 12 months ago by shome0

Hi If my fasta headers contain lines like this : >A0A2I3MB61_PAPAN/29-158; >A0A2IB61_HUMAN/29-10; how to only merge the fasta inputs where the string between _ and / matches. For instance, the fasta sequences only should combine if both are from human or from papan else wont..

ADD REPLYlink written 12 months ago by shome0

I'll repeat my advice from before

You'll need to use custom BioPerl/BioPython code

ADD REPLYlink written 12 months ago by Ram32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2003 users visited in the last hour
_