Question: Concatenate Two .Fasta Files Into One
0
gravatar for Alice
4.1 years ago by
Alice150
USA
Alice150 wrote:

Hello, biostars! I have two fasta files for two different genes and want to create one data matrix. Is there any function in R for that? F.ex. if I have 2 DNAbin objects for that genes. Id numbers are identical in both files. I have the first file:

>sp1
aacc
>sp2
ggtt

the second file:

>sp1
ggaa
>sp2
ttgg

I want:

>sp1
aaccggaa
>sp2
ggttttgg

Python is also OK, but i'm interested in R.

fasta R • 5.9k views
ADD COMMENTlink modified 4.1 years ago by Haluk160 • written 4.1 years ago by Alice150

Could you comment on the rationale behind what you're trying to do?

ADD REPLYlink written 4.1 years ago by Biojl1.6k

In few words: concatenated sequence matrix -> alignment -> phylogenetic tree

ADD REPLYlink written 4.1 years ago by Alice150

Is it some kind of homework question. I answered the same question 4-5 days back. See ehre: C: Combining dna sequences files into one

ADD REPLYlink written 4.1 years ago by Ashutosh Pandey11k

No, it's for my lab work. Your answer is also helpful, thanks.

ADD REPLYlink written 4.1 years ago by Alice150
4
gravatar for Devon Ryan
4.1 years ago by
Devon Ryan74k
Freiburg, Germany
Devon Ryan74k wrote:

Just cbind(A,B) to merge the sequences for DNAbin A and DNAbin B:

A.fa:

>sp1
aacc
>sp2
ggtt

B.fa:

>sp1
ggaa
>sp2
ttgg

In R using DNAbin (as you requested):

library(ape)
A <- read.dna("A.fa", format="fasta")
B <- read.dna("B.fa", format="fasta")
C <- cbind(A,B)
write.dna(C, "C.fa", format="fasta")

C.fa:

>sp1
aaccggaa
>sp2
ggttttgg

See help(DNAbin) for more details about options for cbind(), particularly fill.with.gaps and check.names.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Devon Ryan74k

I've already tried that. Error: the 'cbind' method for "DNAbin" accepts only matrices

ADD REPLYlink written 4.1 years ago by Alice150

How did you read in the sequences?

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Devon Ryan74k

read.dna("B.fa", format="fasta") - fail read.FASTA("B.fasta") - fail

ADD REPLYlink written 4.1 years ago by Alice150

If you get an error message of "fail" or something like that, then you have bigger issues.

ADD REPLYlink written 4.1 years ago by Devon Ryan74k

by "fail" i mean the same error message in both cases: cbind' method for "DNAbin" accepts only matrices

ADD REPLYlink written 4.1 years ago by Alice150

It would be helpful if you posted a reproducible example. The original examples in your question will work fine.

ADD REPLYlink written 4.1 years ago by Devon Ryan74k

I think problem is in lines, i.e. one sequence is like:

>sp1
aattgg
aaggtt

and not

>sp1
aattggaaggtt
ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Alice150

Worked for me.
 

ADD REPLYlink written 2.3 years ago by pescadordigital10
4
gravatar for Haluk
4.1 years ago by
Haluk160
Lincoln, Nebraska
Haluk160 wrote:

You can do this with an awk

paste A.fa B.fa | awk '{if (NR%2==0) {print $1 $2} else {print $1}}'

ADD COMMENTlink written 4.1 years ago by Haluk160

Thank you! It works. I have absolutely no experience with awk, so i have one question: the order of IDs in A.fa have to be the same, as in B.fa? Or concatenation goes by comparing IDs in two files?

ADD REPLYlink written 4.1 years ago by Alice150

They have to be the same and each sequence can occupy only 1 line.

ADD REPLYlink written 4.1 years ago by Devon Ryan74k

Ok, thanks, it is really important.

ADD REPLYlink written 4.1 years ago by Alice150

paste -d '\0' File_A File_B | sed 's/>[A-Z]*//' > File_C.fa will also do the same.

ADD REPLYlink written 4.1 years ago by Ashutosh Pandey11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 820 users visited in the last hour