Question

Combining Dna Sequences Files Into One

1

Entering edit mode

11.5 years ago

nepgorkhey ▴ 130

If I have 2 files with sequence data how can I combine them to one. What I want to do is

if File A has

->X

ACTGCA

->Y

ACGTAA

->Z

AGCATA

and File B has

->X

TCAGA

->Y

GACTA

->Z

GCTAA

I want to combine file A and B into File C that will have following output

->X

ACTGCATCAGA

->Y

ACGTAAGACTA

->Z

AGCATAGCTAA

biopython • 4.5k views

ADD COMMENT • link updated 11.5 years ago by Peter 6.0k • written 11.5 years ago by nepgorkhey ▴ 130

0

Entering edit mode

Can you assume the two files haves the same set of sequences (here X, Y, and Z) and they are in the same order? Also what file format is this (eg FASTA, FASTQ)?

ADD REPLY • link 11.5 years ago by Peter 6.0k

0

Entering edit mode

Yes the files have the same sequence sets in same order and the format I was trying to use fasta files.

ADD REPLY • link 11.5 years ago by nepgorkhey ▴ 130

score 3 · Answer 1 · 2013-12-20

3

Entering edit mode

11.5 years ago

Ashutosh Pandey 12k

Assuming that both the files have same number of sequences and are in the same order something as shown in the above example. Here is what you should do:

paste -d '\0' File_A File_B | sed 's/>[A-Z]*//' > File_C.fa

ADD COMMENT • link 11.5 years ago by Ashutosh Pandey 12k

0

Entering edit mode

FYI, I used ">" which is different from what you used in your fasta header.

ADD REPLY • link 11.5 years ago by Ashutosh Pandey 12k

0

Entering edit mode

it didn't work with my files. Is there any thing i need to be aware apart from using my sequence file names.

ADD REPLY • link 8.3 years ago by nepgorkhey ▴ 130

0

Entering edit mode

try the first command first (before the pipe) and see if it is working for you OR whether it concatenates your sequences into one. Then try the second command to see if it is working. Tell me which command is giving you the problem.

ADD REPLY • link 11.5 years ago by Ashutosh Pandey 12k

score 1 · Answer 2 · 2014-01-08

Here's a Biopython solution, if you want to use it under Python 2 include this at the start:

#Python 2 backward compatibility fixes:                                                                                                           
from __future__ import print_function
try:
    #Python 2's default zip function is not an iterator                                                                                           
    from itertools import izip as zip
except ImportError:
    #Under Python 3 the zip function is already an iterator                                                                                       
    pass

#Script proper starts here:                                                                                                                       
from Bio import SeqIO

def concatenate_matched_sequences(sequences1, sequences2):
    """Concatenate matching records from a pair of SeqRecord iterators."""
    for r1, r2 in zip(sequences1, sequences2):
        assert r1.id == r2.id
        yield r1 + r2

input_file1 = "a.fasta"
input_file2 = "b.fasta"
output_file = "ab.fasta"

in1 = SeqIO.parse(input_file1, "fasta")
in2 = SeqIO.parse(input_file2, "fasta")
count = SeqIO.write(concatenate_matched_sequences(in1, in2), output_file, "fasta")
print("Wrote %i sequences" % count)