Combining Dna Sequences Files Into One
2
1
Entering edit mode
10.4 years ago
nepgorkhey ▴ 130

If I have 2 files with sequence data how can I combine them to one. What I want to do is

if File A has

->X

ACTGCA

->Y

ACGTAA

->Z

AGCATA

and File B has

->X

TCAGA

->Y

GACTA

->Z

GCTAA

I want to combine file A and B into File C that will have following output

->X

ACTGCATCAGA

->Y

ACGTAAGACTA

->Z

AGCATAGCTAA

biopython • 4.0k views
ADD COMMENT
0
Entering edit mode

Can you assume the two files haves the same set of sequences (here X, Y, and Z) and they are in the same order? Also what file format is this (eg FASTA, FASTQ)?

ADD REPLY
0
Entering edit mode

Yes the files have the same sequence sets in same order and the format I was trying to use fasta files.

ADD REPLY
3
Entering edit mode
10.4 years ago

Assuming that both the files have same number of sequences and are in the same order something as shown in the above example. Here is what you should do:

paste -d '\0' File_A File_B | sed 's/>[A-Z]*//' > File_C.fa

ADD COMMENT
0
Entering edit mode

FYI, I used ">" which is different from what you used in your fasta header.

ADD REPLY
0
Entering edit mode

it didn't work with my files. Is there any thing i need to be aware apart from using my sequence file names.

ADD REPLY
0
Entering edit mode

try the first command first (before the pipe) and see if it is working for you OR whether it concatenates your sequences into one. Then try the second command to see if it is working. Tell me which command is giving you the problem.

ADD REPLY
1
Entering edit mode
10.3 years ago
Peter 6.0k

Here's a Biopython solution, if you want to use it under Python 2 include this at the start:

#Python 2 backward compatibility fixes:                                                                                                           
from __future__ import print_function
try:
    #Python 2's default zip function is not an iterator                                                                                           
    from itertools import izip as zip
except ImportError:
    #Under Python 3 the zip function is already an iterator                                                                                       
    pass

#Script proper starts here:                                                                                                                       
from Bio import SeqIO

def concatenate_matched_sequences(sequences1, sequences2):
    """Concatenate matching records from a pair of SeqRecord iterators."""
    for r1, r2 in zip(sequences1, sequences2):
        assert r1.id == r2.id
        yield r1 + r2

input_file1 = "a.fasta"
input_file2 = "b.fasta"
output_file = "ab.fasta"

in1 = SeqIO.parse(input_file1, "fasta")
in2 = SeqIO.parse(input_file2, "fasta")
count = SeqIO.write(concatenate_matched_sequences(in1, in2), output_file, "fasta")
print("Wrote %i sequences" % count)
ADD COMMENT

Login before adding your answer.

Traffic: 2706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6