Question: Combining Dna Sequences Files Into One
1
gravatar for nepgorkhey
5.3 years ago by
nepgorkhey90
United States
nepgorkhey90 wrote:

If I have 2 files with sequence data how can I combine them to one. What I want to do is

if File A has

->X

ACTGCA

->Y

ACGTAA

->Z

AGCATA

and File B has

->X

TCAGA

->Y

GACTA

->Z

GCTAA

I want to combine file A and B into File C that will have following output

->X

ACTGCATCAGA

->Y

ACGTAAGACTA

->Z

AGCATAGCTAA

biopython • 2.1k views
ADD COMMENTlink modified 5.3 years ago by Peter5.8k • written 5.3 years ago by nepgorkhey90

Can you assume the two files haves the same set of sequences (here X, Y, and Z) and they are in the same order? Also what file format is this (eg FASTA, FASTQ)?

ADD REPLYlink written 5.3 years ago by Peter5.8k

Yes the files have the same sequence sets in same order and the format I was trying to use fasta files.

ADD REPLYlink written 5.3 years ago by nepgorkhey90
3
gravatar for Ashutosh Pandey
5.3 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

Assuming that both the files have same number of sequences and are in the same order something as shown in the above example. Here is what you should do:

paste -d '\0' File_A File_B | sed 's/>[A-Z]*//' > File_C.fa

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Ashutosh Pandey11k

FYI, I used ">" which is different from what you used in your fasta header.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Ashutosh Pandey11k

it didn't work with my files. Is there any thing i need to be aware apart from using my sequence file names.

ADD REPLYlink modified 2.1 years ago • written 5.3 years ago by nepgorkhey90

try the first command first (before the pipe) and see if it is working for you OR whether it concatenates your sequences into one. Then try the second command to see if it is working. Tell me which command is giving you the problem.

ADD REPLYlink written 5.3 years ago by Ashutosh Pandey11k
1
gravatar for Peter
5.3 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

Here's a Biopython solution, if you want to use it under Python 2 include this at the start:

#Python 2 backward compatibility fixes:                                                                                                           
from __future__ import print_function
try:
    #Python 2's default zip function is not an iterator                                                                                           
    from itertools import izip as zip
except ImportError:
    #Under Python 3 the zip function is already an iterator                                                                                       
    pass

#Script proper starts here:                                                                                                                       
from Bio import SeqIO

def concatenate_matched_sequences(sequences1, sequences2):
    """Concatenate matching records from a pair of SeqRecord iterators."""
    for r1, r2 in zip(sequences1, sequences2):
        assert r1.id == r2.id
        yield r1 + r2

input_file1 = "a.fasta"
input_file2 = "b.fasta"
output_file = "ab.fasta"

in1 = SeqIO.parse(input_file1, "fasta")
in2 = SeqIO.parse(input_file2, "fasta")
count = SeqIO.write(concatenate_matched_sequences(in1, in2), output_file, "fasta")
print("Wrote %i sequences" % count)
ADD COMMENTlink written 5.3 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1724 users visited in the last hour