merge multiple fasta sequences in two files into a single file line by line
4
1
Entering edit mode
5.0 years ago

Hello,

I need to combine two fasta files having thousands of fasta sequences like:

File1:

>HWI-700823F:57:C97D4ANXX:8:1101:1295:2240 2:N:0:GTGAAACG
NAAGAGGGGAATCAGGAGGGACCGCAAATATGCAGTGCAGCCCCGTGCCGTGTATGCAAC
TGGGGTACACATGTCCCAGAACATAGCCGGGAAGTCAACG
>HWI-700823F:57:C97D4ANXX:8:1101:1587:2235 2:N:0:GTGAAACG
NTCTGCCGCTCTGCGTACAAGCTTGAGAGTTTTTTTGCAGACCTTCTTGCCGGCGAGAGG
CTTAGCTATGGGAGCCAAAGCCATCATCTTCTTCTTCTCT
>HWI-700823F:57:C97D4ANXX:8:1101:1974:2229 2:N:0:NTGAAANN
NCTAAGCATGCTTTGAACTTGATCTTCTCCTTCACGAATGGGAGCGATTGGGATGGTCCT
TACAGATTGCAGTTTCAAGTTCCCAAGGCTTGGCGAAACA

File2:

>HWI-700823F:57:C97D4ANXX:8:1101:1295:2240 1:N:0:GTGAAACG
GTCCCGTGATAATGGAAGTATTTGATTCTCTGCTCCGTCTTGTGCGTTGACTTCCCGGCT
ATGTTCTGGGACATGTGTACCCCAGTTGCATACACGGCAC
>HWI-700823F:57:C97D4ANXX:8:1101:1587:2235 1:N:0:GTGAAACG
CAGAAAGAGAAGAAGAAGATGATGGCTTTGGCTCCCATAGCTAAGCCTCTCGCCGGCAAG
AAGGTCTGCAAAAAAACTCTCAAGCTTGTACGCAGAGCGG
>HWI-700823F:57:C97D4ANXX:8:1101:1974:2229 1:N:0:NTGAAANN
CAACGATCGCCCCCTTCTGCAGACAAGTTACCAACCATGGCACAACTTGTGTCAACAATT
TGTGTGTCCGGAAAGATTGCTCTGTCACACGCGCCTTCT

I want to combine both files line by line and expected outcome is:

>HWI-700823F:57:C97D4ANXX:8:1101:1295:2240 2:N:0:GTGAAACG
NAAGAGGGGAATCAGGAGGGACCGCAAATATGCAGTGCAGCCCCGTGCCGTGTATGCAAC
TGGGGTACACATGTCCCAGAACATAGCCGGGAAGTCAACG
>HWI-700823F:57:C97D4ANXX:8:1101:1295:2240 1:N:0:GTGAAACG
GTCCCGTGATAATGGAAGTATTTGATTCTCTGCTCCGTCTTGTGCGTTGACTTCCCGGCT
ATGTTCTGGGACATGTGTACCCCAGTTGCATACACGGCAC
>HWI-700823F:57:C97D4ANXX:8:1101:1587:2235 2:N:0:GTGAAACG
NTCTGCCGCTCTGCGTACAAGCTTGAGAGTTTTTTTGCAGACCTTCTTGCCGGCGAGAGG
>HWI-700823F:57:C97D4ANXX:8:1101:1587:2235 1:N:0:GTGAAACG
CAGAAAGAGAAGAAGAAGATGATGGCTTTGGCTCCCATAGCTAAGCCTCTCGCCGGCAAG
AAGGTCTGCAAAAAAACTCTCAAGCTTGTACGCAGAGCGG

means, I want to combine both files like one sequence of 1st file then 1st sequence of 2nd file and so on.

I tried various commands but I am not able to parse this multiple fasta file, it takes 1st line as one sequence and not give desired output.

Please help me.

sequence fasta line by line • 2.6k views
ADD COMMENT
0
Entering edit mode

Hi Pierre,

Should I firstly linerize my sequences and then use the paste and transform command? or this one line command is enough for getting desired output?

Thank you.

ADD REPLY
1
Entering edit mode

Please use ADD REPLY/ADD COMMENT to respond to existing posts.

Save following code in a file called linearize.awk

/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}

and then run the command as shown by @Pierre.

ADD REPLY
2
Entering edit mode
5.0 years ago

linearize , paste and transform:

paste <(awk -f linearize.awk file1.fa ) <(awk -f linearize.awk file2.fa  )| tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

opps , fixed paste at wrong position

ADD REPLY
1
Entering edit mode
5.0 years ago
Medhat 8.9k

Using python / I did not test the code

from itertools import izip
file_1 = "path/to/file_1.fasta"
file_2 = "path/to/file_2.fasta"
with open("result.fasta", "w") as output, open(file_1, "r") as f_1,  open(file_2, "r") as f_2:
       for line_from_file_1, line_from_file_2 in izip(f_1, f_2):
              output.write("{}{}".format(line_from_file_1, f_1.next(),  line_from_file_2, f_2.next()))
ADD COMMENT
1
Entering edit mode
5.0 years ago

ADD COMMENT
0
Entering edit mode

Thanks for this answer, It is much faster.

ADD REPLY
1
Entering edit mode
5.0 years ago

You've got file 2 and file 1 mixed up, they should be swapped. File 1 should have headers like

HWI-700823F:57:C97D4ANXX:8:1101:1295:2240 1:N:0:GTGAAACG

Anyway, you can use the BBMap package's reformat tool like this:

reformat.sh in1=file.1fa in2=file2.fa out=interleaved.fa
ADD COMMENT

Login before adding your answer.

Traffic: 1445 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6