I've got a problem that I don't have the scripting skills to solve (nor the time to gain them at the moment).
I have two fasta files and what I want to do is to merge them but in a way that similar sequence ids should be printed back to back - and any sequence ids in one of the files that do not exist in another file then those should not be printed.
>seq1 ATCGTCA >seq2 AAAAACT >seq3 AACATCA >seq71 CCCGA
>seq1 AAAATCGCGCGCATG >seq1 AAATAAAAACGCTCGGG >seq2 TTAGCGCTAGCCCGCGCTCAGC >seq71 AACGCGCATG >seq81 AAACCCAGCGCATGCA
so the desired output should look like :
>seq1 ATCGTCA >seq1 AAAATCGCGCGCATG >seq1 AAATAAAAACGCTCGGG >seq2 AAAAACT >seq2 TTAGCGCTAGCCCGCGCTCAGC >seq71 CCCGA >seq71 AACGCGCATG
I'd prefer to use python as that is the language I'm learning but any solution will suffice.
Thanks for helping.
nice one Pierre Lindenbaum , but for some reason it always gives a 'duplication' of the first entry of the first file ?
nice catch it's because a seq1 is present twice in the second file. So a
uniqshould be added.
@Pierre thank you for your help - could you please also share how I can run it? I am very basic with this stuff, sorry