I've got a problem that I don't have the scripting skills to solve (nor the time to gain them at the moment).
I have two fasta files and what I want to do is to merge them but in a way that similar sequence ids should be printed back to back - and any sequence ids in one of the files that do not exist in another file then those should not be printed.
cat f1.fa
>seq1
ATCGTCA
>seq2
AAAAACT
>seq3
AACATCA
>seq71
CCCGA
cat f2.fa
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
AACGCGCATG
>seq81
AAACCCAGCGCATGCA
so the desired output should look like :
>seq1
ATCGTCA
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
AAAAACT
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
CCCGA
>seq71
AACGCGCATG
I'd prefer to use python as that is the language I'm learning but any solution will suffice.
Thanks for helping.
nice one Pierre Lindenbaum , but for some reason it always gives a 'duplication' of the first entry of the first file ?
nice catch it's because a seq1 is present twice in the second file. So a
uniq
should be added.@Pierre thank you for your help - could you please also share how I can run it? I am very basic with this stuff, sorry