Remove duplicates in multifasta, where entries are paired
6 weeks ago
SaltedPork ▴ 170

Hi my input looks like:

>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>sample1
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT


This entries in the fasta file are paired so that the ref is paired with the sample# below it.

I want to identify where the nt seqeunce for sample# and ref are identical, and remove them from the fasta (or put them into another fasta file of their own). The output would hopefully be a fasta file where the nt sequences for refs and sample# are different.

So far I have tried seqkit rmdup command, however, this doesn't treat the entries as if they are paired. How can I accomplish this, ideally with a bash command or other program.

I don't have an existing tool that would do it, but if your fasta files aren't that large, it would be quite easy to do in R. You could create 2 objects, 1 with ref and the other with sample, then find overlapping sequencing with something like ref$sequence %in% sample$sequence to emit rows with matching entries.

Again, this only really works if the fastas are not large.