Question

Remove duplicates in multifasta, where entries are paired

0

Entering edit mode

14 months ago

SaltedPork ▴ 170

Hi my input looks like:

>ref 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>sample1 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>ref 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT

This entries in the fasta file are paired so that the ref is paired with the sample# below it.

I want to identify where the nt seqeunce for sample# and ref are identical, and remove them from the fasta (or put them into another fasta file of their own). The output would hopefully be a fasta file where the nt sequences for refs and sample# are different.

So far I have tried seqkit rmdup command, however, this doesn't treat the entries as if they are paired. How can I accomplish this, ideally with a bash command or other program.

bash python • 458 views

ADD COMMENT • link updated 14 months ago by dthorbur ★ 1.9k • written 14 months ago by SaltedPork ▴ 170

0

Entering edit mode

I don't have an existing tool that would do it, but if your fasta files aren't that large, it would be quite easy to do in R. You could create 2 objects, 1 with ref and the other with sample, then find overlapping sequencing with something like ref$sequence %in% sample$sequence to emit rows with matching entries.

Again, this only really works if the fastas are not large.

ADD REPLY • link 14 months ago by dthorbur ★ 1.9k