Entering edit mode
7 weeks ago
SaltedPork ▴ 170
Hi my input looks like:
>ref GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA >sample1 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA >ref GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT >sample2 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
This entries in the fasta file are paired so that the ref is paired with the sample# below it.
I want to identify where the nt seqeunce for
ref are identical, and remove them from the fasta (or put them into another fasta file of their own). The output would hopefully be a fasta file where the nt sequences for refs and sample# are different.
So far I have tried
seqkit rmdup command, however, this doesn't treat the entries as if they are paired. How can I accomplish this, ideally with a bash command or other program.
I don't have an existing tool that would do it, but if your fasta files aren't that large, it would be quite easy to do in R. You could create 2 objects, 1 with
refand the other with
sample, then find overlapping sequencing with something like
ref$sequence %in% sample$sequenceto emit rows with matching entries.
Again, this only really works if the fastas are not large.