Remove duplicates in multifasta, where entries are paired
0
0
Entering edit mode
14 months ago
SaltedPork ▴ 170

Hi my input looks like:

>ref 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>sample1 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>ref 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT

This entries in the fasta file are paired so that the ref is paired with the sample# below it.

I want to identify where the nt seqeunce for sample# and ref are identical, and remove them from the fasta (or put them into another fasta file of their own). The output would hopefully be a fasta file where the nt sequences for refs and sample# are different.

So far I have tried seqkit rmdup command, however, this doesn't treat the entries as if they are paired. How can I accomplish this, ideally with a bash command or other program.

bash python • 458 views
ADD COMMENT
0
Entering edit mode

I don't have an existing tool that would do it, but if your fasta files aren't that large, it would be quite easy to do in R. You could create 2 objects, 1 with ref and the other with sample, then find overlapping sequencing with something like ref$sequence %in% sample$sequence to emit rows with matching entries.

Again, this only really works if the fastas are not large.

ADD REPLY

Login before adding your answer.

Traffic: 2710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6