Hello,
I have performed pan-genome analyses with roary. It has produced around 6000 groups/clusters of genes from different genomes. Each cluster has around 425 gene sequences aligned. The problem is that in some clusters there are more than 1 genes from the same organism.
For example:
cluster 1
> Org1_0001
AGTAAGAGAA
>Org2_0023
AGTAAGAGAA
>Org1_0004
AGAAACGC
>Org3_3400
AGTAAGAGAA
I need to extract the genes (headers + sequences) from the same organism from each cluster.
The problem is I can do it by extracting the labels of each cluster separately and then manually identify duplicated sequences (as the name of an organism is the same but the gene number is different) but that would take huge time.
Is there any script or program in R or python or Linux for this kind of problem? I have tried awk or grepl in Linux but I found myself nowhere. Please guide me in this regard.
Regards
Awan
If the sequences are exactly identical (and are of identical length) then you may be simply able to use
dedupe.sh
from BBMap suite. If just the headers containOrg1
which needs to be used for dedupe (and the sequences are different) then this solution will not work.Thanks for the reply but sequences are of different lengths. The only identical thing is the Organism ID i.e. Org1.