Question

Extracting the "almost" duplicate IDs and their sequences

0

Entering edit mode

4.6 years ago

3335098459 ▴ 30

Hello,

I have performed pan-genome analyses with roary. It has produced around 6000 groups/clusters of genes from different genomes. Each cluster has around 425 gene sequences aligned. The problem is that in some clusters there are more than 1 genes from the same organism.

For example:

cluster 1
> Org1_0001
AGTAAGAGAA

>Org2_0023
AGTAAGAGAA

>Org1_0004
AGAAACGC

>Org3_3400
AGTAAGAGAA

I need to extract the genes (headers + sequences) from the same organism from each cluster.

The problem is I can do it by extracting the labels of each cluster separately and then manually identify duplicated sequences (as the name of an organism is the same but the gene number is different) but that would take huge time.

Is there any script or program in R or python or Linux for this kind of problem? I have tried awk or grepl in Linux but I found myself nowhere. Please guide me in this regard.

Regards

Awan

R genome gene • 698 views

ADD COMMENT • link 4.6 years ago by 3335098459 ▴ 30

0

Entering edit mode

If the sequences are exactly identical (and are of identical length) then you may be simply able to use dedupe.sh from BBMap suite. If just the headers contain Org1 which needs to be used for dedupe (and the sequences are different) then this solution will not work.

ADD REPLY • link 4.6 years ago by GenoMax 141k

0

Entering edit mode

Thanks for the reply but sequences are of different lengths. The only identical thing is the Organism ID i.e. Org1.

ADD REPLY • link 4.6 years ago by 3335098459 ▴ 30