Extracting the "almost" duplicate IDs and their sequences
0
0
Entering edit mode
4.6 years ago
3335098459 ▴ 30

Hello,

I have performed pan-genome analyses with roary. It has produced around 6000 groups/clusters of genes from different genomes. Each cluster has around 425 gene sequences aligned. The problem is that in some clusters there are more than 1 genes from the same organism.

For example:

cluster 1
> Org1_0001
AGTAAGAGAA

>Org2_0023
AGTAAGAGAA

>Org1_0004
AGAAACGC

>Org3_3400
AGTAAGAGAA

I need to extract the genes (headers + sequences) from the same organism from each cluster.

The problem is I can do it by extracting the labels of each cluster separately and then manually identify duplicated sequences (as the name of an organism is the same but the gene number is different) but that would take huge time.

Is there any script or program in R or python or Linux for this kind of problem? I have tried awk or grepl in Linux but I found myself nowhere. Please guide me in this regard.

Regards

Awan

R genome gene • 698 views
ADD COMMENT
0
Entering edit mode

If the sequences are exactly identical (and are of identical length) then you may be simply able to use dedupe.sh from BBMap suite. If just the headers contain Org1 which needs to be used for dedupe (and the sequences are different) then this solution will not work.

ADD REPLY
0
Entering edit mode

Thanks for the reply but sequences are of different lengths. The only identical thing is the Organism ID i.e. Org1.

ADD REPLY

Login before adding your answer.

Traffic: 1765 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6