Question: Filtering rows based on specific conditions
gravatar for Promi
3.6 years ago by
Promi10 wrote:


I have a tab-limited text file which has the IDs in column number 1 and the corresponding HMM name in column number 7 as shown below.

gi|336321007|ref|YP_004600975.1| adh_short_C2

gi|336321007|ref|YP_004600975.1| adh_short

gi|336321007|ref|YP_004600975.1| KR

gi|557685240|ref|YP_008788710.1| PS-DH

gi|557685240|ref|YP_008788710.1| adh_short_C2

gi|557685240|ref|YP_008788710.1| adh_short

gi|557685240|ref|YP_008788710.1| KR

gi|557685240|ref|YP_008788710.1| ketoacyl-synt

gi|557685240|ref|YP_008788710.1| Ketoacyl-synt_C

.   .


I want to select all the rows having 'adh_short_C2' or 'adh_short' or 'KR' for every unique sequence ID in column 1. Ex. gi|336321007|ref|YP_004600975.1| in this case.

And delete all the rows which have other HMM names in addition to 'adh_short_C2' or 'adh_short' or 'KR' for every single ID. Ex. gi|557685240|ref|YP_008788710.1| in this case.

Desired output - rows containing the IDs which have only 'adh_short_C2' or 'adh_short' or 'KR' and no other HMM names.

I tried this code but it doesn't work well as it also picks up the IDs having other HMM names as well

adh_short_C2_list <- subset(adh_short_C2, select=`seq id`)

adh_short_list <- subset(adh_short, select=`seq id`)

How to execute these two conditions together or step-by-step?

pfam data filtering • 866 views
ADD COMMENTlink modified 3.6 years ago by GenoMax94k • written 3.6 years ago by Promi10


                       V1              V2
 gi|336321007|ref|YP_004600975.1    adh_short_C2
 gi|336321007|ref|YP_004600975.1       adh_short
 gi|336321007|ref|YP_004600975.1              KR
 gi|557685240|ref|YP_008788710.1           PS-DH
 gi|557685240|ref|YP_008788710.1    adh_short_C2
 gi|557685240|ref|YP_008788710.1       adh_short
 gi|557685240|ref|YP_008788710.1              KR
 gi|557685240|ref|YP_008788710.1   ketoacyl-synt
 gi|557685240|ref|YP_008788710.1 Ketoacyl-synt_C


data1=read.csv("test.txt", sep="\t", header = F)
filter(data1, V2 %in% c("KR","adh_short_C2"))


> filter(data1, V2 %in% c("KR","adh_short_C2"))
                               V1           V2
1 gi|336321007|ref|YP_004600975.1 adh_short_C2
2 gi|336321007|ref|YP_004600975.1           KR
3 gi|557685240|ref|YP_008788710.1 adh_short_C2
4 gi|557685240|ref|YP_008788710.1           KR
ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by cpad011214k

The desired output should be like:

gi|336321007|ref|YP_004600975.1 adh_short_C2

gi|336321007|ref|YP_004600975.1 adh_short

gi|336321007|ref|YP_004600975.1 KR

ADD REPLYlink written 3.6 years ago by Promi10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1234 users visited in the last hour