Question: Filtering rows based on specific conditions
0
gravatar for Promi
2.6 years ago by
Promi10
Promi10 wrote:

Hi,

I have a tab-limited text file which has the IDs in column number 1 and the corresponding HMM name in column number 7 as shown below.

gi|336321007|ref|YP_004600975.1| adh_short_C2

gi|336321007|ref|YP_004600975.1| adh_short

gi|336321007|ref|YP_004600975.1| KR

gi|557685240|ref|YP_008788710.1| PS-DH

gi|557685240|ref|YP_008788710.1| adh_short_C2

gi|557685240|ref|YP_008788710.1| adh_short

gi|557685240|ref|YP_008788710.1| KR

gi|557685240|ref|YP_008788710.1| ketoacyl-synt

gi|557685240|ref|YP_008788710.1| Ketoacyl-synt_C

.   .

.
.

I want to select all the rows having 'adh_short_C2' or 'adh_short' or 'KR' for every unique sequence ID in column 1. Ex. gi|336321007|ref|YP_004600975.1| in this case.

And delete all the rows which have other HMM names in addition to 'adh_short_C2' or 'adh_short' or 'KR' for every single ID. Ex. gi|557685240|ref|YP_008788710.1| in this case.

Desired output - rows containing the IDs which have only 'adh_short_C2' or 'adh_short' or 'KR' and no other HMM names.

I tried this code but it doesn't work well as it also picks up the IDs having other HMM names as well

adh_short_C2_list <- subset(adh_short_C2, select=`seq id`)

adh_short_list <- subset(adh_short, select=`seq id`)

How to execute these two conditions together or step-by-step?

pfam data filtering • 732 views
ADD COMMENTlink modified 2.6 years ago by genomax76k • written 2.6 years ago by Promi10

data:

                       V1              V2
 gi|336321007|ref|YP_004600975.1    adh_short_C2
 gi|336321007|ref|YP_004600975.1       adh_short
 gi|336321007|ref|YP_004600975.1              KR
 gi|557685240|ref|YP_008788710.1           PS-DH
 gi|557685240|ref|YP_008788710.1    adh_short_C2
 gi|557685240|ref|YP_008788710.1       adh_short
 gi|557685240|ref|YP_008788710.1              KR
 gi|557685240|ref|YP_008788710.1   ketoacyl-synt
 gi|557685240|ref|YP_008788710.1 Ketoacyl-synt_C

Code

library(dplyr)
data1=read.csv("test.txt", sep="\t", header = F)
View(data1)
filter(data1, V2 %in% c("KR","adh_short_C2"))

Result

> filter(data1, V2 %in% c("KR","adh_short_C2"))
                               V1           V2
1 gi|336321007|ref|YP_004600975.1 adh_short_C2
2 gi|336321007|ref|YP_004600975.1           KR
3 gi|557685240|ref|YP_008788710.1 adh_short_C2
4 gi|557685240|ref|YP_008788710.1           KR
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by cpad011212k

The desired output should be like:

gi|336321007|ref|YP_004600975.1 adh_short_C2

gi|336321007|ref|YP_004600975.1 adh_short

gi|336321007|ref|YP_004600975.1 KR

ADD REPLYlink written 2.6 years ago by Promi10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 790 users visited in the last hour