Filtering rows based on specific conditions
0
0
Entering edit mode
6.9 years ago
Promi ▴ 10

Hi,

I have a tab-limited text file which has the IDs in column number 1 and the corresponding HMM name in column number 7 as shown below.

gi|336321007|ref|YP_004600975.1| adh_short_C2

gi|336321007|ref|YP_004600975.1| adh_short

gi|336321007|ref|YP_004600975.1| KR

gi|557685240|ref|YP_008788710.1| PS-DH

gi|557685240|ref|YP_008788710.1| adh_short_C2

gi|557685240|ref|YP_008788710.1| adh_short

gi|557685240|ref|YP_008788710.1| KR

gi|557685240|ref|YP_008788710.1| ketoacyl-synt

gi|557685240|ref|YP_008788710.1| Ketoacyl-synt_C

.   .

.
.

I want to select all the rows having 'adh_short_C2' or 'adh_short' or 'KR' for every unique sequence ID in column 1. Ex. gi|336321007|ref|YP_004600975.1| in this case.

And delete all the rows which have other HMM names in addition to 'adh_short_C2' or 'adh_short' or 'KR' for every single ID. Ex. gi|557685240|ref|YP_008788710.1| in this case.

Desired output - rows containing the IDs which have only 'adh_short_C2' or 'adh_short' or 'KR' and no other HMM names.

I tried this code but it doesn't work well as it also picks up the IDs having other HMM names as well

adh_short_C2_list <- subset(adh_short_C2, select=`seq id`)

adh_short_list <- subset(adh_short, select=`seq id`)

How to execute these two conditions together or step-by-step?

pfam data filtering • 1.5k views
ADD COMMENT
0
Entering edit mode

data:

                       V1              V2
 gi|336321007|ref|YP_004600975.1    adh_short_C2
 gi|336321007|ref|YP_004600975.1       adh_short
 gi|336321007|ref|YP_004600975.1              KR
 gi|557685240|ref|YP_008788710.1           PS-DH
 gi|557685240|ref|YP_008788710.1    adh_short_C2
 gi|557685240|ref|YP_008788710.1       adh_short
 gi|557685240|ref|YP_008788710.1              KR
 gi|557685240|ref|YP_008788710.1   ketoacyl-synt
 gi|557685240|ref|YP_008788710.1 Ketoacyl-synt_C

Code

library(dplyr)
data1=read.csv("test.txt", sep="\t", header = F)
View(data1)
filter(data1, V2 %in% c("KR","adh_short_C2"))

Result

> filter(data1, V2 %in% c("KR","adh_short_C2"))
                               V1           V2
1 gi|336321007|ref|YP_004600975.1 adh_short_C2
2 gi|336321007|ref|YP_004600975.1           KR
3 gi|557685240|ref|YP_008788710.1 adh_short_C2
4 gi|557685240|ref|YP_008788710.1           KR
ADD REPLY
0
Entering edit mode

The desired output should be like:

gi|336321007|ref|YP_004600975.1 adh_short_C2

gi|336321007|ref|YP_004600975.1 adh_short

gi|336321007|ref|YP_004600975.1 KR

ADD REPLY

Login before adding your answer.

Traffic: 1641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6