Question

How to find pattern matching over range of columns but not present in other columns?

3

Entering edit mode

4.8 years ago

Inquisitive8995 ▴ 270

Hello,

I have a huge file with 0s and 1s. The position where there is a snp is given 1 and others are 0. Also, I have different populations in the same tsv file. My goal is to find signature SNPs present in a particular population (say population 1 is from column1 to column5). So I need the positions where 1 is present in all 5 columns for population 1 and 0s in all the other columns.

POS     pop1  pop1 pop1 pop1 pop1 pop2 pop2 pop2 pop2 pop2 pop3 pop3 pop3 pop3 pop3 pop4 pop4 pop4 pop4 pop4
 746     1       1       1       1       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0           
 762     0       0       0       0       0       0       0       0       0       0       1      1       1       1       1       0       0       0       0       0

So my result for the signature snp position for pop1 would be 746 and for pop3 would be 762. Are there any tools to do this ? I tried using awk for this but it gave an error. My awk command is as follows ;

$ awk -F '\t'  '$1,$2,$3,$4.$5 ~ /1/' file.tsv

Is there a way where I can specify range of columns in awk pattern match ?

I hope my question is clear. Any help would be appreciated. Thanks in advance.

snp R shell • 2.5k views

ADD COMMENT • link updated 4.8 years ago by AK ★ 2.2k • written 4.8 years ago by Inquisitive8995 ▴ 270

0

Entering edit mode

Is the number of column per population equal for all population or do this vary? Do you know this number before starting?

ADD REPLY • link 4.8 years ago by finswimmer 16k

0

Entering edit mode

They not equal but I know the number before starting

ADD REPLY • link 4.8 years ago by Inquisitive8995 ▴ 270

2

Entering edit mode

You could start by summing up the numbers of 1s per group and then filter for groups where the sum of 1s is identical to the number of elements per group like for pop1 take only rows where sum is 5. I strongly suggest to solve these kinds of things yourself, you will see it enhances your skills and enables you to abstract the knowledge to other problems. Stuff like that you'll probably encounter repetitively during your career.

ADD REPLY • link 4.8 years ago by ATpoint 82k

score 5 · Answer 1 · 2019-07-08

Using data.table, reshape wide-to-long, then get number of 1s and number of rows per "POS, pop", if numbers equal then subset.

library(data.table)

dt1 <- fread("
POS     pop1  pop1 pop1 pop1 pop1 pop2 pop2 pop2 pop2 pop2 pop3 pop3 pop3 pop3 pop3 pop4 pop4 pop4 pop4 pop4
746     1       1       1       1       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0           
762     0       0       0       0       0       0       0       0       0       0       1      1       1       1       1       0       0       0       0       0
")

melt(dt1, id.vars = "POS")[, list(countOnes = sum(value == 1), countPops = .N),
                           by = .(POS, variable)][ countOnes == countPops, ]

#    POS variable countOnes countPops
# 1: 746     pop1         5         5
# 2: 762     pop3         5         5

score 2 · Answer 2 · 2019-07-08

Hi Inquisitive8995,

Would there be two pops that are with all 1, or 1 pop with all 1 and another pop with some non-zero data? zx8754's solution is great, but note that:

So I need the positions where 1 is present in all 5 columns for population 1 and 0s in all the other columns.

When there are 2 pops with all 1 (777 see below example) or 1 pop with all 1 but another pop with non-zero data (778), you'll need to further filter them out:

> library(data.table)
> dt1 <- fread("
+ POS     pop1  pop1 pop1 pop1 pop1 pop2 pop2 pop2 pop2 pop2 pop3 pop3 pop3 pop3 pop3 pop4 pop4 pop4 pop4 pop4
+ 746     1       1       1       1       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0           
+ 762     0       0       0       0       0       0       0       0       0       0       1      1       1       1       1       0       0       0       0       0
+ 777     1       1       1       1       1       0       0       0       0       0       1      1       1       1       1       0       0       0       0       0
+ 778     1       1       1       1       1       1       1       0       0       0       0      0       0       0       0       0       0       0       0       0
+ 779     0       1       1       1       1       0       0       0       0       0       0      0       0       0       0       0       0       0       0       0
+ ")
> melt(dt1, id.vars = "POS")[, list(countOnes = sum(value == 1), countPops = .N),
+                            by = .(POS, variable)][ countOnes == countPops, ]
   POS variable countOnes countPops
1: 746     pop1         5         5
2: 777     pop1         5         5
3: 778     pop1         5         5
4: 762     pop3         5         5
5: 777     pop3         5         5

An alternative but not so fancy way can be:

library(magrittr)

# Get pop size (number of total columns)
df <- read.delim("file.tsv", check.names = FALSE, row.names = 1)
nf <- table(colnames(df))

# Calculate sum for each pop across rows
df_rowsum <- t(rowsum(t(df), group = colnames(df)))
df_rowsum
#     pop1 pop2 pop3 pop4
# 746    5    0    0    0
# 762    0    0    5    0
# 777    5    0    5    0
# 778    5    2    0    0
# 779    4    0    0    0

# Get rows with all 0 sum except for 1 pop
df_rowsum_check <- rowSums(df_rowsum == 0) == (ncol(df_rowsum) - 1)

# Get rows with sum equal to pop size
df_popsize <- df_rowsum %in% nf[colnames(df_rowsum)] %>%
  matrix(ncol = ncol(df_rowsum)) %>%
  as.data.frame()
colnames(df_popsize) <- colnames(df_rowsum)
rownames(df_popsize) <- rownames(df_rowsum)
df_popsize_check <- rowSums(df_popsize) == "1"

# Select the positions that meet both
df_rowsum_check & df_popsize_check
#  746   762   777   778   779 
# TRUE  TRUE FALSE FALSE FALSE