How to find pattern matching over range of columns but not present in other columns?
2
3
Entering edit mode
3.0 years ago

Hello,

I have a huge file with 0s and 1s. The position where there is a snp is given 1 and others are 0. Also, I have different populations in the same tsv file. My goal is to find signature SNPs present in a particular population (say population 1 is from column1 to column5). So I need the positions where 1 is present in all 5 columns for population 1 and 0s in all the other columns.

POS     pop1  pop1 pop1 pop1 pop1 pop2 pop2 pop2 pop2 pop2 pop3 pop3 pop3 pop3 pop3 pop4 pop4 pop4 pop4 pop4
746     1       1       1       1       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
762     0       0       0       0       0       0       0       0       0       0       1      1       1       1       1       0       0       0       0       0


So my result for the signature snp position for pop1 would be 746 and for pop3 would be 762. Are there any tools to do this ? I tried using awk for this but it gave an error. My awk command is as follows ;

$awk -F '\t' '$1,$2,$3,$4.$5 ~ /1/' file.tsv


Is there a way where I can specify range of columns in awk pattern match ?

I hope my question is clear. Any help would be appreciated. Thanks in advance.

snp R shell • 1.5k views
ADD COMMENT
0
Entering edit mode

Is the number of column per population equal for all population or do this vary? Do you know this number before starting?

ADD REPLY
0
Entering edit mode

They not equal but I know the number before starting

ADD REPLY
2
Entering edit mode

You could start by summing up the numbers of 1s per group and then filter for groups where the sum of 1s is identical to the number of elements per group like for pop1 take only rows where sum is 5. I strongly suggest to solve these kinds of things yourself, you will see it enhances your skills and enables you to abstract the knowledge to other problems. Stuff like that you'll probably encounter repetitively during your career.

ADD REPLY
5
Entering edit mode
3.0 years ago
zx8754 11k

Using data.table, reshape wide-to-long, then get number of 1s and number of rows per "POS, pop", if numbers equal then subset.

library(data.table)

dt1 <- fread("
POS     pop1  pop1 pop1 pop1 pop1 pop2 pop2 pop2 pop2 pop2 pop3 pop3 pop3 pop3 pop3 pop4 pop4 pop4 pop4 pop4
746     1       1       1       1       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
762     0       0       0       0       0       0       0       0       0       0       1      1       1       1       1       0       0       0       0       0
")

melt(dt1, id.vars = "POS")[, list(countOnes = sum(value == 1), countPops = .N),
by = .(POS, variable)][ countOnes == countPops, ]

#    POS variable countOnes countPops
# 1: 746     pop1         5         5
# 2: 762     pop3         5         5

ADD COMMENT
2
Entering edit mode
3.0 years ago
AK ★ 2.1k

Would there be two pops that are with all 1, or 1 pop with all 1 and another pop with some non-zero data? zx8754's solution is great, but note that:

So I need the positions where 1 is present in all 5 columns for population 1 and 0s in all the other columns.

When there are 2 pops with all 1 (777 see below example) or 1 pop with all 1 but another pop with non-zero data (778), you'll need to further filter them out:

> library(data.table)
> dt1 <- fread("
+ POS     pop1  pop1 pop1 pop1 pop1 pop2 pop2 pop2 pop2 pop2 pop3 pop3 pop3 pop3 pop3 pop4 pop4 pop4 pop4 pop4
+ 746     1       1       1       1       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
+ 762     0       0       0       0       0       0       0       0       0       0       1      1       1       1       1       0       0       0       0       0
+ 777     1       1       1       1       1       0       0       0       0       0       1      1       1       1       1       0       0       0       0       0
+ 778     1       1       1       1       1       1       1       0       0       0       0      0       0       0       0       0       0       0       0       0
+ 779     0       1       1       1       1       0       0       0       0       0       0      0       0       0       0       0       0       0       0       0
+ ")
> melt(dt1, id.vars = "POS")[, list(countOnes = sum(value == 1), countPops = .N),
+                            by = .(POS, variable)][ countOnes == countPops, ]
POS variable countOnes countPops
1: 746     pop1         5         5
2: 777     pop1         5         5
3: 778     pop1         5         5
4: 762     pop3         5         5
5: 777     pop3         5         5


An alternative but not so fancy way can be:

library(magrittr)

# Get pop size (number of total columns)
df <- read.delim("file.tsv", check.names = FALSE, row.names = 1)
nf <- table(colnames(df))

# Calculate sum for each pop across rows
df_rowsum <- t(rowsum(t(df), group = colnames(df)))
df_rowsum
#     pop1 pop2 pop3 pop4
# 746    5    0    0    0
# 762    0    0    5    0
# 777    5    0    5    0
# 778    5    2    0    0
# 779    4    0    0    0

# Get rows with all 0 sum except for 1 pop
df_rowsum_check <- rowSums(df_rowsum == 0) == (ncol(df_rowsum) - 1)

# Get rows with sum equal to pop size
df_popsize <- df_rowsum %in% nf[colnames(df_rowsum)] %>%
matrix(ncol = ncol(df_rowsum)) %>%
as.data.frame()
colnames(df_popsize) <- colnames(df_rowsum)
rownames(df_popsize) <- rownames(df_rowsum)
df_popsize_check <- rowSums(df_popsize) == "1"

# Select the positions that meet both
df_rowsum_check & df_popsize_check
#  746   762   777   778   779
# TRUE  TRUE FALSE FALSE FALSE

ADD COMMENT
0
Entering edit mode

Thank you very much @SMK

ADD REPLY

Login before adding your answer.

Traffic: 1462 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6