How to check the pattern in a specific file present in which other files in a directory?
2
0
Entering edit mode
3.9 years ago
newbie ▴ 120

I have a directory named Analysis. Inside this directory I have some files like below:

Analysis
  |_____file1.csv
  |_____file2.csv
  |_____file3.csv
  |_____ReqGenes.csv

file1, file2, file3 have following information.

file1.csv looks like below:

LINC01419
AAR2
AC008560.1
ACTRT3
AKAP17A
AL139353.1
ARG2
ATE1
BORA

file2.csv looks like below:

DUSP28
EID2B
ELOVL6
FAM118B
FAM200A
FDXACB1
FKBP1B
FRAT1
FSD1L

file3.csv looks like below:

KDM4D
KLF12
KLLN
LRRC55
LRRIQ3
MBTPS2
MORN2
MRPS17
MRPS6
MTX3

I usually check whether a specific pattern LINC01419 exists in any of the files in the directory like below:

grep -E "LINC01419" *.csv

The output is like below:

file1.csv: LINC01419

But instead of searching for each gene, I have a file named ReqGenes.csv looks like below with all the required genes. So, with one command I would like to know in which files the Genes are present.

Genes
LINC01419
MORN2
MTX3
FSD1L
FAM118B
EID2B
ARG2
KLLN
MRPS6
ATE1

The output I need should be like below:

file1.csv: LINC01419, ARG2, ATE1
file2.csv: FSD1L, FAM118B, EID2B
file3.csv: MORN2, KLLN, MRPS6, MTX3
linux grep find xargs • 1.3k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

thanks. yes I tried like below:

grep -f ReqGenes.csv file*.csv

And the output is like below:

file1.csv: ATE1

Dont know why it gave output for only last gene in ReqGenes.csv. Why it didn't give output for other genes?

ADD REPLY
3
Entering edit mode
3.9 years ago
gtasource ▴ 60

If you don't mind using R, this will do the trick. Install the packages first:

install.packages("tidyverse")
install.packages("magrittr")

And then run the code:

library(tidyverse)
library(magrittr)

genes <- list.files(pattern = "file\\d*.csv")
genes.read <- lapply(genes,function(x) read.delim(x, header = FALSE))
genes.read <- lapply(genes.read, function(x) set_colnames(x, "Genes"))
ref <- list.files(pattern = "Req")
ref.read <- read.delim(ref)
intersect <- lapply(seq_along(genes.read), function(x) 
  intersect(genes.read[[x]], ref.read))
for(i in 1:length(genes.read)) { 
  cat(gene[[i]],":",intersect[[i]]$Genes, "\n")
}

Output

file1.csv : LINC01419 ARG2 ATE1 
file2.csv : EID2B FAM118B FSD1L 
file3.csv : KLLN MORN2 MRPS6 MTX3
ADD COMMENT
0
Entering edit mode

thanks for the reply. small correction in your code.

for(i in 1:length(genes.read)) { 
  cat(genes[[i]],":",intersect[[i]]$gene_name, "\n")
}
ADD REPLY
0
Entering edit mode

Thanks for the catch!! It should still be

intersect[[i]]$Genes

Because that's the name of the column in the intersect data frame.

ADD REPLY
0
Entering edit mode

yes. In my original file it is gene_name. Here it is Genes. My mistake.

ADD REPLY
0
Entering edit mode
3.9 years ago
for F in file*.csv; do echo -n "${F}:" && grep -F -f ReqGenes.csv -w "${F}" | tr "\n" "," ; echo ; done
ADD COMMENT
0
Entering edit mode

thanks but this way I could see output like below:

file1.csv:
file2.csv:
file3.csv:

do I need to remove echo?

ADD REPLY
0
Entering edit mode

Play around with the code and see if you can come up with a solution yourself. We aren't here to hand-spoon solutions (and to be fair, the codes that have been provided are more than enough for you to figure the rest out.)

ADD REPLY
0
Entering edit mode

thanks for the suggestion. I'm not a programmer yet. I'm still learning and couldn't do it and mainly I'm very new to linux. So, not aware about it much.

ADD REPLY

Login before adding your answer.

Traffic: 1556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6