How to find matching strings from two files, and if they match take the whole line/row in R
1
0
Entering edit mode
4.7 years ago
Hann ▴ 110

Hello all,

I have never used for loops and if statement in R, so I am struggling to solve my problem

I have three files, two of them have 1 column that has genes IDs

File #1

subgenA
Dexi1A01G0002580.1
Dexi1A01G0002590.1
Dexi1A01G0002600.1
Dexi1A01G0012800.1
Dexi1A01G0002620.1
Dexi1A01G0002630.1

File #2

subgenB
Dexi1B01G0011330.1
Dexi1B01G0012110.1
Dexi1B01G0011360.1
Dexi1B01G0011370.1
Dexi1B01G0011380.1
Dexi1B01G0011390.1
Dexi1B01G0011400.1

and the last one, File # 3:

gene1   specie1 subg1   chr1    start1  end1    gene2   specie2 subg2   chr2    start2  end2    evalue
Dexi1A01G0012690.1  Dexi    A   Dexi01A 12893985    12896212    Dexi1B01G0012090.1  Dexi    B   Dexi01B 13544586    13546816    0
Dexi1A01G0012800.1  Dexi    A   Dexi01A 14119271    14119814    Dexi1B01G0012110.1  Dexi    B   Dexi01B 13958055    13968706    1.00E-161
Dexi1A01G0012810.1  Dexi    A   Dexi01A 14291305    14294110    Dexi1B01G0012120.1  Dexi    B   Dexi01B 14177979    14180783    0
Dexi1A01G0012820.1  Dexi    A   Dexi01A 14309988    14316932    Dexi1B01G0012130.1  Dexi    B   Dexi01B 14218846    14225842    0
Dexi1A01G0012830.1  Dexi    A   Dexi01A 14482802    14483596    Dexi1B01G0012140.1  Dexi    B   Dexi01B 14281307    14281827    0
Dexi1A01G0012850.1  Dexi    A   Dexi01A 14563049    14567456    Dexi1B01G0012150.1  Dexi    B   Dexi01B 14313758    14318354    0
Dexi1A01G0012860.1  Dexi    A   Dexi01A 14568254    14568975    Dexi1B01G0012160.1  Dexi    B   Dexi01B 14319317    14320718    6.00E-48

I want to use for loop and if statement to get what I want:

if genes in file1 (subgenA) present in column 1 in File3, and genes in file2 (subgenB) present in column 7 in the same row in File3, then take the whole line/row from File3

so from this example; results would be:

Dexi1A01G0012800.1  Dexi    A   Dexi01A 14119271    14119814    Dexi1B01G0012110.1  Dexi    B   Dexi01B 13958055    13968706    1.00E-161
R • 1.5k views
ADD COMMENT
0
Entering edit mode
4.7 years ago
Brice Sarver ★ 3.8k

A function across positions will be cleaner. Using rbindlist() from data.table and lapply()to make clean results. Several specifics will need to be changed (arguments to read.table() like sep, header, etc., most likely. This is one solution that doesn't explicitly use for loops (but is doing a similar function using lapply())

Hope this helps guide you in the right direction.

Edit: comments and formatting


library(data.table)

# read in files
a <- read.table("file1", stringsAsFactors = FALSE) # or vector, depending on data structure
b <- read.table("file2", stringsAsFactors = FALSE) # or vector, depending on data structure
d <- read.table("file3", stringsAsFactors = FALSE)

# assuming that the genes show up ONCE
# easy to generalize iterating across pos1 if length(pos1) > 1

# define function
get_gene <- function(gene) {

# if the gene exists in the second column, get its position in both
if (gene %in% b$subgenB) {
  pos1 <- which(d$gene1 == gene) # we also know this from the row number
  pos2 <- which(d$gene2 == gene)

# if the positions are equal, return that row
if (pos1 == pos2) {
  return(data.table(d[pos1, ]))
  }
}

# lapply across the first vector of genes
res <- lapply(a$subgen1, get_gene)
# coerce list of data.tables to a single data.table
final <- rbindlist(res)
ADD COMMENT
0
Entering edit mode

This is pretty close. But doesn't give what I need. Maybe my explanation wasn't clear.

However, my friend suggested this simple R code, that gave me what I want

all <- read.table("file3.txt", stringsAsFactors = FALSE, header=T) 
a <- read.table("file1.txt", stringsAsFactors = FALSE,header=T) 
b <- read.table("file2.txt", stringsAsFactors = FALSE,header=T)
results = c()
for(r in 1:nrow(all))
{
if(length(grep(all[r,1],a))>0&&length(grep(all[r,7],b))>0)
{
results = rbind(results,all[r,])
}}
ADD REPLY

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6