Question: comparing every row in a file to find commom elements
gravatar for pushu1bawa
16 months ago by
pushu1bawa0 wrote:

I want to compare each row of a file to find elements that are common. Input file:

                  V1 V2 V3 V4 V5
sample_1  AA TT AT TC  CC
sample_2  TT AG CT GG
sample_3  AA  AT  TT  
sample_4  GG CC AA TT AT

Expected output

                 sample_1 sample_2 sample_3 sample_4
sample_1   4              1               3              4
sample_2   1              4               1              2          
sample_3   2              1               3              3   
sample_4   4              1               3              5
ADD COMMENTlink modified 16 months ago by Biostar ♦♦ 20 • written 16 months ago by pushu1bawa0

Please make the post clearer. You can use the code button for editing.

ADD REPLYlink written 16 months ago by Asaf8.4k

Please edit your post and add what you've tried so far. As such, this is purely an R question and could be closed for that reason.

Hint: reshape2::melt() should be really useful here. That or tidyr::gather(). You'll need to use melt() and colsplit()/gather() and separate() to get from wide-form data -> long-form data -> analysis -> wide-form results.

ADD REPLYlink modified 16 months ago • written 16 months ago by RamRS30k

Yes It moslty coding issue. This is what I have tried so far I have binned by bam file (10kb) and have found barcode (10bp seq) from my bam file in each bin. So my input file is a row names as coordinates and columns containing barcode sequence. I want to compare each bin (row) to another to find number of barcodes common between the two rows. The desired output is a matrix with row name and column name as the row name of input and each element of matrix represent the number of overlapping barcodes.

> findMatch <- function(i,n){
+   tmp <- colSums(t(data[-(1:i),]) == unlist(data[i,]))
+   tmp <- tmp[tmp > n]
+   if(length(tmp) > 0) return(data.table(sample=rownames(data)[i],duplicate=names(tmp),match=tmp))
+   return(NULL)
+ }
> tab <- rbindlist(lapply(1:(nrow(data)-1),findMatch,n=1))
ADD REPLYlink modified 16 months ago by finswimmer13k • written 16 months ago by pushu1bawa0

I'm sorry, I cannot invest the time it takes to investigate your custom code and why it doesn't work on your dataset. Like I said, going to long form, aggregating to get your results and transforming those results to wide form will be the reproducible way to go.

The first thing I see when I look at the function findMatch is the undeclared dependency on the object data. The function only takes arguments i and n but operates on i, n and data. This means that it depends on the environment to have a specific type of dataset named data, which breaks reproducibility. Plus, your lapply call passes in a constant value for n, so that parameter is useless in what seems to be a function built specially for this use case.

ADD REPLYlink modified 16 months ago • written 16 months ago by RamRS30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2014 users visited in the last hour