comparing every row in a file to find commom elements
0
0
Entering edit mode
4.9 years ago
pushu1bawa • 0

I want to compare each row of a file to find elements that are common.

Input file:

                  V1 V2 V3 V4 V5
sample_1  AA TT AT TC  CC
sample_2  TT AG CT GG
sample_3  AA  AT  TT  
sample_4  GG CC AA TT AT

Expected output

                 sample_1 sample_2 sample_3 sample_4
sample_1   4              1               3              4
sample_2   1              4               1              2          
sample_3   2              1               3              3   
sample_4   4              1               3              5 
Assembly genome sequence sequencing • 993 views
ADD COMMENT
0
Entering edit mode

Please make the post clearer. You can use the code button for editing.

ADD REPLY
0
Entering edit mode

Please edit your post and add what you've tried so far. As such, this is purely an R question and could be closed for that reason.

Hint: reshape2::melt() should be really useful here. That or tidyr::gather(). You'll need to use melt() and colsplit()/gather() and separate() to get from wide-form data -> long-form data -> analysis -> wide-form results.

ADD REPLY
0
Entering edit mode

Yes It moslty coding issue. This is what I have tried so far I have binned by bam file (10kb) and have found barcode (10bp seq) from my bam file in each bin. So my input file is a row names as coordinates and columns containing barcode sequence. I want to compare each bin (row) to another to find number of barcodes common between the two rows. The desired output is a matrix with row name and column name as the row name of input and each element of matrix represent the number of overlapping barcodes.

> findMatch <- function(i,n){
+   tmp <- colSums(t(data[-(1:i),]) == unlist(data[i,]))
+   tmp <- tmp[tmp > n]
+   if(length(tmp) > 0) return(data.table(sample=rownames(data)[i],duplicate=names(tmp),match=tmp))
+   return(NULL)
+ }
>
> tab <- rbindlist(lapply(1:(nrow(data)-1),findMatch,n=1))
ADD REPLY
1
Entering edit mode

I'm sorry, I cannot invest the time it takes to investigate your custom code and why it doesn't work on your dataset. Like I said, going to long form, aggregating to get your results and transforming those results to wide form will be the reproducible way to go.

The first thing I see when I look at the function findMatch is the undeclared dependency on the object data. The function only takes arguments i and n but operates on i, n and data. This means that it depends on the environment to have a specific type of dataset named data, which breaks reproducibility. Plus, your lapply call passes in a constant value for n, so that parameter is useless in what seems to be a function built specially for this use case.

ADD REPLY

Login before adding your answer.

Traffic: 2283 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6