Question

comparing every row in a file to find commom elements

0

Entering edit mode

4.9 years ago

pushu1bawa • 0

I want to compare each row of a file to find elements that are common.

Input file:

                  V1 V2 V3 V4 V5
sample_1  AA TT AT TC  CC
sample_2  TT AG CT GG
sample_3  AA  AT  TT  
sample_4  GG CC AA TT AT

Expected output

                 sample_1 sample_2 sample_3 sample_4
sample_1   4              1               3              4
sample_2   1              4               1              2          
sample_3   2              1               3              3   
sample_4   4              1               3              5

Assembly genome sequence sequencing • 994 views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 4.9 years ago by pushu1bawa • 0

0

Entering edit mode

Please make the post clearer. You can use the code button for editing.

ADD REPLY • link 4.9 years ago by Asaf 10k

0

Entering edit mode

Please edit your post and add what you've tried so far. As such, this is purely an R question and could be closed for that reason.

Hint: reshape2::melt() should be really useful here. That or tidyr::gather(). You'll need to use melt() and colsplit()/gather() and separate() to get from wide-form data -> long-form data -> analysis -> wide-form results.

ADD REPLY • link 4.9 years ago by Ram 43k

0

Entering edit mode

Yes It moslty coding issue. This is what I have tried so far I have binned by bam file (10kb) and have found barcode (10bp seq) from my bam file in each bin. So my input file is a row names as coordinates and columns containing barcode sequence. I want to compare each bin (row) to another to find number of barcodes common between the two rows. The desired output is a matrix with row name and column name as the row name of input and each element of matrix represent the number of overlapping barcodes.

> findMatch <- function(i,n){
+   tmp <- colSums(t(data[-(1:i),]) == unlist(data[i,]))
+   tmp <- tmp[tmp > n]
+   if(length(tmp) > 0) return(data.table(sample=rownames(data)[i],duplicate=names(tmp),match=tmp))
+   return(NULL)
+ }
>
> tab <- rbindlist(lapply(1:(nrow(data)-1),findMatch,n=1))

ADD REPLY • link updated 4.9 years ago by finswimmer 16k • written 4.9 years ago by pushu1bawa • 0

1

Entering edit mode

I'm sorry, I cannot invest the time it takes to investigate your custom code and why it doesn't work on your dataset. Like I said, going to long form, aggregating to get your results and transforming those results to wide form will be the reproducible way to go.

The first thing I see when I look at the function findMatch is the undeclared dependency on the object data. The function only takes arguments i and n but operates on i, n and data. This means that it depends on the environment to have a specific type of dataset named data, which breaks reproducibility. Plus, your lapply call passes in a constant value for n, so that parameter is useless in what seems to be a function built specially for this use case.

ADD REPLY • link 4.9 years ago by Ram 43k