Entering edit mode
4.9 years ago
pushu1bawa
•
0
I want to compare each row of a file to find elements that are common.
Input file:
V1 V2 V3 V4 V5
sample_1 AA TT AT TC CC
sample_2 TT AG CT GG
sample_3 AA AT TT
sample_4 GG CC AA TT AT
Expected output
sample_1 sample_2 sample_3 sample_4
sample_1 4 1 3 4
sample_2 1 4 1 2
sample_3 2 1 3 3
sample_4 4 1 3 5
Please make the post clearer. You can use the code button for editing.
Please edit your post and add what you've tried so far. As such, this is purely an R question and could be closed for that reason.
Hint:
reshape2::melt()
should be really useful here. That ortidyr::gather()
. You'll need to usemelt()
andcolsplit()
/gather()
andseparate()
to get from wide-form data -> long-form data -> analysis -> wide-form results.Yes It moslty coding issue. This is what I have tried so far I have binned by bam file (10kb) and have found barcode (10bp seq) from my bam file in each bin. So my input file is a row names as coordinates and columns containing barcode sequence. I want to compare each bin (row) to another to find number of barcodes common between the two rows. The desired output is a matrix with row name and column name as the row name of input and each element of matrix represent the number of overlapping barcodes.
I'm sorry, I cannot invest the time it takes to investigate your custom code and why it doesn't work on your dataset. Like I said, going to long form, aggregating to get your results and transforming those results to wide form will be the reproducible way to go.
The first thing I see when I look at the function
findMatch
is the undeclared dependency on the objectdata
. The function only takes argumentsi
andn
but operates oni
,n
anddata
. This means that it depends on the environment to have a specific type of dataset nameddata
, which breaks reproducibility. Plus, yourlapply
call passes in a constant value forn
, so that parameter is useless in what seems to be a function built specially for this use case.