Compare two genotype data frame and calculate percentage of differences
2
1
Entering edit mode
5.6 years ago
dorarinyo88 ▴ 20

Hello there, I have 2 files with the same header and same 1st column, I would like to compare rows by rows and calculate the percentage of matches of the genotype. Somehow I try to transpose it to compare column by column by using the function compare in r, I even try use awk to do the comparing, somehow I couldn't make it. Anyone please help me with some hints or tips. Thanks

Set1

barcode SNP000072   SNP000119   SNP000179   SNP001106   SNP001150
165974-1    A:A A:A A:A G:G C:C
165974-2    A:A A:A A:A G:A C:C
165974-3    A:A A:A G:A G:A C:C
165974-4    A:A A:A A:A A:A C:A
165974-5    A:A A:C A:A G:A ?

Set2

barcode SNP000072   SNP000119   SNP000179   SNP001106   SNP001150
165974-1    A:A A:A A:A G:G C:C
165974-2    A:A A:A A:A G:A C:C
165974-3    A:A A:A A:A G:A C:C
165974-4    A:A A:A A:A A:A C:A
165974-5    A:A A:A A:A G:A C:C

Expected output

barcode percentage(%)
165974-1     100
165974-2     100
165974-3     80
165974-4 100
165974-5 60
R linux awk • 1.6k views
ADD COMMENT
6
Entering edit mode
5.6 years ago
join -t $'\t' -1 1 -2 1 \
    <(sort -t $'\t' -k1,1 input1.txt)\
    <(sort -t $'\t' -k1,1 input2.txt) |\
     grep -v barcode |\
    awk -F '\t'  'BEGIN{printf("barcode\tpercent\n");}{NSAMPLES=(NF-1)/2;T=0.0;for(i=2;i<2+NSAMPLES;i++) {j=i+NSAMPLES; if($i==$j) T++;} printf("%s\t%d\n",$1,100*(T/NSAMPLES));}'

barcode percent
165974-1    100
165974-2    100
165974-3    80
165974-4    100
165974-5    60
ADD COMMENT
1
Entering edit mode

Thanks it works. It is a great help

ADD REPLY
0
Entering edit mode

I was wondering that if I can add one more condition that whenever there is "?", the command skip through it and does not include in the final count. For example, the row consists of "?" with NF of 5 instead I just counted 4. I know just to add a counter, but I don't know where I should make the changes. Thanks.

This is what I modify, but I know something is missing.

{NSAMPLES=(NF-1)/2;T=0.0;for(i=2;i<2+NSAMPLES;i++){j=i+NSAMPLES;if($i==$j && $i!="?" && $j!="?") T++;printf("%s\t%d\n",$1,100*(T/NSAMPLES));else if($i==$j && $i=="?" && $j!="?") for (N=0;NF=='?';N++);printf("%s\t%d\n",$1,100*(T/NSAMPLES-N));}'
ADD REPLY
2
Entering edit mode
5.6 years ago
ATpoint 81k

In R:

install.packages("data.table")
require(data.table)

## Load data:
set1 <- fread("/path/to/set1", sep=" ", header = T, data.table=F)
set2 <- fread("/path/to/set2", sep=" ", header = T, data.table=F)

## TRUE/FALSE decision:
tmp <- set1[,2:ncol(set1] == set2[,2:ncol(set2)]

## Output dataframe:
output <- data.frame(set1[,1], 
                     100 * rowSums(tmp == "TRUE") / nrow(tmp))

## Column names:
colnames(output) <- c("barcode", "percentage (%)")
ADD COMMENT
0
Entering edit mode

rowSums(tmp == "TRUE") should be rowSums(tmp), and you can give column names within data.frame(barcode = set1[,1], ...

ADD REPLY
0
Entering edit mode

Output is the same, but you're right. TRUE counts as 1 and FALSE as 0.

ADD REPLY
0
Entering edit mode

During this step it gives me this error. Is it somehow when comparing different factors in the files. tmp <- set1[,2:ncol(set1)] == set2[,2:ncol(set2)] Error in Ops.factor(left, right) : level sets of factors are different.

ADD REPLY
0
Entering edit mode

Read your data as character not factor, read about ?read.table and set stringsAsFactors = FALSE. Then you will not get factor error.

ADD REPLY

Login before adding your answer.

Traffic: 2862 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6