Question: Convert genotype probabilities (0-1) into genotypes (0,1,2)
0
gravatar for lakemonster
20 days ago by
lakemonster0 wrote:

I have performed an analysis that yields the probability that a diploid DNA sequence is inherited from species A (genotype = AA), inherited from species B (genotype = BB) or one copy from each species is inherited (genotype = AB). The analysis yields three vectors of probabilities ranging from 0 to 1. I'll paste the header below so you can see what the output data look like. I need to convert these probabilities into an assignment of 0, 1, or 2, where 0 = AA, 1 = AB, and 2 = BB.

Can anyone suggest a strategy that uses unix or R tools to convert the genotype probabilities to genotypes that uses all three columns of information? I have to convert 19 files of these into genotypes.

Thanks!!!

Example output:

            snp     probAncestry(1,1)       probAncestry(1,2)       probAncestry(2,2)
            S1_11928        0.98880 0.01117 0.00003
            S1_28339        0.99042 0.00956 0.00002
            S1_30258        0.99061 0.00937 0.00002
            S1_37984        0.99138 0.00860 0.00002
            S1_38081        0.99139 0.00860 0.00002
            S1_39977        0.99157 0.00841 0.00002
            S1_39988        0.99157 0.00841 0.00002

edit: clarified example data as output

snp unix R • 165 views
ADD COMMENTlink modified 19 days ago by zx87544.5k • written 20 days ago by lakemonster0

Which program did you use that produces this data? Obviously the probabilities that are close to 1 are the likely genotypes. For all of the ones that you've listed, the genotype relates to the first column (probAncestry(1,1). A threshold can be typically 0.90 or 0.95. Do you literally want a script that can parse this and make a call at each line?

ADD REPLYlink written 20 days ago by Kevin Blighe21k

PS - double posted: https://stats.stackexchange.com/questions/349738/convert-genotype-probabilities-0-1-into-genotypes-0-1-2

ADD REPLYlink written 20 days ago by Kevin Blighe21k

Is double posting against the rules?

ADD REPLYlink written 19 days ago by lakemonster0

It's not breaking the law or anything, but it's more about me bringing attention to others. If an answer was already provided on StackExchange, then no point posting here

ADD REPLYlink written 19 days ago by Kevin Blighe21k

RASPberry was the program I used. The output above is just the relevant columns I need, there are a few more columns in the output omitted.

ADD REPLYlink written 19 days ago by lakemonster0

Yes, I do need a script to take the probabilities and make a call for each locus. This individual is unadmixed, so the probabilies for every locus are high; later on in this file the probs = 1 for nearly every locus. However, admixed individuals will have probAncestry(1,2) > 0.5 at some loci & probAncestry(2,2) > 0.5 at other loci. I need genotypes of them.

My current thinking is to use nested ifelse statements, but I like the simplicity of zx8754's answer!

ADD REPLYlink written 19 days ago by lakemonster0

Where is the example input?

ADD REPLYlink written 19 days ago by zx87544.5k

Oh yeh, s/he's showing the example output (?)

ADD REPLYlink written 19 days ago by Kevin Blighe21k
1

Not sure, if it is input or expected output. If it is input, then it is as simple as:

round(0.98880 * 0 + 0.01117 * 1 + 0.00003 * 2)
# 0

meaning AA

ADD REPLYlink modified 19 days ago • written 19 days ago by zx87544.5k

nice and simple. I can work that into apply (I think ;)

ADD REPLYlink written 19 days ago by lakemonster0

Yes, neat way of coding it, zx8754.

ADD REPLYlink written 19 days ago by Kevin Blighe21k

This solution works well for some loci, but here's an example where it fails. The output should be a 2, but it's a 1.

> round(0.48 * 0 + 0.01 * 1 + 0.51 * 2)
[1] 1

> 0.48 * 0 + 0.01 * 1 + 0.51 * 2
[1] 1.03

This kind of a distribution is unlikely to be observed from my data. Usually, if a locus has high probability of BB, then the AA probability is low (ie. ~ 0). But I assume it could happen.

ADD REPLYlink modified 19 days ago • written 19 days ago by lakemonster0

What about just using 0.9 as cut-off? Literally scan over each data-point and, if >0.9, set to some value.

I would do this using AWK or tr command in BASH.

ADD REPLYlink written 19 days ago by Kevin Blighe21k
1
gravatar for zx8754
19 days ago by
zx87544.5k
London
zx87544.5k wrote:

To speed up the import bit, use fread from data.table package:

library(data.table)

df1 <- rbindlist(lapply(import, fread)

Assuming your df1 looks like below:

df1 <- read.table(text = "
S1_1 0.9 0.0 0.1
S1_2 0.1 0.7 0.2
S1_3 0.1 0.0 0.9
S1_4 0.3 0.3 0.3
                  ")

Then we can use apply per row, keeping only 3 probs columns:

df1$geno <- apply(df1[, -1], 1, function(i){
  ix <- which(i > 0.5)
  if(length(ix) == 0) NA else c(0, 1, 2)[ ix ]
})

# result
df1
#     V1  V2  V3  V4 geno
# 1 S1_1 0.9 0.0 0.1    0
# 2 S1_2 0.1 0.7 0.2    1
# 3 S1_3 0.1 0.0 0.9    2
# 4 S1_4 0.3 0.3 0.3   NA
ADD COMMENTlink modified 19 days ago • written 19 days ago by zx87544.5k
0
gravatar for lakemonster
19 days ago by
lakemonster0 wrote:

Here's R code I wrote to solve this problem. It's very inefficient and runs slowly.

Any tips to speed this up are appreciated.

# This script will is designed to go into a direcotry, read in all the files into a list, 
# then perform the ifelse function to call genotypes from ancestry probabilities
# and output to a new dataframe that will grow as the function loops through the list.

# Read in all the RASPberry output into a list of data frames
path = "/path/to/chr1"
import <- dir(path, pattern = ".txt")
chr1 <- lapply(import, read.table, header=T, sep="\t")
names(chr1) <- gsub(".txt","", import, fixed=TRUE)

# Perform a threshold on the locus probabilities to call genotypes.
geno <- data.frame(NA)
for( i in 1:length(chr1)){
    for( j in 1:4224){
        if (lapply(chr1, `[[`, 9)[[i]][j] > 0.5) {
            0 -> geno[j,i]
        } else if (lapply(chr1, `[[`, 10)[[i]][j] > 0.5) {
            1 -> geno[j,i]
        } else if (lapply(chr1, `[[`, 11)[[i]][j] > 0.5) {
            2 -> geno[j,i]
        }
    }
 }
ADD COMMENTlink written 19 days ago by lakemonster0

Can you provide example input text file?

Are you sure the logic is correct? if AA > 0.5 then it is 0, else if AB > 0.5 then it is 1, else if BB > 0.5 then it is 2.

What happens when we have 0.3 0.3 0.4?

ADD REPLYlink modified 19 days ago • written 19 days ago by zx87544.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 971 users visited in the last hour