Question

Convert genotype probabilities (0-1) into genotypes (0,1,2)

0

Entering edit mode

6.0 years ago

lakemonster • 0

I have performed an analysis that yields the probability that a diploid DNA sequence is inherited from species A (genotype = AA), inherited from species B (genotype = BB) or one copy from each species is inherited (genotype = AB). The analysis yields three vectors of probabilities ranging from 0 to 1. I'll paste the header below so you can see what the output data look like. I need to convert these probabilities into an assignment of 0, 1, or 2, where 0 = AA, 1 = AB, and 2 = BB.

Can anyone suggest a strategy that uses unix or R tools to convert the genotype probabilities to genotypes that uses all three columns of information? I have to convert 19 files of these into genotypes.

Thanks!!!

Example output:

            snp     probAncestry(1,1)       probAncestry(1,2)       probAncestry(2,2)
            S1_11928        0.98880 0.01117 0.00003
            S1_28339        0.99042 0.00956 0.00002
            S1_30258        0.99061 0.00937 0.00002
            S1_37984        0.99138 0.00860 0.00002
            S1_38081        0.99139 0.00860 0.00002
            S1_39977        0.99157 0.00841 0.00002
            S1_39988        0.99157 0.00841 0.00002

edit: clarified example data as output

R unix SNP • 2.4k views

ADD COMMENT • link updated 6.0 years ago by zx8754 11k • written 6.0 years ago by lakemonster • 0

0

Entering edit mode

Which program did you use that produces this data? Obviously the probabilities that are close to 1 are the likely genotypes. For all of the ones that you've listed, the genotype relates to the first column (probAncestry(1,1). A threshold can be typically 0.90 or 0.95. Do you literally want a script that can parse this and make a call at each line?

ADD REPLY • link 6.0 years ago by Kevin Blighe 88k

0

Entering edit mode

PS - double posted: https://stats.stackexchange.com/questions/349738/convert-genotype-probabilities-0-1-into-genotypes-0-1-2

ADD REPLY • link 6.0 years ago by Kevin Blighe 88k

0

Entering edit mode

Is double posting against the rules?

ADD REPLY • link 6.0 years ago by lakemonster • 0

0

Entering edit mode

It's not breaking the law or anything, but it's more about me bringing attention to others. If an answer was already provided on StackExchange, then no point posting here

ADD REPLY • link 6.0 years ago by Kevin Blighe 88k

0

Entering edit mode

RASPberry was the program I used. The output above is just the relevant columns I need, there are a few more columns in the output omitted.

ADD REPLY • link 6.0 years ago by lakemonster • 0

0

Entering edit mode

Yes, I do need a script to take the probabilities and make a call for each locus. This individual is unadmixed, so the probabilies for every locus are high; later on in this file the probs = 1 for nearly every locus. However, admixed individuals will have probAncestry(1,2) > 0.5 at some loci & probAncestry(2,2) > 0.5 at other loci. I need genotypes of them.

My current thinking is to use nested ifelse statements, but I like the simplicity of zx8754's answer!

ADD REPLY • link 6.0 years ago by lakemonster • 0

0

Entering edit mode

Where is the example input?

ADD REPLY • link 6.0 years ago by zx8754 11k

0

Entering edit mode

Oh yeh, s/he's showing the example output (?)

ADD REPLY • link 6.0 years ago by Kevin Blighe 88k

1

Entering edit mode

Not sure, if it is input or expected output. If it is input, then it is as simple as:

round(0.98880 * 0 + 0.01117 * 1 + 0.00003 * 2)
# 0

meaning AA

ADD REPLY • link 6.0 years ago by zx8754 11k

0

Entering edit mode

nice and simple. I can work that into apply (I think ;)

ADD REPLY • link 6.0 years ago by lakemonster • 0

0

Entering edit mode

Yes, neat way of coding it, zx8754.

ADD REPLY • link 6.0 years ago by Kevin Blighe 88k

0

Entering edit mode

This solution works well for some loci, but here's an example where it fails. The output should be a 2, but it's a 1.

> round(0.48 * 0 + 0.01 * 1 + 0.51 * 2)
[1] 1

> 0.48 * 0 + 0.01 * 1 + 0.51 * 2
[1] 1.03

This kind of a distribution is unlikely to be observed from my data. Usually, if a locus has high probability of BB, then the AA probability is low (ie. ~ 0). But I assume it could happen.

ADD REPLY • link 6.0 years ago by lakemonster • 0

0

Entering edit mode

What about just using 0.9 as cut-off? Literally scan over each data-point and, if >0.9, set to some value.

I would do this using AWK or tr command in BASH.

ADD REPLY • link 6.0 years ago by Kevin Blighe 88k

score 1 · Answer 1 · 2018-06-05

To speed up the import bit, use fread from data.table package:

library(data.table)

df1 <- rbindlist(lapply(import, fread)

Assuming your df1 looks like below:

df1 <- read.table(text = "
S1_1 0.9 0.0 0.1
S1_2 0.1 0.7 0.2
S1_3 0.1 0.0 0.9
S1_4 0.3 0.3 0.3
                  ")

Then we can use apply per row, keeping only 3 probs columns:

df1$geno <- apply(df1[, -1], 1, function(i){
  ix <- which(i > 0.5)
  if(length(ix) == 0) NA else c(0, 1, 2)[ ix ]
})

# result
df1
#     V1  V2  V3  V4 geno
# 1 S1_1 0.9 0.0 0.1    0
# 2 S1_2 0.1 0.7 0.2    1
# 3 S1_3 0.1 0.0 0.9    2
# 4 S1_4 0.3 0.3 0.3   NA

score 0 · Answer 2 · 2018-06-05

Here's R code I wrote to solve this problem. It's very inefficient and runs slowly.

Any tips to speed this up are appreciated.

# This script will is designed to go into a direcotry, read in all the files into a list, 
# then perform the ifelse function to call genotypes from ancestry probabilities
# and output to a new dataframe that will grow as the function loops through the list.

# Read in all the RASPberry output into a list of data frames
path = "/path/to/chr1"
import <- dir(path, pattern = ".txt")
chr1 <- lapply(import, read.table, header=T, sep="\t")
names(chr1) <- gsub(".txt","", import, fixed=TRUE)

# Perform a threshold on the locus probabilities to call genotypes.
geno <- data.frame(NA)
for( i in 1:length(chr1)){
    for( j in 1:4224){
        if (lapply(chr1, `[[`, 9)[[i]][j] > 0.5) {
            0 -> geno[j,i]
        } else if (lapply(chr1, `[[`, 10)[[i]][j] > 0.5) {
            1 -> geno[j,i]
        } else if (lapply(chr1, `[[`, 11)[[i]][j] > 0.5) {
            2 -> geno[j,i]
        }
    }
 }