Convert genotype probabilities (0-1) into genotypes (0,1,2)
2
0
Entering edit mode
6.0 years ago

I have performed an analysis that yields the probability that a diploid DNA sequence is inherited from species A (genotype = AA), inherited from species B (genotype = BB) or one copy from each species is inherited (genotype = AB). The analysis yields three vectors of probabilities ranging from 0 to 1. I'll paste the header below so you can see what the output data look like. I need to convert these probabilities into an assignment of 0, 1, or 2, where 0 = AA, 1 = AB, and 2 = BB.

Can anyone suggest a strategy that uses unix or R tools to convert the genotype probabilities to genotypes that uses all three columns of information? I have to convert 19 files of these into genotypes.

Thanks!!!

Example output:

            snp     probAncestry(1,1)       probAncestry(1,2)       probAncestry(2,2)
            S1_11928        0.98880 0.01117 0.00003
            S1_28339        0.99042 0.00956 0.00002
            S1_30258        0.99061 0.00937 0.00002
            S1_37984        0.99138 0.00860 0.00002
            S1_38081        0.99139 0.00860 0.00002
            S1_39977        0.99157 0.00841 0.00002
            S1_39988        0.99157 0.00841 0.00002

edit: clarified example data as output

R unix SNP • 2.4k views
ADD COMMENT
0
Entering edit mode

Which program did you use that produces this data? Obviously the probabilities that are close to 1 are the likely genotypes. For all of the ones that you've listed, the genotype relates to the first column (probAncestry(1,1). A threshold can be typically 0.90 or 0.95. Do you literally want a script that can parse this and make a call at each line?

ADD REPLY
0
Entering edit mode

Is double posting against the rules?

ADD REPLY
0
Entering edit mode

It's not breaking the law or anything, but it's more about me bringing attention to others. If an answer was already provided on StackExchange, then no point posting here

ADD REPLY
0
Entering edit mode

RASPberry was the program I used. The output above is just the relevant columns I need, there are a few more columns in the output omitted.

ADD REPLY
0
Entering edit mode

Yes, I do need a script to take the probabilities and make a call for each locus. This individual is unadmixed, so the probabilies for every locus are high; later on in this file the probs = 1 for nearly every locus. However, admixed individuals will have probAncestry(1,2) > 0.5 at some loci & probAncestry(2,2) > 0.5 at other loci. I need genotypes of them.

My current thinking is to use nested ifelse statements, but I like the simplicity of zx8754's answer!

ADD REPLY
0
Entering edit mode

Where is the example input?

ADD REPLY
0
Entering edit mode

Oh yeh, s/he's showing the example output (?)

ADD REPLY
1
Entering edit mode

Not sure, if it is input or expected output. If it is input, then it is as simple as:

round(0.98880 * 0 + 0.01117 * 1 + 0.00003 * 2)
# 0

meaning AA

ADD REPLY
0
Entering edit mode

nice and simple. I can work that into apply (I think ;)

ADD REPLY
0
Entering edit mode

Yes, neat way of coding it, zx8754.

ADD REPLY
0
Entering edit mode

This solution works well for some loci, but here's an example where it fails. The output should be a 2, but it's a 1.

> round(0.48 * 0 + 0.01 * 1 + 0.51 * 2)
[1] 1

> 0.48 * 0 + 0.01 * 1 + 0.51 * 2
[1] 1.03

This kind of a distribution is unlikely to be observed from my data. Usually, if a locus has high probability of BB, then the AA probability is low (ie. ~ 0). But I assume it could happen.

ADD REPLY
0
Entering edit mode

What about just using 0.9 as cut-off? Literally scan over each data-point and, if >0.9, set to some value.

I would do this using AWK or tr command in BASH.

ADD REPLY
1
Entering edit mode
6.0 years ago
zx8754 11k

To speed up the import bit, use fread from data.table package:

library(data.table)

df1 <- rbindlist(lapply(import, fread)

Assuming your df1 looks like below:

df1 <- read.table(text = "
S1_1 0.9 0.0 0.1
S1_2 0.1 0.7 0.2
S1_3 0.1 0.0 0.9
S1_4 0.3 0.3 0.3
                  ")

Then we can use apply per row, keeping only 3 probs columns:

df1$geno <- apply(df1[, -1], 1, function(i){
  ix <- which(i > 0.5)
  if(length(ix) == 0) NA else c(0, 1, 2)[ ix ]
})

# result
df1
#     V1  V2  V3  V4 geno
# 1 S1_1 0.9 0.0 0.1    0
# 2 S1_2 0.1 0.7 0.2    1
# 3 S1_3 0.1 0.0 0.9    2
# 4 S1_4 0.3 0.3 0.3   NA
ADD COMMENT
0
Entering edit mode
6.0 years ago

Here's R code I wrote to solve this problem. It's very inefficient and runs slowly.

Any tips to speed this up are appreciated.

# This script will is designed to go into a direcotry, read in all the files into a list, 
# then perform the ifelse function to call genotypes from ancestry probabilities
# and output to a new dataframe that will grow as the function loops through the list.

# Read in all the RASPberry output into a list of data frames
path = "/path/to/chr1"
import <- dir(path, pattern = ".txt")
chr1 <- lapply(import, read.table, header=T, sep="\t")
names(chr1) <- gsub(".txt","", import, fixed=TRUE)

# Perform a threshold on the locus probabilities to call genotypes.
geno <- data.frame(NA)
for( i in 1:length(chr1)){
    for( j in 1:4224){
        if (lapply(chr1, `[[`, 9)[[i]][j] > 0.5) {
            0 -> geno[j,i]
        } else if (lapply(chr1, `[[`, 10)[[i]][j] > 0.5) {
            1 -> geno[j,i]
        } else if (lapply(chr1, `[[`, 11)[[i]][j] > 0.5) {
            2 -> geno[j,i]
        }
    }
 }
ADD COMMENT
0
Entering edit mode

Can you provide example input text file?

Are you sure the logic is correct? if AA > 0.5 then it is 0, else if AB > 0.5 then it is 1, else if BB > 0.5 then it is 2.

What happens when we have 0.3 0.3 0.4?

ADD REPLY

Login before adding your answer.

Traffic: 1370 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6