Question: Calculate average parental genotype
1
gravatar for coleman_jonathan
13 days ago by
European Union
coleman_jonathan410 wrote:

I am interested in creating an averaged value for genotype across two parents:

E.g. Input (three unrelated couples, 1, 2 and 3):

Individual    SNP1    SNP2    SNP3    SNP4 ...
Father_1      1       0       1       2    ...
Mother_1      0       0       2       1    ...
Father_2      1       0       1       1    ...
Mother_2      2       1       0       1    ...
Father_3      1       2       0       1    ...
Mother_3      1       .       0       2    ...

E.g. Output (the midparental genotypes for couple 1, 2 and 3):

Individual    SNP1    SNP2    SNP3    SNP4 ...
Midparent_1   0.5     0       1.5     1.5  ...
Midparent_2   1.5     0.5     0.5     1    ...
Midparent_3   1       .       0       1.5  ...

I can think of a way to do this using e.g. a PLINK .raw file and manipulating that in R, but am concerned that would become memory- and storage-heavy when assessing 100000s of SNPs and 1000s of pairs.

Does anyone know of an existing tool that can do this?

Thanks

snp gwas • 115 views
ADD COMMENTlink modified 12 days ago by finswimmer10k • written 13 days ago by coleman_jonathan410
1

Can you provide more example rows, add more parent pairs. Pretty sure it is easily done using R. Something like split by ped lapply, then use colMeans.

ADD REPLYlink modified 13 days ago • written 13 days ago by zx87546.5k

Thanks - have added. My concern is less the method to do so in R (if we end up doing that, I'll add the code as an answer), but more whether there's a less computationally-intensive / more sophisticated method that I'm missing.

ADD REPLYlink written 13 days ago by coleman_jonathan410

Do we assume rows 1,2 is one parent pair and rows 3,4 is another parent pair, etc?

ADD REPLYlink modified 13 days ago • written 13 days ago by zx87546.5k

That can be assumed (in that it would be straightforward to ensure the file is formatted in that way)

ADD REPLYlink written 12 days ago by coleman_jonathan410

Hello,

you've also posted this question on stackexchange, but with slightly (but important) differences in your in- and output. You should tell us when you cross-post your questions.

Thanks.

fin swimmer

ADD REPLYlink written 12 days ago by finswimmer10k
0
gravatar for zx8754
13 days ago by
zx87546.5k
London
zx87546.5k wrote:

We could use R:

# example data
df1 <- read.table(text = "
Individual  SNP1    SNP2    SNP3    SNP4
Father_1    1   0   1   2
Mother_1    0   0   2   1
Father_2    1   0   1   1
Mother_2    2   1   0   1
Father_3    1   2   0   1
Mother_3    1   .   0   2
", header = TRUE, na.string = ".")

Using loops

t(sapply(split(df1[, -1], rep(1:(nrow(df1)/2), each = 2)), colMeans))
#   SNP1 SNP2 SNP3 SNP4
# 1  0.5  0.0  1.5  1.5
# 2  1.5  0.5  0.5  1.0
# 3  1.0   NA  0.0  1.5

Or

# subset using recycling then get mean
(df1[ c(TRUE, FALSE), -1] + df1[ c(FALSE, TRUE), -1]) / 2

Using awk:

# get odd rows
sed -n 'p;n' original.txt > file1.txt
# get even rows
sed -n 'n;p' original.txt > file2.txt

Then use the awk script from SO: Average of multiple files in shell:

awk 'FNR == 1 { nfiles++; ncols = NF }
     { for (i = 1; i < NF; i++) sum[FNR,i] += $i
       if (FNR > maxnr) maxnr = FNR
     }
     END {
         for (line = 1; line <= maxnr; line++)
         {
             for (col = 1; col < ncols; col++)
                  printf "  %f", sum[line,col]/nfiles;
             printf "\n"
         }
     }' file*.txt
ADD COMMENTlink modified 12 days ago • written 13 days ago by zx87546.5k
0
gravatar for i.sudbery
12 days ago by
i.sudbery3.8k
Sheffield, UK
i.sudbery3.8k wrote:

In python you can do this with reading the file two lines at a time, so you never hold more than two lines at once. This will use next to no memory, and I'm pretty sure it will be much faster than the R solution. There is probably also a solution with pandas and chunking the input file read.

infile = open("infile.tsv")
header = infile.readline()
header = header.strip().split()[1:]
print("\t".join(header))

while True:

    line1 = infile.readline()
    line2 = infile.readline()

    if (not line1) or (not line2): break # EOF

    assert "Father" in line1 and "Mother" in line2

    genotypes1 = map(float, line1.strip().split()[1:])
    genotypes2 = map(float, line2.strip().split()[1:])

    average_genotypes = [str((x+y)/2) for x, y in zip(genotypes1, genotypes2)]

    print("\t".join(average_genotypes))
ADD COMMENTlink written 12 days ago by i.sudbery3.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2004 users visited in the last hour