Question: separate_rows, but keep related ones together
1
gravatar for _r_am
15 days ago by
_r_am32k
Baylor College of Medicine, Houston, TX
_r_am32k wrote:

Hi,

This appears to be a simple problem that I am unable to solve. I have some data that looks like this:

CHROM    POS    REF    ALT    TYPE       AF
chr1     1      A      T      MISSENSE   0.23
chr2     1      A      T,G    MISSENSE   0.17, 0.09

The above is dummy meaningless data, but it is representative of the problem at hand.

I'd like to separate_rows such that the ALT and AF are separated in a couples manner. Running separate_rows on the 2 columns would give me 4 rows, not 2. I'd like my output to be:

CHROM    POS    REF    ALT    TYPE       AF
chr1     1      A      T      MISSENSE   0.23
chr2     1      A      T      MISSENSE   0.17
chr2     1      A      G      MISSENSE   0.09

Is there any way I can conserve this combination while separating the values out? I am really far out from the VCF to go back and split multi-allelics.

multiallelic variants R • 123 views
ADD COMMENTlink modified 15 days ago by zx87549.9k • written 15 days ago by _r_am32k
4
gravatar for antonioggsousa
15 days ago by
antonioggsousa1.9k
antonioggsousa1.9k wrote:

If I run this:

separate_rows(data, ALT, AF, convert = TRUE)

where data is your 1st data frame I obtain the second data frame.

How did you run the function separate_rows()?

ADD COMMENTlink modified 15 days ago • written 15 days ago by antonioggsousa1.9k
1

I run with defaults buy I'll try toggling the convert parameter. Thanks, Antonio!

ADD REPLYlink written 15 days ago by _r_am32k

You're welcome. Actually I was lucky, because first I tested on the example from the function documentation, that sets convert = TRUE. Since the outcome was similar to what you wanted, I just kept it.

ADD REPLYlink written 15 days ago by antonioggsousa1.9k

OK, moment of truth - I did not run separate_rows, I assumed how it would work based on my experience. It looks like separate_rows does exactly what I need, not a random combination like I thought it would. I really should have tested it before asking here. Sorry about that.

ADD REPLYlink written 15 days ago by _r_am32k
2
gravatar for Alex Reynolds
15 days ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

You could use a Python script to do this easily:

#!/usr/bin/env python

import sys

headers = None
idx = 0
for line in sys.stdin:
    elems = line.rstrip().split('\t')
    if idx == 0:
        headers = elems
        sys.stdout.write(line)
    else:
        items = {x:y for x,y in zip(headers, elems)}
        alleles = items['ALT'].split(',')
        afs = items['AF'].split(',')
        for ai in range(len(alleles)):
            items['ALT'] = alleles[ai]
            items['AF'] = afs[ai]
            sys.stdout.write('{}\n'.format('\t'.join([items[x] for x in headers])))
    idx += 1

For example:

$ ./split.py < variants.txt
CHROM   POS REF ALT TYPE    AF
chr1    1   A   T   MISSENSE    0.23
chr2    1   A   T   MISSENSE    0.17
chr2    1   A   G   MISSENSE     0.09

Write it out to a file and bring that back into R:

$ ./split.py < variants.txt > variants.split.txt
ADD COMMENTlink written 15 days ago by Alex Reynolds31k

Turns out, I'm an idiot who should really test something and make sure it doesn't work before saying it doesn't work. separate_rows works exactly the way I need my solution to, not the way I thought it would.

ADD REPLYlink written 15 days ago by _r_am32k
2
gravatar for zx8754
15 days ago by
zx87549.9k
London
zx87549.9k wrote:

Using data.table:

library(data.table)

x <- fread("CHROM POS REF ALT TYPE AF
chr1 1 A T MISSENSE 0.23
chr2 1 A T,G MISSENSE 0.17,0.09")

x[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed = TRUE))),
    by = .(CHROM, POS, REF, TYPE)
  ][, .(CHROM, POS, REF, ALT, TYPE, AF = as.numeric(AF))]
#    CHROM POS REF ALT     TYPE   AF
# 1:  chr1   1   A   T MISSENSE 0.23
# 2:  chr2   1   A   T MISSENSE 0.17
# 3:  chr2   1   A   G MISSENSE 0.09

Below should work with auto type conversion, but it fails, as the first value "T" gets converted as logical "TRUE", next "T,G" as character, then when binding it, unfortunately, errors out:

x[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed = TRUE, type.convert = TRUE))),
    by = .(CHROM, POS, REF, TYPE)]
# Error in `[.data.table`(x, , lapply(.SD, function(x) unlist(tstrsplit(x,  : 
#   Column 1 of result for group 2 is type 'character' but expecting type
#   'logical'. Column types must be consistent for each group.

Related SO post with other alternative solutions:

ADD COMMENTlink written 15 days ago by zx87549.9k

Gotta love SO's benchmarked solution list :-)

ADD REPLYlink written 15 days ago by _r_am32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 924 users visited in the last hour
_