Question

R extract the sequences from DNAStringset

0

Entering edit mode

5.0 years ago

User6891 ▴ 330

Hi,

I have a dataframe tab5 of 6500 lines that contains a column 'ALT' which is a list of DNAStringSet instances of length 1

head(tab5$ALT)

[[1]]
 A DNAStringSet instance of length 1
 width seq
[1]     1 C

[[2]]
 A DNAStringSet instance of length 1
 width seq
 [1]     1 T

If I do:

head(tab5)

dbsnp_id GT_parent1 GT_embryo GT_parent2  chr   start REF ALT   QUAL
1 rs7521546        1/1       0/1        0/0 chr1 3518944   T   C 7828.9

The ALT column just shows the nucleotide, like it should.

However when trying to write this as a table

write.table(tab5, file = "test.txt", sep = "\t", row.names = FALSE)

I get the following in my 'ALT' column

new("DNAStringSet", pool = new("SharedRaw_Pool", xp_list = list(<pointer: 0x0>), .link_to_cached_object_list = list(<environment>)), ranges = new("GroupedIRanges", group = 1, start = 606, width = 1, NAMES = NULL, elementType = "ANY", elementMetadata = NULL, metadata = list()), elementType = "DNAString", elementMetadata = NULL, metadata = list())

How is this possible? I've already tried to convert the 'ALT' column to a character, but this doesn't change a thing.

R • 9.4k views

ADD COMMENT • link 5.0 years ago by User6891 ▴ 330

0

Entering edit mode

I've already tried to convert the 'ALT' column to a character,

How did you do that? Did you try tab5$ALT <- as.character(tab5$ALT)?

ADD REPLY • link 5.0 years ago by Ram 43k

0

Entering edit mode

yes, and if I do that, see the same for each element

new("DNAStringSet", pool = new("SharedRaw_Pool", xp_list = list(<pointer: (nil)>), .link_to_cached_object_list = list(<environment>)), ranges = new("GroupedIRanges", group = 1, start = 606, width = 1, NAMES = NULL, elementType = "ANY", elementMetadata = NULL, metadata = list()), elementType = "DNAString", elementMetadata = NULL, metadata = list())

ADD REPLY • link updated 5.0 years ago by GenoMax 141k • written 5.0 years ago by User6891 ▴ 330

0

Entering edit mode

Are you trying to write tab5 or tab7?

ADD REPLY • link 5.0 years ago by zx8754 11k

0

Entering edit mode

tab5, sorry, mistake in the code above, changed it now.

ADD REPLY • link 5.0 years ago by User6891 ▴ 330

1

Entering edit mode

5.0 years ago

Ram 43k

You should actually need

tab5$ALT <- sapply(tab5$ALT, function(x) as.character(x[[1]])

If you're assigning like so:

tab5$ALT <- tab5$ALT[[1]][[1]]

you might be actually applying the first element's ALT allele to the entire data frame column by assigning a scalar value to a vector. I'd double check that.

My sample code:

> library('Biostrings')
> dat <- DNAStringSet("C")
> dat
  A DNAStringSet instance of length 1
    width seq
[1]     1 C
R> dat[[1]]
  1-letter "DNAString" instance
seq: C
> as.character(dat[[1]])
[1] "C"
> dat[[1]][[1]]
Error in dat[[1]][[1]] : this S4 class is not subsettable
> as.character(dat[[1]][[1]])
Error in dat[[1]][[1]] : this S4 class is not subsettable

The error in your approach gets clear here:

> dat2 <- DNAStringSet("T")
> ldat <- list(dat,dat2)
> ldat #this is like your head(tab5$ALT)
[[1]]
  A DNAStringSet instance of length 1
    width seq
[1]     1 C

[[2]]
  A DNAStringSet instance of length 1
    width seq
[1]     1 T

> as.character(ldat[[1]][[1]]) #Just the first DNAString (a single value/scalar) 
[1] "C"

> sapply(ldat, function(x) as.character(x[[1]])) # A vector of equal length with each DNAString's nucleotide
[1] "C" "T"

ADD COMMENT • link 5.0 years ago by Ram 43k

score 1 · Accepted Answer · 2019-05-03

1

Entering edit mode

5.0 years ago

User6891 ▴ 330

this works:

as.character(tab5$ALT[[1]][[1]])

the output is then the nucleotide itself

"C"

ADD COMMENT • link 5.0 years ago by User6891 ▴ 330