R extract the sequences from DNAStringset
2
0
Entering edit mode
5.0 years ago
User6891 ▴ 330

Hi,

I have a dataframe tab5 of 6500 lines that contains a column 'ALT' which is a list of DNAStringSet instances of length 1

head(tab5$ALT)

[[1]]
 A DNAStringSet instance of length 1
 width seq
[1]     1 C

[[2]]
 A DNAStringSet instance of length 1
 width seq
 [1]     1 T

If I do:

head(tab5)

dbsnp_id GT_parent1 GT_embryo GT_parent2  chr   start REF ALT   QUAL
1 rs7521546        1/1       0/1        0/0 chr1 3518944   T   C 7828.9

The ALT column just shows the nucleotide, like it should.

However when trying to write this as a table

write.table(tab5, file = "test.txt", sep = "\t", row.names = FALSE)

I get the following in my 'ALT' column

new("DNAStringSet", pool = new("SharedRaw_Pool", xp_list = list(<pointer: 0x0>), .link_to_cached_object_list = list(<environment>)), ranges = new("GroupedIRanges", group = 1, start = 606, width = 1, NAMES = NULL, elementType = "ANY", elementMetadata = NULL, metadata = list()), elementType = "DNAString", elementMetadata = NULL, metadata = list())

How is this possible? I've already tried to convert the 'ALT' column to a character, but this doesn't change a thing.

R • 9.4k views
ADD COMMENT
0
Entering edit mode

I've already tried to convert the 'ALT' column to a character,

How did you do that? Did you try tab5$ALT <- as.character(tab5$ALT)?

ADD REPLY
0
Entering edit mode

yes, and if I do that, see the same for each element

new("DNAStringSet", pool = new("SharedRaw_Pool", xp_list = list(<pointer: (nil)>), .link_to_cached_object_list = list(<environment>)), ranges = new("GroupedIRanges", group = 1, start = 606, width = 1, NAMES = NULL, elementType = "ANY", elementMetadata = NULL, metadata = list()), elementType = "DNAString", elementMetadata = NULL, metadata = list())
ADD REPLY
0
Entering edit mode

Are you trying to write tab5 or tab7?

ADD REPLY
0
Entering edit mode

tab5, sorry, mistake in the code above, changed it now.

ADD REPLY
1
Entering edit mode
5.0 years ago
User6891 ▴ 330

this works:

as.character(tab5$ALT[[1]][[1]])

the output is then the nucleotide itself

"C"

ADD COMMENT
1
Entering edit mode
5.0 years ago
Ram 43k

You should actually need

tab5$ALT <- sapply(tab5$ALT, function(x) as.character(x[[1]])

If you're assigning like so:

tab5$ALT <- tab5$ALT[[1]][[1]]

you might be actually applying the first element's ALT allele to the entire data frame column by assigning a scalar value to a vector. I'd double check that.


My sample code:

> library('Biostrings')
> dat <- DNAStringSet("C")
> dat
  A DNAStringSet instance of length 1
    width seq
[1]     1 C
R> dat[[1]]
  1-letter "DNAString" instance
seq: C
> as.character(dat[[1]])
[1] "C"
> dat[[1]][[1]]
Error in dat[[1]][[1]] : this S4 class is not subsettable
> as.character(dat[[1]][[1]])
Error in dat[[1]][[1]] : this S4 class is not subsettable

The error in your approach gets clear here:

> dat2 <- DNAStringSet("T")
> ldat <- list(dat,dat2)
> ldat #this is like your head(tab5$ALT)
[[1]]
  A DNAStringSet instance of length 1
    width seq
[1]     1 C

[[2]]
  A DNAStringSet instance of length 1
    width seq
[1]     1 T

> as.character(ldat[[1]][[1]]) #Just the first DNAString (a single value/scalar) 
[1] "C"

> sapply(ldat, function(x) as.character(x[[1]])) # A vector of equal length with each DNAString's nucleotide
[1] "C" "T"
ADD COMMENT

Login before adding your answer.

Traffic: 2460 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6