Question: (Closed) subset function limitation in R
gravatar for demoraesdiogo2017
23 months ago by
demoraesdiogo201740 wrote:

Hello all

I have a very large dataset, with 11.000 cols and 50.000 rows. I have a list of about 2.000 names which can be found in the columns.

To extract only the columns with those 2000 names, I have used the function subset like this

mysubset <- subset(mybigdata, select = c("name1", "name2", ... "name2000")

I got a few error messages and thought something could be wrong with the names or the code, but after a few tests it seems these errors do not occur if I limit my subsets to about 50 names per line. This, of course, makes the subsetting very laborious (especially since I might have to repeat this process).

Can you recommend an alternative?

subsetting data • 447 views
ADD COMMENTlink modified 23 months ago • written 23 months ago by demoraesdiogo201740

Are those the actual names? Manually constructing the vector for select= is absolutely not the way to go. Do yourself (and all of us) a favor and go through a few R tutorials. Invest some time with learning the basics of the language, it will pay off in the long run. I can for example recommend swirl but there are literally hundreds of tutorials freely available online.

I'm going to close this question too, as it is not specific to bioinformatics, it is pure programming and that is not within the scope of biostars.

ADD REPLYlink written 23 months ago by WouterDeCoster45k

Hello demoraesdiogo2017!

We believe that this post does not fit the main topic of this site.

Pure programming, not exactly bioinformatics

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.


ADD REPLYlink written 23 months ago by WouterDeCoster45k

Hello WouterDeCoster I just used those names as examples to shorten the amount of text to explain my issue. My 50 thousand rows consist of gene names, marking TPMs for the 11 thousand columns, which are tissue samples. These are not the actual names of course, but the solution to the problem would still be the same. Constructing a list manually is absolutely necessary because the actual IDs from the samples are very specific and do not follow a numerical order as "namex". Moreover, I need very specific samples scattered in the table, therefore, range selection, which I assume is what you would suggest, would not be the way to go.

Because many issues in this forum would are still on topic simply because of the nature of the data (say, for example, logistical regressions, which can be applied in virtually any field besides bioinformatics) I do disagree with the closure and do believe it is on topic.


ADD REPLYlink written 23 months ago by demoraesdiogo201740


I agree with the question being closed, as it stands now. But the core problem is about not being specific enough to help, and that caused your question to become a generic question. There is no mention of gene names, only 'names', no specification of the data type, just 'mybigdata' (now it's TPM), no error message is given. You need to be specific about all these things, and then we might be able to help or not.


ADD REPLYlink written 23 months ago by Michael Dondrup48k
Please log in to add an answer.
The thread is closed. No new answers may be added.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1423 users visited in the last hour