Question: (Closed) subset function limitation in R
0
gravatar for demoraesdiogo2017
5 weeks ago by
demoraesdiogo201710 wrote:

Hello all

I have a very large dataset, with 11.000 cols and 50.000 rows. I have a list of about 2.000 names which can be found in the columns.

To extract only the columns with those 2000 names, I have used the function subset like this

mysubset <- subset(mybigdata, select = c("name1", "name2", ... "name2000")

I got a few error messages and thought something could be wrong with the names or the code, but after a few tests it seems these errors do not occur if I limit my subsets to about 50 names per line. This, of course, makes the subsetting very laborious (especially since I might have to repeat this process).

Can you recommend an alternative?

subsetting data • 141 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by demoraesdiogo201710
1

Are those the actual names? Manually constructing the vector for select= is absolutely not the way to go. Do yourself (and all of us) a favor and go through a few R tutorials. Invest some time with learning the basics of the language, it will pay off in the long run. I can for example recommend swirl but there are literally hundreds of tutorials freely available online.

I'm going to close this question too, as it is not specific to bioinformatics, it is pure programming and that is not within the scope of biostars.

ADD REPLYlink written 5 weeks ago by WouterDeCoster38k
1

Hello demoraesdiogo2017!

We believe that this post does not fit the main topic of this site.

Pure programming, not exactly bioinformatics

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLYlink written 5 weeks ago by WouterDeCoster38k

Hello WouterDeCoster I just used those names as examples to shorten the amount of text to explain my issue. My 50 thousand rows consist of gene names, marking TPMs for the 11 thousand columns, which are tissue samples. These are not the actual names of course, but the solution to the problem would still be the same. Constructing a list manually is absolutely necessary because the actual IDs from the samples are very specific and do not follow a numerical order as "namex". Moreover, I need very specific samples scattered in the table, therefore, range selection, which I assume is what you would suggest, would not be the way to go.

Because many issues in this forum would are still on topic simply because of the nature of the data (say, for example, logistical regressions, which can be applied in virtually any field besides bioinformatics) I do disagree with the closure and do believe it is on topic.

Cheers!

ADD REPLYlink written 5 weeks ago by demoraesdiogo201710

Hi,

I agree with the question being closed, as it stands now. But the core problem is about not being specific enough to help, and that caused your question to become a generic question. There is no mention of gene names, only 'names', no specification of the data type, just 'mybigdata' (now it's TPM), no error message is given. You need to be specific about all these things, and then we might be able to help or not.

Cheers

ADD REPLYlink written 5 weeks ago by Michael Dondrup46k
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1483 users visited in the last hour