Question: How to extract from a large text file in R
0
gravatar for e.jabbari
11 months ago by
e.jabbari0
e.jabbari0 wrote:

Hi all,

Very grateful for any help provided for this simple R question being asked by a novice!

I have a txt file with ~8 million rows and 4 columns

I have a separate txt file with just 1 column of ~2 million rows and the data in each of the rows correspond to the data in column 1 of the first txt file.

What would be the correct command in R to extract the ~2 million rows from the original txt file with 8 million rows such that the data in all 4 columns are extracted?

Can you even load 2 separate txt files as 2 separate data frames to make this possible??

Much appreciated, Ed

extract R • 287 views
ADD COMMENTlink modified 11 months ago by Brice Sarver3.5k • written 11 months ago by e.jabbari0
1

What you're trying to accomplish is not clear. As stated this is not even a bioinformatics question and may be closed. If the content is too big to fit in RAM, you can process the files line by line. This is not the fastest but sometimes is the only/easiest solution.

ADD REPLYlink modified 11 months ago • written 11 months ago by Jean-Karim Heriche22k
1

It is perfectly possible to read millions of lines in R using e.g. scan. I am guessing that the data contains SNP id's and that you want to extract a subset of rows from the larger file based on ids given in another file. I think we have multiple good solutions for this all over the place already, not involving R.

https://unix.stackexchange.com/questions/110645/select-lines-from-text-file-which-have-ids-listed-in-another-file

https://stackoverflow.com/questions/13732295/extract-all-lines-from-text-file-based-on-a-given-list-of-ids

Extracting features from a file from another file with an ID list

ADD REPLYlink modified 11 months ago • written 11 months ago by Michael Dondrup47k

P.s. happy to be advised on how to do the above in plink if that's easier...

ADD REPLYlink written 11 months ago by e.jabbari0

To add information to your question, use the edit link or the 'add comment' button but do not use 'add an answer' if what you're posting is not an answer to the question because it makes it appear that the question has been answered and you may not attract actual answers.

ADD REPLYlink written 11 months ago by Jean-Karim Heriche22k

Indeed, you imply that your dataset is a plink dataset?; so, which plink files are these?

ADD REPLYlink written 11 months ago by Kevin Blighe59k
0
gravatar for Brice Sarver
11 months ago by
Brice Sarver3.5k
United States
Brice Sarver3.5k wrote:

Not enough information is given, but it sounds like they either want to subset or merge. Jean-Karim is right in this should likely be closed as off-topic. On the off chance that this is simply subsetting a large results file (say, ignoring headers), see below.

Using data tables below, though data frames would work just as well. For that case, look at merge() or subsetting with logicals using which() and the infix operator %in%.

library(data.table)
a <- fread("file1.delim", ...)
b <- fread("file2.delim", ...)
setkey(a, the_col_you_want)
d <- a[vector_of_vals_from_b]
fwrite(d, file = "a_subset.delim")
ADD COMMENTlink modified 11 months ago • written 11 months ago by Brice Sarver3.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1233 users visited in the last hour