Question

How to extract from a large text file in R

0

Entering edit mode

4.8 years ago

e.jabbari • 0

Hi all,

Very grateful for any help provided for this simple R question being asked by a novice!

I have a txt file with ~8 million rows and 4 columns

I have a separate txt file with just 1 column of ~2 million rows and the data in each of the rows correspond to the data in column 1 of the first txt file.

What would be the correct command in R to extract the ~2 million rows from the original txt file with 8 million rows such that the data in all 4 columns are extracted?

Can you even load 2 separate txt files as 2 separate data frames to make this possible??

Much appreciated, Ed

R extract • 3.1k views

ADD COMMENT • link updated 4.8 years ago by Brice Sarver ★ 3.8k • written 4.8 years ago by e.jabbari • 0

1

Entering edit mode

What you're trying to accomplish is not clear. As stated this is not even a bioinformatics question and may be closed. If the content is too big to fit in RAM, you can process the files line by line. This is not the fastest but sometimes is the only/easiest solution.

ADD REPLY • link 4.8 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

It is perfectly possible to read millions of lines in R using e.g. scan. I am guessing that the data contains SNP id's and that you want to extract a subset of rows from the larger file based on ids given in another file. I think we have multiple good solutions for this all over the place already, not involving R.

https://unix.stackexchange.com/questions/110645/select-lines-from-text-file-which-have-ids-listed-in-another-file

https://stackoverflow.com/questions/13732295/extract-all-lines-from-text-file-based-on-a-given-list-of-ids

Extracting features from a file from another file with an ID list

ADD REPLY • link 4.8 years ago by Michael 54k

0

Entering edit mode

P.s. happy to be advised on how to do the above in plink if that's easier...

ADD REPLY • link 4.8 years ago by e.jabbari • 0

0

Entering edit mode

To add information to your question, use the edit link or the 'add comment' button but do not use 'add an answer' if what you're posting is not an answer to the question because it makes it appear that the question has been answered and you may not attract actual answers.

ADD REPLY • link 4.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Indeed, you imply that your dataset is a plink dataset?; so, which plink files are these?

ADD REPLY • link 4.8 years ago by Kevin Blighe 87k

score 0 · Answer 1 · 2019-06-24

Not enough information is given, but it sounds like they either want to subset or merge. Jean-Karim is right in this should likely be closed as off-topic. On the off chance that this is simply subsetting a large results file (say, ignoring headers), see below.

Using data tables below, though data frames would work just as well. For that case, look at merge() or subsetting with logicals using which() and the infix operator %in%.

library(data.table)
a <- fread("file1.delim", ...)
b <- fread("file2.delim", ...)
setkey(a, the_col_you_want)
d <- a[vector_of_vals_from_b]
fwrite(d, file = "a_subset.delim")