How to extract from a large text file in R
1
0
Entering edit mode
4.8 years ago
e.jabbari • 0

Hi all,

Very grateful for any help provided for this simple R question being asked by a novice!

I have a txt file with ~8 million rows and 4 columns

I have a separate txt file with just 1 column of ~2 million rows and the data in each of the rows correspond to the data in column 1 of the first txt file.

What would be the correct command in R to extract the ~2 million rows from the original txt file with 8 million rows such that the data in all 4 columns are extracted?

Can you even load 2 separate txt files as 2 separate data frames to make this possible??

Much appreciated, Ed

R extract • 3.1k views
ADD COMMENT
1
Entering edit mode

What you're trying to accomplish is not clear. As stated this is not even a bioinformatics question and may be closed. If the content is too big to fit in RAM, you can process the files line by line. This is not the fastest but sometimes is the only/easiest solution.

ADD REPLY
1
Entering edit mode

It is perfectly possible to read millions of lines in R using e.g. scan. I am guessing that the data contains SNP id's and that you want to extract a subset of rows from the larger file based on ids given in another file. I think we have multiple good solutions for this all over the place already, not involving R.

https://unix.stackexchange.com/questions/110645/select-lines-from-text-file-which-have-ids-listed-in-another-file

https://stackoverflow.com/questions/13732295/extract-all-lines-from-text-file-based-on-a-given-list-of-ids

Extracting features from a file from another file with an ID list

ADD REPLY
0
Entering edit mode

P.s. happy to be advised on how to do the above in plink if that's easier...

ADD REPLY
0
Entering edit mode

To add information to your question, use the edit link or the 'add comment' button but do not use 'add an answer' if what you're posting is not an answer to the question because it makes it appear that the question has been answered and you may not attract actual answers.

ADD REPLY
0
Entering edit mode

Indeed, you imply that your dataset is a plink dataset?; so, which plink files are these?

ADD REPLY
0
Entering edit mode
4.8 years ago
Brice Sarver ★ 3.8k

Not enough information is given, but it sounds like they either want to subset or merge. Jean-Karim is right in this should likely be closed as off-topic. On the off chance that this is simply subsetting a large results file (say, ignoring headers), see below.

Using data tables below, though data frames would work just as well. For that case, look at merge() or subsetting with logicals using which() and the infix operator %in%.

library(data.table)
a <- fread("file1.delim", ...)
b <- fread("file2.delim", ...)
setkey(a, the_col_you_want)
d <- a[vector_of_vals_from_b]
fwrite(d, file = "a_subset.delim")
ADD COMMENT

Login before adding your answer.

Traffic: 2717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6