Hi all,
Very grateful for any help provided for this simple R question being asked by a novice!
I have a txt file with ~8 million rows and 4 columns
I have a separate txt file with just 1 column of ~2 million rows and the data in each of the rows correspond to the data in column 1 of the first txt file.
What would be the correct command in R to extract the ~2 million rows from the original txt file with 8 million rows such that the data in all 4 columns are extracted?
Can you even load 2 separate txt files as 2 separate data frames to make this possible??
Much appreciated, Ed
What you're trying to accomplish is not clear. As stated this is not even a bioinformatics question and may be closed. If the content is too big to fit in RAM, you can process the files line by line. This is not the fastest but sometimes is the only/easiest solution.
It is perfectly possible to read millions of lines in R using e.g. scan. I am guessing that the data contains SNP id's and that you want to extract a subset of rows from the larger file based on ids given in another file. I think we have multiple good solutions for this all over the place already, not involving R.
https://unix.stackexchange.com/questions/110645/select-lines-from-text-file-which-have-ids-listed-in-another-file
https://stackoverflow.com/questions/13732295/extract-all-lines-from-text-file-based-on-a-given-list-of-ids
Extracting features from a file from another file with an ID list
P.s. happy to be advised on how to do the above in plink if that's easier...
To add information to your question, use the edit link or the 'add comment' button but do not use 'add an answer' if what you're posting is not an answer to the question because it makes it appear that the question has been answered and you may not attract actual answers.
Indeed, you imply that your dataset is a plink dataset?; so, which plink files are these?