Hi everyone!!
I am new to text mining studies carried out in bioinformatics. I was learning how to use pubmed.mineR package for text mining using mulit-abstract txt file. I read a lot on RISmed package and easypubmed R packages but it seems they have some disadvantage when you have to use their output in pubmed.mineR.
pubmed.mineR uses a text file containing multiple abstracts in "abstract" format of Pubmed (one format abovet many other pubmed formats such as xml). easyPubmed have such function that can be exploited while RISmed doesn't it seems. On the other hand you can retrieve only 5000 abstracts at a time using easyPubmed while RISmed has no such upper limit.
I randomly chose a cancer type "oral cancer" and it had above 1 lakh PMIDs. RISmed successfully retrieved the abstracts(1.1 GB data) however it's output is incompatible for pubmed.mineR. On the other hand, though easypubmed had a compatible output, it has retrieval limit since it use PubMed API at the backend.
Is there a way using CLI to retrieve all ~1 lakh of abstracts in "abstract" format from pubmed since the website itself after June 2020 update has set a limit to 10,000 abstracts at a time. Here I attach a short code I used with easypubmed
library("easyPubMed")
search_topic <- 'oral cancer'
my_entrez_id <- get_pubmed_ids(search_topic)
my_entrez_id$Count
?fetch_pubmed_data
my_abstracts_txt <- fetch_pubmed_data(my_entrez_id, retmax = 142000, format = "abstract")
writeLines(my_abstracts_txt, con = "oral_cancer.txt")
Wow, this is such an effortless way of doing that. However, I noticed that the numbering of the abstract starts from 1 again after every 100 abstracts. Hope that won't cause any trouble during the analysis though!!.
Thanks a lot for the solution.
I changed the
to
Appending helps I think. Is that truncation you are referring to?
I was referring to truncation of the example. I have updated my answer to append data to the file.
Wonderful. Thanks for your time. I really appreciate it. I will start looking out of R from now for solutions
I am using entrezdirect package in conda. It seems not to ask for any such key.
Checking how many abstracts were fetched
It won't ask. NCBI will start throttling your queries based on description in the link I provided above, if you don't use the key.
I can not access this paper. Can you share the author's copy if available? My email ID is rohitsatyam102@gmail.com
That paper has not been published yet. You will find that journals will pre-send paper listings to PubMed before actual publication.