Entering edit mode
7.2 years ago
nejc ▴ 50
I would to count the number of unique authors in publications referenced by PubMed in the last 10 years. Any ideas how to do this?
I tried downloading tarballs of abstracts here: http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/
But these don't really contain author names, just their acronyms (at least the file I looked at did).
If an accurate number would be too difficult or impossible to get, I am looking for other ways to estimate the number of life science researchers who have published something in the last 10 years.
Are you going to account for the fact that there could be 10's of Smith, John (an example)? Or are you going to look deeper to check their affiliations (not foolproof since people migrate/move)?
An update: I have managed to download the NXML files and have written parsers to extract first and last names. So no, I didn't meen to check affiliations at this point, I thought there are not too many duplicates, that this would cause the estimate to be irrelevant.
Anyway, I have now decided to only consider publications published from 2010 on and only those publications that contain string "genom" in the NXML file. This is because I am only interested in researchers who work in genomics (I am interested in other omics too though, but I think "genom" will be good enough).
The problem I have now is this:
I think I might have still far too many duplicates because of "unclean" data. I think affiliations would be even harder to deduplicate, so I don't plan to go there. Current number (after only approximately 20% of publications analysed): 2,073,101 "uniques", which I think is way too much as there should be 7-8 million all researchers (in all fields) globally.
What do you think?
Any ideas how to avoid these issues and come up with some reasonable estimate?
Hello, did you know about ORCID? It seems to have more than 1,855,000 registered researchers (though it includes humanity academics). A "John Smith" search yields several results, but I have no idea about how to link those IDs to papers (or, more importantly, to link papers to ORCID IDs).
It is entirely possible that all of those examples you listed are in fact individual people. Of course there will be unclean data for one reason or another but for any specific example, what might look like duplication may not be. If Aarts is a very common surname, and H is an initial for a common given name, you would expect lots of similar initial sharers. There is research out there that do data mining on authorship, you should start your work by looking at that literature. There may very well be software packages out there that will make this easier. In addition you may find someone has already done this almost exact analysis.
Thanks, Dan. Any ideas for keywords I should use to find such literature? Any hints for software packages that could help here? Thanks!
NLM has a defunct project for unique author ID's.
Those are all initiatives aimed at trying to come up with unique ID systems for authors. They are all fantastic projects but just counting say OrcidIDs would underestimate the number of unique authors because plenty of them don't have such IDs as they are not mandatory for the majority of journals. However, looking at their databases, if available, would be something that may be useful as part of a large analysis
Here is a link to the google results for a search with keywords authorship pubmed machine learning. You can see there are lots of relevant looking studies, that might be a good place to start