I'm trying to find the number of authors publishing on a given topic per year via an Entrez Direct query to Pubmed. That is, I want to give it a query and get back the number of unique author names on publications each year, preferably in an xls or csv spreadsheet. Here's what I have so far:
esearch -db pubmed -query "[query]" | efetch -format xml | xtract -pattern PubmedArticle -block Author -sep " " -element LastName,Initials -block PubDate -sep " " -element Year | sort-uniq-count > filename.xls
Unfortunately, that's just giving me the year and a list of authors, each with a count of 1 next to it. The list looks like this for one of my queries:
1 Bondar SA Feklissowa ME Beloussowa ND 1965
1 BONDAR ZA FEKLISOVA ME BELOUSOVA ND 1965
1 DISANTAGNESE PA 1965
1 Georgi M Winkel K zum Prpic B 1965
1 HOLT PR HASHIM SA VANITALLIE TB 1965
1 KINNEY VR TAUXE WN DEARING WH 1965
1 KUO PT BASSETT DR DIGEORGE AM CARPENTER GG 1965
1 MALDONADO JE HANLON DG 1965
1 STICKLER GB PEYLA TL DOWER JC LOGAN GB 1965
1 Zujović J Milosević V Petrović L 1965
I've also tried moving the year to the first column, and that didn't help, but at least it was a bit neater.
Does anyone know how I can get the count of unique authors for each year?
Thank you in advance.
Can you add checks to see if a name comes from the same affiliation or more than one? Since people with identical names/initials can be from 2 or more institutions.
record['AD'] holds the affiliation so yes that's possible.
I immediately see two ways to do this:
-nasty: concatenate author and affiliation, perhaps with '%' in between for separation afterwards, use these concatenates to check for being unique
-more difficult: generate a tuple per author with (name, affiliation) and use this to check uniqueness (slightly harder to check)
Would probably be the best to wrap this in try-except blocks for when the format isn't properly present in pubmed and e.g. your authorlist is empty.
Thank you very much. Also, I'm sorry, I have no clue what this is saying, and I'm not very familiar with Python. If it's not too much to ask, would you be able to comment exactly what each line is doing? That would be a lot of help with any troubleshooting or modification I might have to do. Thank you!
I have added comments to my previous post to clarify the statements and commands used. There is a wealth of information online, and my opinion is that it's more rewarding (for you) to figure things out yourself. But if that's too much trouble and you do not have theambition to learn some more python programming, I would be happy to help you further with this script.
I was actually very decent in Python a while ago, but that was before I saw any utility in it, so I eventually forgot the language entirely. I have, though, been playing around a lot (read: googling with some guess-and-check) with the code you gave me (thank you, again), and I think I have a somewhat decent modified version:
However, when I tried to run it, the program gave me an error on the first iteration:
I'm really not sure what that means or what to do with the information it's giving me. I also don't know, therefore, if the program is working, since the first iteration failed. Does this mean anything to you, and does the code look viable?
Also, is there a way to give Biopython my whole bit of code (all 13 lines) at once, so I'm not doing copy and paste 13 times for however many queries?
Thank you once again.
Code seems reasonable. Could you check wether idlist is what you think it is, e.g. by printing the length and/or first items? Makes it easier to track down the error.
As I hinted at earlier, it might be that a record is not properly formatted, e.g. the 'AU' key is not present in the retrieved data. Let's rewrite a part to take that into account:
I change
allauthors = [record['AU'] for record in records]
to:For your second question: easiest would be to just save it as a script/text file, e.g. getauthors.py and execute as
python getauthors.py
It looks like some of the articles just don't list authors or don't properly list authors for whatever reason, so that explains the error. I implemented your try piece, and that seems to be going well.
So you're aware, I had this weird bug in which the for loop I put in to break up the lists within the list (I think that's coherent) was breaking the author lists up into letters instead of elements; it ended up being an unnecessary component after switching += for append, so I got rid of it, but regardless, += does seem to behave oddly. I'm assuming that's why you used append yourself?
Anyway, I added a few bells and whistles and streamlined the whole thing a bit (visually, at least). It seems to be working very well. The current version is below. In case it comes up, I noticed that retmax doesn't go above 10000. Is there an easy fix in the event that there are more than 10000 records in a given year?
Also, so I understand what's actually happening a little better, what is "for record in records" doing (line 9 in your original code, line 16 in the version below)?
Current version:
Nicely done. To work around the retmax limitation you could consider splitting years in pieces...
records is a list containing all retrieved articles.
for record in records
is a for loop, in which the code loops over each individual record present in the list of records. It's in that sense very similar tofor y in range (1960,2016)
(Notice that the range excludes the last element and the final element will be 2015 because that's python). Also, I notice now you usedfor i in range (0,len(allauthors)):
which is maybe a javascript artefact (?) in your reasoning. You could also have usedfor partial_list in allauthors:
I'm not sure why += didn't work properly in the case you described, I tried a toy example
And that appears to work correctly.
I think I saw somewhere that doing a second iteration with
retstart
atretmax+1
, which makes sense. Would you happen to know of a way to check if number ofrecords
is greater thanretmax
? I recall thatprint len(records)
returns some sort of error. But it's not very important if you don't happen to know, since it's a special case and has an easy workaround.That makes sense with the for loop. It seems almost too intuitive. And yes, I'm more familiar with Java and C++ than anything else (far from fluent in either) and have a little (read: minimal but more than with Python) experience in JS.
Maybe you're not having the array issue because you're using numbers/integers, or maybe because it's single-character/digit elements. I'll see if I can replicate the issue to show you exactly what was happening.
Thank you.
The
len(records)
doesn't work, because records is not exactly a list but a generator. You could think of it as a more memory-efficient type of list, but one you can only access once. For example if you would do the following:You wouldn't be able to get the author information out afterwards, because now the generator is emptied. The next for loop would just yield nothing. I'm sure you can find a lot of things about this online. It depends on your needs.
A straightforward way to count the number of records while extracting the author information would be to initiate a
counter = 0
before starting the loop, and increment that counter (counter += 1
) before the try-except block.