Question: edirect: Number of authors per year in Pubmed for a given query
1
gravatar for ote123
4.0 years ago by
ote12310
ote12310 wrote:

I'm trying to find the number of authors publishing on a given topic per year via an Entrez Direct query to Pubmed. That is, I want to give it a query and get back the number of unique author names on publications each year, preferably in an xls or csv spreadsheet. Here's what I have so far:

esearch -db pubmed -query "[query]" | efetch -format xml | xtract -pattern PubmedArticle -block Author -sep " " -element LastName,Initials -block PubDate -sep " " -element Year | sort-uniq-count > filename.xls

Unfortunately, that's just giving me the year and a list of authors, each with a count of 1 next to it. The list looks like this for one of my queries:

1   Bondar SA   Feklissowa ME   Beloussowa ND   1965
1   BONDAR ZA   FEKLISOVA ME    BELOUSOVA ND    1965
1   DISANTAGNESE PA 1965
1   Georgi M    Winkel K zum    Prpic B 1965
1   HOLT PR HASHIM SA   VANITALLIE TB   1965
1   KINNEY VR   TAUXE WN    DEARING WH  1965
1   KUO PT  BASSETT DR  DIGEORGE AM CARPENTER GG    1965
1   MALDONADO JE    HANLON DG   1965
1   STICKLER GB PEYLA TL    DOWER JC    LOGAN GB    1965
1   Zujović J   Milosević V Petrović L  1965

I've also tried moving the year to the first column, and that didn't help, but at least it was a bit neater.

Does anyone know how I can get the count of unique authors for each year?

Thank you in advance.

ADD COMMENTlink modified 4.0 years ago by Pierre Lindenbaum129k • written 4.0 years ago by ote12310
2
gravatar for WouterDeCoster
4.0 years ago by
Belgium
WouterDeCoster44k wrote:

I would use Biopython for this. Bit of unfinished and untested code below:

    from Bio import Entrez #Import the module you need, requires installation of biopython
    Entrez.email = "youremail@example.com" #Entrez wants to know who you are if you are using their services, so they can contact you if something goes wrong or you overload them with requests
    from Bio import Medline #Another module we need
    handle = Entrez.esearch(db="pubmed", term=searchterm, reldate=time, retmax=500) #Search entrez, in the database pubmed with "searchterm" (your topic of interest) and time (the amount of days you want to go back).
#You can probably customize this with more options, regarding the time to get exactly what you need. It's set to maximally retrieve 500 records but you can add more
    idlist = Entrez.read(handle)["IdList"] #We retrieve a list of pubmed IDs based on this query
    handle = Entrez.efetch("pubmed", id=idlist, rettype="medline", retmode="text", retmax=500) #using these ids we search for the articles themselves, in medline format. You can probably also ask for xml type and others
    records = Medline.parse(handle) #we parse the returned records in our favorite format
    allauthors = [record['AU'] for record in records] #Creating a lists of lists #List comprehension statement to loop over the records and extract the author (AU) element from each

You need error handling for when the parsing is not successful. And then some set operations to get the unique out... Notice that you have to adapt the searchterm and time in the Entrez.esearch. Additionally, notice that this is code from an entirely different script, so it's maybe not the best way to do what you want.

More info:

http://biopython.org/DIST/docs/tutorial/Tutorial.html

http://biopython.org/DIST/docs/api/Bio.Medline-pysrc.html

http://biopython.org/DIST/docs/api/Bio.Entrez-module.html

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by WouterDeCoster44k

Can you add checks to see if a name comes from the same affiliation or more than one? Since people with identical names/initials can be from 2 or more institutions.

ADD REPLYlink written 4.0 years ago by genomax85k

record['AD'] holds the affiliation so yes that's possible.

I immediately see two ways to do this:

-nasty: concatenate author and affiliation, perhaps with '%' in between for separation afterwards, use these concatenates to check for being unique

-more difficult: generate a tuple per author with (name, affiliation) and use this to check uniqueness (slightly harder to check)

Would probably be the best to wrap this in try-except blocks for when the format isn't properly present in pubmed and e.g. your authorlist is empty.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by WouterDeCoster44k

Thank you very much. Also, I'm sorry, I have no clue what this is saying, and I'm not very familiar with Python. If it's not too much to ask, would you be able to comment exactly what each line is doing? That would be a lot of help with any troubleshooting or modification I might have to do. Thank you!

ADD REPLYlink written 4.0 years ago by ote12310
1

I have added comments to my previous post to clarify the statements and commands used. There is a wealth of information online, and my opinion is that it's more rewarding (for you) to figure things out yourself. But if that's too much trouble and you do not have theambition to learn some more python programming, I would be happy to help you further with this script.

ADD REPLYlink written 4.0 years ago by WouterDeCoster44k

I was actually very decent in Python a while ago, but that was before I saw any utility in it, so I eventually forgot the language entirely. I have, though, been playing around a lot (read: googling with some guess-and-check) with the code you gave me (thank you, again), and I think I have a somewhat decent modified version:

from Bio import Entrez
Entrez.email = "<email address>"
from Bio import Medline
for y in range (1960,2016):
    handle = Entrez.esearch(db="pubmed", term="<my query> AND %s[pdat]" % (str(y)), retmax=1000)
    idlist = Entrez.read(handle)["IdList"]
    handle = Entrez.efetch("pubmed", id=idlist, rettype="medline", retmode="text", retmax=1000)
    records = Medline.parse(handle)
    allauthors = [record['AU'] for record in records]
    allauthorslist = []
    for i in range (0,len(allauthors)):
        allauthorslist += allauthors[i]
    print "%s \t %s \n" % (str(y),len(set(allauthorslist)))

However, when I tried to run it, the program gave me an error on the first iteration:

Traceback (most recent call last):
    File "<stdin>", line 6, in <module>
KeyError: 'AU'

I'm really not sure what that means or what to do with the information it's giving me. I also don't know, therefore, if the program is working, since the first iteration failed. Does this mean anything to you, and does the code look viable?

Also, is there a way to give Biopython my whole bit of code (all 13 lines) at once, so I'm not doing copy and paste 13 times for however many queries?

Thank you once again.

ADD REPLYlink written 4.0 years ago by ote12310
1

Code seems reasonable. Could you check wether idlist is what you think it is, e.g. by printing the length and/or first items? Makes it easier to track down the error.

As I hinted at earlier, it might be that a record is not properly formatted, e.g. the 'AU' key is not present in the retrieved data. Let's rewrite a part to take that into account:

I change allauthors = [record['AU'] for record in records] to:

failedparser = [] #List in which we drop all failures, perhaps try to fix this later?
allauthors = [] #The result of this will be a one dimensional list, no longer a list of lists
for record in records:
    try: #Extremely useful statement. Executes code and if the specified error occurs, execute that action without interupting the script
        allauthors.extend(record['AU']) #Add the authorlist to the allauthors
    except KeyError: #In the case of a KeyError, don't kill the program
        failedparser.append(record) #Failures go here, depending on the number you may want to try and fix this

For your second question: easiest would be to just save it as a script/text file, e.g. getauthors.py and execute as python getauthors.py

ADD REPLYlink written 4.0 years ago by WouterDeCoster44k

It looks like some of the articles just don't list authors or don't properly list authors for whatever reason, so that explains the error. I implemented your try piece, and that seems to be going well.

So you're aware, I had this weird bug in which the for loop I put in to break up the lists within the list (I think that's coherent) was breaking the author lists up into letters instead of elements; it ended up being an unnecessary component after switching += for append, so I got rid of it, but regardless, += does seem to behave oddly. I'm assuming that's why you used append yourself?

Anyway, I added a few bells and whistles and streamlined the whole thing a bit (visually, at least). It seems to be working very well. The current version is below. In case it comes up, I noticed that retmax doesn't go above 10000. Is there an easy fix in the event that there are more than 10000 records in a given year?

Also, so I understand what's actually happening a little better, what is "for record in records" doing (line 9 in your original code, line 16 in the version below)?

Current version:

import sys
from Bio import Entrez
Entrez.email = "<email>"
from Bio import Medline
while True:
    query = raw_input("Query: ")
    if query == "quit" or query == "exit":
        sys.exit("Program terminated.")
    for y in range (1960,2016):
        failedparser = []
        allauthors = []
        handle = Entrez.esearch(db="pubmed", term="%s AND %s[pdat]" % (query,str(y)), retmax=10000)
        idlist = Entrez.read(handle)["IdList"]
        handle = Entrez.efetch("pubmed", id=idlist, rettype="medline", retmode="text", retmax=10000)
        records = Medline.parse(handle)
        for record in records:
            try:
                allauthors.extend(record['AU'])
            except KeyError:
                failedparser.append(record)
        print "%s \t %s" % (str(y),len(set(allauthors)))
ADD REPLYlink written 4.0 years ago by ote12310
1

Nicely done. To work around the retmax limitation you could consider splitting years in pieces...

records is a list containing all retrieved articles. for record in records is a for loop, in which the code loops over each individual record present in the list of records. It's in that sense very similar to for y in range (1960,2016) (Notice that the range excludes the last element and the final element will be 2015 because that's python). Also, I notice now you used for i in range (0,len(allauthors)): which is maybe a javascript artefact (?) in your reasoning. You could also have used for partial_list in allauthors:

I'm not sure why += didn't work properly in the case you described, I tried a toy example

a = [1,4,6]
b = [5,8,7]
a += b
a

[1, 4, 6, 5, 8, 7]

And that appears to work correctly.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by WouterDeCoster44k

I think I saw somewhere that doing a second iteration with retstart at retmax+1, which makes sense. Would you happen to know of a way to check if number of records is greater than retmax? I recall that print len(records) returns some sort of error. But it's not very important if you don't happen to know, since it's a special case and has an easy workaround.

That makes sense with the for loop. It seems almost too intuitive. And yes, I'm more familiar with Java and C++ than anything else (far from fluent in either) and have a little (read: minimal but more than with Python) experience in JS.

Maybe you're not having the array issue because you're using numbers/integers, or maybe because it's single-character/digit elements. I'll see if I can replicate the issue to show you exactly what was happening.

Thank you.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by ote12310

The len(records) doesn't work, because records is not exactly a list but a generator. You could think of it as a more memory-efficient type of list, but one you can only access once. For example if you would do the following:

counter = 0
for record in records:
    counter += 1

You wouldn't be able to get the author information out afterwards, because now the generator is emptied. The next for loop would just yield nothing. I'm sure you can find a lot of things about this online. It depends on your needs.

A straightforward way to count the number of records while extracting the author information would be to initiate a counter = 0 before starting the loop, and increment that counter (counter += 1) before the try-except block.

counter = 0 
for record in records:
      counter += 1  
      try:
            allauthors.extend(record['AU'])
            lastdate = record['DP'] #Store the date of publication, overwrite every iteration but we only need the last one in the generator
        except KeyError:
            failedparser.append(record)
   if not counter < 10000:
        #solveproblem here. Perhaps you want to use the date of the last article you parsed and start the search again from there?
ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by WouterDeCoster44k
0
gravatar for DCGenomics
4.0 years ago by
DCGenomics320
United States
DCGenomics320 wrote:

Heres another way (from the author of edirect - http://www.ncbi.nlm.nih.gov/books/NBK179288/)

   AuthorsPerYear() {
     echo "$1" |
     efilter -query "$2 [PDAT]" |
     efetch -format docsum |
     xtract -pattern DocumentSummary \
       -block Author -match AuthType:Author \
         -tab "\n" -element Name |
     sort -f | uniq -i | grep '.' | wc -l
   }

   AuthorLoop() {
     citations=`esearch -db pubmed -query "$1"`
     for (( yr = 2016; yr >= 1960; yr -= 1 ))
     do
       count=`AuthorsPerYear "$citations" "$yr"`
       echo -e "$yr\t$count"
     done
   }

   AuthorLoop "transposition [TITL]"
ADD COMMENTlink modified 4.0 years ago by genomax85k • written 4.0 years ago by DCGenomics320
0
gravatar for Pierre Lindenbaum
4.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

Using SaxScript (java + XML SAX handler using javascript) : https://github.com/lindenb/jsandbox/blob/master/src/sandbox/SAXScript.md and pubmeddump https://github.com/lindenb/jvarkit/wiki/PubmedDump

The script:

 /** current author */
var author = null;
/** current text */
var text = null;

 /** called when a new element is found, */
function startElement(uri,localName,name,atts)
        {
        if(name=="Author") 
            {
            author={};
            text="";
            }
        }

/** in text node  */
function characters(s)
        {
        if(text!=null) text+=s;
        }

/** end of element */
function endElement(uri,localName,name)
        {
        if(author!=null)
                {
                if(name=="Author")
                    {
                    print( 
                        ("LastName" in author ? author.LastName : ".") +" \t" +
                        ("ForeName" in author ? author.ForeName : ".")
                        );
                     author=null;
                     }
                else if(text!=null && text!="")
                    {
                    author[name]=text;
                    }
                }
        text=(author!=null?"":null);
        }

one-liner:

$java -jar ~/src/jvarkit-git/dist/pubmeddump.jar "Bioinformatics[JOUR] && 2016[PDAT]" | \
java -jar ~/src/jsandbox/dist/saxscript.jar -f author.js | sort | uniq -c | sort -n

      1 Abecasis    Goncalo R
(...)
      3 Weissman    Tsachy
      3 Wu  Rongling
      3 Zhang   Shihua
      3 Zhang   Wei
      3 Zhang   Yang
      4 Chou    Kuo-Chen
      4 Eils    Roland
      4 Hakenberg   Jörg
      4 Milanesi    Luciano
      4 Wang    Wei
      4 Wolkenhauer     Olaf
      5 .   .
      5 Lu  Zhiyong
      5 Wang    Yadong
      6 Mamitsuka   Hiroshi
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Pierre Lindenbaum129k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1028 users visited in the last hour