The following java program parses a pubmed XML from stdin and prints the difference of days beteen "received" and "accepted":
import java.io.InputStream;
import java.util.GregorianCalendar;
import java.util.concurrent.TimeUnit;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.XMLEvent;
public class Biostar54473
{
private static class PubMedPubDate
{
int year;
int month=-1;
int day=-1;
@Override
public String toString() {
String s=String.format("%04d", year);
if(month!=-1)
{
s+="-"+String.format("%02d", month);
if(day!=-1)
{
s+="-"+String.format("%02d", day);
}
}
return s;
}
long getTimeInMillis()
{
GregorianCalendar cal=new GregorianCalendar(
year,
month==-1?0:month-1,
month==-1 || day==-1?
1:day);
return cal.getTimeInMillis();
}
}
private void parse(InputStream in) throws Exception
{
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.FALSE);
factory.setProperty(XMLInputFactory.IS_VALIDATING, Boolean.FALSE);
factory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.FALSE);
XMLEventReader r= factory.createXMLEventReader(in);
String PubStatus=null;
PubMedPubDate curr=null;
PubMedPubDate accepted=null;
PubMedPubDate received=null;
String MedlineTA=null;
String pmid=null;
String ArticleTitle=null;
QName attPubStatus=new QName("PubStatus");
while(r.hasNext())
{
XMLEvent evt=r.nextEvent();
if(evt.isStartElement())
{
String name=evt.asStartElement().getName().getLocalPart();
if(name.equals("PubmedArticle"))
{
pmid=null;
accepted=null;
received=null;
MedlineTA=null;
pmid=null;
ArticleTitle=null;
}
else if(name.equals("ArticleTitle") && ArticleTitle==null)
{
ArticleTitle=r.getElementText().trim();
}
else if(name.equals("PMID") && pmid==null)
{
pmid=r.getElementText().trim();
}
else if(name.equals("MedlineTA") && MedlineTA==null)
{
MedlineTA=r.getElementText().trim();
}
else if(name.equals("PubMedPubDate"))
{
curr=null;
Attribute att=evt.asStartElement().getAttributeByName(attPubStatus);
if(att!=null) PubStatus=att.getValue();
if("received".equals(PubStatus))
{
curr=new PubMedPubDate();
received=curr;
}
else if("accepted".equals(PubStatus))
{
curr=new PubMedPubDate();
accepted=curr;
}
else
{
curr=null;
}
}
else if(curr!=null && name.equals("Year"))
{
try { curr.year=Integer.parseInt(r.getElementText().trim()); } catch(Exception err) { curr=null;received=null;ok=false;}
}
else if(curr!=null && name.equals("Month"))
{
String month=r.getElementText().trim().toLowerCase();
if(month.equals("jan") || month.equals("january")) month="1";
else if(month.equals("feb") || month.equals("february")) month="2";
else if(month.equals("mar") || month.equals("march")) month="3";
else if(month.equals("apr") || month.equals("april")) month="4";
else if(month.equals("may") || month.equals("may")) month="5";
else if(month.equals("jun") || month.equals("june")) month="6";
else if(month.equals("jul") || month.equals("july")) month="7";
else if(month.equals("aug") || month.equals("august")) month="8";
else if(month.equals("sep") || month.equals("september")) month="9";
else if(month.equals("oct") || month.equals("october")) month="10";
else if(month.equals("nov") || month.equals("november")) month="11";
else if(month.equals("dec") || month.equals("december")) month="12";
try { curr.month=Integer.parseInt(month); } catch(Exception err) { curr=null;accepted=null;ok=false;}
}
else if(curr!=null && name.equals("Day"))
{
try { curr.day=Integer.parseInt(r.getElementText().trim()); } catch(Exception err) { curr=null;accepted=null;ok=false;}
}
}
else if(evt.isEndElement())
{
String name=evt.asEndElement().getName().getLocalPart();
if(name.equals("PubmedArticle"))
{
if(received!=null && accepted!=null)
{
long n=accepted.getTimeInMillis()-received.getTimeInMillis();
System.out.println(
pmid+"\t"+
ArticleTitle+"\t"+
MedlineTA+"\t"+
received+"\t"+
accepted+"\t"+
TimeUnit.DAYS.convert(n, TimeUnit.MILLISECONDS)
);
}
ArticleTitle=null;
MedlineTA=null;
pmid=null;
curr=null;
received=null;
accepted=null;
}
else if(name.equals("PubMedPubDate"))
{
curr=null;
}
}
}
}
public static void main(String[] args) throws Exception
{
System.out.println("#pmid\t"+
"ArticleTitle\t"+
"MedlineTA\t"+
"Received\t"+
"Accepted\t"+
"DiffDays"
);
new Biostar54473().parse(System.in);
}
}
A 'verticalized' example for a few papers containing the word "Next generation Sequencing" in the title. You can read this in R# or whatever to get some stats about a journal, a subject, etc...
$ javac Biostar54473.java && cat pubmed_result.xml | java Biostar54473
>>> 2
$1 #pmid 23020966
$2 ArticleTitle Transcriptome analysis using next-generation sequencing.
$3 MedlineTA Curr Opin Biotechnol
$4 Received 2012-07-04
$5 Accepted 2012-09-04
$6 DiffDays 62
<<< 2
>>> 3
$1 #pmid 23000871
$2 ArticleTitle Understanding pathogens in the era of next generation sequencing.
$3 MedlineTA J Infect Dev Ctries
$4 Received 2012-09-13
$5 Accepted 2012-09-14
$6 DiffDays 1
<<< 3
>>> 4
$1 #pmid 22994565
$2 ArticleTitle Accurate variant detection across non-amplified and whole genome amplified DNA using targeted next generation sequencing.
$3 MedlineTA BMC Genomics
$4 Received 2012-01-30
$5 Accepted 2012-09-20
$6 DiffDays 233
<<< 4
(...)
>>> 253
$1 #pmid 18604217
$2 ArticleTitle Alta-Cyclic: a self-optimizing base caller for next-generation sequencing.
$3 MedlineTA Nat Methods
$4 Received 2008-03-10
$5 Accepted 2008-06-02
$6 DiffDays 83
<<< 253
>>> 254
$1 #pmid 18262675
$2 ArticleTitle The impact of next-generation sequencing technology on genetics.
$3 MedlineTA Trends Genet
$4 Received 2007-11-15
$5 Accepted 2007-12-17
$6 DiffDays 32
<<< 254
Very useful idea!
I would title this question as "Degree of burden in submitting a paper" :) !
It would be interesting to calculate results per journal and compare to what the publisher claims is turnaround time :)
That's a good point. There are a lot of claims about the speed of the review process made by journals but as far as I know there is no one who checks these facts. Our experience with some journals has certainly deviated a great deal from their claims.
This is an issue in the wet-lab world for sure: http://www.nature.com/news/2011/110427/full/472391a.html
I wonder if there is a similar phenomenon among bioinformatics journals. "Please provide tests of extra use cases..." that sort of thing. Anyone had that experience?
I've played with my java program and uploaded the results on figshare: http://dx.doi.org/10.6084/m9.figshare.96403
Wish I had this when I was trying to calculate the embargo-induced delays in publication of the ENCODE papers http://caseybergman.wordpress.com/2012/09/05/the-cost-to-science-of-the-encode-publication-embargo/