Forum:Programming language use distribution from recent programs / articles
2
5
Entering edit mode
4.6 years ago

I would be interested to see a breakdown distribution of the different programming languages used for recently published bioinformatic programs.

I suspect such a breakdown isn't available, but would there be a good enough source or quick and dirty way of assessing this? If possible, without having to go through a year of Bioinformatics Journal, download each paper and/or program and find what language each used.

Alternatively, what interesting online sources compile language use by year and, ideally, by sector?

Basically, I'm interested in something like the TIOBE index, but for bioinformatics: https://www.tiobe.com/tiobe-index/

programming_language programs Forum • 4.8k views
4
Entering edit mode

Not strictly related, as it's not Bioinfx, but I saw this on twitter a few weeks ago. Kind of interesting. Some of the conclusions are perhaps a bit sketchy, like python > java > C might reflect an increase in programming skill rather than a trend toward a 'better' language.

You might be able to adapt their work flow idea though?

0
Entering edit mode

I saw it pass too but didn't really look at it much last time. It is quite interesting :)

0
Entering edit mode

There may only be two a few Java programs: BBMap suite and FastQC.

Edit: Danger of responding to a poll like this. Things I don't generally use fade away in mind.

1
Entering edit mode

.gatk , picard ...

1
Entering edit mode

trimmomatic is also in java.

1
Entering edit mode

Mauve, Artemis, Qualimap(?)...

6
Entering edit mode
4.6 years ago

With a simple algorithm, It' s difficult to detect languages like 'C' ( e.g: 'in C.' vs 'C. Elegans')

output for 'bioinformatics 2017'

* update *

histogram for 'bioinformatics

(php is overrated because many urls end with '.php' )

0
Entering edit mode

That's more like it ;) Nicely done!

0
Entering edit mode

yes it's hard-coded https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/pubmed/PubmedCodingLanguages.java#L201 (feel free to suggest some more). I can send you the table (pmid/lang/title/year/context) if you want

0
Entering edit mode

I cloned and compiled (with make) jvarkit. How do I run the code for PubmedCodingLanguages? Java newbie here :)

EDIT: I also tried javac PubmedCodingLanguages.java but got errors. Not sure it is meant to be compiled by itself.

0
Entering edit mode

I'm refactoring my code these days, that's why I Haven't compiled the documentation.

make pubmedcodinglang pubmeddump


(requires java oracle 8)

and then something like:

java -jar dist/pubmeddump.jar 'Bioinformatics' | java -jar dist/pubmedcodinglang.jar

1
Entering edit mode

Working :)

I'll see if I can tweak the code to add languages or add things as needed.

0
Entering edit mode

cool! I've commented out 'R', 'PHP' needs to be separated from the URLS, I only look at the abstract (not the title) etc...

1
Entering edit mode
4.6 years ago
John 13k

Code posted on github has a breakdown of the languages used in the project at the top. It should be possible to automate the process of going from a Github url to a CSV of langauge usage. Probably with both relative percentages, and absolute lines of code.

Then you could parse pubmed for github urls.

3
Entering edit mode

5
Entering edit mode

Your ability to get shit done (in under 5 minutes) will never cease to amaze me Pierre :D

1
Entering edit mode

I up vote your answer and Pierre's code snippet because both are awesome, but this is not really what I am looking for. These are GitHub projects mentioning the word "bioinformatics" in the description (EDIT or somewhere in the file or directory names). It seems the gap between this and published programs is too big for the count to be informative. GitHub has its own bias for Python and scripts or random repositories will also be different from published programs.

Still, I really like this! I'll give a look at GitHub's API.

1
Entering edit mode

Could you scan a repository's README for a DOI? That might be a way to quickly filter for published work.

0
Entering edit mode

Here's a little python function that will scrape github for the code usage statistics:

def get_stats(github_url,pretty=False):
import requests
import lxml
files = {}
tree = lxml.etree.HTML(requests.get(github_url + '/search?l=markdown').content)

for language in tree.xpath("//span[@class='count']"):
info = language.getparent().itertext()
next(info)
count = int(next(info))
lang = next(info).strip()
files[lang] = count

if not pretty: return files

total_files = sum(files.values())/100.
print 'Language    Files    Percentage'
for language,counts in files.items():
print language.ljust(11),
print str(counts).rjust(5),'  ',
print counts/total_files


It can either return either a dict of langauge_names:raw_counts, or it can just print it (with percentages):

>>> stats = get_stats('https://github.com/broadinstitute/picard',True)
Language    Files    Percentage
XML            12    1.99004975124
Shell           3    0.497512437811
Java          513    85.0746268657
Text           60    9.95024875622
JavaScript      2    0.331674958541
R               9    1.49253731343
Dockerfile      1    0.16583747927
CSS             1    0.16583747927


will return an empty dict if the github repo doesnt exist. If you can make a list of bioinformatic repos to scan, pop this function in a loop and aggregate the data :) I couldn't get the code usage stats via the github api unfortunately, so scraping html was all i could do.