Dear all,
There are still some seats available for the Transmitting Science online course: Web Scraping and Text Analysis in R. The course will be held online with Online live sessions on March 21st, 28th and on April 4th, and 11th, 2025, from 15:00 to 18:30 (Madrid time zone).
From collecting scientific literature to mining biological databases, web scraping and text analysis are powerful research tools. In this course, we will learn how to extract, clean, and analyse text data using R. We will explore techniques such as regular expressions, sentiment analysis, and topic modelling, making handling unstructured data in research easier.
More details here: https://www.transmittingscience.com/courses/statistics-and-bioinformatics/web-scraping-and-text-analysis-in-r/
Best regards,
Ana
Ana Rosa Gómez Cano, PhD.
Transmitting Science - Communication manager
Will you also be teaching the ethics of web scraping and how to get permission to scrape? Web scraping is a grey area where you're attacking the website for data in a way it's not meant to be used. If a website were to make its data available through automated means, they'd provide an API.
Teaching people how to scrape web sites is propagating techniques for possible abuse of web servers.
Nope. Use APIs.
Wow, that sounds one hell of an unethical practice. This is kind of why Amazon blocks web scrapers completely - you need to let website owners know before you extract their data wholesale and make their website obsolete.
Thank you very much for your thoughtful comment. You are absolutely right that web scraping comes with important ethical and legal considerations, and this is something we also emphasise in the course.
Our goal is not to promote abusive scraping, but to equip researchers with the skills to responsibly handle situations where no other option exists—particularly in scientific contexts where open access data is available but not provided in structured form. We also encourage critical reflection on the boundaries between acceptable and unacceptable uses of these techniques.
Thank you again for raising this point—it is an essential part of the conversation.
That's a good approach. Personally, I copy-paste text and then extract info from that text file - you can see how prohibitive it can be to copy dozens of pages. An alternative - and I can use this because I'm a programmer - is to use JS/jQuery to pick just the elements you need. Challenging but worth the effort IMO.
Thank you for sharing your perspective and experience. Copy–pasting text manually or using scripts like JS/jQuery to isolate elements is indeed another way to approach the challenge, and it highlights well how diverse the strategies can be depending on skills and goals.
In the course, we aim to show web scraping as one option within that wider toolbox, always framed by the need to weigh efficiency against the ethical and legal responsibilities involved. It is very valuable to have these exchanges, since they broaden the view on what researchers may find practical or more aligned with their principles in different situations.