Code - Tools - Science - Help - Social
In the internet data is read by an browser, where the browser renders the data via a specification. Therefore a website must be machine-readable. HTML specifies content and layout of webpages and is rendered by the browser into a reader ready document.
|---------| linked /-----------\
| | reads /-----------\ in | Resourcen |
| Browser | <---- | HTML Page | <---- | json, jpg |
| | \-----------/ | csv ... |
|---------| \-----------/
<li>Parkhaus
.Example html content of a german website for a car park:
<html>
<head>
</head>
<body>
<h1>Parkhäuser in Gießen</h1>
<ul>
<li>Parkhaus A - 30/50 belegt</li>
<li>Parkhaus B - 35/60 belegt</li>
</ul>
<a href="data.csv">Data</a>
<img src="graph.jpg" />
</body>
</html>
If you just save the web page, no special programm is needed. If you post-process your webpage first, you need to write a bit of code to extract the relevant data. But most of the time you need to process the html at least at the anaylsis phase)
Exploration of HTML contents of a web page is best done with the dev tools in your browser or with Jupyter Notebooks (see the previous notes).
You should think about getting notified if something goes wrong. If you don’t, it can cause loss of data, various failures can happen.
What are possible failures:
If track failures of your automatic scraping, you can react to this kind of failures easily.
If you don’t want to be dependent on structurally contents of the web resource you are scraping, you can save the whole content instead of extract specific XML-paths. Depdending of the size of the scraped contents, this naturally increases time for post-processing in analysis phase.
Note: HTML docuemnts can be transformed to pure text format via pandoc (see the notes on the last hacky-hour).
German Railway-News scraper: This saves the contents of the rail news from the homepage of german railways. The scraping method is based upon the git-history pattern of Simon Willison (and forked from his ca-fires repo), where he uses Github-Actions as free cloud computing resource to assemble time series data from git commits. It has about 4.7 k commits at the time of writing.
Traffic news from the public availbe website of Hesse police. The dataset ends on on 16th of May, when the webserver didn’t allow the Github IP to access the website any longer.
The homepage of Simon Willison has also many more great examples.
Apart from that I like the CCC talks (in german) “BahnMining” and “SpiegelMining” from David Kriesel, where he scrapes data from the website of the german railway and from one of the biggest german news outlets to visualize and conclude about the publicly available data.
Speaker: Martin