HackyHour Giessen

Logo

Code - Tools - Science - Help - Social

View the Project on GitHub HackyHour/Giessen

Topic Webscraping

What is web scraping

In the internet data is read by an browser, where the browser renders the data via a specification. Therefore a website must be machine-readable. HTML specifies content and layout of webpages and is rendered by the browser into a reader ready document.

|---------|                    linked /-----------\
|         | reads /-----------\  in   | Resourcen |
| Browser | <---- | HTML Page | <---- | json, jpg |
|         |       \-----------/       | csv ...   |
|---------|                           \-----------/

Example html content of a german website for a car park:

<html>
  <head>
  </head>
  <body>
    <h1>Parkhäuser in Gießen</h1>
    <ul>
      <li>Parkhaus A - 30/50 belegt</li>
      <li>Parkhaus B - 35/60 belegt</li>
    </ul>
    <a href="data.csv">Data</a>
    <img src="graph.jpg" />
  </body>
</html>

How to build a web scraper

If you just save the web page, no special programm is needed. If you post-process your webpage first, you need to write a bit of code to extract the relevant data. But most of the time you need to process the html at least at the anaylsis phase)

Exploration of HTML contents of a web page is best done with the dev tools in your browser or with Jupyter Notebooks (see the previous notes).

Caveats

You should think about getting notified if something goes wrong. If you don’t, it can cause loss of data, various failures can happen.

What are possible failures:

If track failures of your automatic scraping, you can react to this kind of failures easily.

If you don’t want to be dependent on structurally contents of the web resource you are scraping, you can save the whole content instead of extract specific XML-paths. Depdending of the size of the scraped contents, this naturally increases time for post-processing in analysis phase.

Note: HTML docuemnts can be transformed to pure text format via pandoc (see the notes on the last hacky-hour).

Examples

The homepage of Simon Willison has also many more great examples.

Apart from that I like the CCC talks (in german) “BahnMining” and “SpiegelMining” from David Kriesel, where he scrapes data from the website of the german railway and from one of the biggest german news outlets to visualize and conclude about the publicly available data.

Use cases

Tasks you can solve with web scraping are

What is data analysis

Speaker: Martin