Websites contain a lot of useful information, such as job vacancies on job boards, public opinions on social media, or travel times and ticket prices at transport companies. Sometimes, information from websites is also crucial for research, audits, inspection, and supervision (e.g. by government) of companies.
Usually the only way to obtain this (online) information is to extract it directly from websites. If done manually, collecting information is very inefficient and time consuming. More and more parties are therefore applying web scraping: a computer technique in which software is used to automatically extract information from web pages.
The web scraping process includes multiple techniques for the crawling and parsing of websites, as well as storing data in a structured way. Saving specific data usually takes place in a database or spreadsheet, after which the information can be further used and analyzed.
A major advantage of web scraping is its speed. Where manually extracting specific information from websites can only be done a few hundred times a day, a fully automated process speeds it up to tens of thousands of times a day. The web scraping algorithm for this can be written in just a few hours. The huge time efficiency, in combination with the possibility of sorting, structured collation and storage of data, makes web scraping an ideal, and sometimes the only realistic way to extract large amounts of data from websites.
In addition to quickly extracting useful information from websites, web scraping also has other applications, such as the automatic testing of websites for security purposes, errors or missing parts.
In general, web scraping is extremely suitable for eliminating repetitive, administrative (manual) tasks from online sources and speeding up the data collection process.
Not only websites, but also all kinds of online documents (such as PDFs, DOCs, XMLs) can be automatically searched and downloaded using web scraping. To extract high-quality information from texts and documents, web scraping is generally combined with text mining and text analytics. These techniques make it possible to extract relevant information and patterns from large amounts of text material at high speed. The information contained within online collected documents, such as scanned PDFs, can also be extracted utilizing OCR techniques.
We regularly scrape publicly accessible websites and online registers for supervisory authorities and audit companies. We also apply web scraping in research aimed at detecting crime and subversion. There is always a privacy check performed beforehand, including an assessment of the general terms and conditions of websites in order to assess what is allowed and what is not.