- 30th January, 2018
A Beautiful Soup
By Kingston Coker
“You might have to scrape it and use beautiful soup afterwards.”
This idea proposed by my data scientist colleague sounded all too rational to me. It took me a second to realise I was slipping into the box my mother had put me in when she said, “All you data scientists are weird”. I burst out laughing and asked my colleague not to make statements that might be misconstrued by non-programmers. I could only imagine what the other people in the room must have thought we were talking about. Just to help you out, the topic of discussion was neither how to prepare nor eat a meal.
Scraping in this context, is a programming technique for combing through webpages to extract information or other data you might want without having to manually click to open each web page before copying what you are looking for. This kind of operation is a data scientist’s candy because it opens so many doors for easy, quick and inexpensive data analysis. For instance, it can be very helpful for quickly extracting all the links in a web page and checking for their validity.
There are free easy-to-use tools written in different programming languages to aid any curious developer on this research path. My two favourites are “Scrapy” and the infamous “Beautiful Soup”. Both of these are packages in Python. To be more specific, they are HTML parsers, meaning they analyse the content of web page into logical syntactic content and provide the information we are looking for. Scrapy and Beautiful Soup are able to create internet bots (also known as web crawlers) that systematically surf the web to retrieve the desired content.
Scraping web pages is remarkably simple. With less than twenty lines of code, you can scrape any website within minutes. Data in web browsers is typically rendered and displayed with the help of HTML tags. e.g. <html>, <body>, <head>,<div>, etc. Considering that on average, up to 30 different tags are used in different combinations, it would be a tedious process to analyse web pages manually. Each line of the web page would have to be read, tracked and checked to make sure only what was needed was extracted and then copied to another format such as a text file for further analysis.
Technology has greatly advanced our need for more friendly and responsive websites making the underlying structures quite complex. For a data scientist, the need to “beautify the soup” is a must.