Double Debug
—
July 20, 2022
Have you been in a scenario where you want to use some data from a website, but there's no public API available? Well today, web scraping is easier than ever and you can do it with just a few lines of code. Modern libraries such as Puppeteer, Selenium, Playwright and others do an excellent job of simplifying this process for you and making it super beginner friendly.
Web scraping is the process of extracting data from a website. This is done programmatically — you can code a bot and tell it where to navigate to, which part of the website to grab, where to click, type or even take a screenshot. This can be useful in many ways. As I mentioned before, some online services don't have a public API that you can access. So, some people might use web scraping to directly navigate a website, extract data they need and use it for their web application.
Some people might scrape some weather data from a weather channel and use for their IOT application... Some might use it for cryptocurrency price monitoring. Others might create a Twitter bot that displays the latest Elon Musk tweet on their homepage. The possibilities are endless.
Not at all. Using a modern Javascript library like Puppeteer, web scraping is very beginner friendly and it can be done with very few lines of code, depending on what exactly you want to scrape.
If you'd like to try it, I encourage you to follow my YouTube tutorial where I described the process, step by step.
In this example, I created a NodeJS API that navigates to google.com and retrieves all autocomplete suggestions for a specified text. I created a function called getGoogleSuggestions
that takes the search query as the only argument. As I explained in the video, the bot navigates the website exactly the same way a person would.
First, he opens a new tab and types in the address - google.com. Next, he clicks on the search box and types the search query. After that, the list of autocomplete suggestions pops up below the search bar. This is where we can target these DOM elements using CSS selectors (id, class name, tag name, etc). For this case, I needed to use tag names for selecting them because:
Fun fact: For some DOM elements, Google uses a CSS preprocessor that randomizes class names.
I assume this is a way of discouraging web scraping on their website, but don't quote me on that. Either way, I couldn't use id's or class names to select the suggestions, so I went with their tag names, which isn't always possible to do.
Once I selected all the google suggestions, I filter and format them in a way that's more consumable. The final result of this function is a string array of Google suggestions and the function is exposed on a RestAPI endpoint.
One big question mark when it comes to web scraping is — is it legal? The answer is that it depends on the website. Some websites, such as Twitter, disallow web scraping, according to their TOS.
"...scraping the Services without the prior consent of Twitter is expressly prohibited."
However, scraping publicly accessible data is legal and there are thousands of services that scrape Twitter data every day. This is the case with many other websites, too, and this is why web scraping is a notorious legal gray area.
Other, perhaps more interesting disadvantage is that if you're web scraping data for your website, you will always rely on 3rd party websites. If the service you're scraping for some reason isn't available today, your website won't be either. If the website's servers are really slow for some reason, your load times will be very high as well.
This is the downside you have to deal with when relying on 3rd party websites for data.
Speaking of load times, even if the website works completely fine, web scraping is generally a slow process. In the Google suggestions example, I was using Node and Puppeteer and I was getting my results after 1.4 seconds on average. In a world where APIs are expected to respond within 200ms or under, this is really slow. This is definitely a factor you should consider before committing to the web scraping approach in your application.