Web scraping is a technique that enables quick in-depth data retrieving. It can be used to help people of all fields capturing massive data and information from the internet.
As more and more people turn to web scraping for acquiring data, automatic tools, like Octoparse, are getting popular and helping people to quickly turn web data into spreadsheets.
However, the web scraping process does post extra pressure on the target website. When a crawler goes un-restrained and sends an overwhelmingly amount of requests to a website, the server could potentially crash down. As a result, many websites “protect” themselves using anti-scraping mechanisms to avoid being “attacked” by web-scraping programs.
Luckily for those that do need the data and want to scrape responsibly, there are solutions to avoid being blocked by anti-scraping systems. In this article, we will talk about some common anti-scraping systems and discuss the corresponding solutions to tackle them.
1. Scraping speed matters
Most web scraping bots aim to fetch data as quickly as possible, however, this can easily get you exposed as a scraping bot as there’s no way a real human can surf the web so fast. Websites can track your access speed easily and once the system finds you are going through the pages too fast, it will suspect that you are not a human and block you by default.
Solution: We can set random time intervals between requests, i.e., we can either add “sleep” in the code when writing a script or set up wait time when using Octoparse to build a crawler.
2. Dealing with CAPTCHA
CAPTCHA is by far one of the most effective ways to tell human and robot apart. It does a good job in preventing “attacks” from bots. There are three types of CAPTCHA most often used.
Type 1: Click the CAPTCHA option
Type 2: Enter the CAPTCHA code
Type 3: Select the specified images from all the given images.
Solution: With the surge of the image recognition tech, conventional CAPTCHA can be cracked easily, though it costs a lot. Tools like Octoparse does provide cheaper alternatives with a bit compromised results.
3. IP restriction
When a site detects there are a number of requests coming from a single IP address, the IP address can be easily blocked. To avoid sending all of your requests through the same IP address, you can use proxy servers. A proxy server is a server (a computer system or an application) that acts as an intermediary for requests from clients seeking resources from other servers (from Wikipedia: Proxy server). It allows you to send requests to websites using the IP you set up, masking your real IP address.
Of course, if you use a single IP set up in the proxy server, it is still easy to get blocked. You need to create a pool of IP addresses and use them randomly to route your requests through a series of different IP addresses.
Solutions: Many servers, such as VPNs, can help you to get rotated IP. Octoparse Cloud Service is supported by hundreds of cloud servers, each with a unique IP address.
When an extraction task is set to be executed in the Cloud, requests are performed on the target website through various IPs, minimizing the chances of being traced. Octoparse local extraction allows users to set up proxies to avoid being blocked.
4. Scrape behind login
Login can be regarded as a permission to gain more access to some specific web pages, like Twitter, Facebook and Instagram. Take Instagram as an example, without login, visitors can only get 20 comments under each post.
Solution: Octoparse works by imitating human browsing behaviors, so when login is required to access the data needed, you can easily incorporate the login steps, ie. inputting username and password as part of the workflow. More details can be found in Extract data behind a login.
6. Be aware of honeypot traps
Honeypots are links that are invisible to normal visitors but are there in the HTML code and can be found by web scrapers. They are just like traps to detect scraper by directing them to blank pages. Once a particular visitor browses a honeypot page, the website can be relatively sure it is not a human visitor and starts throttling or blocking all requests from that client.
Solution: When building a scraper for a particular site, it is worth looking carefully to check whether there are hidden links by using a standard browser.
Octoparse uses XPath for precise capturing or clicking actions, avoiding clicking the faked links .
7. Pages with different layouts
To avoid being scraped easily, some websites are built with slightly different page layouts. For example, page 1 to10 of a directory listing may look slightly different than page 11 to 20 from the same list.
Solution: There are two ways to solve this. For crawlers that written in Java Scripts, a set of new codes is needed. For the crawlers built with Octoparse, you can easily add a “Branch Judgment” into the workflow to tell apart the different layouts then proceed to extract the data precisely.
I hope all these tips above can help you build your own solutions or improve your current solution. You’re welcome to share your ideas with us or if you feel anything can be added to the list.