While web scraping is rampant today, every one suffers from the same apprehension:
Is web scraping legal?
Or is it something illegal and will it invite legal trouble?
It’s common knowledge that web scraping is a way of extracting data from websites. It’s a compulsion for many types of businesses to scrape data and analyze it. But it is equally true that many people are not sure of the legality of web scraping.
Well, there are arguments on both sides.
On the one hand, when web scraping is carried out with “good bots”, it allows search engines to index information and data. It enables various websites to provide price comparison of different products which eventually saves consumers’ money. Market researchers can gather vital data and easily measure and analyze market sentiment through social media with the help of web scraping.
On the other hand, web scraping can be used and abused as well. Web scraping can be employed for evil purposes through malicious automation better known as “bad bots”. It can enable a number of unhealthy practices with help of scrapped data. It can facilitate denial of service attacks, competitive data mining, data theft, account hijacking, violation of intellectual property, online fraud, unauthorized vulnerability scans, spam and digital ad fraud. Copyright infringement is also one of legal issues to be considered.
The point is that there’s a thin line between legal and illegal aspects of web scraping. It is, therefore, necessary to set it right whether it is legal or not, once and for all.
There’s a difference between extracting and stealing data. It will depend on a number of factors whether it’s legal if you extract the data or not.
Here’re the top 6 criteria for you to consider before you plunge into web scraping.
- Before you set out to extract and crawl data, Robots.txt is the first thing you should consider. It will provide you some sort of idea regarding the legality of your plan.
- Every website keeps its rules documented in a robots.txt file as to how the bots should interact with the website.
- Some websites completely block the bots and this is an indication that you got to leave the site alone! Avoid scraping such a site.
- Keep that in mind that if you still go ahead and try to scrape the sites which block the bots, it is not only illegal but also unethical. Apart from the fact that they will block the bots, the robots file also clarifies what would be termed as “good behavior” on that site with respect to the areas which are open for access, restricted web pages and the frequency limit of crawling.
- To ensure that you don’t get into any legal trouble, you should respect all these rules and follow it at all times.
- Don’t be aggressive in crawling; use a reasonable rate of crawling. Don’t pester the site with requests. Again, the robot.txt comes into play; follow the craw-delay settings mentioned in robot.txt. If there’s nothing specified, you should still follow a fair crawl rate of something like 1 request in 10-15 seconds.
Use an API if one is provided, instead of scraping data.
- Most of the websites that you access have an API already developed for their users. If not, it might be the case that it’s already a part of their bucket list.
- Considering the fact that there’s an API available, it is far more advisable to use it. An API puts you in a much better position with respect to access to data because you can request for the data you want. This is much better than waiting to receive the data.
- Once it becomes fully functional, it works just fine and needs very little maintenance.
- In all, it keeps you safe in your quest for data and saves you from any legal issues.
Respect the Terms of Service (ToS)
- In the quest for data for various purposes, businesses and individuals, at times, don’t respect the Terms of Service.
- When the businesses have piled up data on the site, they don’t want to allow you to scrape it. Terms of Service would invariably carry their message.
- Let’s say you want the data and you think that this is publicly available data so if you go ahead and set up a web scraper to scrape it. It is not strictly illegal. However, you must bear in mind that the website is now in a position to initiate legal action against you for the breach of contract. By violating Terms of Service, you are creating a situation wherein the legal issues can start. Court rulings can go against you!
Don’t hit the servers too frequently
- Web servers have pre-defined capacity exceeding which they can crash. Even if they don’t crash, they will slow down and will not be able to function properly.
- If you don’t exercise restraint and send too many requests to the server, it may happen that the server will crash or at least slow down to the extent that it cannot efficiently load web pages. So you should refrain from sending multiple requests too frequently.
- When a site slows down in this way, it leads to poor user experience. Since the website primarily exists for its users, such a scenario defeats the purpose of a website’s existence.
- For a site, human visitors are top priority and not the bots. Therefore, you should hit the website with a reasonable time gap in between. You should also make sure that you don’t send too many parallel requests and keep number of parallel requests in control.
- If you exercise restraint in the way you go about extracting data from a site, it will have much needed space for its actual operations. You should not indulge in any activity in the form of web scraping that affects the core functionality of a website.
- As long as you follow the basics, you will not get into any legal trouble. If you continue to extract data that is public, there’s hardly any reason to worry. If you don’t have permission from the site, don’t be too persistent in extracting data anyhow.
- In other words, if the data can be accessed only by logging in, you must understand and accept that this data is for users and not for bots.
- You are, by law, allowed to scrape only the public data. If you still go ahead and scrape private data, you are in violation of Computer Fraud and Abuse Act (CFAA).
- If you scrape private data which is not allowed by the site, it’s illegal and you can be sued.
Let’s take a look at a few key court cases and court rulings that found web scraping on the wrong side of the law:
Many such instances have been noted in the United States and court rulings are interesting to study!
- 2000 eBay v. Bidder’s Edge
- 2009 Facebook v. Power.com
- 2010 Cvent, Inc. v. Eventbrite, Inc.
- 2013 The Associated Press v. Meltwater U.S. Holdings
- May 2017, LinkedIn sent hiQ a cease and desist letter demanding that they stop scraping data because it violates the federal anti-hacking law.