Take your web scraping to a new level: Let's play with scrapy
☝🏼
Key takeaways:
Web scraping is a useful and unique project that is good for beginners.
Scrapy makes it easier to operationalise your web scraping and to implement them at scale, by providing the ability to reuse code and features that are useful for web scraping.
Making a transition to the scrapy framework is not always straightforward, but it will pay dividends.
Web scraping is fundamental to any data science journey. There's a lot of information out there on the world wide web. Very few of them are presented in a pretty interface which allows you just to take it. By scraping information off websites, you get structured information. It's a unique challenge that is doable for a beginner.
There are thus a lot of articles which teach you how to scrape websites — here’s mine.
After spending gobs of time plying through books and web articles, I created a web scraper that did the job right.
GitHub – houfu/pdpc-decisions: Data Protection Enforcement Cases in SingaporeData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufuThe code repository of the original web scraper is available on GitHub.
I was pretty pleased with the results and started feeling ambition in my loins. I wanted to take it to a new level.
Levelling up... is not easy.
Turning your hobby project production-ready isn’t so straightforward, though. If you plan to scan websites regularly, update your data or do several operations, you run into a brick wall.
To run it continuously, you will need a server that can schedule your scrapes, store the data and report the results.
Then, in production mode, problems like being blocked and burdening the websites you are dealing with become more significant.
Finally, scrapers share many standard features. I wasn’t prepared to write the same code over and over again. Reusing the code would be very important if I wanted to scrape several websites at scale.
Enter scrapy
The solutions to these problems are well-known, such as using multithreading, asynchronous operations, web proxies or throttling or randomising your web requests. Writing all these solutions from scratch? Suddenly your hobby project has turned into a chore.
Enter scrapy.
The scrapy project is of some vintage. It reached 1.0 in 2015 and is currently at version 2.6.2 (as of August 2022). Scrapy’s age is screaming at you when it recommends you to install it in a “virtual environment” (who doesn’t install anything in Python except in a virtual environment?). On the other hand, scrapy is stable and production ready. It’s one of the best pieces of Python software I have encountered.
I decided to port my original web scraper to scrapy. I anticipated spending lots of time reading documentation, failing and then giving up. It turned out that I spent more time procrastinating, and the actual work was pretty straightforward.
Transitioning to scrapy
Here’s another thing you would notice about scrapy’s age. It encourages you to use a command line tool to generate code. This command creates a new project:
scrapy startproject tutorial
This reminds me of Angular and the ng
command. (Do people still do that these days?)
While I found these commands convenient, it also reminded me that the learning curve of such frameworks is quite steep. Scrapy is no different. In the original web scraper, I defined the application's entry point through the command line function. This seemed the most obvious place to start for me.
@click.command()
@click.argument('action')
def pdpcdecision(csv, download, corpus, action, root, extras, extracorpus, verbose):
starttime = time.time()
scraperesults = Scraper.scrape()
if (action == 'all') or (action == 'files'):
downloadfiles(options, scraperesults)
if (action == 'all') or (action == 'corpus'):
createcorpus(options, scraperesults)
if extras and ((action == 'all') or (action == 'csv')):
scraperextras(scraperesults, options)
if (action == 'all') or (action == 'csv'):
savescraperesultstocsv(options, scraperesults)
diff = time.time() – starttime
logger.info('Finished. This took {}s.'.format(diff))
The original code was shortened to highlight the process.
The organisation of a scrapy project is different. You can generate a new project with the command above. However, the spider does the web crawling, and you have to create that within your project separately. If you started coding, you would not find this intuitive.
For the spider, the starting point is a function which generates or yields requests. The code example below does a few things. First, we find out how many pages there are on the website. We then yield a request for each page by submitting data on a web form.
import requests
import scrapy
from scrapy import FormRequest
class CommissionDecisionSpider(scrapy.Spider):
name = “PDPCCommissionDecisions”
def startrequests(self):
defaultform_data = {
“keyword”: “”,
“industry”: “all”,
“nature”: “all”,
“decision”: “all”,
“penalty”: “all”,
“page”: “1
}
response = requests.post(CASELISTINGURL, data=defaultformdata)
if response.statuscode == requests.codes.ok:
responsejson = response.json()
totalpages = responsejson[“totalPages”]
for page in range(1, totalpages + 1):
yield FormRequest(CASELISTINGURL, formdata=createform_data(page=page))
Now, you need to write another function that deals with requests and yields items, the standard data format in scrapy.
def parse(self, response, **kwargs):
responsejson = response.json()
for item in responsejson[“items”]:
from datetime import datetime
nature = [DPObligations(nature.strip()) for nature in item[“nature”].split(',')] if item[
“nature”] else “None”
decision = [DecisionType(decision.strip()) for decision in item[“decision”].split(',')] if item[
“decision”] else “None”
yield CommissionDecisionItem(
title=item[“title”],
summaryurl=f”https://www.pdpc.gov.sg{item['url']}“,
publisheddate=datetime.strptime(item[“date”], '%d %b %Y'),
nature=nature,
decision=decision
)
You now have a spider! (Scrapy’s Quotesbot example is more minimal than this)
Run the spider using this command in the project directory:
scrapy crawl PDPCCommissionDecisions -o output.csv
Using its default settings, the spider scraped the PDPC website in a zippy 60 seconds. That’s because it already implements multithreading, so you are not waiting for tasks to complete one at a time. The command above even gets you a file containing all the items you scraped with no additional coding.
Transitioning from a pure Python codebase to a scrapy framework takes some time. It might be odd at first to realise you did not have to code the writing of a CSV file or manage web requests. This makes scrapy an excellent framework — you can focus on the code that makes your spider unique rather than reinventing the essential parts of a web scraper, probably very poorly.
It’s all in the pipeline.
If being forced to write spiders in a particular way isn’t irritating yet, dealing with pipelines might be the last straw. Pipelines deal with a request that doesn’t involve generating items. The most usual pipeline component checks an item to see if it’s a duplicate and then drops it if that’s true.
Pipelines look optional, and you can even avoid the complexity by incorporating everything into the main code. It turns out that many operations can be expressed as components in a timeline. Breaking them up into parts also helps the program implement multithreading and asynchronous operations effectively.
In pdpc-decisions, it wasn’t enough to grab the data from the filter or search page. You’d need to follow the link to the summary page, which makes additional information and a PDF download available. I wrote a pipeline component for that:
class CommissionDecisionSummaryPagePipeline:
def processitem(self, item, spider):
adapter = ItemAdapter(item)
soup = bs4.BeautifulSoup(requests.get(adapter[“summaryurl”]).text, features=“html5lib”)
article = soup.find('article')
# Gets the summary from the decision summary page
paragraphs = article.find(class='rte').findall('p')
result = ''
for paragraph in paragraphs:
if not paragraph.text == '':
result += re.sub(r'\s+', ' ', paragraph.text)
break
adapter[“summary”] = result
# Gets the respondent in the decision
adapter[“respondent”] = re.split(r”\s+[bB]y|[Aa]gainst\s+“, article.find('h2').text, re.I)[1].strip()
# Gets the link to the file to download the PDF decision
decisionlink = article.find('a')
adapter[“decisionurl”] = f”https://www.pdpc.gov.sg{decision_link['href']}”
adapter[“fileurls”] = [f”https://www.pdpc.gov.sg{decisionlink['href']}“]
return item
This component takes an item, visits the summary page and grabs the summary, respondent’s name and the link to the PDF, which contains the entire decision.
Note also the item has a field called file_urls
. I did not create this data field. It’s a field used to tell scrapy to download a file from the web.
You can activate pipeline components as part of the spider’s settings.
ITEM_PIPELINES = {
'pdpcSpider.pipelines.CommissionDecisionSummaryPagePipeline': 300,
'pdpcSpider.pipelines.PDPCDecisionDownloadFilePipeline': 800,
}
In this example, the pipeline has two components. Given a priority of 300, the CommissionDecisionSummaryPagePipeline
goes first. PDPCDecisionDownloadFilePipeline
then downloads the files listed in the file_urls
field we referred to earlier.
Note also that PDPCDecisionDownloadFilePipeline
is an implementation of the standard FilesPipeline
component provided by scrapy, so I didn’t write any code to download files on the internet. Like the CSV feature, scrapy downloads the files when its files pipeline is activated.
Once again, it’s odd not to write code to download files. Furthermore, writing components for your pipeline and deciding on their seniority in a settings file isn’t very intuitive if you’re not sure what’s going on. Once again, I am grateful that I did not have to write my own pipeline.
I would note that “pipeline” is a fancy term for describing what your program is probably already doing. It’s true — in the original pdpc-decisions, the pages are scraped, the files are downloaded and the resulting items are saved in a CSV file. That’s a pipeline!
Settings, settings everywhere
Someone new to the scrapy framework will probably find the settings file daunting. In the previous section, we introduced the setting to define the seniority of the components in a pipeline. If you’re curious what else you can do in that file, the docs list over 50 items.
I am not going to go through each of them in this article. To me, though, the number of settings isn’t user-friendly. Still, it hints at the number of things you can do with scrapy, including randomising the delay before downloading files from the same site, logging and settings for storage adapters to common backends like AWS or FTP.
As a popular and established framework, you will also find an ecosystem. This includes scrapyd, a service you can run on your server to schedule scrapes and run your spiders. Proxy services are also available commercially if your operations are very demanding.
There are lots to explore here!
Conclusion
Do I have any regrets about doing pdpc-decisions
? Nope. I learned a lot about programming in Python doing it. It made me appreciate what special considerations are involved in web scraping and how scrapy was helping me to do that.
I also found that following a framework made the code more maintainable. When I revisited the original pdpc-decisions while writing this article, I realised the code didn’t make sense. I didn’t name my files or function sensibly or write tests which showed what the code was doing.
Once I became familiar with the scrapy framework, I knew how to look for what I wanted in the code. This extends to sharing — if everyone is familiar with the framework, it’s easier for everyone to get on the same page rather than learning everything I wrote from scratch.
Scrapy could afford power and convenience, which is specialised for web scraping. I am going to keep using it for now. Learning such a challenging framework is already starting to pay dividends.
Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuRead more interesting adventures.
#Programming #Python #WebScraping #DataScience #Law #OpenSource #PDPC-Decisions #scrapy
Love.Law.Robots. – A blog by Ang Hou Fu
- Discuss... this Post
- If you found this post useful, or like my work, a tip is always appreciated:
- Follow [this blog on the Fediverse]()
- Contact me: