Overview¶
arackpy is a simple but powerful web crawler and scraper. While it is good natured and respectful by default, it can be used to do evil. Remember with great power comes great responsibilities.
Some features of arackpy are:
Concurrent page downloads using Python threads
Support for robots.txt to prevent host server bottlenecking
Different backends for additional capabilities such as:
- Dealing with JavaScript/AJAX requests and,
- Anonymous scraping using Tor and proxies.
Join us on the mailing list. Coming soon!
Requirements¶
arackpy currently supports Python 2.7 and 3.6+ out of the box. Depending on the data you want to extract, future releases will require one or more of the dependencies below to be installed to support the various backends:
- BeautifulSoup
- Selenium
- requests
- stem
- fake_useragent
Installation¶
For the vanilla arackpy install do a simple pip install:
pip install arackpy
The following packages may also be installed:
pip install bs4, selenium, requests, stem, fake_useragent
Quickstart¶
Open up your favorite python text editor and type the following:
# hello_spider.py
from __future__ import division # for python 2.7
from arackpy.spider import Spider
class HelloSpider(Spider):
"""A simple spider in just five lines of working code"""
start_urls = ["https://www.python.org"]
def parse(self, url, html):
"""Extract data from the raw html"""
print("Crawling url, %s" % url)
if __name__ == "__main__":
print("Press Ctrl-c to stop crawling")
spider = HelloSpider()
spider.crawl()
Run the program using:
python hello_spider.py
Note
Press Ctrl-c to terminate crawling.
Programming Guide¶
The arackpy Programming Guide provides in-depth documentation for writing applications using arackpy. Many topics described here reference the arackpy API reference, which is listed below.
If this is your first time reading about arackpy, we suggest you start at Writing an arackpy application.
API Reference¶
Developer Guide¶
These documents describe details on how to develop arackpy itself further. Ready these to get a more detailed insight into how arackpy is designed, and how to help make arackpy even better. Get in touch if you would like to contribute!
Third Party Libraries¶
Listed here are a few third party libraries that you might find useful when developing your projects. Please direct any questions to the respective authors.
BeautifulSoup - A Python library for pulling data out of HTML and XML files.
Works well with static websites where all content is loaded at one time. If installed arackpy uses bs4 to extract anchor tags from html pages.
Selenium - A webdriver and test automation tool.
Works well with websites that use javascript and AJAX to dynamically update different sections of the website.
- For Tor backed anonymous scraping, the following libraries are required.