Overview

arackpy is a simple but powerful web crawler and scraper. While it is good natured and respectful by default, it can be used to do evil. Remember with great power comes great responsibilities.

Some features of arackpy are:

  1. Concurrent page downloads using Python threads

  2. Support for robots.txt to prevent host server bottlenecking

  3. Different backends for additional capabilities such as:

    1. Dealing with JavaScript/AJAX requests and,
    2. Anonymous scraping using Tor and proxies.

Join us on the mailing list. Coming soon!

Requirements

arackpy currently supports Python 2.7 and 3.6+ out of the box. Depending on the data you want to extract, future releases will require one or more of the dependencies below to be installed to support the various backends:

  • BeautifulSoup
  • Selenium
  • requests
  • stem
  • fake_useragent

Installation

For the vanilla arackpy install do a simple pip install:

pip install arackpy

The following packages may also be installed:

pip install bs4, selenium, requests, stem, fake_useragent

Quickstart

Open up your favorite python text editor and type the following:

# hello_spider.py

from __future__ import division     # for python 2.7

from arackpy.spider import Spider


class HelloSpider(Spider):
    """A simple spider in just five lines of working code"""

    start_urls = ["https://www.python.org"]

    def parse(self, url, html):
        """Extract data from the raw html"""
        print("Crawling url, %s" % url)


if __name__ == "__main__":
    print("Press Ctrl-c to stop crawling")
    spider = HelloSpider()
    spider.crawl()

Run the program using:

python hello_spider.py

Note

Press Ctrl-c to terminate crawling.

Programming Guide

The arackpy Programming Guide provides in-depth documentation for writing applications using arackpy. Many topics described here reference the arackpy API reference, which is listed below.

If this is your first time reading about arackpy, we suggest you start at Writing an arackpy application.

API Reference

Developer Guide

These documents describe details on how to develop arackpy itself further. Ready these to get a more detailed insight into how arackpy is designed, and how to help make arackpy even better. Get in touch if you would like to contribute!

Third Party Libraries

Listed here are a few third party libraries that you might find useful when developing your projects. Please direct any questions to the respective authors.