arackpy.spider

A multithreaded web crawler and scraper

class arackpy.spider.Spider(backend=u'Default')

Create a spider.

The spider is implemented using two queues. Reader threads get from the active queue and put into the empty queue. When the active queue is empty, it is swapped with the once empty but now full queue. New reader threads are spawned at every level and process urls from the active queue.

Urls are grouped by host server ip address and the corresponding html is downloaded sequentially from each ip depending on the requirements set in the robots.txt file. If a time duration is not explicitly specified in the robots file, the default wait_time_range is used.

The spider can be terminated by adjusting two parameters, namely max_urls and max_levels. Pressing Ctrl-c will also interrupt and terminate the process albeit in a harsh manner.

Parameters:
start_urls : list

The starting point for the spider.

wait_time_range : tuple

A time interval from which a wait time is randomly selected.

follow_external_links : bool

If set to True, spider will traverse domains outside the starting urls.

visit_history_limit : int

Used to set the cache size of the deque which keeps tracks of all the visisted urls.

respect_server : bool

If set to True, the wait_time_range attribute is applied.

read_robots_file : bool

If set to True, the robots.txt file is parsed and checked. The spider honors the time and/or download rate as well as whether to crawl the page at all. If a time cannot be determined, the wait_time_range is set to the default.

timeout : int

The timeout used when the url is read from. If the url cannot be read within the specified time, a timeout exception occurs.

thread_safe_parse : bool

If set to True, the parse method is thread safe, which allows for easy debugging using print statements.

max_urls_per_level : int

Children urls immediately below the start urls form the first level. Since the number of urls per level can increase at an exponential rate, a limit is set to prevent memory bottlenecks by defining a max queue size for the active and empty queues.

max_level : int

The maximum number of levels to crawl before termination. Everytime all the reader threads return (i.e. join), marks the end of the previous level and the beginning of the next.

max_urls: int

The total number of urls to crawl before termination. This is implemented using a counter that each reader thread increments by one after it reads an url.

debug : bool

Log all debug messages to stdout.

TODO

In order of importance:

  1. Implement a bloomfilter for visisted urls cache instead of a deque.
  2. Implement backends like BeautifulSoup, Tor, Selenium, and proxies.

BUGS

  1. Fix http://example.com not same as https://example.com/.
  2. Fix pypi not showing the code syntax highlighting.

Methods

parse(url, html)

User code used to handle each url and corresponding html.

Attributes

start_urls = []
wait_time_range = (1, 5)
visit_history_limit = 2000
respect_server = True
read_robots_file = True
timeout = 5
max_urls_per_level = 1000
max_levels = 100
debug = False