arackpy.spider¶

A multithreaded web crawler and scraper

class arackpy.spider.Spider(backend=u'Default')¶

Create a spider.

The spider is implemented using two queues. Reader threads get from the active queue and put into the empty queue. When the active queue is empty, it is swapped with the once empty but now full queue. New reader threads are spawned at every level and process urls from the active queue.

Urls are grouped by host server ip address and the corresponding html is downloaded sequentially from each ip depending on the requirements set in the robots.txt file. If a time duration is not explicitly specified in the robots file, the default wait_time_range is used.

The spider can be terminated by adjusting two parameters, namely max_urls and max_levels. Pressing Ctrl-c will also interrupt and terminate the process albeit in a harsh manner.

Parameters:

Parameters:	start_urls : list The starting point for the spider. wait_time_range : tuple A time interval from which a wait time is randomly selected. follow_external_links : bool If set to True, spider will traverse domains outside the starting urls. visit_history_limit : int Used to set the cache size of the deque which keeps tracks of all the visisted urls. respect_server : bool If set to True, the wait_time_range attribute is applied. read_robots_file : bool If set to True, the robots.txt file is parsed and checked. The spider honors the time and/or download rate as well as whether to crawl the page at all. If a time cannot be determined, the wait_time_range is set to the default. timeout : int The timeout used when the url is read from. If the url cannot be read within the specified time, a timeout exception occurs. thread_safe_parse : bool If set to True, the parse method is thread safe, which allows for easy debugging using print statements. max_urls_per_level : int Children urls immediately below the start urls form the first level. Since the number of urls per level can increase at an exponential rate, a limit is set to prevent memory bottlenecks by defining a max queue size for the active and empty queues. max_level : int The maximum number of levels to crawl before termination. Everytime all the reader threads return (i.e. join), marks the end of the previous level and the beginning of the next. max_urls: int The total number of urls to crawl before termination. This is implemented using a counter that each reader thread increments by one after it reads an url. debug : bool Log all debug messages to stdout.

start_urls : list: The starting point for the spider.
wait_time_range : tuple: A time interval from which a wait time is randomly selected.
follow_external_links : bool: If set to True, spider will traverse domains outside the starting urls.
visit_history_limit : int: Used to set the cache size of the deque which keeps tracks of all the visisted urls.
respect_server : bool: If set to True, the wait_time_range attribute is applied.
read_robots_file : bool: If set to True, the robots.txt file is parsed and checked. The spider honors the time and/or download rate as well as whether to crawl the page at all. If a time cannot be determined, the wait_time_range is set to the default.
timeout : int: The timeout used when the url is read from. If the url cannot be read within the specified time, a timeout exception occurs.
thread_safe_parse : bool: If set to True, the parse method is thread safe, which allows for easy debugging using print statements.
max_urls_per_level : int: Children urls immediately below the start urls form the first level. Since the number of urls per level can increase at an exponential rate, a limit is set to prevent memory bottlenecks by defining a max queue size for the active and empty queues.
max_level : int: The maximum number of levels to crawl before termination. Everytime all the reader threads return (i.e. join), marks the end of the previous level and the beginning of the next.
max_urls: int: The total number of urls to crawl before termination. This is implemented using a counter that each reader thread increments by one after it reads an url.
debug : bool: Log all debug messages to stdout.

TODO

In order of importance:

Implement a bloomfilter for visisted urls cache instead of a deque.

Implement backends like BeautifulSoup, Tor, Selenium, and proxies.

BUGS

Fix http://example.com not same as https://example.com/.

Fix pypi not showing the code syntax highlighting.

Methods

parse(url, html)¶: User code used to handle each url and corresponding html.

Attributes

start_urls = []¶

wait_time_range = (1, 5)¶

follow_external_links = False¶

visit_history_limit = 2000¶

respect_server = True¶

read_robots_file = True¶

timeout = 5¶

max_urls_per_level = 1000¶

max_levels = 100¶

debug = False¶