arackpy.spider¶
A multithreaded web crawler and scraper
-
class
arackpy.spider.
Spider
(backend=u'Default')¶ Create a spider.
The spider is implemented using two queues. Reader threads get from the active queue and put into the empty queue. When the active queue is empty, it is swapped with the once empty but now full queue. New reader threads are spawned at every level and process urls from the active queue.
Urls are grouped by host server ip address and the corresponding html is downloaded sequentially from each ip depending on the requirements set in the robots.txt file. If a time duration is not explicitly specified in the robots file, the default wait_time_range is used.
The spider can be terminated by adjusting two parameters, namely max_urls and max_levels. Pressing Ctrl-c will also interrupt and terminate the process albeit in a harsh manner.
Parameters: - start_urls : list
The starting point for the spider.
- wait_time_range : tuple
A time interval from which a wait time is randomly selected.
- follow_external_links : bool
If set to True, spider will traverse domains outside the starting urls.
- visit_history_limit : int
Used to set the cache size of the deque which keeps tracks of all the visisted urls.
- respect_server : bool
If set to True, the wait_time_range attribute is applied.
- read_robots_file : bool
If set to True, the robots.txt file is parsed and checked. The spider honors the time and/or download rate as well as whether to crawl the page at all. If a time cannot be determined, the wait_time_range is set to the default.
- timeout : int
The timeout used when the url is read from. If the url cannot be read within the specified time, a timeout exception occurs.
- thread_safe_parse : bool
If set to True, the parse method is thread safe, which allows for easy debugging using print statements.
- max_urls_per_level : int
Children urls immediately below the start urls form the first level. Since the number of urls per level can increase at an exponential rate, a limit is set to prevent memory bottlenecks by defining a max queue size for the active and empty queues.
- max_level : int
The maximum number of levels to crawl before termination. Everytime all the reader threads return (i.e. join), marks the end of the previous level and the beginning of the next.
- max_urls: int
The total number of urls to crawl before termination. This is implemented using a counter that each reader thread increments by one after it reads an url.
- debug : bool
Log all debug messages to stdout.
TODO
In order of importance:
- Implement a bloomfilter for visisted urls cache instead of a deque.
- Implement backends like BeautifulSoup, Tor, Selenium, and proxies.
BUGS
- Fix http://example.com not same as https://example.com/.
- Fix pypi not showing the code syntax highlighting.
Methods
-
parse
(url, html)¶ User code used to handle each url and corresponding html.
Attributes
-
start_urls
= []¶
-
wait_time_range
= (1, 5)¶
-
follow_external_links
= False¶
-
visit_history_limit
= 2000¶
-
respect_server
= True¶
-
read_robots_file
= True¶
-
timeout
= 5¶
-
max_urls_per_level
= 1000¶
-
max_levels
= 100¶
-
debug
= False¶