Crawler Engine

class engine.CrawlerEngine.CrawlerEngine[source]
class CustomSpider(*a, **kw)[source]
allowed_domains = ['en.wikipedia.org']
config = {'start_urls': 'http://en.wikipedia.org/wiki/Programming_language', 'allowed_domains': 'en.wikipedia.org'}
config_file = <closed file '/home/docs/checkouts/readthedocs.org/user_builds/iosr-crawler/checkouts/latest/src/engine/conf.crawler', mode 'r'>
config_path = '/home/docs/checkouts/readthedocs.org/user_builds/iosr-crawler/checkouts/latest/src/engine/conf.crawler'
crawler
handles_request(request)
log(message, level=10, **kw)

Log the given messages at the given log level. Always use this method to send log messages from your spider

make_requests_from_url(url)
name = 'spider'
parse(response)
static parse_page(response)[source]
parse_start_url(response)
process_results(response, results)
rules = (<scrapy.contrib.spiders.crawl.Rule object at 0x7fa6cf2d3c50>,)
set_crawler(crawler)
settings
start_requests()
start_urls = ['http://en.wikipedia.org/wiki/Programming_language']
CrawlerEngine.add_query(user_id, query)[source]

Add crawling query for given user.

Parameters:
  • user_id (int) – ID of user associated with the query.
  • query (str) – User’s query.
CrawlerEngine.get_urls(query)[source]

Retrieves all URLs associated with given query form database.

Returns:list of URLs.
CrawlerEngine.get_user_queries(user_id)[source]

Retrieves user queries form database.

Parameters:user_id (int) – Id of user associated with the query.
Returns:list of user queries.
static CrawlerEngine.notify_agents()[source]

Notifies agent about new crawling query.

CrawlerEngine.start_crawling()[source]

Notifies all agents and if crawling process in not started, starts it.