Welcome to iosr-crawler’s documentation!¶
Contents:
Crawler Engine¶
Contents:
Crawler Engine¶
-
class
engine.CrawlerEngine.
CrawlerEngine
[source]¶ -
class
CustomSpider
(*a, **kw)[source]¶ -
allowed_domains
= ['en.wikipedia.org']¶
-
config
= {'start_urls': 'http://en.wikipedia.org/wiki/Programming_language', 'allowed_domains': 'en.wikipedia.org'}¶
-
config_file
= <closed file '/home/docs/checkouts/readthedocs.org/user_builds/iosr-crawler/checkouts/latest/src/engine/conf.crawler', mode 'r'>¶
-
config_path
= '/home/docs/checkouts/readthedocs.org/user_builds/iosr-crawler/checkouts/latest/src/engine/conf.crawler'¶
-
crawler
¶
-
handles_request
(request)¶
-
log
(message, level=10, **kw)¶ Log the given messages at the given log level. Always use this method to send log messages from your spider
-
make_requests_from_url
(url)¶
-
name
= 'spider'¶
-
parse
(response)¶
-
parse_start_url
(response)¶
-
process_results
(response, results)¶
-
rules
= (<scrapy.contrib.spiders.crawl.Rule object at 0x7f1fd7a48bd0>,)¶
-
set_crawler
(crawler)¶
-
settings
¶
-
start_requests
()¶
-
start_urls
= ['http://en.wikipedia.org/wiki/Programming_language']¶
-
-
CrawlerEngine.
get_urls
(query)[source]¶ Retrieves all URLs associated with given query form database.
Returns: list of URLs.
-
class
DB Engine¶
-
class
engine.db_engine.DbEngine.
DbEngine
[source]¶ -
add_keywords
(query, keywords, bucket_name='keywords')[source]¶ Adds keywords for given query to database.
Parameters:
-
get_all_queries
(bucket_name='all_queries')[source]¶ Retrieves all queries form database.
Returns: list of all queries.
-
get_keywords
(query, bucket_name='keywords')[source]¶ Retrieves all keywords associated with given query form database.
Returns: list of keywords.
-
Search Engine¶
-
class
engine.search_engine.SearchEngine.
SearchEngine
[source]¶
Extractor¶
Contents:
Extractor¶
-
class
nlp.extractor.
NLPExtractor
[source]¶ -
-
static
calculate_word_scores
(phrase_list)[source]¶ Calculates words scores based on their frequency and degree.
Parameters: phrase_list (list) – List of phrases to be processed. Returns: mapping between word and its score.
-
static
generate_candidate_keyword_scores
(phrase_list, word_score)[source]¶ Generates scores for candidate keywords.
Parameters: Returns: mapping between phrases and their scores.
-
static
generate_candidate_keywords
(sentence_list, stopword_pattern)[source]¶ Generates list of keywords candidates.
Parameters: Returns: list of keywords
-
static
is_number
(word)[source]¶ Checks whether word is a number.
Parameters: word (str) – Word to be checked. Returns: True or False
-
load_stop_words
()[source]¶ Utility function to load stop words from a file and return as a list of words.
Returns: list A list of stop words.
-
run
(text)[source]¶ Extracts keywords from the text.
Parameters: text (str) – Text to be processed. Returns: list of keywords.
-
static
User Interface¶
Contents:
Forms¶
-
class
ui.forms.
QueryForm
(data=None, files=None, auto_id=u'id_%s', prefix=None, initial=None, error_class=<class 'django.forms.utils.ErrorList'>, label_suffix=None, empty_permitted=False)[source]¶ -
add_error
(field, error)¶ Update the content of self._errors.
The field argument is the name of the field to which the errors should be added. If its value is None the errors will be treated as NON_FIELD_ERRORS.
The error argument can be a single error, a list of errors, or a dictionary that maps field names to lists of errors. What we define as an “error” can be either a simple string or an instance of ValidationError with its message attribute set and what we define as list or dictionary can be an actual list or dict or an instance of ValidationError with its error_list or error_dict attribute set.
If error is a dictionary, the field argument must be None and errors will be added to the fields that correspond to the keys of the dictionary.
-
add_initial_prefix
(field_name)¶ Add a ‘initial’ prefix for checking dynamic initial values
-
add_prefix
(field_name)¶ Returns the field name with a prefix appended, if this Form has a prefix set.
Subclasses may wish to override.
-
as_p
()¶ Returns this form rendered as HTML <p>s.
-
as_table
()¶ Returns this form rendered as HTML <tr>s – excluding the <table></table>.
-
as_ul
()¶ Returns this form rendered as HTML <li>s – excluding the <ul></ul>.
-
base_fields
= OrderedDict([('query', <django.forms.fields.CharField object at 0x7f1fd75295d0>)])¶
-
changed_data
¶
-
clean
()¶ Hook for doing any extra form-wide cleaning after Field.clean() has been called on every field. Any ValidationError raised by this method will not be associated with a particular field; it will have a special-case association with the field named ‘__all__’.
-
declared_fields
= OrderedDict([('query', <django.forms.fields.CharField object at 0x7f1fd75295d0>)])¶
-
errors
¶ Returns an ErrorDict for the data provided for the form
-
full_clean
()¶ Cleans all of self.data and populates self._errors and self.cleaned_data.
-
has_changed
()¶ Returns True if data differs from initial.
-
has_error
(field, code=None)¶
Returns a list of all the BoundField objects that are hidden fields. Useful for manual form layout in templates.
-
is_multipart
()¶ Returns True if the form needs to be multipart-encoded, i.e. it has FileInput. Otherwise, False.
-
is_valid
()¶ Returns True if the form has no errors. Otherwise, False. If errors are being ignored, returns False.
-
media
¶
-
non_field_errors
()¶ Returns an ErrorList of errors that aren’t associated with a particular field – i.e., from Form.clean(). Returns an empty ErrorList if there are none.
-
visible_fields
()¶ Returns a list of BoundField objects that aren’t hidden fields. The opposite of the hidden_fields() method.
-