utils (utility library)¶
Utility library
- Contains functions for
- Loading of articles
- Loading of clusters
- Creating the database of articles
- Initial cleaning of the raw csv data from crawler
-
class
utils.FileList(folder_path)¶ Bases:
objectObject that hold lists of files and creates a database from them.
-
__init__(folder_path)¶ Builds a list of files (file_list) and csv files (csv_list) from a given folder path.
Parameters: folder_path (str) – Path of the folder to build file_list and csv_list from
-
display_all()¶ Prints file_list.
-
display_csv()¶ Prints csv_list.
-
find_csv(key)¶ Used to load csv files for a specific database. For example, find_csv(“NYT”) will return every csv file with NYT in its file name.
Parameters: key (str) – A keyword which every file’s name must contain Returns: A list of csv files where every file’s name contains the key Return type: list
-
get_csv()¶ Returns: A list of csv files Return type: list
-
sqlite_build_nyt_full()¶ Construct database for NYT.
- Format
- name (text) PRIMARY KEY
- section (text)
- date (datetime)
- wordcnt (int)
- summary (text)
- id (int)
name, rather than id is the primary key for this database to remove articles with duplicate names. All the text fields involve a conversion to unicode, with any errors being ignored. This does lead to some weird characters in the processed text, although this should not have an impact on the results.
-
-
utils.create_nyt_cluster_database(database_name, all_clusters)¶ Creates a database to store computed clusters for subsequent chaining.
Parameters: - database_name (str) – Name of the database to store computed clusters to
- all_cluster (list) – A list of computed clusters
-
utils.get_time()¶ Gets current system time
Returns: Current system time Return type: str
-
utils.load_csv(file_name)¶ Loads the CSV file with the given filename.
Parameters: file_name (str) – Name of a csv file Returns: Each item is a single row in the provided CSV file Return type: list
-
utils.load_nyt(section_name='World', start_date='2014-01-01', end_date='2015-01-01', keywords='')¶ Loads data from SQLite DB
- SQLite DB requires the following format
- name (text)
- section (text)
- date (datetime)
- wordcnt (int)
- summary (text)
- id (int)
Parameters: - section_name (str) – Name of a section, either “World” or “US”. May also be blank to load from both.
- start_date (date) – Articles loaded will be published after this date
- end_date (date) – Articles loaded will be published before this date
- keywords (str) – Keywords, each separated by a single space to restrict the articles loaded. e.g. “china israel”
Returns: Each item in the list is an article loaded from the SQLite DB
Return type: list
-
utils.load_nyt_by_article_id(article_id)¶ Loads an article from SQLite DB
- SQLite DB requires the following format
- name text,
- section text,
- date datetime,
- wordcnt int,
- summary text,
- id int
Parameters: article_id (int) – The ID of an article Returns: A list with a single item, the article if it exists in the database. Otherwise a blank list is returned Return type: list
-
utils.load_nyt_clusters(start_date=None, end_date=None, db_name='NewYT_clustered.db')¶ Loads clusters from a specified database.
Parameters: - start_date (datetime) – The earliest published article in the cluster must be published on or after this date
- end_date (datetime) – The earliest published article in the cluster must be published on or before this date
- db_name (str) – The name of the database of clusters to load from
Returns: A list containing the clusters that were selected by the query. Each row in the list is a single cluster
Return type: list
-
utils.log_to_file(program_name)¶ Prepares configuration for logging for a specific program.
Parameters: program_name (str) – Name of a program (usually a .py file)
-
utils.sqlite_test()¶ Tests if the database has been loaded correctly.
-
utils.stdout_redirect(*args, **kwds)¶ Redirects stdout to the given stream.