utils (utility library)¶
Utility library
- Contains functions for
- Loading of articles
- Loading of clusters
- Creating the database of articles
- Initial cleaning of the raw csv data from crawler
-
class
utils.
FileList
(folder_path)¶ Bases:
object
Object that hold lists of files and creates a database from them.
-
__init__
(folder_path)¶ Builds a list of files (file_list) and csv files (csv_list) from a given folder path.
Parameters: folder_path (str) – Path of the folder to build file_list and csv_list from
-
display_all
()¶ Prints file_list.
-
display_csv
()¶ Prints csv_list.
-
find_csv
(key)¶ Used to load csv files for a specific database. For example, find_csv(“NYT”) will return every csv file with NYT in its file name.
Parameters: key (str) – A keyword which every file’s name must contain Returns: A list of csv files where every file’s name contains the key Return type: list
-
get_csv
()¶ Returns: A list of csv files Return type: list
-
sqlite_build_nyt_full
()¶ Construct database for NYT.
- Format
- name (text) PRIMARY KEY
- section (text)
- date (datetime)
- wordcnt (int)
- summary (text)
- id (int)
name, rather than id is the primary key for this database to remove articles with duplicate names. All the text fields involve a conversion to unicode, with any errors being ignored. This does lead to some weird characters in the processed text, although this should not have an impact on the results.
-
-
utils.
create_nyt_cluster_database
(database_name, all_clusters)¶ Creates a database to store computed clusters for subsequent chaining.
Parameters: - database_name (str) – Name of the database to store computed clusters to
- all_cluster (list) – A list of computed clusters
-
utils.
get_time
()¶ Gets current system time
Returns: Current system time Return type: str
-
utils.
load_csv
(file_name)¶ Loads the CSV file with the given filename.
Parameters: file_name (str) – Name of a csv file Returns: Each item is a single row in the provided CSV file Return type: list
-
utils.
load_nyt
(section_name='World', start_date='2014-01-01', end_date='2015-01-01', keywords='')¶ Loads data from SQLite DB
- SQLite DB requires the following format
- name (text)
- section (text)
- date (datetime)
- wordcnt (int)
- summary (text)
- id (int)
Parameters: - section_name (str) – Name of a section, either “World” or “US”. May also be blank to load from both.
- start_date (date) – Articles loaded will be published after this date
- end_date (date) – Articles loaded will be published before this date
- keywords (str) – Keywords, each separated by a single space to restrict the articles loaded. e.g. “china israel”
Returns: Each item in the list is an article loaded from the SQLite DB
Return type: list
-
utils.
load_nyt_by_article_id
(article_id)¶ Loads an article from SQLite DB
- SQLite DB requires the following format
- name text,
- section text,
- date datetime,
- wordcnt int,
- summary text,
- id int
Parameters: article_id (int) – The ID of an article Returns: A list with a single item, the article if it exists in the database. Otherwise a blank list is returned Return type: list
-
utils.
load_nyt_clusters
(start_date=None, end_date=None, db_name='NewYT_clustered.db')¶ Loads clusters from a specified database.
Parameters: - start_date (datetime) – The earliest published article in the cluster must be published on or after this date
- end_date (datetime) – The earliest published article in the cluster must be published on or before this date
- db_name (str) – The name of the database of clusters to load from
Returns: A list containing the clusters that were selected by the query. Each row in the list is a single cluster
Return type: list
-
utils.
log_to_file
(program_name)¶ Prepares configuration for logging for a specific program.
Parameters: program_name (str) – Name of a program (usually a .py file)
-
utils.
sqlite_test
()¶ Tests if the database has been loaded correctly.
-
utils.
stdout_redirect
(*args, **kwds)¶ Redirects stdout to the given stream.