utils (utility library)¶

Utility library

Contains functions for

Loading of articles
Loading of clusters
Creating the database of articles
Initial cleaning of the raw csv data from crawler

class utils.FileList(folder_path)¶

Bases: object

Object that hold lists of files and creates a database from them.

__init__(folder_path)¶

Builds a list of files (file_list) and csv files (csv_list) from a given folder path.

Parameters:	folder_path (str) – Path of the folder to build file_list and csv_list from

display_all()¶: Prints file_list.

display_csv()¶: Prints csv_list.

find_csv(key)¶

Used to load csv files for a specific database. For example, find_csv(“NYT”) will return every csv file with NYT in its file name.

Parameters:	key (str) – A keyword which every file’s name must contain
Returns:	A list of csv files where every file’s name contains the key
Return type:	list

get_csv()¶

Returns:	A list of csv files
Return type:	list

sqlite_build_nyt_full()¶

Construct database for NYT.

Format

name (text) PRIMARY KEY
section (text)
date (datetime)
wordcnt (int)
summary (text)
id (int)

name, rather than id is the primary key for this database to remove articles with duplicate names. All the text fields involve a conversion to unicode, with any errors being ignored. This does lead to some weird characters in the processed text, although this should not have an impact on the results.

utils.create_nyt_cluster_database(database_name, all_clusters)¶

Creates a database to store computed clusters for subsequent chaining.

Parameters:	database_name (str) – Name of the database to store computed clusters to all_cluster (list) – A list of computed clusters

utils.get_time()¶

Gets current system time

Returns:	Current system time
Return type:	str

utils.load_csv(file_name)¶

Loads the CSV file with the given filename.

Parameters:	file_name (str) – Name of a csv file
Returns:	Each item is a single row in the provided CSV file
Return type:	list

utils.load_nyt(section_name='World', start_date='2014-01-01', end_date='2015-01-01', keywords='')¶

Loads data from SQLite DB

SQLite DB requires the following format

name (text)
section (text)
date (datetime)
wordcnt (int)
summary (text)
id (int)

Parameters:	section_name (str) – Name of a section, either “World” or “US”. May also be blank to load from both. start_date (date) – Articles loaded will be published after this date end_date (date) – Articles loaded will be published before this date keywords (str) – Keywords, each separated by a single space to restrict the articles loaded. e.g. “china israel”
Returns:	Each item in the list is an article loaded from the SQLite DB
Return type:	list

utils.load_nyt_by_article_id(article_id)¶

Loads an article from SQLite DB

SQLite DB requires the following format

name text,
section text,
date datetime,
wordcnt int,
summary text,
id int

Parameters:	article_id (int) – The ID of an article
Returns:	A list with a single item, the article if it exists in the database. Otherwise a blank list is returned
Return type:	list

utils.load_nyt_clusters(start_date=None, end_date=None, db_name='NewYT_clustered.db')¶

Loads clusters from a specified database.

Parameters:	start_date (datetime) – The earliest published article in the cluster must be published on or after this date end_date (datetime) – The earliest published article in the cluster must be published on or before this date db_name (str) – The name of the database of clusters to load from
Returns:	A list containing the clusters that were selected by the query. Each row in the list is a single cluster
Return type:	list

utils.log_to_file(program_name)¶

Prepares configuration for logging for a specific program.

Parameters:	program_name (str) – Name of a program (usually a .py file)

utils.sqlite_test()¶: Tests if the database has been loaded correctly.

utils.stdout_redirect(*args, **kwds)¶: Redirects stdout to the given stream.