utils (utility library)

Utility library

Contains functions for
  • Loading of articles
  • Loading of clusters
  • Creating the database of articles
  • Initial cleaning of the raw csv data from crawler
class utils.FileList(folder_path)

Bases: object

Object that hold lists of files and creates a database from them.

__init__(folder_path)

Builds a list of files (file_list) and csv files (csv_list) from a given folder path.

Parameters:folder_path (str) – Path of the folder to build file_list and csv_list from
display_all()

Prints file_list.

display_csv()

Prints csv_list.

find_csv(key)

Used to load csv files for a specific database. For example, find_csv(“NYT”) will return every csv file with NYT in its file name.

Parameters:key (str) – A keyword which every file’s name must contain
Returns:A list of csv files where every file’s name contains the key
Return type:list
get_csv()
Returns:A list of csv files
Return type:list
sqlite_build_nyt_full()

Construct database for NYT.

Format
  • name (text) PRIMARY KEY
  • section (text)
  • date (datetime)
  • wordcnt (int)
  • summary (text)
  • id (int)

name, rather than id is the primary key for this database to remove articles with duplicate names. All the text fields involve a conversion to unicode, with any errors being ignored. This does lead to some weird characters in the processed text, although this should not have an impact on the results.

utils.create_nyt_cluster_database(database_name, all_clusters)

Creates a database to store computed clusters for subsequent chaining.

Parameters:
  • database_name (str) – Name of the database to store computed clusters to
  • all_cluster (list) – A list of computed clusters
utils.get_time()

Gets current system time

Returns:Current system time
Return type:str
utils.load_csv(file_name)

Loads the CSV file with the given filename.

Parameters:file_name (str) – Name of a csv file
Returns:Each item is a single row in the provided CSV file
Return type:list
utils.load_nyt(section_name='World', start_date='2014-01-01', end_date='2015-01-01', keywords='')

Loads data from SQLite DB

SQLite DB requires the following format
  • name (text)
  • section (text)
  • date (datetime)
  • wordcnt (int)
  • summary (text)
  • id (int)
Parameters:
  • section_name (str) – Name of a section, either “World” or “US”. May also be blank to load from both.
  • start_date (date) – Articles loaded will be published after this date
  • end_date (date) – Articles loaded will be published before this date
  • keywords (str) – Keywords, each separated by a single space to restrict the articles loaded. e.g. “china israel”
Returns:

Each item in the list is an article loaded from the SQLite DB

Return type:

list

utils.load_nyt_by_article_id(article_id)

Loads an article from SQLite DB

SQLite DB requires the following format
  • name text,
  • section text,
  • date datetime,
  • wordcnt int,
  • summary text,
  • id int
Parameters:article_id (int) – The ID of an article
Returns:A list with a single item, the article if it exists in the database. Otherwise a blank list is returned
Return type:list
utils.load_nyt_clusters(start_date=None, end_date=None, db_name='NewYT_clustered.db')

Loads clusters from a specified database.

Parameters:
  • start_date (datetime) – The earliest published article in the cluster must be published on or after this date
  • end_date (datetime) – The earliest published article in the cluster must be published on or before this date
  • db_name (str) – The name of the database of clusters to load from
Returns:

A list containing the clusters that were selected by the query. Each row in the list is a single cluster

Return type:

list

utils.log_to_file(program_name)

Prepares configuration for logging for a specific program.

Parameters:program_name (str) – Name of a program (usually a .py file)
utils.sqlite_test()

Tests if the database has been loaded correctly.

utils.stdout_redirect(*args, **kwds)

Redirects stdout to the given stream.