analyse (article cluster creation)¶

Classes to perform clustering on individual articles.

class analyse.BigClamArticle(data, coms)¶

Bases: analyse.BisectingKmeans

Performs clustering using overlapping community detection

__init__(data, coms)¶

Refer to superclass for details

Parameters:	data (list) – Refer to superclass coms (int) – Number of communities to find

compute()¶: Prepares data for community detection algorithm, the detected communities are then saved to bigclam_cluster. Each detected community represents a single cluster and is stored as a ComputedCluster.

class analyse.BisectingKmeans(data)¶

Bases: object

Performs clustering using Bisecting Kmeans.

Algorithm

Begin with all items in a single cluster
Use normal kmeans to split the cluster into 2 clusters
Repeat the previous step until some termination condition is reached. For this project, we terminate when there is no cluster which contains more than 5 items.

__init__(data)¶

Initialises the object, key parameter to modify is max_cluster_size, to determine when the algorithm terminates. We insert the full data into the queue initially, and it functions as the first cluster.

Parameters:	data (list) – list of articles, 1 article per row

compute()¶

Performs Bisecting Kmeans on the given data.

Parameters for Kmeans are as follows (detailed explanation at scikit learn website)

n_clusters: 2 (split into 2)
max_iter: 500
n_init: 50

find_computed_cluster_metrics()¶: Initialises cluster metric computation over every cluster that is found by the given clustering algorithm.

class analyse.ComputedCluster(data)¶

Bases: object

Stores a computed cluster and its associated date in an easy format for subsequent processing.

__init__(data)¶

Breaks down raw data into different fields for subsequent processing into SQL DB. Also initialises metrics for the cluster.

Parameters:	data (list) – Articles in raw format that are included in the cluster

compute_metrics(corpus, article_pos)¶

Computes metrics for the given cluster. Metrics computed are: diameter, radius, centroid, closest article to centroid, the distance of the closest article to the centroid.

Parameters:	corpus – A corpus in LSI space article_pos (dict) – Maps the article id to the actual positions of the article in the corpus

display()¶: Helper function to display the contents of a computed cluster

class analyse.ExtractFeatures¶

Bases: object

Extracts features from data extracted from SQL DB in a specific format.

DB Format

name text
section text
date datetime
wordcnt int
summary text
id int

Usage

load_data() with the desired data in appropriate format
pre_processing()
transform_tfidf()
model_lsi()

__init__()¶: Declares variables for later use

load_data(data)¶

Loads data in the format specified above.

Loaded data located in self.data.

model_lsi(n_topics=200)¶

Performs dimensionality reduction on the LogEntropy weighted corpus using latent semantic indexing (LSI). The corpus, now converted into LSI space, is located in self.corpus_lsi.

Parameters:	n_topics (int) – the number of topics in the LSI space

parsing()¶

Performs standard text processing on the contents of self.documents.

Steps:

Removal of html tags, punctuation, multiple whitespaces, numbers and stopwords
Addition of bigrams and trigrams to the dictionary ( self.dictionary)
Conversion into a bag of words corpus (self.corpus)

pre_processing()¶

Combines the titles with the summaries of the articles for further processing. Combined articles and summaries located in self.documents.

Calling this function will result in a call to the parsing function.

pre_processing_line()¶: Helper function for feature extraction in snap_cluster_lib.

print_data()¶: Helper function that prints the contents of self.data

transform_corpus()¶: Transforms bag of words corpus into a LogEntropy weighted corpus, reducing the weight of unimportant terms. The transformed corpus is located in self.corpus_transformed.

word_freq()¶

DEPRECATED

Extracts the 75 most common words in the data and adds them to the stopword list.

Returns:	A modified stop word list
Return type:	list

class analyse.QueueCluster(data)¶

Bases: object

Object to store a cluster while it is in the queue. This supports the bisecting kmeans clustering. In that algorithm, a cluster in the queue can either be processed further (broken down into 2 smaller clusters) or stored.

__init__(data)¶

Performs feature extraction / dimensionality reduction on the raw article data that makes up this cluster for further processing later. The data is converted into LSI space which is necessary for Kmeans clustering.

Parameters:	data (list) – Articles in raw format

analyse.get_stop_list()¶

DEPRECATED

Stopword list generator

Returns:	list of stopwords
Return type:	list