analyse (article cluster creation)

Classes to perform clustering on individual articles.

class analyse.BigClamArticle(data, coms)

Bases: analyse.BisectingKmeans

Performs clustering using overlapping community detection

__init__(data, coms)

Refer to superclass for details

Parameters:
  • data (list) – Refer to superclass
  • coms (int) – Number of communities to find
compute()

Prepares data for community detection algorithm, the detected communities are then saved to bigclam_cluster. Each detected community represents a single cluster and is stored as a ComputedCluster.

class analyse.BisectingKmeans(data)

Bases: object

Performs clustering using Bisecting Kmeans.

Algorithm
  • Begin with all items in a single cluster
  • Use normal kmeans to split the cluster into 2 clusters
  • Repeat the previous step until some termination condition is reached. For this project, we terminate when there is no cluster which contains more than 5 items.
__init__(data)

Initialises the object, key parameter to modify is max_cluster_size, to determine when the algorithm terminates. We insert the full data into the queue initially, and it functions as the first cluster.

Parameters:data (list) – list of articles, 1 article per row
compute()

Performs Bisecting Kmeans on the given data.

Parameters for Kmeans are as follows (detailed explanation at scikit learn website)

  • n_clusters: 2 (split into 2)
  • max_iter: 500
  • n_init: 50
find_computed_cluster_metrics()

Initialises cluster metric computation over every cluster that is found by the given clustering algorithm.

class analyse.ComputedCluster(data)

Bases: object

Stores a computed cluster and its associated date in an easy format for subsequent processing.

__init__(data)

Breaks down raw data into different fields for subsequent processing into SQL DB. Also initialises metrics for the cluster.

Parameters:data (list) – Articles in raw format that are included in the cluster
compute_metrics(corpus, article_pos)

Computes metrics for the given cluster. Metrics computed are: diameter, radius, centroid, closest article to centroid, the distance of the closest article to the centroid.

Parameters:
  • corpus – A corpus in LSI space
  • article_pos (dict) – Maps the article id to the actual positions of the article in the corpus
display()

Helper function to display the contents of a computed cluster

class analyse.ExtractFeatures

Bases: object

Extracts features from data extracted from SQL DB in a specific format.

DB Format
  • name text
  • section text
  • date datetime
  • wordcnt int
  • summary text
  • id int
Usage
  • load_data() with the desired data in appropriate format
  • pre_processing()
  • transform_tfidf()
  • model_lsi()
__init__()

Declares variables for later use

load_data(data)

Loads data in the format specified above.

Loaded data located in self.data.

model_lsi(n_topics=200)

Performs dimensionality reduction on the LogEntropy weighted corpus using latent semantic indexing (LSI). The corpus, now converted into LSI space, is located in self.corpus_lsi.

Parameters:n_topics (int) – the number of topics in the LSI space
parsing()

Performs standard text processing on the contents of self.documents.

Steps:
  • Removal of html tags, punctuation, multiple whitespaces, numbers and stopwords
  • Addition of bigrams and trigrams to the dictionary ( self.dictionary)
  • Conversion into a bag of words corpus (self.corpus)
pre_processing()

Combines the titles with the summaries of the articles for further processing. Combined articles and summaries located in self.documents.

Calling this function will result in a call to the parsing function.

pre_processing_line()

Helper function for feature extraction in snap_cluster_lib.

print_data()

Helper function that prints the contents of self.data

transform_corpus()

Transforms bag of words corpus into a LogEntropy weighted corpus, reducing the weight of unimportant terms. The transformed corpus is located in self.corpus_transformed.

word_freq()

DEPRECATED

Extracts the 75 most common words in the data and adds them to the stopword list.

Returns:A modified stop word list
Return type:list
class analyse.QueueCluster(data)

Bases: object

Object to store a cluster while it is in the queue. This supports the bisecting kmeans clustering. In that algorithm, a cluster in the queue can either be processed further (broken down into 2 smaller clusters) or stored.

__init__(data)

Performs feature extraction / dimensionality reduction on the raw article data that makes up this cluster for further processing later. The data is converted into LSI space which is necessary for Kmeans clustering.

Parameters:data (list) – Articles in raw format
analyse.get_stop_list()

DEPRECATED

Stopword list generator

Returns:list of stopwords
Return type:list