analyse (article cluster creation)¶
Classes to perform clustering on individual articles.
-
class
analyse.
BigClamArticle
(data, coms)¶ Bases:
analyse.BisectingKmeans
Performs clustering using overlapping community detection
-
__init__
(data, coms)¶ Refer to superclass for details
Parameters: - data (list) – Refer to superclass
- coms (int) – Number of communities to find
-
compute
()¶ Prepares data for community detection algorithm, the detected communities are then saved to bigclam_cluster. Each detected community represents a single cluster and is stored as a ComputedCluster.
-
-
class
analyse.
BisectingKmeans
(data)¶ Bases:
object
Performs clustering using Bisecting Kmeans.
- Algorithm
- Begin with all items in a single cluster
- Use normal kmeans to split the cluster into 2 clusters
- Repeat the previous step until some termination condition is reached. For this project, we terminate when there is no cluster which contains more than 5 items.
-
__init__
(data)¶ Initialises the object, key parameter to modify is max_cluster_size, to determine when the algorithm terminates. We insert the full data into the queue initially, and it functions as the first cluster.
Parameters: data (list) – list of articles, 1 article per row
-
compute
()¶ Performs Bisecting Kmeans on the given data.
Parameters for Kmeans are as follows (detailed explanation at scikit learn website)
- n_clusters: 2 (split into 2)
- max_iter: 500
- n_init: 50
-
find_computed_cluster_metrics
()¶ Initialises cluster metric computation over every cluster that is found by the given clustering algorithm.
-
class
analyse.
ComputedCluster
(data)¶ Bases:
object
Stores a computed cluster and its associated date in an easy format for subsequent processing.
-
__init__
(data)¶ Breaks down raw data into different fields for subsequent processing into SQL DB. Also initialises metrics for the cluster.
Parameters: data (list) – Articles in raw format that are included in the cluster
-
compute_metrics
(corpus, article_pos)¶ Computes metrics for the given cluster. Metrics computed are: diameter, radius, centroid, closest article to centroid, the distance of the closest article to the centroid.
Parameters: - corpus – A corpus in LSI space
- article_pos (dict) – Maps the article id to the actual positions of the article in the corpus
-
display
()¶ Helper function to display the contents of a computed cluster
-
-
class
analyse.
ExtractFeatures
¶ Bases:
object
Extracts features from data extracted from SQL DB in a specific format.
- DB Format
- name text
- section text
- date datetime
- wordcnt int
- summary text
- id int
- Usage
- load_data() with the desired data in appropriate format
- pre_processing()
- transform_tfidf()
- model_lsi()
-
__init__
()¶ Declares variables for later use
-
load_data
(data)¶ Loads data in the format specified above.
Loaded data located in self.data.
-
model_lsi
(n_topics=200)¶ Performs dimensionality reduction on the LogEntropy weighted corpus using latent semantic indexing (LSI). The corpus, now converted into LSI space, is located in self.corpus_lsi.
Parameters: n_topics (int) – the number of topics in the LSI space
-
parsing
()¶ Performs standard text processing on the contents of self.documents.
- Steps:
- Removal of html tags, punctuation, multiple whitespaces, numbers and stopwords
- Addition of bigrams and trigrams to the dictionary ( self.dictionary)
- Conversion into a bag of words corpus (self.corpus)
-
pre_processing
()¶ Combines the titles with the summaries of the articles for further processing. Combined articles and summaries located in self.documents.
Calling this function will result in a call to the parsing function.
-
pre_processing_line
()¶ Helper function for feature extraction in snap_cluster_lib.
-
print_data
()¶ Helper function that prints the contents of self.data
-
transform_corpus
()¶ Transforms bag of words corpus into a LogEntropy weighted corpus, reducing the weight of unimportant terms. The transformed corpus is located in self.corpus_transformed.
-
word_freq
()¶ DEPRECATED
Extracts the 75 most common words in the data and adds them to the stopword list.
Returns: A modified stop word list Return type: list
-
class
analyse.
QueueCluster
(data)¶ Bases:
object
Object to store a cluster while it is in the queue. This supports the bisecting kmeans clustering. In that algorithm, a cluster in the queue can either be processed further (broken down into 2 smaller clusters) or stored.
-
__init__
(data)¶ Performs feature extraction / dimensionality reduction on the raw article data that makes up this cluster for further processing later. The data is converted into LSI space which is necessary for Kmeans clustering.
Parameters: data (list) – Articles in raw format
-
-
analyse.
get_stop_list
()¶ DEPRECATED
Stopword list generator
Returns: list of stopwords Return type: list