analyse (article cluster creation)¶
Classes to perform clustering on individual articles.
-
class
analyse.BigClamArticle(data, coms)¶ Bases:
analyse.BisectingKmeansPerforms clustering using overlapping community detection
-
__init__(data, coms)¶ Refer to superclass for details
Parameters: - data (list) – Refer to superclass
- coms (int) – Number of communities to find
-
compute()¶ Prepares data for community detection algorithm, the detected communities are then saved to bigclam_cluster. Each detected community represents a single cluster and is stored as a ComputedCluster.
-
-
class
analyse.BisectingKmeans(data)¶ Bases:
objectPerforms clustering using Bisecting Kmeans.
- Algorithm
- Begin with all items in a single cluster
- Use normal kmeans to split the cluster into 2 clusters
- Repeat the previous step until some termination condition is reached. For this project, we terminate when there is no cluster which contains more than 5 items.
-
__init__(data)¶ Initialises the object, key parameter to modify is max_cluster_size, to determine when the algorithm terminates. We insert the full data into the queue initially, and it functions as the first cluster.
Parameters: data (list) – list of articles, 1 article per row
-
compute()¶ Performs Bisecting Kmeans on the given data.
Parameters for Kmeans are as follows (detailed explanation at scikit learn website)
- n_clusters: 2 (split into 2)
- max_iter: 500
- n_init: 50
-
find_computed_cluster_metrics()¶ Initialises cluster metric computation over every cluster that is found by the given clustering algorithm.
-
class
analyse.ComputedCluster(data)¶ Bases:
objectStores a computed cluster and its associated date in an easy format for subsequent processing.
-
__init__(data)¶ Breaks down raw data into different fields for subsequent processing into SQL DB. Also initialises metrics for the cluster.
Parameters: data (list) – Articles in raw format that are included in the cluster
-
compute_metrics(corpus, article_pos)¶ Computes metrics for the given cluster. Metrics computed are: diameter, radius, centroid, closest article to centroid, the distance of the closest article to the centroid.
Parameters: - corpus – A corpus in LSI space
- article_pos (dict) – Maps the article id to the actual positions of the article in the corpus
-
display()¶ Helper function to display the contents of a computed cluster
-
-
class
analyse.ExtractFeatures¶ Bases:
objectExtracts features from data extracted from SQL DB in a specific format.
- DB Format
- name text
- section text
- date datetime
- wordcnt int
- summary text
- id int
- Usage
- load_data() with the desired data in appropriate format
- pre_processing()
- transform_tfidf()
- model_lsi()
-
__init__()¶ Declares variables for later use
-
load_data(data)¶ Loads data in the format specified above.
Loaded data located in self.data.
-
model_lsi(n_topics=200)¶ Performs dimensionality reduction on the LogEntropy weighted corpus using latent semantic indexing (LSI). The corpus, now converted into LSI space, is located in self.corpus_lsi.
Parameters: n_topics (int) – the number of topics in the LSI space
-
parsing()¶ Performs standard text processing on the contents of self.documents.
- Steps:
- Removal of html tags, punctuation, multiple whitespaces, numbers and stopwords
- Addition of bigrams and trigrams to the dictionary ( self.dictionary)
- Conversion into a bag of words corpus (self.corpus)
-
pre_processing()¶ Combines the titles with the summaries of the articles for further processing. Combined articles and summaries located in self.documents.
Calling this function will result in a call to the parsing function.
-
pre_processing_line()¶ Helper function for feature extraction in snap_cluster_lib.
-
print_data()¶ Helper function that prints the contents of self.data
-
transform_corpus()¶ Transforms bag of words corpus into a LogEntropy weighted corpus, reducing the weight of unimportant terms. The transformed corpus is located in self.corpus_transformed.
-
word_freq()¶ DEPRECATED
Extracts the 75 most common words in the data and adds them to the stopword list.
Returns: A modified stop word list Return type: list
-
class
analyse.QueueCluster(data)¶ Bases:
objectObject to store a cluster while it is in the queue. This supports the bisecting kmeans clustering. In that algorithm, a cluster in the queue can either be processed further (broken down into 2 smaller clusters) or stored.
-
__init__(data)¶ Performs feature extraction / dimensionality reduction on the raw article data that makes up this cluster for further processing later. The data is converted into LSI space which is necessary for Kmeans clustering.
Parameters: data (list) – Articles in raw format
-
-
analyse.get_stop_list()¶ DEPRECATED
Stopword list generator
Returns: list of stopwords Return type: list