sim_lib (similarity query)¶

Class for performing similarity query

class sim_lib.SimMatrix(data, refresh=False, clusters=True)¶

Bases: object

__init__(data, refresh=False, clusters=True)¶

Does the processing on the data before performing a similarity query. This entails a conversion of the data into a LSI space. It has added ability to save and load indexes / dictionaries, so that if the same set of data is used (refresh=False), the code is sped up.

Parameters:

data (list) – Contains the cluster data, each item in the list is a single cluster. It is also possible for the data to be articles, with each item in the list being a single article. The program defaults to clusters
refresh (bool) – Set to true to rebuild all the indexes and save them to the disk (this is necessary when loading new data). Otherwise, for reasons of speed, the indexes will be loaded from the disk
clusters (bool) – Decides whether to pull data from the cluster DB or the article DB

addition(clusters)¶

DEPRECATED

Attempts to find similar clusters to a given list of initial clusters in a greedy fashion.

Algorithm

Take an initial list of N clusters
For each of the clusters in the list, generate a list of similar clusters
With N lists of clusters, identify the cluster which is the most similar to all N initial clusters by searching for the cluster with the highest combined similarity score (summed over the N lists)
If that cluster is not in the initial list, add it to the list and repeat, otherwise the algorithm terminates

Returns:	cluster_set – The ids of the clusters in the chain
Return type:	list

dijkstra(article_no)¶

DEPRECATED

This method chains articles together.

Runs Djikstra’s algorithm to find a path in the graph with the lowest average distance between a starting article and other articles in the corpus. When an article is examined, the 25 nearest nodes by similarity are added into the queue and explored (using query). The process ends when there are no new articles to explore. It then prints the best chain of articles.

Parameters:	article_no (int) – The article to start dijkstra’s algorithm from

keyword_query(keyword, n_cluster=5)¶

Selects the top N most similar clusters to the keyword provided and returns their ids in sorted order.

Parameters:	keyword (str) – The desired keywords for the query, space separated n_cluster (int) – The number of most similar clusters to return
Returns:	res – The ids of the top N most similar clusters in sorted order
Return type:	list

query(article_no=0)¶

DEPRECATED

This is a helper method for dijkstra. It converts the contents of the given article into the same LSI space as the data ( must be articles) and finds the top 25 most similar articles by cosine similarity. It then returns these articles in res.

Parameters:	article_no (int) – The article which the query is to be performed on
Returns:	res – The ids of the top 25 most similar articles to article_no
Return type:	list