sim_lib (similarity query)¶
Class for performing similarity query
-
class
sim_lib.
SimMatrix
(data, refresh=False, clusters=True)¶ Bases:
object
-
__init__
(data, refresh=False, clusters=True)¶ Does the processing on the data before performing a similarity query. This entails a conversion of the data into a LSI space. It has added ability to save and load indexes / dictionaries, so that if the same set of data is used (refresh=False), the code is sped up.
Parameters: - data (list) – Contains the cluster data, each item in the list is a single cluster. It is also possible for the data to be articles, with each item in the list being a single article. The program defaults to clusters
- refresh (bool) – Set to true to rebuild all the indexes and save them to the disk (this is necessary when loading new data). Otherwise, for reasons of speed, the indexes will be loaded from the disk
- clusters (bool) – Decides whether to pull data from the cluster DB or the article DB
-
addition
(clusters)¶ DEPRECATED
Attempts to find similar clusters to a given list of initial clusters in a greedy fashion.
- Algorithm
- Take an initial list of N clusters
- For each of the clusters in the list, generate a list of similar clusters
- With N lists of clusters, identify the cluster which is the most similar to all N initial clusters by searching for the cluster with the highest combined similarity score (summed over the N lists)
- If that cluster is not in the initial list, add it to the list and repeat, otherwise the algorithm terminates
Returns: cluster_set – The ids of the clusters in the chain Return type: list
-
dijkstra
(article_no)¶ DEPRECATED
This method chains articles together.
Runs Djikstra’s algorithm to find a path in the graph with the lowest average distance between a starting article and other articles in the corpus. When an article is examined, the 25 nearest nodes by similarity are added into the queue and explored (using
query
). The process ends when there are no new articles to explore. It then prints the best chain of articles.Parameters: article_no (int) – The article to start dijkstra’s algorithm from
-
keyword_query
(keyword, n_cluster=5)¶ Selects the top N most similar clusters to the keyword provided and returns their ids in sorted order.
Parameters: - keyword (str) – The desired keywords for the query, space separated
- n_cluster (int) – The number of most similar clusters to return
Returns: res – The ids of the top N most similar clusters in sorted order
Return type: list
-
query
(article_no=0)¶ DEPRECATED
This is a helper method for
dijkstra
. It converts the contents of the given article into the same LSI space as the data ( must be articles) and finds the top 25 most similar articles by cosine similarity. It then returns these articles in res.Parameters: article_no (int) – The article which the query is to be performed on Returns: res – The ids of the top 25 most similar articles to article_no Return type: list
-