News chain project documentation¶

Please read the following information before proceeding

Contents¶

Introduction

Cleaning of data

Creating article clusters

Cluster chaining

Indices and tables

Introduction¶

Requirements

Language: Python 2.7.6
Additional libraries: gensim, numpy, scikit-learn, snap-py, sqlite3

Structure of the project

/data: Contains all of the data files used by the program, the raw .csv files and the SQL databases
/docs: Contains all the documentation for the project
/src: Contains source files as well as other folders that hold save indexes and dictionaries

The programs in this project were written for use with a specific data format. It is possible to run this project with different sets of data. However, this would involve rewriting functions that deal with the saving / loading of the raw data.

Module information

Return to top

Cleaning of data¶

Crawl raw data from the Internet and save it in the following format (.csv)
Raw article format
- Title (str): Title of the article
- Section (str): Section in which the article was published
- Region (str): Region which the events described in the article occured in
- Time (datetime): Time of publication of the article
- Word count (int): Number of words in the article
- Summary (str): A short summary of the article
The program minimally requires title, section, time and summary. The region and word count fields can be filled with ” ” and 0 respectively.
The details of how the program builds the database can be found in utils.FileList.sqlite_build_nyt_full()
The database is then stored in ../data/NewYT_all.db. This can be modified by changing the con variable in utils.FileList.sqlite_build_nyt_full()

Example program: ../src/data_clean.py

Return to top

Creating article clusters¶

From this point onwards, we will discuss how to run the programs in the context of finding relevant information about events that occured in the South China Sea.

Load the cleaned data from ../data/NewYT_all.db using utils.load_nyt(). We load all the data in the range, from 8 Apr 2013 to 5 July 2015. We set the keyword field to “China”, while this restricts the search space significantly, it still allows for some breadth in the discussion.
Set the step size to 400. This can be adjusted to increase or decrease the number of clusters generated per step.
Cluster the data for each step using one of the provided clustering method, either bisecting kmeans or overlapping community detection. Assuming we use the former, we then call compute. This function performs clustering and has parameters set to ensure that the cluster size is not too large. The details of this can be found in the documentation for analyse.
After the clusters have been computed, call find_computed_cluster_metrics to compute metrics for each cluster. The metrics that are computed include: radius, diameter and the closest article to the centroid of the cluster.

Example program: ../src/top_articles_new.py

Return to top

Cluster chaining¶

Load all the cluster data from ../data/NewYT_clustered_china_bisect_final.db into the variable data using utils.load_nyt_clusters(). As we are using the clusters generated in the previous step, all the clusters will contain articles with content pertaining to China.
Prepare the data for similarity query by initialising a new SimMatrix object and loading the data into it. It is important that for the initial run, refresh = True. The indexes are saved in ../src/tmp. To load cluster data, the folder must be renamed to tmp_clusters.
Perform a similarity query using keyword_query. The keywords that we use for the query are “south china sea” and we also set the number of similar clusters to be returned to be 30 (n_cluster = 30). The results are stored in cluster_id.
From cluster_id, we then extract the data for each of the clusters and load it into a new BigClamChainCluster.
Overlapping community detection is then performed by calling find_community. It is possible to set the number of desired communities by adjusting the opt_com variable.
The chains found via overlapping community detection can then be viewed by calling print_community.

Example program: ../src/chain.py

Return to top

Indices and tables¶

Return to top