SPIMI Implementation v3.0: The Release Version

By Boulat Oulmachev Ivan Khrisanov Sergiy Samus SPIMI Implementation v3.0: The Release Version

Overview of theChanges • Collection of docs on ENCS domain was build Crawled the domain, collected and parsed the docs, removed duplicates and created inverted index • Clustering Built document vector space, implemented K-means algorithm and determined the optimal K for our collection • Clustering Search Introduced a new search method that matches the query to the cluster. Also, defined mechanism to determine which search method (BM-25 or Clustering) is more useful for a given query • GUI Simple, but pleasant web interface was created to facilitate the ease of use of the search engine and to alleviate user happiness by displaying results in clear, concise, organized way

Building the Collection • Crawled the ENCS domain using a crawler and saved all the docs to disc. • Parsed all the html docs, omitting all the other types of documents • Built vector space representation of the whole collection, with each doc represented by a vector in N dimension, where N is the size of the vocabulary. • For increased efficiency, document representations do not explicitly store values for those dimensions where they are zero. • Eliminated duplicates by computing cosine similarity for every possible pair of documents and deleting those docs that were found very similar.

Clustering • Implemented a K-means clustering from scratch - first, the docs are normalized. - then the algorithms assigns random K doc vectors as seeds, finds the closest seed for each doc, groups them and finds a centroid for each group. - It iterates, reassigning the docs and re-computing centroids, until the RSS ceases to change drastically. - K found by running experiments with different numbers of K, recording the resulting RSS, graphing both on RSS-K axis and then choosing the K where the slope of the graph approach zero from the negative side.

Clustering Search and GUI • Clustering Search • Query represented as a normalized vector in N-dimensional space • It is compared to the normalized centroids of all the clusters using Euclidean distance. • The documents in the nearest cluster are retrieved. • Good for general queries, not very good for specific ones. • Defined a way to choose the best search mode ( clustering vs BM-25 ) by looking at how specific the query is. • GUI • A separate module was created to capture the input from the user and to provide an easy concise clear way to display the results.

SPIMI Implementation v3.0: The Release Version