240 likes | 344 Vues
A web-based tool to accurately compare 2 documents, featuring advanced algorithms like stemming and vector space modeling. Performance optimized with Numpy for instant results. Developed on Django framework using Python libraries. Future plans include file uploading and HTML5 integration. Check out the demo at http://imds.alwaysdata.net.
E N D
Copy or Not Dawei (David) Shi
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Introduction • A web-based document comparator • Calculate accurate similarity between 2 documents
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Algorithm • Preprocessing • Vector space • Similarity calculation
Preprocessing • Stemming • Porter Stemming Algorithm • E.g. • cat – cats • meet – meeting • agree – agreed • correct - correctness
Vector Space • Build dictionary 1 • word -> frequency • Sort the keys of dictionary 1 • Build dictionary 2 • key -> (index, count) • Build binary vectors • index -> occurrence
Similarity Calculation • Vectors v1 and v2 • Similarity = v1 * v2 / (norm(v1) * norm(v2))
Performance • Algorithms coded in Python • Dynamic typing • Not good at numerical operations • Solution: numpy
Numpy • A Python extension module • Written mostly in C • Define numerical array and matrix types and basic operations on them
Numpyvs Python • Python code • a = range(10000000) • b = range(10000000) • c = [] • for i in range(len(a)): • c.append(a[i] + b[i]) • Takes up to 10 seconds on a several GHz processor
Numpyvs Python • Numpy code • import numpy as np • a = np.arrange(10000000) • a = np.arrange(10000000) • c = a + b • Almost Instant
Numpy Usage • Vector dot product • Vector normalization • Vector zero filling
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Framework • Django • The web framework for perfectionists with deadlines
Libraries • Python • Numpy • Porter Stemming • jQuery
Hosting • Alwaysdata • Django 1.3 • Python 2.6
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Future Work • Support file uploading and comparison • Add HTML5 features
Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo
Demo • http://imds.alwaysdata.net