1 / 25

Deep Web Mining and Learning for Advanced Local Search

Deep Web Mining and Learning for Advanced Local Search. CS8803 Advisor Prof Liu Yu Liu, Dan Hou Zhigang Hua, Xin Sun Yanbing Yu. Competitors. Yahoo! Local Yelp CitySearch Google Local Yellow Page How to beat them?. Research Background. Deep Web Crawling Sentimental Learning

shelley
Télécharger la présentation

Deep Web Mining and Learning for Advanced Local Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deep Web Mining and Learning for Advanced Local Search CS8803 Advisor Prof Liu Yu Liu, Dan Hou Zhigang Hua, Xin Sun Yanbing Yu

  2. Competitors • Yahoo! Local • Yelp • CitySearch • Google Local • Yellow Page How to beat them?

  3. Research Background • Deep Web Crawling • Sentimental Learning • Sentimental Ranking Model • Geo-credit Ranking Model • Social Network for Businesses

  4. Show Time! Local Biz Space

  5. Query-based Crawler Sentimental Learner Super Local-Search HTML Parser Apache Server JDBC Database Architecture

  6. Tools • Open source social network platform Elgg, OpenSocial • LAMP Server Linux+Apache+Mysql+PHP • Google Map API, eg, Geocode,

  7. Crawling Dynamic Pages

  8. Crawling Dynamic Pages

  9. Parsing Dynamic Pages

  10. Database Design

  11. Sentimental Learning

  12. Sentimental Learning

  13. Sentimental Learning Can we use ONE score to show how good/ bad the store is?

  14. Sentimental Learning • Objective • To identify positive and negative opinions of a store • Dataset • Reviews represented by bag-of-terms • Normalized TF-IDF feature (normalized) • Two ways of sentiment representation • Simply average the scores • but “what you think good might be bad for me” • Manual labeling • 1 to 5 (“least satisfied” to “most satisfied”) • consensus based • time-accuracy tradeoff

  15. Dimension Reduction • High dimensionality • 6857 tokens • Memory limitation • Possibly under-fitting • Dimension Reduction • PCA (Principle Component Analysis) • an orthogonal linear transformation • transforms the data to a new coordinate system • retains the characteristics of the data set that contribute most to its variance • Get the most important features without losing generality

  16. Principle Component Analysis • Original Dimension: 6857 • Covariance Reserved: 95% • Different Granularity • Manual Labeling: • Score Averaging:

  17. Sentimental learning • Features used for sentimental learning: • Vector Space Model (reviews/comments) • Some keywords related to sentiments: • Positive: good, happy, wonderful, excellent, awesome, great, ok, nice, etc • Negative: bad, sad, ugly, outdated, shabby, stupid, wrong, awful, etc • Most words unrelated to sentiments: • e.g. buy, take, go, iPod, apple, comment, etc… • Causing noise for sentimental learning!!

  18. What we do? • How to learn sentiments from a large set of features with lots of noise? • Vector Space Model: MXN (Entity-Term, e.g. 6,000X20,000) • Dimensionality reduction (PCA) • Using supervised learning for sentimental learning • Human labeling vs. Average rating • An online entity always includes many reviews with each review containing a rating • Average Rating is an alternative labeling for the entity • Manual labeling: • 1 (least satisfactory) – 5 (most satisfactory) • Three persons do labeling, most-vote-adopted

  19. Manual labeling vs. Average rating • Machine learning • Around 300 entities from local search, 6800 features after stop words removing and stemming • Using different SVM kernels • Avoiding overfit • Leave-one-out estimation • Nonlinearity of features • Polynomial kernel achieves best performance • Manual labeling • Training more precise • Labeling more consistent • Rate averaging • Training less precise • Rating more random • E.g. average(5, 5, 1) = 3

  20. What we learned? • Dimensionality reduction is necessary • Term Vector Space Model (VSM) is huge in nature • Human labeling is necessary • Sentimental learning involved subjective judge instead of objective judge. • Human rating is very random because it is not consistent across different people • More labeling data is needed • Other methods to be used: • Unsupervised learning (clustering) • Gaussian Mixture Model (an alternative to learn sentiments, while it is difficult to know the # of hidden sentiments)

  21. How to use learned sentiments? • Sentimental learning can be used to improve ranking of local search • Because sentimental value represents an important metrics to evaluate the rank of an entity • Local search is influenced by the sentiment • Sentimental ranking model (SRM): • SentiRank = a*ContentSim + (1-a)*SentiValue • Empirically setting the parameter as “0.5”. • Similar to PageRank • PageRank = b*ContentSim + (1-b)*PageImportance

  22. Geocoding • Geocoding of Addresses For example , the geo-center of store AA National Auto Parts Is located at 3410 Washington St, Phoenix, AZ,85009 Using Geocode, we can get the exact latitude and longtitude (33.447708, -112.13246) • Haversine Formula of Great-circle distance: Distance between two pairs of coordinates on sphere = (3959 * acos( cos( radians(33.448) ) * cos( radians( lat ) ) * cos( radians( lng ) - radians(-122) ) + sin( radians(-112.132) ) * sin(radians( lat ) ) ) )

  23. Geo-Sentimental Ranking Model (GSRM) • Three Measurements • Content Similarity -- term-frequency • Sentimental Value -- sentimental learning • Geo-distance -- Google Map API • GSRM Ranking model

  24. Example

  25. Thank You ! • QA time

More Related