1 / 26

MOVIE WEB PORTAL

MOVIE WEB PORTAL . GROUP 12 CHI HUNG HUNG TONG KA WAI TERENCE YUEN. CONTENT. Web Crawling Data Preprocessing Schema Alignment Entity Resolution Data Fusion Web Portal Demo. OVERVIEW. Database MySQL. Data Sources IMDB TMDB Rotten Tomatoes Programming Languages Python R PHP.

wallaceh
Télécharger la présentation

MOVIE WEB PORTAL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MOVIE WEB PORTAL GROUP 12 CHI HUNG HUNG TONG KA WAI TERENCE YUEN

  2. CONTENT • Web Crawling • Data Preprocessing • Schema Alignment • Entity Resolution • Data Fusion • Web Portal Demo

  3. OVERVIEW • Database • MySQL • Data Sources • IMDB • TMDB • Rotten Tomatoes • Programming Languages • Python • R • PHP

  4. WEB CRAWLING

  5. METHODOLOGY • 1000 most popular movies between 2006-2016 • HTTP request sender: Requests • HTML/XML parser: BeautifulSoup

  6. WEB CRAWLER EXAMPLE • Data navigation via traversing the DOM tree top-to-bottom • Nodes are recognized by the type of tag and the name of the class • Data extraction Crawler in Python: Page Source:

  7. CRAWLING ILLUSTRATION List of Movies Movie1 Movie2 Movie3 Director List of Actors List of Genres … … Actor1 Genre1 Genre2 Actor2 Order of Traversing: List of Movies -> Movie1 -> Director -> List of Actors -> Actor1 -> Actor2 -> List of Genres -> Genre1 -> Genre2

  8. DATA PREPROCESSING

  9. EXAMPLE • Date/Time format conflicts • August 7, 1975 vs 1975-8-7 vs 1975-08-07 • 2 hrs. 18 mins vs 138 mins • Gender naming convention • “F” vs “Female” • Regional discrepancies • Release date/country • Currencies

  10. SCHEMA ALIGNMENT

  11. METHODOLOGY • Union the attributes among all 3 sources • Example: • S1: {A1,A2,A3,A4} • S2: {A1,A2,A3,A5,A7} • S3: {A1,A3,A6} • Unified S: {A1,A2,A3,A4,A5,A6,A7}

  12. UNIFIED SCHEMA • Movie • mid, title, year, overview, runtime, film_location, budget, global_revenue, us_revenue, us_release_date, other_release_date, other_release_country, dvd_date,user_rating, votes_num • Actor • aid, name, gender, date_of_birth, place_of_birth • Director • did, name, gender, date_of_birth, place_of_birth • Genre • gid, genre_type

  13. ENTITY RESOLUTION

  14. METHODOLOGY • Clustering into Groups by keys • Movie, Genre: by first character of the title name/type name • Actor, Director: by concatenating the first character of actor’s/director’s first name and last name • Pairwise Matching • Distance-based approach • used on Actors, Directors, Genres • edit distance: Levenshtein, Jaro-Winkler • Rule-based approach • used on Movies

  15. PERFORMANCE COMPARISON • Efficiency evaluation is conducted on group blocking between 2 different solutions • Experiment performed on Actor’s entities:

  16. BLOCK SIZE DISTRIBUTION (Solution 2) (Solution 1)

  17. RULE-BASED MATCHING • Rule used for deciding whether or not two movie entities are matching • Step 1: IF | year1 – year2 | > 2 years, declare a non-match ELSE go to step 2 • Step 2: IF | runtime1 – runtime2 | > 15 mins, declare a non-match ELSE go to step 3 • Step 3: IF edit-distance between title_name1 and title_name2 < threshold, declare a non-match ELSE consider the entity a match

  18. EXAMPLE After Record Linkage…

  19. DATA FUSION

  20. METHODOLOGY • Fusion by voting • Assumption made on trustworthiness of the 3 data sources • IMDB > TMDB > Rotten Tomato • Extract the most informative value • Example 1: • For actor’s DOB => S1: 1985, S2: 1985-05/05, S3: 1983 • S2:1985-05-05 will be chosen, as S1 & S2 share the same year value, and S2 provides details on month and date over S1 • Example 2:

  21. METHODOLOGY

  22. WEB PORTAL

  23. PORTAL APPLICATION • Search movies for more details. • Rank movies by filtering, such as rating , box office. • Find out the relating movies of celebrities.

  24. PORTAL DEMO

  25. Q & A

More Related