
MOVIE WEB PORTAL GROUP 12 CHI HUNG HUNG TONG KA WAI TERENCE YUEN
CONTENT • Web Crawling • Data Preprocessing • Schema Alignment • Entity Resolution • Data Fusion • Web Portal Demo
OVERVIEW • Database • MySQL • Data Sources • IMDB • TMDB • Rotten Tomatoes • Programming Languages • Python • R • PHP
METHODOLOGY • 1000 most popular movies between 2006-2016 • HTTP request sender: Requests • HTML/XML parser: BeautifulSoup
WEB CRAWLER EXAMPLE • Data navigation via traversing the DOM tree top-to-bottom • Nodes are recognized by the type of tag and the name of the class • Data extraction Crawler in Python: Page Source:
CRAWLING ILLUSTRATION List of Movies Movie1 Movie2 Movie3 Director List of Actors List of Genres … … Actor1 Genre1 Genre2 Actor2 Order of Traversing: List of Movies -> Movie1 -> Director -> List of Actors -> Actor1 -> Actor2 -> List of Genres -> Genre1 -> Genre2
EXAMPLE • Date/Time format conflicts • August 7, 1975 vs 1975-8-7 vs 1975-08-07 • 2 hrs. 18 mins vs 138 mins • Gender naming convention • “F” vs “Female” • Regional discrepancies • Release date/country • Currencies
METHODOLOGY • Union the attributes among all 3 sources • Example: • S1: {A1,A2,A3,A4} • S2: {A1,A2,A3,A5,A7} • S3: {A1,A3,A6} • Unified S: {A1,A2,A3,A4,A5,A6,A7}
UNIFIED SCHEMA • Movie • mid, title, year, overview, runtime, film_location, budget, global_revenue, us_revenue, us_release_date, other_release_date, other_release_country, dvd_date,user_rating, votes_num • Actor • aid, name, gender, date_of_birth, place_of_birth • Director • did, name, gender, date_of_birth, place_of_birth • Genre • gid, genre_type
METHODOLOGY • Clustering into Groups by keys • Movie, Genre: by first character of the title name/type name • Actor, Director: by concatenating the first character of actor’s/director’s first name and last name • Pairwise Matching • Distance-based approach • used on Actors, Directors, Genres • edit distance: Levenshtein, Jaro-Winkler • Rule-based approach • used on Movies
PERFORMANCE COMPARISON • Efficiency evaluation is conducted on group blocking between 2 different solutions • Experiment performed on Actor’s entities:
BLOCK SIZE DISTRIBUTION (Solution 2) (Solution 1)
RULE-BASED MATCHING • Rule used for deciding whether or not two movie entities are matching • Step 1: IF | year1 – year2 | > 2 years, declare a non-match ELSE go to step 2 • Step 2: IF | runtime1 – runtime2 | > 15 mins, declare a non-match ELSE go to step 3 • Step 3: IF edit-distance between title_name1 and title_name2 < threshold, declare a non-match ELSE consider the entity a match
EXAMPLE After Record Linkage…
METHODOLOGY • Fusion by voting • Assumption made on trustworthiness of the 3 data sources • IMDB > TMDB > Rotten Tomato • Extract the most informative value • Example 1: • For actor’s DOB => S1: 1985, S2: 1985-05/05, S3: 1983 • S2:1985-05-05 will be chosen, as S1 & S2 share the same year value, and S2 provides details on month and date over S1 • Example 2:
PORTAL APPLICATION • Search movies for more details. • Rank movies by filtering, such as rating , box office. • Find out the relating movies of celebrities.