Pattern Reduction and Information Retrieval in Combinatorial Optimization Problems

Pattern Reduction and Information Retrieval www.themegallery.com 國立成功大學電機工程學系楊竹星

Outline • Pattern Reduction • Information Retrieval • Conclusion http://itlab.ee.ncku.edu.tw/

Combinatorial Optimization Problem • Complex Problems • NP-complete problem (Time) • No optimum solution can be found in a reasonable time with limited computing resources. • E.g., Traveling Salesman Problem • Large scale problem (Space) • In general, this kind of problem cannot be handled efficiently with limited memory space. • E.g., Data Clustering Problem, astronomy, MRI http://itlab.ee.ncku.edu.tw/

Combinatorial Optimization Problem and Metaheuristics • Traveling Salesman Problem (n!) • Shortest Routing Path Path 1: Path 2: http://itlab.ee.ncku.edu.tw/

Transition Transition Evaluation Evaluation Determination Determination An example-Bulls and cows • Check all candidate solutions • Guess  Feedback  Deduction • Secret number: 9305 • Opponent's try: 1234 • 0A1B • 1234 • Opponent's try: 5678 • 0A1B • 5678 • number 0 and 9 must be the secret number from wiki http://itlab.ee.ncku.edu.tw/

Concept (1/4) Our observation shows that a lot of computations of most, if not all, of the metaheuristic algorithms during their convergence process are redundant. (Data courtesy of Su and Chang) http://itlab.ee.ncku.edu.tw/ 6

Concept (2/4) http://itlab.ee.ncku.edu.tw/ 7

Concept (3/4) http://itlab.ee.ncku.edu.tw/ 8

C1 C1 C1 C1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 C2 C2 C2 C2 C1 1 0 g =2, s =2 C2 0 1 . . . C1 0 1 Metaheuristics + Pattern Reduction g =n, s =2 C2 1 1 Concept (4/4) g=1, s =4 g =1, s =4 g=2, s =4 . . . g =n, s =4 Metaheuristics http://itlab.ee.ncku.edu.tw/

The Proposed Algorithm Create the initial solutions P = {p1, p2, . . . , pn} While termination criterion is not met Apply the transition, evaluation, and determination operators of the metaheuristics in question to P /* Begin PR */ Detect the sub−solutions R = {r1, r2, . . . , rm} that have a high probability not to be changed Compress the sub−solutions in R into a single pattern, say, c Remove the sub−solutions in R from P; that is, P = P \ R P = P∪ {c} /* End PR */ End http://itlab.ee.ncku.edu.tw/ 10

Detection Time-Oriented Detect patterns not changed in a certain number of iterations aka static patterns Space-Oriented Detect sub-solutions that are common at certain loci Problem-Specific E.g., for the k-means, we are assuming that patterns near a centroid are unlikely to be reassigned to another cluster. T1: 1352476 T2: 7352614 T3: 7352416 … Tn: 7 C1416 T1 P1: 1352476 P2: 7352614 … Tn P1: 1 C1476 P2: 7C1614 http://itlab.ee.ncku.edu.tw/ 11

An Example http://itlab.ee.ncku.edu.tw/

Time Complexity Ideally, the running time of “k-means with PR” is independent of the number of iterations. In reality, however, our experimental result shows that setting the removal bound to 80% gives the best result. where n is the number of patterns, k the number of clusters, l the number of iterations, and d number of dimensions. http://itlab.ee.ncku.edu.tw/ 13

Outline • Pattern Reduction • Information Retrieval • Conclusion http://itlab.ee.ncku.edu.tw/

Instruction • Over the past decade, computer has transformed traditional printed material into digital material. • The internet technology make the most of information and knowledge can be searched and used by anyone. • Acquiring knowledge is no longer limited by geography, as a search engine can be shared and used by anyone, anywhere, anytime, using any internet browsing software. Digital Material Printed Material Database Web Pages http://itlab.ee.ncku.edu.tw/

Internet Application Data Analysis and Mining Web Information Extraction Web Information Retrieval Web Mining The history • 1960 Printed Material Digital Material • 1980 • 1990 Digital Material Web Pages • 1990 http://itlab.ee.ncku.edu.tw/

Internet Library Problem • Due to the growth of online information, a large number of file have flooded the internet. • We can easy get the information that we need, but spend too much time to seek out the relevant information. ＝ http://itlab.ee.ncku.edu.tw/

Problem (cont.) • The user always can not handle the large number of internet information. http://itlab.ee.ncku.edu.tw/

Web IR • Information Retrieval (IR) • The goal of IR is retrieve documents with content that is relevant to user’s need and find the relationship between the documents. • Rieh and Xie pointed that Information Retrieval is an interactive and iterative process. • A collection of documents is a set of documents which is related to a specific context of interest. • Research on information retrieval covers a very broad area including the dependence analysis of a group of files, the clustering of files and the classification of files. http://itlab.ee.ncku.edu.tw/

IR and Web IR • Traditional IR to Web IR • New information sources: • Digital Data Web Page, Database, Internet, etc. • New media types: • Text HTML, Image, Video, Audio. • New applications • File or Data Web Search, Video Search, Audio Search. • However, the major difference between the classic information retrieval (CIR) and web information retrieval (Web IR) is that faced the different data sets. http://itlab.ee.ncku.edu.tw/

A taxonomy of information retrieval models • R. Baeze-Yates http://itlab.ee.ncku.edu.tw/

Document Similarity • Vector Space Model, VSM • G. Salton and M.E. Lesk, 1968 • The cosine of θ is the similarity of the document j and q http://itlab.ee.ncku.edu.tw/

Web Information Extraction • Information Extraction (IE) • Wrapper ( human ): Given a set of manually labeled pages, a machine learning method is applied to learn extraction rules or patterns • IE ( automation ) • Given a set of positive pages, generate extraction patterns • Given only a single page with multiple data records, generate extraction patterns • Assumption : data having a structure or a schema • Integrate the data present in different web sites http://itlab.ee.ncku.edu.tw/

IB IB IB IB IB A example of IE IB: Information Block http://itlab.ee.ncku.edu.tw/

Web Mining • Data mining is to mine knowledge from data, but web mining is mining information from World Wide Web. • Web mining broadly defined as the discovery and analysis of useful information from the Web • Web Mining can be separate as: • Web usage mining • Web content mining • Web structure mining http://itlab.ee.ncku.edu.tw/

Web Mining Process 5. Information or Knowledge 1.webpage 2. Wrapper (extract rules) 3.patterns 4. Mining http://itlab.ee.ncku.edu.tw/

Web Structure Mining • To generate structural summary about the web site and web page • Try to discover the link structure of the hyperlinks at the web pages • To reveal the more information than about the information contained in web pages http://itlab.ee.ncku.edu.tw/

30+3=33 30 30 33/4 = 8.25 30 3 30+3+3=36 3 3 36/2 = 18 Web structure mining – PageRank (Google) • S Brin 9 http://itlab.ee.ncku.edu.tw/

CSES – A Cluster Search Engine • Motivation • Search Engine Operations • Other Types of Search Engines • Meta-search Engine • Clustering Search Engine • CSES: A Clustering Search Engine System • Framework • Clustering Algorithm • User Interface http://itlab.ee.ncku.edu.tw/

Motivation • The search engine is an information retrieval system designed to help user find information on a computer. • Problem: For example, if “mp3” was given to a search engine, it could mean an “mp3 music file” or an “mp3 player.” Another example is when the keyword “cat” (meaning a cat) is given as a query to the Google search engine,3 the first item returned is the company “Caterpillar, Inc.,” which has nothing to do with the animal “cat.” http://itlab.ee.ncku.edu.tw/

Search Engine Operations • Web crawling (Web spider, Web robot) • An automated Web browser which follows every link it sees Indexing file. • Indexing • The contents of each page are then analyzed to determine how it should be indexed (titles, headings, or meta tags). Data about web pages are stored in an index database for use in later queries. • Searching • When a user enters a query into a search engine (typically by using keywords), the engine checks its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. http://itlab.ee.ncku.edu.tw/

Google Search Engine • Google stores and indexes data with Shared nothing architecture (distributed computing architecture where each node is independent and self-sufficient), called Google file system. • After Google announced IPO S-1 form in April 2004, Tristan Louis (the founder of Internet.com) estimates that Google’s server includes: • 63,272 computers • 126,544 processors • 253,088 GHzworkload • 126,544 GBmemory • 5,062 TBstorage http://itlab.ee.ncku.edu.tw/

Search Engine Ranking • The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of Web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. • Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. • Ranking algorithm. http://itlab.ee.ncku.edu.tw/

Meta-search Engine • Search engines are measured by • Coverage • Recency (Freshness) • How to improve the Coverage? • Meta-search engine • P2P platform • Crawling hidden pages • Meta-search Engine:You submit keywords in its search box, and it transmits your search simultaneously to severalindividual search engines and their databases of web pages. http://itlab.ee.ncku.edu.tw/

Meta-search Engine (Cont.) Google User Interface Yahoo! Parser query MSN altavista result overture Other search engines … Meta-search Engine System http://itlab.ee.ncku.edu.tw/

Clustering Search Engine • Search engine (even meta-search engine) could return a huge amount of ranked lists of Web pages. However, this method is highly inefficient. • Search results can be in the millions for a typical query. • The criteria used for the ranking may not reflect the needs of the user. • A majority of the queries tend to be short, thus making them non-specific or imprecise. • By clustering the search results, users could find the ones which they really want efficiently and correctly. http://itlab.ee.ncku.edu.tw/

Clustering Search Engine (Cont.) • Clustering search engines’ rising: • Usually built on the meta-search engine. • Clustering search results provide a better way to help users find information quickly. • Famous clustering search engines: • Vivisimo, SnakeT, iBoogie, KartOO, Grokker…,etc. http://itlab.ee.ncku.edu.tw/

Relevant Info Irrelevant Info Results of Traditional Search Engines Taxonomy 1 Taxonomy 2 Taxonomy 3 Taxonomy 1 Taxonomy 3 Taxonomy 2 Taxonomy 1 http://itlab.ee.ncku.edu.tw/

Results of Clustering Search Engines Summary A Summary B Summary C Taxonomy 1 Taxonomy 2 Taxonomy 3 Taxonomy 1 Taxonomy 2 Taxonomy 1 Taxonomy 2 Taxonomy 1 Irrelevant Info Relevant Info http://itlab.ee.ncku.edu.tw/

Vivisimo http://itlab.ee.ncku.edu.tw/

iBoogie http://itlab.ee.ncku.edu.tw/

KartOO http://itlab.ee.ncku.edu.tw/

Clustering Search Engine System : CSES • We proposed a novel Clustering Search Engine System, called CSES. • The information coverage provided by search engines • The relevance of information offered by directory search systems. • We proposed a simple but novel algorithm for clustering the web pages. This algorithm is fundamentally different from traditional clustering algorithms that require a tremendous amount of computation time. http://itlab.ee.ncku.edu.tw/

Yahoo! Dir Google Dir ODP Dir query clustering Web sites result Directory Tree Taxonomy Information System CSES : Framework Yahoo! MSN Google Meta-search Engine Meta-Directory System Data Grid Grid Computing http://itlab.ee.ncku.edu.tw/

Compare the Similarity MA1 Compare the Similarity MA2 Web sites Taxonomy: Tax1Sub tax3 Ex: MusicBand CSES: Meta-Directory Tree Based Clustering Directory tree (ODP, Yahoo! and Google) Similarity computation of MA1 and MA2 is based on the term frequency http://itlab.ee.ncku.edu.tw/

CSES: User Interface Input Area Tree Structure of Clusters Search Results http://itlab.ee.ncku.edu.tw/

Future Work of CSES • The problem of Cluster Search Engine • Computation Load (response time) • Accuracy (relevant) • Information Display (user interface) • Grid computing, distributed computing • Social network • Applied this framework to other areas http://itlab.ee.ncku.edu.tw/

Pattern Reduction and Information Retrieval in Combinatorial Optimization Problems