1 / 33

Query Optimization by Genetic Algorithms

Query Optimization by Genetic Algorithms. Suhail Owais, Pavel Kromer, Vaclav Sna š el Department of Computer Science, V Š B-Technical University of Ostrava, 17. listopadu 15, Ostrava - Poruba, Czech Republic. Outline. Introduction Information Retrieval (IR) Genetic Algorithms (GA)

carolyni
Télécharger la présentation

Query Optimization by Genetic Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Optimization by Genetic Algorithms Suhail Owais, Pavel Kromer, Vaclav Snašel Department of Computer Science, VŠB-Technical University of Ostrava, 17. listopadu 15, Ostrava - Poruba, Czech Republic

  2. Outline • Introduction • Information Retrieval (IR) • Genetic Algorithms (GA) • Optimization • State of art • IR and GA • Experiments • Conclusion • Future Work

  3. Internet

  4. Information Retrieval • In principle, Suppose there are set of documents and a person (user of these documents), the user formulates a question (request or query) to which the answer is a subset of documents satisfying the information need expressed by his question “Relevant documents”. • Searching for information in documents, for document in collection of documents, for metadata in documents, … • Searching will be in databases, or in hypertext networked databases Internet or intranet.

  5. Information Retrieval System - IRS • IRS concerned • with responding to the requests of users queries for the information seeking text. • with retrieve all relevant documents to user query from a collection of documents, with retrieving some of non-relevant as less as possible.

  6. Retrieved - Relevant Documents to the user Query Collection of Documents Relevant Doc. Relevant Retrieved Doc. Retrieved Doc.

  7. IR Evaluation The most Measuring performance of retrieval effectiveness are: • Precision ”the percentage of the retrieved documents that are relevant to the user query” • Recall ”the percentage of the relevant documents that are retrieved”

  8. Genetic Algorithm • GA used Darwinian Evolution to extract optimization strategies nature uses successfully and transform them for application in mathematical optimization theoryto find the global optimum in defined phase space • GA are used in IR problems specially in optimizing of a Boolean query. • GA operators: Selection, Fitness function, Crossover, and Mutation.

  9. GA Flowchart Diagram Contents Condition Satisfied Yes Optimized Query Initialize Population Encoding Evaluate Fitness's No Regenerate New Offsprings End Start Selection Crossover Mutation 

  10. Optimization • The procedure or procedures are used to make a system or design as effective or functional as possible, especially the mathematical techniques involved. • Is the process of modifying a system to improve its efficiency. The system can be a single computer program, a collection of computers  or even an entire network such as the Internet.

  11. State of the art 1 Contents Evolutionary Learning of Boolean Queries by Multiobjective Genetic Programming; • Authors: Cordon et al., Springer-Verlag GmbH 2002 • Subject: Automatic derivation of Boolean queries, by incorporating a Pareto-based multiobjective evolutionary approach, MOGA, into genetic programming technique. • Notes: • A query represented as a parse tree with maximum of 20 nodes. • Boolean operators used are AND, OR and NOT. • Maximum number of documents is 1400. • Result: The proposed approach has performed appropriately in seven queries of the well known Cranfield collection in terms of absolute retrieval performance and of the quality of the obtained Paretos.

  12. State of the art 2 Contents An Appropriate Boolean Query Reformulation Interface for Information Retrieval Based on Adaptive Generalization • Authors: Yoshioka et al., WIRI 2005, In Conjunction with IEEE 2005, Tokyo Japan • Subject: Implement a user query interface that supports reformulation of IR queries by using abstract concepts. • Notes: • IR interface uses small numbers of query terms and concept categories with Boolean expression. • Reformulate a Boolean query by using only words that exist in the original query. • Boolean operators used are AND, and OR. • Result: Proposed a new IR interface with Boolean query reformulation (ABRIR-AG). Find complementary query terms that exist in relevant documents and reformulate Boolean query formulas to clarify the information need. ABRIR-AG : Appropriate Boolean query Reformulation for IR- Adaptive Generalization

  13. IR and GA • Collection or set of Documents • Terms for Document di • Weighting function 1 0 W2 Not in Document d2 W8 IN Document d2

  14. Chromosome Encoding • A query; combination from set of terms and set of Boolean operators • Set of queries will beencoded to be chromosomes for genetic programming in prefix form such as (w2 OR w6) AND (w9 AND w3)  Prefix  AND (OR w2w6) (AND w9w3) (w3 AND w4) XOR ((w5 AND w6) OR w8)  Prefix  XOR (AND w3w4) (OR (AND w5 w6) w8)

  15. Tree Structure Representation XOR (AND w3w4) (OR (AND w5 w6) w8) AND (OR w2w6) (AND w9w3)

  16. Fitness function • Recall and Precision functions are used to Evaluate Chromosomes. Selection Operators • From the population of chromosomes, the best two chromosomes depending on the highest fitness values for precision or recall measures will be selected. • rd : the relevance of document d (1 for relevant and 0 for nonrelevant), • fd : the retrieved document d (1 for retrieval and 0 for nonretrieval), and • α and β are arbitrary weights; added specially to precision fitness function.

  17. Crossover Operator • Chose Randomly one node position in each Tree to be exchanged OR 4 OR 1

  18. Exchange Sub trees Created two New Offsprings

  19. ReIndexing nodes in Offsprings

  20. Mutation Operator Randomly will change one of the Boolean logical operators to another and the position randomly chosen AND , OR , XOR AND , XOR AND No one select, SO no mutation will be done over this offspring AND 4

  21. Experiments • Implementation for our Genetic Program was tested under the following conditions and limitations:- • Two sets of queries that represent in a tree prefix forms used as two different initial populations • Boolean model of a collection of documents • Different Collections of documents • User query / request w8OR w2

  22. Initial Populations • The two initial population differs by containing sub queries, so • Initial Population 1 contain sub query • w8 AND w2 • Initial Population 2 contains sub queries • w8 AND w2 • w8 OR w2 • w8 XOR w2 Initial Population 2 Initial Population 1

  23. Variables initialization • Crossover probability value  0.8 • Mutation probability value  0.2 • Population size (number of chromosomes)  8 • Maximum number of generations  50. • α  0.25 • β  1.0

  24. Document Collections • Three different document collections with variant number of words and documents.

  25. Notes on limitations • Single point for Crossover • Mutation operator applied only over Boolean operators AND, OR or XOR. • Fitness operator must be defined in input data to be:- • PrecisionFitnessor • RecallFitness. • maximum value for PrecisionFitness = α + β; • so It may be grater than one ( > 1 ) • it can not be interpreted as the probabilityof retrieving relevant document.

  26. Experiments • Set of experiments done over three test cases; depends on:. • Initial Population used • Initial Population 1 OR • Initial Population 2. • Fitness function used • PrecisionFitness OR • RecallFitness. • Collection used • Collection 1 OR • Collection 2 OR • Collection 3.

  27. Experiments Results Using IP. 1 IP : Initial Population, FF: Fitness Function , R : Recall , P :Precision, V: Value

  28. Experiments Results Using IP. 2 IP : Initial Population, FF: Fitness Function , R : Recall , P :Precision, V: Value

  29. Precision and Recall Diagrams Collections

  30. Precision and Recall Diagrams Initial Populations

  31. Conclusions • The final population contains set of individuals that have same fitness values • one randomly chosen to be an optimized query. • Because of selection queries with different sub queries similar to the user query that increase the quality of the initial population selected •  this obtained better results • Especially when precision was used as fitness measure and experiment was done over largest collection, the fitness values of recall in final population were low. • in many experiments mostly all members of population reached the maximum values of precision and recall before reaching given number of generations.

  32. Future works • Use more of unweighted Boolean operators like ( ADJ, and OF) operators • Mutation operates over all Boolean operators (AND, OR, XOR, ADJ, OF, and NOT) • Try to improve selection method for choosingthe best individual from a set of queries with equal values of precision or recall. • Appling of fuzzy theorem approach in this problematic - Use weights for terms in documents instead of Boolean weights.

  33. Thanks for your attention Suhail Owais : suhailowais@yahoo.com

More Related