1 / 46

Another approach to Information Extraction

Another approach to Information Extraction. using Extended Ontologies. Marek Nekvasil xnekm06@vse.cz. agenda. g athering information with wrappers w ays to build a wrapper u sing and extending an ontology t emplates and patterns s uggesting a simple wrapper induction method.

yukio
Télécharger la présentation

Another approach to Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Another approach to Information Extraction using Extended Ontologies Marek Nekvasil xnekm06@vse.cz

  2. agenda • gathering information with wrappers • ways to build a wrapper • using and extending an ontology • templates and patterns • suggesting a simple wrapper induction method

  3. wrapping up a document • synonym to identifying relevant information in the document • there are many ways how to wrap a document up

  4. wrapper classes • string-based wrappers • Kushmerick‘s wrapper classes • tree-based wrappers • XPath • Elog • finite automata • Methods Comparison

  5. LR class • basic class (stands for Left-Right) • 2n parameters (2 for every part of extracted tuple) • example: • suitable wrapper LR(<B>; </B>; <I>; </I>) <HTML> <TITLE>Ceny pobytů</TITLE> <BODY> <B>Řecko - Lefkada</B> <I>16 299 Kč</I><BR> <B>Mallorca - Santa Ponsa</B> <I>21 100 Kč</I><BR> <B>Egypt - Sharm El Sheikh</B> <I>18 500 Kč</I><BR> <B>Egypt - Ghiza</B> <I>19 049 Kč</I><BR> </BODY> </HTML>

  6. other LR class derivates • Nicolas Kushmerick‘s classes • HLRT (Head-Left-Right-Tail) • OCLR (Opening-Closing-Left-Right) • HOCLRT (…) • N-LR or N-HLRT (Nested-…)

  7. XPath wrappers • using XPath queries to identify data in the tree representation of a document • often using just the very basicfeatures of the XPath language • usually building queries from the root of a document

  8. Elog • declarative language similar to Prolog • uses predicates to generate instances • used in the Lixto tool • example of Elog wrapper

  9. finite automata • FSM can be used for wrapping in various ways • usually used for searching in the linear representation of a document • Carme shows it is possible to use FSM for searching in the tree structure

  10. methods comparison • Tree-based wrappers are more error-prone than linear string-based wrappers • Elog and N-LR allow extraction not only from tabular data structure but also from a general hierarchical data structure • XPath wrappers reuse a well defined standard

  11. agenda • gathering information with wrappers • ways to build a wrapper • using and extending an ontology • templates and patterns • suggesting a simple wrapper induction method

  12. building a wrapper • by hand • Oracle and PAC analysis • interactive visual pattern design • tree-fragment queries • tree traversal pattern generalization • and many other …

  13. PAC analysis • uses an abstract function called Oracle to gather enough example instances of extracted class (asuming it‘s embrased by human) • gathers examples until it has enoughN to suggest a wrapper class with a designated error e on a given probality level 1-d, using the formula: • finally searches for the first set of parameters of the wrapper to match all the exmaples

  14. interactive visual pattern design • used in Lixto tool to craft wrappers in Elog language • first user points out the example instances which makes a generating rule, a pattern • then the user forms conditions (filters) of the patterns to restrict them, which is done visually

  15. interactive condition building in Lixto

  16. tree-fragment queries • searching such a minimum XPath query that forms a tree-prefix to all examples • tree-prefix examples

  17. tree traversal pattern generalization • application of the graph theory on the generalized document tree • searching the shortest path through the document tree and thus forming an efficient XPath query

  18. agenda • gathering information with wrappers • ways to build a wrapper • using and extending an ontology • templates and patterns • suggesting a simple wrapper induction method

  19. ontologies and wrappers • ontology is a knowledge model • we can make a knowledge model that summarizes what information we are going to extract • with a nifty extension we can use the ontology to identify examples of what we are going to extract • theese examples can be used to build a wrapper with any method

  20. ontology in OWL <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#"> <owl:Ontology rdf:about=""> <owl:imports rdf:resource=“http://www.somedomain.com/x“/> </owl:Ontology> <owl:Class rdf:ID=“class_A“> <owl:disjointWith rdf:resource=“#class_B“/> </owl:Class> <owl:Class rdf:ID=“class_C“> <owl:subClassOf rdf:resource=“#class_A“/> </owl:Class> <owl:DatatypeProperty rdf:ID="property_A"> <rdfs:domain rdf:resource="#class_A"/> </owl:DatatypeProperty> </rdf:RDF>

  21. extending OWL • in the terms of ontologies we extract values of datatype properties • therefore we need some technique to identify (and rank) possible instances of theese values • we suggest a way to define complex templates of typical values of a datatype property

  22. placing a template into the ontology • we estabilish a new namespace: xmlns:ot="http://st.vse.cz/~XNEKM06/ontologytemplates#„ • in the new namespace we use an element <ot:Template> to write a template down • such a template can only be joined with a datatype property <owl:DatatypeProperty rdf:ID=„property_A"> <rdfs:domain rdf:resource="#class_B"/> <ot:Template ...> ... </ot:Template> </owl:DatatypeProperty>

  23. agenda • gathering information with wrappers • ways to build a wrapper • using and extending an ontology • templates and patterns • suggesting a simple wrapper induction method

  24. patterns • pattern – a general rule that can be evaluated against any continuous part of a document to see with what degree it matches

  25. template • template – a set of rules that can be evaluated as a whole against any continuous part of a document to see with what degree it matches • a template is a special case of a pattern • thus a template can contain other templates

  26. simple patterns • pattern has an internal algorythm that can (with some parameters) identify possible matches throughout the document with a pattern match degree as an output • moreover we need to infer a degree of evidence certainty which should be our confidence that it really is a value that the pattern was to identify

  27. deriving the degree of evidence certainty 1 • let us define two propositions: • A – the pattern algorythm identified a given part of a document • E – the part really should have been identified by that pattern • A and E are logical propositions and in fuzzy logic their truth value is a real number from the interval <0; 1>

  28. deriving the degree of evidence certainty 2 • intuitively there should be a relationA  E • thanks to modus ponens rule we can write in basic logic(A & (A  E))  E • of that we can deriveval(E)  val(A & (A  E)) • and while not wanting to overestimate the evidence certainty we setval(E) = val(A & (A  E))

  29. deriving the degree of evidence certainty 3 • now we introduce a parameter of the patternval (A  E) = p • we call it pattern precision • using for examle Łukasiewicz‘ logic we can derivee = max (0, a + p -1)where e stands for val(E) and A for val(A)

  30. deriving the degree of evidence certainty 4 • without doubt it‘s true that(E A)  E, and (A  E)  E • while in Łukasiewicz‘ logic we can derive from the above(A S E)  (E  A) • and therefore(E  A)(A  E)

  31. deriving the degree of evidence certainty 5 • while we substitute (E  A) for (E  A) we can derive(E  A)  E • and we introduce a second parameterval (E  A) = cwhich we call a pattern completeness

  32. deriving the degree of evidence certainty 6 • combinig the two rules above we can derive an ultimate rule((A & (A  E)) (E  A))  E • and while still not wanting to overestimate the evidence certainty we can write down (in Łukasiewicz‘ logic)e = max (max (0, a + p -1), 1 – c)

  33. simple patterns summary • a pattern identifies a given place in the document with a pattern match degree denoted as a • every pattern has two parameters: p – precision and c – completeness • the degree of pattern evidence certainty can then be calculated ase = max (a + p -1, 1 – c)

  34. composite patterns • as to forming a template we can combine the fragmentary simple patterns together • computing the evidence certainty is the same as it was in case of simple patterns however we have to derive a pattern match degree somehow

  35. deriving the composite pattern match degree • joining evidences of two patterns can be viewed as joining two fuzzy sets • for this we can use either a set union (asociated with disjuntion) or a set intersection (asociated with conjunction) • therefore we compute the composite pattern match degree as the conjuncion or disjunction of evidence certainties of all component patterns • so we get two kinds of templates: conjoint and disjoint

  36. the nature of templates • for the calculations we use the formulae of min-conjuntion and max-disjunction • the parameters p and c of component patterns now get a new meaning • in a disjoint template a high value of p means that the pattern forms a sufficient condition • in a conjoint template a high value of c means that the pattern forms a necessary condition

  37. writing down the templates • we write the template down as to match it with the ontology as was shown before: <ot:Template ot:p=“0.95“ ot:c=“0.8“ ot:type=“disjoint“> ... </ot:Template> • the component patterns will be written in the form of nested xml tags

  38. a few kinds of patterns • <ot:String ot:p=“0.7“>Egypt</ot:String> • <ot:Stringlist ot:source=“c:\temp\zeme.txt“ ot:c=“0.62“/> • <ot:Concatenation> ..</..> • <ot:Context ot:side="left" ot:maxdistance="1" ot:c="0.5">..</..> • <ot:Number ot:min = “1“ ot:min = “10“ /> • <ot:Distribution ot:type="gauss" ot:mean="10900" ot:variance="9200000"/> • <ot:Regexp> ..</..> • …

  39. example template <ot:Template ot:type="disjoint" ot:c="0.9"> <ot:Concatenation> <ot:Distribution ot:type="gauss" ot:mean="10900" ot:variance="9200000"/> <ot:Stringlist> <ot:String ot:case="any">kc</ot:string> <ot:String ot:case="any">kč</ot:string> <ot:String ot:case="same">,-</ot:string> </ot:Stringlist> </ot:Concatenation> <ot:Context ot:side="left" ot:maxdistance="2" ot:p="0.6"> <ot:Template> <ot:String ot:case="any">cena</ot:string> <ot:String ot:case="any">cena:</ot:string> </ot:Template> </ot:Context> </ot:Template>

  40. agenda • gathering information with wrappers • ways to build a wrapper • using and extending an ontology • templates and patterns • suggesting a simple wrapper induction method

  41. anotating the document • fisrt of all we can use the ontology as a model of the extracted data • then we would have to use the templates included in the ontology to identify possible example instances of the extracted values • theese examples can be used with any wrapper induction method

  42. purifying the evidences • while every pattern has the precision attribute, we can say that up to (1-p)% of the template evidences can be false • we can make segments of the evidences based on thei absolute XPath • then we calculate the sum of confidences of all evidences in such a segment and ignore (1-p)% of the segments with the lowest sum

  43. generalizing the segments • we generalize the segment using the variable index in the XPath • comparing the number of this generalized segment‘s elements with the original, we can use the completeness parameter to measure the probable error of such a generalization

  44. matching the segments • we can match the segments of patterns of more datatype properties and form thus complex rules for extracting the instances of ontology classes • the matching can be based on the number of their elements or on the conformity of their XPath

  45. future work suggestions • integration with some wrapper generation tool • automatic learnig of the patterns • using other properties of ontologies, such as cardinalities

  46. thank you for your time

More Related