XML on Semantic Web

XML on Semantic Web

Outline • The Semantic Web • Ontology • XML • Probabilistic DTD • References

The Semantic Web (1/4) • The first generation Web • The second generation Web：current Web • The third generation Web：Semantic Web • The conceptual structuring of the Web in an explicit machine-readable way • Requirements：Universal expressive power、Support for syntactic Interoperability、Support for Semantic Interoperability

The Semantic Web (2/4) • Syntactic interoperability talks about parsing the data, and semantic interoperability means to define mappings between unknown terms and known terms in the data • Semantic interoperability：requires standards syntactic form of document and semantic content • A further representation and inference layer is needed on top of the currently available layers of the WWW：Ontology

The Semantic Web (3/4)

The Semantic Web (4/4)

Ontology (1/5) • An explicit machine-readable specification of a shared conceptualization • Crucial role：representation of a shared conceptualization of a particular domain • reusable • find pages that contain syntactically different but semantically similar words • Construct：concepts (which are usually organized by taxonomies), relations, functions, axioms, instances

Ontology (2/5)

Ontology (3/5) • Concepts： • Be anything about which something is said • Also known as classes (XOL, RDF(s), OIL, DAML+OIL), objects (OML), categories (SHOE) • Taxonomies： • used to organize ontological knowledge using generalization and specialization relationships through which simple and multiple inheritance could be applied

Ontology (4/5) • Relations and functions： • An interaction between concepts of the domain and attributes • Be called relations in SHOE、OML, roles in OIL • Functions are a special kind of relation • Axioms： • Constraining information, verifying correctness, deducting new information • Also known as assertions (OML), rule, logic

Ontology (5/5) • Instances： • Represent elements in the domain attached to a specific concept • Measurement of the expressiveness： • XOL, RDF(s), SHOE, OML, OIL, DAML+OIL

XML (1/7) • As a serialization syntax for other markup language, ex：SMIL、XOL、SHOE • As semantic markup of Web-pages • As a uniform data-exchange format

XML (2/7) • Universal expressive power：anything can be encoded in XML if a grammar can be defined for it • Syntactic interoperability：XML parser can parse any XML data and is usually a reusable component • Semantic interoperability：there is no way of recognizing a semantic unit from a particular domain of interest (not yet widely recognized)

XML (3/7)

XML (4/7) • Data exchange： • Build a model of the domain of interest • From the domain model a DTD or an XMLs is constructed • Advantage：reusability of the parsing software components • There exists multiple possibilities to encode a given domain model into a DTD, so the direct connection from the DTD to the domain model is lost and it cannot be easily reconstructed

XML (5/7)

XML (6/7) • A direct mapping based on the different DTDs is not possible • So we have to define the mappings between the different domain models, then between the different DTDs： • Reengineering of the original Domain Model from the DTD or XML Schema • Establishing mappings between the entities in the domain model • Defining translation procedures for XML Documents • Using a more suitable formalism than pure XML can save much of the additional effort

XML (7/7)

Probabilistic DTD(1/11) • Describes the most likely orderings of XML tags and that contains statistical properties for each tag • Utilize association rule discovery algorithm and sequence mining techniques

Probabilistic DTD (2/11) • Objectives：tagging all text documents and deriving an appropriate preliminary flat XML DTD • A knowledge discovery in textual databases (KDT) process to build clusters of semantically similar text units and then new documents can be converted into XML documents

Probabilistic DTD (3/11) • UML schema：are initially conceived by experts serves as a reference for the DTD, but there is no guarantee that the final DTD will be contained in or contain this schema • KDT process： • Tagging initial text documents • Domain knowledge constitutes such as thesaurus、preliminary UML schema, input to process • Pre-processing • Iterative clustering • Post-processing • Establishing a probabilistic DTD

Probabilistic DTD (4/11)

Probabilistic DTD (5/11) • Pre-processing： • Setting the level of granularity • NLP processing such as tokenization、normalization、word stemming • Building text unit descriptors—a reduced feature space(now are chosen by engineer) • Mapping all text units into Boolean vectors of this feature space • Extract named entity

Probabilistic DTD (6/11) • Clustering： • Performed in multiple iterations, each iteration outputs a set of clusters • All text unit vectors are clustered • Partition clusters into “acceptable” and “unacceptable” according to quality criteria • Members of “unacceptable” are input data to the next iteration

Probabilistic DTD (7/11) • Post-processing： • “acceptable” clusters are semi-automatically assigned a label • Ultimately, cluster labels are determined by the engineer • All default cluster labels are derived from text unit descriptors • Automatically derived XML DTD from XML tags

Probabilistic DTD (9/11) • Establishing a probabilistic DTD： • Deriving the most likely ordering of the tags • Computing the statistically properties of each tag inside the document type definition • Deriving the ordering of the tags • Backward Construction of DTD Sequences：builds “maximal” sequences • Forward sequence construction

Probabilistic DTD (10/11) • Backward Construction of DTD Sequences • Starts with an arbitrary tag ﺡand then identifies the tag most likely to appear before it • If no such tag exists, then shifts to the next sequence. If there is one, then the next iteration starts. If there are k tags, then duplicates k incomplete sequences. • Each tag Xi leading to ﺡ with a confidence Ci • If there is a Ci larger than the others, then Xi is the predecessor of ﺡ in the sequence • If C0 where is the confidence where ﺡ has no predecessor is largest, then ﺡ is the first element • Confidence is the tag’s TagSupport multiplied by the accuracy

References • The Semantic Web—on the respective Roles of XML and RDF • Stefan Decker, Frank van Harmelen, Jeen Broekstra, Michael Erdmann, Dieter Fensel, Ian Horrocks, Michel Klein, Sergey Melnik • Intelligent Information Agent with Ontology on the Semantic Web • Weihua Li • Ontology Languages for the Semantic Web • Asuncion Gomez-Perez, Oscar Corcho • Extraction of Semantic XML DTDs from Texts Using Data Mining Techniques • Karsten Winkler, Myra Spiliopoulou

XML on Semantic Web