CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226)

CIS336Website design, implementation and management(also Semester 2 of CIS219, CIS221 and IT226) David Meredith d.meredith@gold.ac.uk www.titanmusic.com/teaching/cis336-2006-7.html Lecture 3 XPath (Based on Møller and Schwartzbach, 2006, Chapter 3)

What is XPath? • XPath is a language for selecting parts of and navigating around an XML tree • Used in • XML Schema for uniqueness and scope descriptions • XSLT for pattern matching and selection • XQuery for selection and iteration • XLink and XPointer • XPath can also be used to do computations on data values • Example of an XPath expression://rcp:ingredient[@amount='0.5' and @unit='cup']/@name • This expression uses abbreviations (// and @) and tacit conventions • XPath 1.0 was a relatively simple language • Through interaction with XQuery, it has become a much larger language called XPath 2.0 • XPath 2.0 specification is a W3C proposed recommendation (November 2006), available here: http://www.w3.org/TR/xpath20/

Location steps and paths • An XPath location path evaluates to a sequence of nodes in an XML tree • Sequence never contains duplicates of identical nodes • However it is possible for two or more nodes to contain the same values and therefore be “equal” • Location path is sequence of location steps, separated with / character, e.g.,child::rcp:recipe[attribute::id='117']/child::rcp:ingredient/attribute::amount • This expression selects… • … all the amount attributes… • … in rcp:ingredient nodes that are children of… • … rcp:recipe nodes with an attribute called id with value '117‘ that are… • … children of the context node (which is assumed here to be the root element) • Expression consists of three location steps, each with the formataxis::nodetest[pred1][pred2]…where axis is the axis, nodetest is a nodetest and pred1 and pred2 are predicates which are XPath expressions • Axis, nodetest and predicates are increasingly specific definitions of the sequence of nodes that the location step selects

Location step maps context onto new sequence of nodes • Location step always evaluated relative to a context and always evaluates to a sequence of nodes • The context is itself a sequence of nodes • Therefore a location step transforms one sequence of XML tree nodes, called the context, into another sequence of XML tree nodes • The output sequence is generated by concatenating the result of replacing each node, x, in the input sequence with the result of evaluating the location step relative to x as the context node

An Example A B B C C D E F C E F F F E E F F

An Example A B B C C D E F C E F F F E E F F Context node

An Example A B B C C D E F C E F F F E E F F descendant::C

An Example A B B C C D E F C E F F F E E F F descendant::C/child::E

An Example A B B C C D E F C E F F F E E F F descendant::C/child::E/child::F

Contexts A • The context of an XPath evaluation consists of • a context node (a node in an XML tree) • a context position and size • If location path isdescendant::C/child::E/child::Fand first two location steps have been evaluated, then we have the sequence (E1, E2, E3) shown at left • child::F location step then evaluated on E1, E2 and E3 in turn • In each case, context size is 3 because input sequence for this step contains 3 nodes: (E1, E2, E3) • When child::F evaluated on E1, context position is 1; when evaluated on E2, context position is 2; and so on. • a set of variable bindings (mapping variable names to values) • a function library • XPath specification guarantees that context provides set of core functions • a set of namespace declarations • For example, in our examples, we assume that the namespacehttp://www.brics.dk/ixwt/recipesis bound to the namespace prefix rcp • The application determines the initial context • If the path starts with ‘/’ then • the initial context node is the root node (not the root element) • the initial position and size are 1 B B C C D E F C E 1 F F F E E F 2 3 F

An axis is a sequence of nodes evaluated relative to the context node First approximation to sequence of nodes we want to obtain as the result of a location step XPath supports 12 different axes child Children of context node NB: excludes attribute nodes descendant Descendants of context node NB: excludes attribute nodes parent Unique parent node or empty sequence if context node is root node ancestor All ancestors of the context node, from parent to root node following-sibling Right-hand siblings of context node Empty sequence if context node is an attribute node preceding-sibling Left-hand siblings of context node Empty sequence if context node is an attribute node following All nodes appearing later in the document than the context node Excludes descendants of context node preceding All nodes appearing before context node in document Excludes ancestors of context node attribute Every attribute node whose parent is the context node Order is implementation-dependent, but stable (i.e., always same ordering for a given input) self The context node itself Actually a sequence containing just the context node descendant-or-self Concatenation of self and descendant axes ancestor-or-self Concatenation of self and ancestor axes Axes

Axis Directions • Each axis has a direction • Forwards means document order: child, descendant, following-sibling, following, self, descendant-or-self • Backwards means reverse document order: parent, ancestor, preceding-sibling, preceding, ancestor-or-self • Stable (i.e., always same output for same input) but depends on the implementation: attribute • self, ancestor, descendant, preceding and following together form a disjoint partition of all the nodes in an XML tree

The parent Axis (4) 1 2 20 22 3 21 23 24 4 5 9 13 19 25 6 8 10 14 15 26 27 11 12 16 17 18 7 28

The child Axis (14,15) 1 2 20 22 3 21 23 24 4 5 9 13 19 25 6 8 10 14 15 26 27 11 12 16 17 18 7 28

The descendant Axis (14,15,16,17,18) 1 2 20 22 3 21 23 24 4 5 9 13 19 25 6 8 10 14 15 26 27 11 12 16 17 18 7 28

The ancestor Axis (4,2,1) 1 2 20 22 3 21 23 24 4 5 9 13 19 25 6 8 10 14 15 26 27 11 12 16 17 18 7 28

The following-sibling Axis (19) 1 2 20 22 3 21 23 24 4 5 9 13 19 25 6 8 10 14 15 26 27 11 12 16 17 18 7 28

The preceding-sibling Axis (9,5) 1 2 20 22 3 21 23 24 4 5 9 13 19 25 6 8 10 14 15 26 27 11 12 16 17 18 7 28

The following Axis (19,20,21,22,23,24,25,26,27,28) 1 2 20 22 3 21 23 24 4 5 9 13 19 25 6 8 10 14 15 26 27 11 12 16 17 18 7 28

The preceding Axis (12,11,10,9,8,7,6,5,3) 1 2 20 22 3 21 23 24 4 5 9 13 19 25 6 8 10 14 15 26 27 11 12 16 17 18 7 28

Node tests • Second part of a location step is the node test:child::rcp:recipe[attribute::id='117']/child::rcp:ingredient/attribute::amount • Types of node test: text() selects only the character data nodes comment() selects only the comment nodes processing-instruction() selects only the processing instruction nodes node() selects all nodes * selects all nodes in the axis preceding the node test • If axis is not attribute, then only element nodes selected name selects the nodes with the given QName, name *:localname selects the nodes with given NCName, localname, in any namespace prefix:* selects all nodes in the same way as *, but only in the specified namespace

Resolving names without namespaces • In XPath 1.0, missing namespace prefix interpreted as empty URI, "", not default namespace • Bug fixed in XPath 2.0, where empty prefix interpreted as default namespace, not empty URI • However, most tools implement XPath 1.0 • Suppose we want to select the ref attribute of the subwidget element • Could try:/child::widget/child::big/child::subwidget/attribute::ref • But this won’t work because each node name in expression is interpreted as being from the empty namespace • In fact, no XPath expression will work here! • In order to work XPath 1.0 and 2.0, every element name has to be explicitly qualified with namespace, as in lower example and XPath expression must be changed to/child::wdg:widget/child::wdg:big/child::wdg:subwidget/attribute::ref

Predicates • Final part of location step consists of zero or more predicates:child::rcp:recipe[attribute::id='117']/child::rcp:ingredient/attribute::amount • Predicate is an XPath expression, evaluated as a boolean condition • XPath expressions are as rich as expressions in a general purpose programming language like Java • Can produce values of different types • e.g., numbers, booleans, strings, sequences • When used as a predicate, the value of an XPath expression is coerced into a boolean value: • A number is coerced to true when it is equal to the current context position • A string is coerced to true when it has non-zero length • A sequence is coerced to true when it has non-zero length • Boolean conditions can be combined using the operators and and orand the function not • Variables from the context can be referenced using the syntax $foo where foo is the variable name • The usual arithmetic (+,-,*,div) and comparison (=, !=, <, <=, etc.) operators are also available • Sometimes useful to use location paths as predicates:/descendant::rcp:recipe[descendant::rcp:ingredient[attribute::name=‘sugar’]]selects every recipe node that contains sugar; whereas/descendant::rcp:recipe/descendant::rcp:ingredient[attribute::name=‘sugar’]selects every ingredient node whose name is sugar.

More on predicates • The predicates in a location step are evaluated left-to-right • i.e., the first predicated is evaluated, producing a sequence of nodes which forms the context for the second predicate, and so on • This means that changing the order of the predicates can change the value of the result of the expression • For example:/descendant::rcp:ingredient[position()=3][position()=1]returns every ingredient which is the third ingredient in a recipe; whereas/descendant::rcp:ingredient[position()=1][position()=3]returns the empty sequence because once the first ingredient in each recipe has been selected, there is only one element in the context when the second predicated is evaluated (and therefore no third element) • If we combine two predicates with and, then this also generally gives a different result:/descendant::rcp:ingredient[position()=3 and position()=1]returns the empty sequence because there is no element whose position within the current context is both 3 and 1

Typical location paths • XPath language is large, but you usually only use a small part of it • There are a few patterns that are used particularly often • Most commonly used axes are child, descendant and attribute • *, test() and QName are most commonly-used node tests

Some examples /descendant::rcp:recipe/child::rcp:title • Selects every title node in every recipe /descendant::rcp:recipe/descendant::rcp:ingredient/attribute::name • Selects the name of every ingredient in every recipe /descendant::rcp:*/child::text() • Selects every character data node in the collection [attribute::amount] • Selects the nodes in the context that contain an attribute called amount [attribute::amount='0.5'] • Selects nodes in the context that have an attribute whose name is amount and whose value is '0.5' [attribute::name!='flour'] • Selects nodes in the context that do not have a name attribute whose value is 'flour‘ [attribute::amount<0.5 and attribute::unit='cup'] • Selects context nodes that have an amount attribute with a value less than 0.5 and a unit attribute whose value is cup [position()=2] • Selects the second node in the context [descendant::rcp:ingredient] • Selects those nodes in the context that contain an ingredient node

Using XPath in other languages • XPath expressions often appear as attribute values in other XML languages (e.g., XML Schema and XSLT) • When used in other XML languages, all special characters have to be escaped, e.g. <xsl:apply-templates select="descendant::rcp:ingredient[attribute::amount<0.5]" /> Instead of <xsl:apply-templates select="descendant::rcp:ingredient[attribute::amount<0.5]" />

Abbreviations • XPath allows certain abbreviations that make expressions easier to write • If no axis is used, default axis of child is used/child::rcp:collection/child::rcp:recipe/child::rcp:ingredientis equivalent to/rcp:collection/rcp:recipe/rcp:ingredient • attribute axis can be replaced with @/rcp:collection/rcp:recipe/rcp:ingredient/attribute::amountis equivalent to /rcp:collection/rcp:recipe/rcp:ingredient/@amount • /descendant-or-self::node()/ can be replaced with //, e.g.,//rcp:recipe[rcp:title=‘Ricotta Pie’]//rcp:ingredientselects all ingredient nodes in any recipe whose title is ‘Ricotta Pie’ no matter how deeply nested within the recipe node the ingredient node might be • self::node() can be replaced with . and parent::node() can be replaced with .. e.g.,/descendant-or-self::node()/ child::rcp.nutrition[attribute::calories=349]/ parent::node()/child::rcp:title/child::text()can be abbreviated to//rcp:nutrition[@calories=349]/../rcp:title/text()

Some subtleties with abbreviations • Any expression beginning with / is evaluated with the root node (not the root element) as the context, so//rcp:recipe/rcp:ingredient[//rcp:ingredient]returns all ingredients that are not inside composite ingredients, whereas//rcp:recipe/rcp:ingredient[.//rcp:ingredient]returns all composite ingredients • Note also that//rcp:ingredient[1]selects all ingredient nodes that are first among their siblings; whereas,/descendant::rcp:ingredient[1]selects the first ingredient in the collection since predicate selects from single sequence of nodes that satisfies the axis and node test

General expressions • XPath has grown into a large language for expressing computations on sequences • XPath 2.0 has many features motivated by its use for selection within XQuery • Keywords and operators in XPath 2.0: $ , to | union intersect except . + - * div idiv mod and or = != > >= < <= eq ne lt le gt ge is << >> for in if then else some every satisfies

Values and atomization • Every XPath expression evaluates to a sequence of items • Sequence may be empty • Each element in the sequence can be • A node • An atomic value which can be • A number • Integer, decimal, float or double • A boolean value • A string of Unicode characters • A datatype defined in XML Schema • Note that the result of an expression is always a sequence even if the sequence only contains one element • XPath interprets a single atomic value as a singleton sequence containing that value • Remember that a single node contains all its descendants and therefore denotes the subtree of which it is the root • Atomizing a sequence means converting it into a sequence of atomic values • This is done by converting every node into its string value thus: • The string value of a text node is its contents • The string value of an element is the concatenation in document order of the string values of all descendant text nodes • The string value of an attribute node is the value of the attribute • The string value of a comment node is the comment text • The string value of a processing instruction node is the processing instruction value • The string value of a root node is the concatenation in document order of the string values of all descendant text nodes • For example, the atomized sequence generated by the XPath location/rcp:collection/rcp:recipe[@id='r101']/rcp:ingredient/@namewhich returns the sequence containing the names of the ingredients in the first recipe in the collection isbeef cube steak onion, sliced into thin rings green bell pepper, sliced in rings Italian seasoned bread crumbs grated Parmesan cheese olive oil spaghetti sauce shredded mozzarella cheese angel hair pasta minced garlic butter

Literal expressions • A literal expression is a singleton sequence containing a constant atomic value • Literal numbers written in expected way, thus 42 is an integer 3.1415 is a decimal 6.022E23 is a float or double • Literal strings enclosed in single or double quotes, thus 'XPath is a lot of fun' "XPath is a lot of fun" • When string contains single or double quotes, there are two solutions: 'The cat said "Meow"' "The cat said ""Meow""' • No literal boolean values • Use constant functions, true() and false()

Comments • Insert a comment into an XPath expression by using the following syntax: (: this is a comment :)

Variable references • Variable references are written as follows: $foo refers to the variable foo • Variable may be • bound within context of expression • created through a binding in a for expression or quantified expression • Variable name may be any QName • Variable name may belong to a specific namespace: $bar:foo

Arithmetic expressions • For integers, decimals, floats and doubles: + - * div -n (unary minus) • For integers: idiv (integer division) mod (modulo) • Every argument is actually a sequence • if argument is the empty sequence, then result is an empty sequence • if all arguments are singleton sequences containing numbers of the expected type, then the operation is performed and the result is returned as a sequence • otherwise, a runtime error occurs • Variables are QNames and can therefore contain a minus sign (-) • Thus: $foo-17 is a reference to a variable called foo-17 • If we want to subtract 17 from foo, then we have to write: ($foo)-17 $foo -17 $foo+-17 or anything else that separates the foo from the 17

Sequence expressions • If expiis an expression, then exp1, exp2, exp3, ..., expn constructs a new sequence which is the concatenation of all the expressions, expi • Atomization is always performed before concatenation, so it is impossible to produce nested sequences • Thus,(1, (2, 3, 4), ((5)), (), (((6, 7), 8), 9))produces the same sequence as1, 2, 3, 4, 5, 6, 7, 8, 9 • Expression exp1 to exp2requires that exp1 and exp2are both singleton sequences that evaluate to integers and whole expression evaluates to the sequenceexp1, exp1+1, ..., exp2 • e.g., sequence above could also be expressed as1 to 9 • Node sequences can be combined using set operators:union (or |) intersect except (which means set difference) • each performs set operation and returns result sequence in document order containing no duplicate of identical nodes • For example, sequence expcan be sorted into document order with duplicates of identical nodes removed using the expression exp | ()

Path expressions • Location paths are XPath expressions • A location path is evaluated in a sequence of location steps, starting with a context • Can also evaluate a location path relative to any arbitrary node sequence returned by some XPath expression • result is concatenation of results of evaluating location path with each node in the input sequence in turn • context position of each node in input sequence is position within input sequence • context size is length of input sequence • For example(fn:doc("veggie.xml"), fn:doc("bbq.xml"))//rcp:titlereturns titles of all recipes in both files • fn:doc function returns root node of a document

Filter expressions • Location path predicate is a special type of filter expression • Filter expression can be applied to any sequence containing nodes and/or atomic values • Syntax:exp[filter]where • exp is an expression that evaluates to a sequence • filter is a filter expression that selects those elements in exp for which filter is true • inside filter, current item in exp is referred to by the symbol . • current context position is position within exp • current context size is size of exp • Example: (30 to 60)[. mod 5 = 0 and position()>20]has the same result as the expression 50,55,60(remember that , is the symbol for concatenation in XPath)

Comparison expressions • There are three types of comparison expressions in XPath: • Value comparisons • used to compare atomic values • General comparisons • can be used to compare all values • Node comparisons • used to compare nodes on identity and document order

Value comparisons • Value comparison operators are used to compare atomic values • Value comparison operators areeq ne lt le gt ge • When applied to two arbitrary values, following procedure carried out: • the two values are atomized • if either resulting sequence is empty, the result is the empty sequence • if either sequence has more than one element, the result is a type error • if the two atomic values (represented by two singleton sequences) are not comparable (e.g., 7 and "abc"), a runtime error occurs • otherwise the result is obtained by comparing the two atomic values • For example, the following expressions all evaluate to true:8 eq 4 + 4//rcp:description/text() eq "Some recipes used in the XML tutorial."(//rcp:ingredient)[1]/@name eq "beef cube steak"

General comparisons • General comparison operators used to compare all values • General comparison operators are = != < <= > >= • When applied to two values, following steps performed: • the two arguments are atomized • if there exists at least one pair of atomic values, one from each argument, for which the comparison holds, the result is true • otherwise the result is false • For example, the following all evaluate to true: 8 = 4+4 (1,2) = (2,4) (2,4) = (3,4) //rcp:ingredient/@name='salt' • This type of equality is not transitive: (1,2) != (3,4)even though (1,2) = (2,4) and (2,4) = (3,4)

Node comparisons • Node comparison operators used to compare nodes for identity and document order • Node comparison operators are: is arguments refer to the same node (identity) << first argument precedes second in document order >> first argument follows second in document order • When applied to two arguments, following steps performed: • if either sequence is empty, returns empty sequence • if both arguments are singleton sequences containing nodes, then the result is the boolean value of the comparison • otherwise a runtime error occurs • For example, following all evaluate to true: (//rcp:recipe)[2] is //rcp:recipe[rcp:title/text() eq "Ricotta Pie"] /rcp:collection << (//rcp:recipe)[4] (//rcp:recipe)[4] >> (//rcp:recipe[3]) • Note that, in an XSLT or XQuery file, the operators << and >> must be rendered as << and >>, respectively

Comparison confusions • Must always carefully consider whether a comparison should be a value, general or node comparison • making the wrong decision could lead to an unexpected result • For example, given that the 40th and 53rd ingredients in the recipes.xml file are different amounts of salt: (//rcp:ingredient)[40]/@name eq (//rcp:ingredient)[53]/@namereturns true but (//rcp:ingredient)[40]/@name is (//rcp:ingredient)[53]/@namereturns false

Boolean expressions • Operators and and or accept arguments of any type which are then coerced to effective boolean values • Following are coerced to boolean value true: • boolean value true • a non-empty string • a non-zero number • a sequence in which the first item is a node • Following coerced to boolean value false: • boolean value false • empty string • 0 • empty sequence • Otherwise, result is undefined or an error • Boolean values true and false can be constructed using the functions true() and false() • Boolean value can be negated using the function not(exp)

Functions • XPath 2.0 and XQuery 1.0 functions are defined in the proposed W3C recommendation which is available here: http://www.w3.org/TR/xpath-functions/ • To use functions, context must contain declaration of the namespace http://www.w3.org/2005/xpath-functions/ • This URI is also the URL of a page that summarises all the available XPath 2.0 functions • This namespace is traditionally given the prefix fn in a namespace declaration • The XML Schema namespace,http://www.w3.org/2001/XMLSchemaalso defines some useful functions for coercion and constructing data values • XML Schema namespace traditionally given prefix xs

Arithmetic Functions fn:abs(-23.4) = 23.4 fn:ceiling(-23.4) = -23 fn:floor(23.4) = 23 fn:round(23.4) = 23 fn:round(-23.4) = -23 fn:round(23.5) = 24

Boolean Functions fn:not(0) = fn:true() fn:not(fn:true()) = fn:false() fn:not("") = fn:true() fn:not((1)) = fn:false() (1) evaluates to the number 1

String Functions fn:concat("X","ML") = "XML" fn:concat("X","ML"," ","book") = "XML book" fn:string-join(("XML","book")," ") = "XML book" fn:string-join(("1","2","3"),"+") = "1+2+3" fn:substring("XML book",5) = "book" fn:substring("XML book",2,4) = "ML b" fn:string-length("XML book") = 8 fn:upper-case("XML book") = "XML BOOK" fn:lower-case("XML book") = "xml book"

Regexp Functions fn:contains("XML book","XML") = fn:true() fn:matches("XML book","XM..[a-z]*") = fn:true() fn:matches("XML book",".*Z.*") = fn:false() fn:replace("XML book","XML","Web") = "Web book" fn:replace("XML book","[a-z]","8") = "XML 8888"

Cardinality Functions Following decide cardinality of general sequences fn:empty is the negation of fn:exists it returns false iff the argument is the empty sequence fn:exists(()) = fn:false() fn:exists((1,2,3,4)) = fn:true() fn:empty(()) = fn:true() fn:empty((1,2,3,4)) = fn:false() fn:count((1,2,3,4)) = 4 fn:count(//rcp:recipe) = 5

CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226)