AM18 ASA INTERNALS: DATA MANAGEMENT

AM18ASA INTERNALS: DATA MANAGEMENT GLENN PAULLEY, DEVELOPMENT MANAGER paulley@ianywhere.com AUGUST 2005

Goals of this presentation • Overview of data management and query processing in Adaptive Server Anywhere 9.0.2 • Concentrate on performance issues and problem areas • Provide an overview of SQL Anywhere 9.0 technology • Highlight planned features for the Jasper release • Agenda • Section One: SQL language support, data management • Section Two: query execution and optimization

Design goals of SQL Anywhere Studio • Ease of administration • Good out-of-the-box performance • “Embeddability” features  self-tuning • Cross-platform support • Interoperability

Motivation for the ASA 9.0 release • Exploit the new architecture of 8.0 and add support for additional language features, including • GROUP BY ROLLUP • RECURSIVE UNION • Window functions and other OLAP support • XML • Table Functions • INTERSECT and EXCEPT • ORDER BY, SELECT TOP N in any query block, including views • Improve performance

Highlights of the ASA 9.0 releases • HTTP server • ASA Index Consultant • Improved performance, scalability • better scalability in OLTP environments • Query processing improvements • optimization refinements – particularly with the server’s cost model • histograms modified according to update DML statements • alternate, efficient execution methods for complex queries • SNMP support • 9.0.1 EBF build 1828, Windows platforms only • Formally part of the 9.0.2 release

Performance, performance, performance

Contents • Language Support • New SQL constructs supported with 9.0.1 • Data Management in 9.0.1 • Database organization • Table storage organization • Index storage organization • Physical database design tips • Jasper features

New SQL language support in 9.0.1 • Table functions (SELECT over a stored procedure) • ORDER BY clause now supported in all SELECT blocks • Necessary to support SELECT TOP n in derived tables, views, and subqueries with correct semantics • RECURSIVE UNION (bill-of-materials) queries • INTERSECT and EXCEPT query expressions • LATERAL keyword for derived tables • Now necessary for derived tables or table expressions containing outer references • WITH clause (common table expressions) • Essentially in-lined view definitions

New SQL language support in 9.0.1 • SELECT TOP n START AT m • Equivalent functionality to that in MySQL, Postgres • n and m can be variables or host variables • WITH INDEX hint in FROM clause • Named CHECK, PK, FK, UNIQUE constraints • Constraint violation message refers to the constraint name • New catalog tables: • SYSCONSTRAINT contains information about all constraints, even referential integrity constraints • SYSCHECK contains the body of the CHECK constraint; now permit multiple CHECK constraints on the same column(s) • Specific CHECK constraint that is violated appears in error • Not available in older database formats, even if DBUPGRAD is used

New SQL language support in 9.0.1 • OLAP support • VARIANCE, STD_DEV aggregate functions • ORDER BY clause for LIST aggregate function • GROUP BY • ROLLUP, CUBE, GROUPING SETS • Binary set functions (linear regression, co-variance, etc.) • Rank functions • Windowed aggregate functions • Construct “moving average” results in a single SQL statement • Support for multiple DISTINCT aggregate functions in a single SELECT block • Necessitates the use of Hash Group By

New SQL language support in 9.0.1 • Support for SET statement in Transact-SQL dialect stored procedures • Implemented for MS SQL Server compatibility • EXECUTE IMMEDIATE extensions • Procedures can now use EXECUTE IMMEDIATE to execute dynamically-constructed queries which return a result set • WITH ESCAPES ON | OFF • WITH QUOTES ON | OFF • Variable assignment permitted in UPDATE statements (8.0.1) • SELECT INTO base-table

New SQL language support in 9.0.1 • FOR XML AUTO, FOR XML RAW, FOR XML EXPLICIT, OPENXML procedure (supports XPATH queries over XML column values) • SQLX functionality: xmlelement(), xmlforest(), xmlgen(), xmlconcat(), and xmlagg() • EXPRTYPE() function – outputs the type of the expression argument • Useful when defining computed columns • LOCATE() can handle negative offsets • INSERT WITH AUTO NAME (8.0.2)

Table functions • Result set description determined from the catalog; result set must match exactly • Otherwise SQLSTATE ‘WP012’ • Workaround: use the WITH clause to annotate the procedure reference in the FROM clause: SELECT * FROM PROC() WITH( X Integer, Y char(17) ) SELECT * FROM SYS.SYSTABLE as st, sa_table_fragmentation() as tbfrg WHERE st.table_name = tbfrg.tablename

Table functions • Procedure may return only one result set • Statistics regarding cost, result set cardinality of the procedure are captured at run time; used for subsequent requests • Statistics are stored in SYS.SYSPROCEDURE • Minimally requires DBUPGRAD of older databases to 9.0.0

Recursive UNION • SQL-2003 implementation of recursive (bill-of-materials) queries • Only DB2 also offers RECURSIVE UNION support; Oracle implements a ‘cycle’ clause • Uses specialized join operators: recursive hash inner and outer joins • will utilize a nested-loop strategy if inputs are small; done adaptively at run-time during query execution WITH RECURSIVE r (level, emp_id, manager_id) as ( SELECT 1, emp_id, manager_id FROM employee WHERE emp_id = manager_id UNION ALL SELECT level+1, e.emp_id, e.manager_id FROM employee e JOIN r ON (e.manager_id = r.emp_id) WHERE e.emp_id <> e.manager_id and level < 3) SELECT * FROM r

Recursive UNION: restrictions • Query expression must be UNION ALL • Recursive reference must be in a query block that does not contain DISTINCT, aggregation, or an ORDER BY clause • Recursive reference in a LEFT OUTER JOIN is permitted • Schema of WITH clause must match recursive query • Implicit type conversions involving truncation can yield undesired results; SQLSTATE 42WA2 returned if server detects a type mismatch • Use CAST to ensure compatible types • Infinite queries are possible; server kills the query after N recursions • controlled by the new connection option MAX_RECURSIVE_ITERATIONS (default 100)

INTERSECT and EXCEPT • Implement set/bag difference and set/bag intersection • Both ALL and DISTINCT variants are supported; DISTINCT performed by default • Form query expressions in the same fashion as UNION • NULL treated as a special value in each domain, hence NULLs are equivalent to each other • Useful when formulating queries that require counting of identical rows • See the help for order-of-precedence amongst the set operators

EXCEPT and INTERSECT ALL • Rewrite to transform ALL to DISTINCT done automatically by the optimizer • Both EXCEPT and INTERSECT can be computed through either a merge or hashing technique • Also supports an (expensive) nested-loop strategy in case a cache shortage is encountered • With ALL variants: • implicitly performs aggregation to count the number of duplicate rows in each input • A new query execution operator, ROW REPLICATE, generates the required copies of each row SELECT description FROM product EXCEPT ALL SELECT description FROM product as p2 WHERE quantity < 15

GROUP BY ROLLUP • Computes aggregates as usual, but result set contains multiple sets of groups • Logically, grouping is performed N+1 times for N grouping expressions • Essentially implements the functionality of COBOL Report Writer in a single SQL request SELECT state, zip, count(*), grouping(zip), grouping(state) FROM customer GROUP BY ROLLUP (state, zip)

GROUP BY CUBE • Computes aggregates as usual, but result set contains the power set of the N grouping expressions • Expensive to execute for large N • Result can be restricted through the specification of GROUPING SETS SELECT state, zip, count(*), grouping(zip), grouping(state) FROM customer GROUP BY CUBE (state, zip) SELECT state, zip, count(*), grouping(zip), grouping(state) FROM customer GROUP BY GROUPING SETS ( (state, zip), state, zip, () )

WINDOW functions • Part of SQL OLAP extensions • Computes aggregates (except LIST) over a window of rows • Provides an ANSI-compliant way to number the rows of a result set • ROW_NUMBER() rather than NUMBER(*) • Useful to: • Compute cumulative aggregates, or “moving averages” • Eliminate the need for correlated subqueries involving aggregation

WINDOW functions • List employees, by department, in four US states by their start dates, along with their cumulative salaries: SELECT dept_id, emp_lname, start_date, salary, SUM(salary) OVER (PARTITION BY dept_id ORDER BY start_date RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS “Sum_Salary" FROM employee WHERE state IN ('CA', 'UT', 'NY', 'AZ') AND dept_id IN ('100', '200') ORDER BY dept_id, start_date;

WINDOW functions • List all orders (with part information) where the part quantity cannot cover the maximum single order for that part: SELECT o.id, o.order_date, p.* FROM sales_order o, sales_order_items s, product p WHERE o.id = s.id and s.prod_id = p.id and p.quantity < (SELECT max(s2.quantity) FROM sales_order_items s2 WHERE s2.prod_id = p.id) ORDER BY p.id, o.id SELECT order_qty.id, o.order_date, p.*, max_q FROM ( SELECT s.id, s.prod_id, MAX(s.quantity) OVER (partition BY s.prod_id order by s.prod_id) AS max_q FROM sales_order_items s) as order_qty, product p, sales_order o WHERE p.id = prod_id and o.id = order_qty.id and p.quantity < max_q ORDER BY p.id, o.id

WINDOW functions • Find the salespeople with the best sales (total amount) for each product, including ties: SELECT s.prod_id, o.sales_rep, SUM(s.quantity) as total_quantity, SUM(s.quantity * p.unit_price) as total_sales FROM sales_order o KEY JOIN sales_order_items s KEY JOIN product p GROUP BY s.prod_id, o.sales_rep HAVING total_sales = (SELECT FIRST SUM(s2.quantity * p2.unit_price) as sum_sales FROM sales_order o2 KEY JOIN sales_order_items s2 KEY JOIN product p2 WHERE s2.prod_id = s.prod_id GROUP BY o2.sales_rep ORDER BY sum_sales DESC ) ORDER BY s.prod_id SELECT v.prod_id, v.sales_rep, v.total_quantity, v.total_sales FROM ( SELECT o.sales_rep, s.prod_id, SUM(s.quantity) as total_quantity, SUM(s.quantity * p.unit_price) as total_sales, RANK() OVER (PARTITION BY s.prod_id ORDER BY SUM(s.quantity * p.unit_price) DESC) as sales_ranking FROM sales_order o KEY JOIN sales_order_items s KEY JOIN product p GROUP BY o.sales_rep, s.prod_id ) as v WHERE sales_ranking = 1 ORDER by v.prod_id

Data Management in 9.0.2

Moving to ASA 9.0.2 • If database is 8.0.2, unload/reload to 9.0 is largely unnecessary • DBUPGRAD to 9.0 required for some catalog schema changes, in particular for the Index Consultant • There should be no consequences of using DBUPGRAD with respect to performance • However: • only 9.0 format databases support named constraints • only 9.0 format databases support cache warming • only 9.0.1 databases support page checksums • 8.0.2 databases do not support index statistics collection by default • Can be turned on when creating the database via CREATE DATABASE (but not dbinit)

Moving to ASA 9.0.2 • Otherwise, unload/reload from 8.0.1 or 8.0.0 recommended • Clustered index support • Better statistics management • Improved histogram organization, statistics collection • Index statistics kept persistent in the database file • Improved histograms • Cache warming on startup • Checksums on database pages • PCTFREE option for base and temporary tables

Moving to SQL Anywhere “Jasper” • The Jasper release of the SQL Anywhere server will not support older database formats • Jasper will ship with a migration tool to convert an existing database into a Jasper-format database

Database organization • A database consists of up to 13 “dbspaces” • Maximum size of each dbspace is limited by the underlying operating system • Maximum database size is also determined by page size • Limit for any dbspace is 2**28 (256 million) pages • Each dbspace, the temporary file, and the transaction log is a simple OS file • Ease of administration, backup • Temporary file is used for temporary tables • A dbspace file grows in 256K extents (512K if 16K pages, 1Mb if 32K pages) • Database files can be copied to/from different endian machines • Can copy database from Wintel to big-endian UNIX systems and back again • Server automatically does data conversion where necessary

Database organization • A database file contains: • table pages • index pages • free pages • rollback log pages • checkpoint log pages • Each dbspace for a database must use the same page size

Physical organization: tables • Each table uses an independent set of table pages • Each table allocates at least one page, even if the table is empty • Server maintains bit-maps for table pages • Supports clustering of table pages in the same portion of the database file • Facilitates large-block I/O – SQL Anywhere reads 64K at a time when doing sequential scans • Result: considerably faster sequential scan performance

Physical organization: tables • New in 8.0.2: ‘scattered read’ support on Windows 2000 and Windows XP • Another mainframe technology being reinvented on PC/UNIX servers • aka “locate-mode I/O” • Improves performance, reduces memory requirements • Coming to other platforms as vendors implement it • Tables cannot span dbspaces • Each secondary index on a table can be stored in a separate dbspace • Recommended if multiple spindles are available (not necessary for RAID devices) • Partition dbspaces on separate devices whenever possible • Brings more disk arms to bear, reducing seek latency

Physical organization: tables • Rows are inserted into pages at a point where, if at all possible, the entire row can be stored contiguously • Caveat: row segments are at most 4K; second or subsequent row segments can appear on different pages • Columns are packed tightly together; only unpadded values are stored on disk • Primary key columns are always at the beginning of each row, in sequence • Server may rewrite all rows if PK added or modified • Rows can be of (almost) unlimited size; are split across pages where necessary • Maximum length of any column is 2Gb • Maximum number of rows per page is 255

Physical organization: tables • Rows are not guaranteed to be placed in pages corresponding to their insertion order • By default, ASA uses a first-fit algorithm for page selection • To guarantee ordering of a result set, specify an ORDER BY clause • Space is not reserved for columns that are null • BLOB values are stored in a separate “arena” of pages • First 255 bytes are stored together with the row • Access to the rest of the BLOB value will almost certainly require a SEEK • Implications for choice of page size • Once inserted, a row identifier is immutable • An updated row must be split if its new length does not allow it to fit on the page

Physical organization: tables • Table pages are allocated in 8 page clusters; cluster allocation depends on page size • 2K: grow 4 clusters at a time • 4K: grow 2 clusters at a time • All other page sizes: one cluster at a time • ASA will re-use database pages for additional inserts if entire pages are freed • Defaults: for 1K pages, free space is 100 bytes; all other page sizes is 200 bytes • DBA can specify freespace percentage to accommodate future table UPDATEs using PCTFREE • PCTFREE characteristic stored in new catalog table SYSATTRIBUTE (and corresponding table SYSATTRIBUTENAME) • Can be specified for temporary tables

Page sizes • Page sizes supported are 1K, 2K, 4K, 8K, 16K, 32K • 2K page size minimum on all UNIX platforms • Default changed to 2K in the 6.0.3 release • A server can support several databases concurrently • Buffer pool page size will be the largest database page size specified on the command line • Consider tradeoffs with your choice of page size • 4K recommended; occasionally 8K may offer improved performance • Default will change to 4K with Jasper release • Do not use 16K or 32K pages unless you have a specialty application • In typical environments, large page sizes cause inefficient use of cache

Choice of page size does matter • Larger rows usually require larger pages (requires fewer split rows) • Random retrieval performance is dependent on the application • Larger pages can pollute the cache with unnecessary data • Often require larger buffer pools to accommodate the application’s working set • Smaller pages are more cache efficient, but • Smaller pages reduce index fanout, and can increase index depth

Choice of page size does matter • Don’t ignore index maintenance costs when considering page size (larger page sizes can mean increased cache pressure) • Test your application with different alternatives • Your mileage may vary • A 4K page size is a typical choice for many applications • My recommendation: use 4K pages unless thorough testing proves that a different page size offers better performance/scalability • See data storage whitepaper • Available at www.ianywhere.com/developer • Recently updated for 9.0.0

Physical organization: indexes • ASA 9.0 supports two different types of indexes: • Hash-based • Key is a one-way order-preserving encoding of at most nine bytes of the data values • Hash-based indexes are still used when the key length does not satisfy the limits for compressed indexes • Compressed • Contains Patricia tries in the index’s internal nodes • Used for keys > 10 bytes and less than • 122 bytes with 1K pages • 248 bytes for all other page sizes • Substantially improved performance with larger keys

Physical index organization: hash-based indexes • Values in an index are “hashed” into a key of at most 10 bytes using an order-preserving encoding function • WITH HASH SIZE is deprecated • Each indexed column encoded separately, with a one-byte length • A 10-byte hash value can hold two 32-bit integer values (including two length bytes) • Hash values in an index are stored separately from the index entry itself • The hash value for an identical secondary key is shared for each index entry (row) in that index page • This improves fanout when data distribution is skewed

Physical index organization: Compressed indexes • Internal nodes in the index contain a Patricia trie • PATRICIA: Practical Algorithm to Retrieve Information Coded in Alphanumeric (D. R. Morrison, J. ACM Vol. 15, 1968) • Combines a binary trie with an optimization to skip over bit comparisons that would result from one-way branching • Result: automatic compression of string data • Excellent fanout of internal nodes • Common substrings of key values have a negligible impact on space requirements and performance • Superb performance improvements in many cases, especially with composite primary and foreign keys

Clustered index support • First offered with the 8.0.2 release • At most one clustered index per table (may be a temporary table) • May be secondary index, PK, FK, UNIQUE constraint • Optimizer assumes PK indexes are clustered unless a different clustering index exists • Engine will not attempt to maintain clustering on PK indexes unless they are declared CLUSTERED • May be hash or compressed index • Clustering characteristic stored in SYSATTRIBUTE catalog table • CLUSTERED keyword can be used in both CREATE INDEX and CREATE/ALTER TABLE statements • However, ALTER does not reorganize the table; use REORGANIZE TABLE

Clustered index support • On INSERT/LOAD TABLE, server attempts to keep rows physically adjacent in base table pages • Specification of PCTFREE on LOAD can be critical • Adjacency is NOT guaranteed; ORDER BY still requires a physical sort or indexed retrieval • Can significantly improve performance • Optimizer costs clustered index access differently • Consider their use with queries that involve range predicates • Often useful with DATE or TIMESTAMP columns • Use REORGANIZE TABLE or UNLOAD/RELOAD if clustering degrades over time • ALTER INDEX statement can rename an index or change its clustering attribute

Physical index organization: fanout and page size • Fanout refers to the number of index entries on a page • Lower fanout means greater index depth, and hence more costly random retrieval • Fanout is affected by • Page size • Hash value size/trie compression • Distribution of key values • Index maintenance • Fanout can degrade over time • sa_index_density() procedure

Indexes and query processing • ASA does not store actual data values in the index • implies each base row must be retrieved to • Fetch the values of any attributes, or • To compare keys longer than the maximum hash value size • Indexes are automatically created to enforce referential integrity • Primary keys, foreign keys, unique constraints • All related indexes must be the same type (hash or compressed) • Maximum number of indexes is dependent on page size • <= 4K: 2048 indexes • 8K: 1024 indexes • 16K: 512 indexes • 32K: 256 indexes

Indexes and query processing • Each indexed column can be ascending or descending • Index is scanned backwards if the application scrolls in the opposite direction, or an ORDER BY clause specifies the reverse sequence • Support for merge and hash joins means that ASA will often use sequential scans, rather than indexed retrieval

REORGANIZE Statement – base tables • REORGANIZE TABLE tablename • Defragments rows on-the-fly by removing/inserting groups of rows in clustered index (or PK) order • Exclusive lock held on the table while a group is processed; commits occur periodically to enable other applications to run, checkpoints are suspended while the group is being processed • Performs implicit COMMITs during operation • Rows will be in clustered sequence when operation is complete (except possibly concurrent UPDATES) • Use new procedure sa_table_fragmentation() to discover tables that warrant reorganization

REORGANIZE Statement - indexes • REORGANIZE TABLE tablename [ index specification ] • INDEX indexname • FOREIGN KEY indexname • PRIMARY KEY • Exclusive lock is held throughout • CHECKPOINTs are suspended • Reclaims space lost to update activity • Re-balances the index, especially important after many DELETE operations • Use the new procedure sa_index_density() to identify indexes that require reorganization

Data management improvements in 9.0.1 • Better scalability – new lock-free cache manager • Substantially better performance across the board • Support for page checksums • New option for dbinit and CREATE DATABASE statement • Supported by dbvalid utility, and a new statement VALIDATE CHECKSUM • Overhead: largely depends on CPU speed. Examples: • 2.8 milliseconds per I/O for 32K pages • 0.7 milliseconds per I/O for 8K pages • Improvements to dynamic cache sizing • Sampling rate changes with database growth or the starting of a new database on the same server

Data management improvements in 9.0.1 • Database cache warming feature • Two operational phases, collection and reload • During collection, page IDs are saved in the database as they are accessed at startup • During reload, collected page IDs are read into cache as background processing • Checks and balances used to prevent swamping the server with I/O during server startup • Need to test performance before deploying • Cache warming is *enabled* by default

AM18 ASA INTERNALS: DATA MANAGEMENT