120 likes | 339 Vues
CSC 485E/SENG 480D/CSC 571 Advanced Databases Introduction. DB and DBMS. Database (DB): a collection of information that exists over a long period of time. Database Management System (DBMS): a complex software for handling Large data efficiently and safely. DBMS.
E N D
DB and DBMS • Database (DB): a collection of information that exists over a long period of time. • Database Management System (DBMS): a complex software for handling • Large data efficientlyand safely.
DBMS • Allows users to create new databases and specify their schema, using a data-definition language. • Enables users to query and modify the data, using a query and data-manipulation language. • Supports intelligent storage of very large amounts of data. • Protects data from accident or not proper use. • Example: We can require from the DBMS to not allow the insertion of two different people with the same SIN. • Allows efficient access to the data for queries and modifications. • Example: Indexes over specified fields • Controls access to data from many users at once (concurrency), without allowing “bad” interactions that can corrupt the data accidentally. • Recovers from software failures and crashes.
Database Studies • Design of databases. • What kinds of information go into the database? • How is the information structured? • How do data items connect? • Database programming. • How does one express queries on the database? • How is database programming combined with conventional programming? • Database system implementation. • How does one build a DBMS, including such matters as query processing, transaction processing and organizing storage for efficient access? We’ll focus on this part
Fictitious Megatron DBMS • Stores relations as Unix files • Students(name, sid, dept) is stored in the file /home/megatron/students as Smith#123#CS Jones#533#EE • Schemas are stored in /home/megatron/schemas e.g. Students#name#STR#id#INT#dept#STR Depts#name#STR#office#str
Megatron sample session mayne$ megatron WELCOME TO MEGATRON 2006 megaSQL% SELECT * FROM Students; Name id dept ------------------------------------- Smith 123 CSC Johnson 522 ECE megaSQL%
Megatron sample session II megaSQL% SELECT * FROM Students WHERE id >= 500; Johnson#522#EE megaSQL% quit THANK YOU FOR USING MEGATRON 2006 mayne$
Megatron Implementation • To execute SELECT * FROM R WHERE <COND> • Read file schema to get attributes of R • Check that the <COND> is semantically valid for R • Read file R, • for each line • check condition • if OK, display
What’s wrong with Megatron? • Tuple layout on disk: no flexibility for DB modifications. • Change CSC to ECON and the entire file has to be rewritten. • Search Expensive: no indexes; always read entire relation. • Bruteforce query processing. • No buffer manager: everything comes off of disk all the time. • No concurrency control: several users can modify a file at the same time with unpredictable results. • No reliability: can lose data in a crash or leave operations half done.
Architecture of a DBMS • The “cylindrical” component contains not only data, but also metadata, i.e.info about the structure of data. • If DBMS is relational, metadata includes: • names of relations, • names of attributes of those relations, and • data types for those attributes (e.g., integer or character string). • A database also maintains indexes for the data. • Indexes are part of the stored data. • Description of which attributes have indexes is part of the metadata.
What will be covered • Secondary Storage Management • Disks Mechanics • Disk Computation Model • Handling disk failures • Index Structures • B-Trees, Extensible Hash Tables, etc. • Multidimensional Indexes (for GIS and OLAP) • Query Execution • Algorithms for relational operators • Join methods. • Query Compiler • Algebraic laws for improving query plans. • Cost based plan selection • Join orders
What will be covered • Parallel and Distributed Databases • Parallel algorithms on relations • Distributed query processing • Distributed transactions • Google’s Map-Reduce framework • Peer-to-peer distributed search • Data Mining • Frequent-Itemset Mining • Finding similar items • Clustering of large-scale data • Databases and the Internet • Search engines • PageRank • Data streams • Data mining of streams