Provenance Management & Citations in Curated Databases

Provenance Management & Citations in Curated Databases Kleisarchaki Sophia, HY561, 05/05/09

About the Author – Peter Buneman • He works in the Database Group in the Laboratory for Foundations of Computer Science (University of Edinburgh). • He spent many years in the Database Group of the Department of Computer and Information Science at the University of Pennsylvania. • You can find him.. ..in polynomial time.

Contents 1st paper 2nd paper Before All.. • “Provenance Management In Curated Databases” • Peter Buneman, • Adriane P. Chapman, • James Cheney • “How to cite curated databases and how to make them citable” • Peter Buneman • “Curated Databases” • Peter Buneman, • James Cheney, • Wang-Chiew Tan • “Provenance in Databases (Tutorial Outline)” • Peter Buneman, • Wang-Chiew Tan

Curated Databases • What is a Curated Database? • The term “curated” comes from the Latin curare – to care for. • Are databases that are populated & updated with a great deal of human effort through the consultation, verification and aggregation of existing sources and the interpretation of new raw data. • Are a result of a great deal of annotation, correction and transfer data from other sources.

Curated Databases • What a Curated Database IS NOT? • Curated databases are not warehouses. They are manually constructed by highly skilled scientists. • They are not computed automatically from existing datasets. • They are not views.

Curated Databases • Notable examples of curated databases • UniProt (formerly called SwissProt) used in molecular biology. • CIA World Factbook: source of demographic data. • IUPHAR: receptor database. Maintained by volunteers. • Nuclear Protein Database (NPD). • Reference manuals, dictionaries and gazetteers. • Such databases are not confined to biology; they are also being developed in areas such as astronomy and geology. • Wikipedia and other wikis are also curated in that they are the product of direct human effort.

Curated Databases • Which are the characteristics of a Curated Database? • Source. Data that is copied and edited from existing sources, perhaps other curated databases.Knowing the origin – provenance – is important. • Annotation. In addition to core data, curated databases also contain annotations that carry additional pieces of information such as provenance. • Update. A common practice is to maintain a working database updated and to “publish” versions of it. • Schema and structure. Constructed “on the cheap”, usally stored in a text file. Almost inevitably the structure of the entries evolves over time.

Curated Databases • Which are the characteristics of a Curated Database? • Source. Data that is copied and edited from existing sources, perhaps other curated databases.Knowing the origin – provenance – is important. • Annotation. In addition to core data, curated databases also contain annotations that carry additional pieces of information such as provenance. • Update. A common practice is to maintain a working database updated and to “publish” versions of it. • Schema and structure. Constructed “on the cheap”, but almost inevitably the structure of the entries evolves over time.

Provenance in Databases (1/2) • Provenance – also called lineage and pedigree – describes the source and derivation of data. • Helps to: • Determine the authenticity of a work. • Establish the historical importance of a work by suggesting other artists who might have seen and be influenced by it. • Determine the legitimacy of current ownership. • Trust the data. Why is provenance important?

Provenance in Databases (2/2) • Overview of provenance Describes the source and derivation of data. Provenance Record a complete history of the derivation of some data set. Workflow or coarse-grain provenance Dataflow or fine-grain provenance Derivation of part of the resulting data set. Keeps the justification for the element appearing in the output. The identification of the source elements where the data in the target is copied from. Where – provenance Why – provenance

(Why-provenance) Why? (Where-provenance) Where? Where-, Why- Provenance NYHotels (Source table) Cost Type Restaurant Zip Rating Hotel Zip Peacock Alley $$$ French 10022 4.5 10022 Waldorf Astoria Bull & Bear $$$ Seafood 10022 Holiday Inn DT 10013 4.0 Pacifica $ Chinese 10013 $ Soho Kitchen & Bar American 10022 JOIN, PROJECT View Restaurant Hotel Rating Cost $$$ Waldorf Astoria Peacock Alley 4.5 Bull & Bear 4.5 $$$ Waldorf Astoria Waldorf Astoria $ Soho Kitchen & Bar 4.5 Pacifica $ Holiday Inn DT 4.0

Contents 1st paper 2nd paper • “Provenance Management In Curated Databases” • Peter Buneman, • Adriane P. Chapman, • James Cheney • “How to cite curated databases and how to make them citable” • Peter Buneman

What is the problem being addressed in the paper? • Database technology is employed not only to provide access to source data, but also to the derived knowledge of scientifics who have interpreted the data. • Provenance or metadata describing creation, recording, ownership, processing, or version history is essential for assessing the value of such data. What information should be retained? How should it be managed?

What is this paper about? • Investigates general-purpose techniques for recording provenance for data that is copied among databases. • Describes an approach in which they track the user’s actions, in order to record them in a convenient, query able form. • Presents an implementation of this technique and use it to evaluate the feasibility of database support for provenance management.

Curated Databases - Example • Example a) Copies records of some interesting proteins from a SwissProt webpage into her database. b) Fixes the new entries so that the PTM (post translational modification) found in SwissProt is not confused with her. c) Copies some publications from OMIM and NCBI. d) One year later she finds a discrepancy between two PTMs.

The Problem • It is necessary to retain provenance information describing the source and version history of the data. • We focus on “fine-grained” provenance, which describes how data has moved through a network of databases. • Need to record both local modifications to the database (insert, delete, update) and global operations such as copying data from external sources. Constraints: • 1. There is not a standardfor storing or exchanging provenance. • 2. Varying practices for identifying or locating data. • 3. Past versions may not be archived. • 4. Curators employ a variety of application programs that cannot be changed.

Our Approach (1/2) • User’s actions are captured as a sequence of insert, delete, copy and paste by provenance-aware application. External source databases Local database Auxiliary provenance database Provenance architecture

Our Approach (2/2) • Implemented a naïve approach and several more sophisticated. • The naïve approach increases the time to process each update by 28%. The amount of provenance information stored is proportional to the size of the changed data. • Optimization techniques: • Transactional provenance management. • Hierarchical provenance management. • Together these optimizations reduce the added processing cost of provenance tracking to less than 5-10% per operation and reduce the storage cost by a factor of 5-7 relative to the naïve approach. Typical provenance queries can be executed more efficiently.

Manual Updates and Provenance (1/2) • “Where a piece of data comes from?” • We need to have a means for describing the location of any data element. • Two assumptions: • Database can be viewed as a tree. • Labels on edges occur on at most one path. (SwissProt/Release{20}/Q01780 identify a specific entry)

Manual Updates and Provenance (2/2) • Update operations are of the form: • u ::= ins{a:u} into p | del a from p | copy q into p Inserts an edge labeled a with value v into the subtree at p. Deletes an edge and its subtree. Replaces the subtree at p with a copy of the subtree at location q.

Provenance Tracking • Prov(Tid, Op, Loc, Src) External source databases Local database Auxiliary provenance database Provenance architecture

Naïve Provenance • Store one provenance record for each copied, inserted or deleted node. • Wasteful in terms of space. • Retains the maximum possible information about the user’s actions. One transaction per line

Transactional Provenance • Actions are grouped into transactions larger than a single operation. • Store only provenance links describing the net changes resulting from a transaction. • Details about intermediate states are not retained. • Less precise than naïve approach. • Number of transactional provenance records: i + d + c i: number of inserted nodes in the output. d: number of nodes deleted in the input. c: number copied nodes in the output. Entire update as one transaction

Hierarchical Provenance (1/2) • It is not necessary to store all of the provenance links explicitly. • The provenance of a child of a copied node can often be inferred from its parent’s provenance using a simple rule. • Does not discard any information. • Does not require user to group operations into transactions. Hierarchical version of naïve approach. 25% smaller than Prov, but much larger savings are possible.

Hierarchical Provenance (2/2) • We can define the full provenance table as a view of the hierarchical table as follows: • If the provenance is specified in HProv, then it is just copied into Prov. Otherwise, • The provenance of every target path p/a not mentioned in HProv is q/a, provided p was copied from q. • Infer(t, p)  ¬( x, q.Hprov(t, x, p, q)) • Prov(t, op, p, q)  Hprov(t, op, p, q) • Prov(t, I, p/a, )  Prov(t, I, p, ), Infer(t, p) • Prov(t, C, p/a, q/a)  Prov(t, C, p, q), Infer(t, p) • Prov(t, D, p/a, )  Prov(t, D, p, ), Infer(t, p)

Transactional-Hierarchical Provenance • Combination of transactional and hierarchical provenance techniques. • Storage is: i + d + C, i: number of inserted nodes in the output. d: number of nodes deleted in the input. C: number of roots of copied subtrees that appear in the output. Hierarchical version of (b). Entire update as one transaction

Provenance Queries • Define some convenient views of the raw Prov table. “p was unchanged during transaction t” Unch(t, p)  ¬( x, q.Prov(t, x, p, q)) • “p was inserted during transaction t” Ins(t, p)  Prov(t, I, p, ) • “p was deleted during transaction t” Del(t, p)  Prov(t, D, p, ) “p was copied from q during transaction t” Copy(t, p, q)  Prov(t, C, p, q)

Provenance Queries • Define some convenient views of the raw Prov table. “node p comes from q during transaction t” • “the data at location p at the end of transaction t “came from” the data at location q at the end of transaction u” From(t, p, q) From(t, p, q)  Copy(t, p, q) From(t, p, q)  Unch(t, p) Trace(p, t, q, u) Trace(p, t, p, t). Trace(p, t, q, u)  Trace(p, t, r, s), Trace(r, s, q, u). Trace(p, t, q, t-1)  From(t, p, q).

Let’s answer some… “simple” questions!

Provenance Queries (1/2) • Q1: Src • Q2: Hist • Q3: Mod What transaction first created the data at a location? (e.g. who entered your telephone number incorrect?) Src(p) = {u | q.Trace(p, tnow, q, u), Ins(u, q)} Hist(p) = {u | q.Trace(p, tnow, q, u), Copy(u, q)} Mod(p) = {u, | q.p ≤ q, Trace(p, tnow, r, u), ¬Unch(u, r)} What is the sequence of all transactions that copied a node to its current position? What transactions are responsible for the creation or modification of the subtree under a node?

Provenance Queries (2/2) • There are many interesting queries that mention both provenance and the row data. • Q4 • Such queries are tricky to write by hand. • Providing advanced support for provenance queries is future work. • Note:If some source databases do not track provenance then queries stop following the chain of provenance. Project the A field out of relation R(Id, A, B) along with its current provenance. Q(x, Px)  R(k, x, y), From(tnow, “R/” + k + “/A”, Px)

Implementation Target database - MiMI Source database - OrganelleDB Auxiliary provenance database Provenance architecture Wrappers for source and target databases

Implementation Of Provenance Tracking (1/2) • Naïve provenance • Is a straightforward process of recording target and source information of every transaction that affects the target database. • For a paste operation we add one record per node in the copied subtree. • Transactional provenance • When a commit action occurs, CPDB stores the provenance links connecting the current version with its predecessor. • No links corresponding to temporary data are stored. • The implementation maintains a provlist, of provenance links that will be added to the provenance store when the user commits.

Implementation Of Provenance Tracking (2/2) • Hierarchical Provenance • Stores at most one record per operation. • For a copy, stores the record connecting the root of the copied tree to the root of the source. • Hierarchical Transactional Provenance • Maintains hierarchical provenance instead of naïve provenance records in provlist. • Checks and removes redundant links from provlist. E.g. copy S/a to T/a, copy S/a/b to T/a/b  redundant links

Provenance Queries - Implementation • Src, Mod, Hist implemented as programs. • For naïve and transactional provenance, query directly the provenance store. • For hierarchical provenance, the provenance store corresponds to the Hprov relation. • Query the provenance store directly and compute the appropriate provenance links on the fly.

Evaluation • The experiments focused primarily on the storage and processing requirements of provenance tracking for the different approaches. • Query optimization and database tuning left for future work. • Chose to use random sequences of copy-paste operations to simulate worst case behavior.

Experimental Setup • Performed five sets of experiments. • Used six patterns of update operations. Update patterns Deletion patterns

First Two Experiments N, T store 4 records/copy. H, HT store only 1 record. First Experiment Second Experiment Figure 8: Number of entries in the provenance store after mix and real update patterns of length 14000. The number at the top of each bar shows the physical size of the table. Figure 7: Number of entries in the provenance store after a variety of update patterns of length 3500.

Second Experiment • Figure 9 shows the time spent on storing provenance information for all the techniques. Copying in T is close to zero, because copies do not involve interaction with the provenance store. Figure 9: The average amount of time for target database processing and for add, delete, copy and commit operations on the provenance store during 14000-mix update.

Second Experiment For naïve approach all operations require less than 30% of the processing time needed for interaction with the target DB. H-provenance requires more time to process inserts than copies. H-provenance treats deletes as naïve provenance. T-provenance: Inserts and copies run essentially instantaneously, because no interaction with the target database or provenance store is needed. Figure 10: The overhead of provenance tracking per operation as a percentage of the time to perform each basic operation.

Third Experiment • Measured the effects of deletes on provenance storage. HT-provenance stores the fewest records among the approaches for each update pattern. Figure 11: The effect of deletion on the provenance store. The notation (ac) indicates provenance table size when only add and copy operations are performed while (acd) includes deletes.

Fourth Experiment Time to process a commit grows approximately linearly with transaction length. Figure 12: The effect of transaction size on provenance processing time.

Fifth Experiment • Displays the time needed to perform basic provenance queries. The queries ran fastest for transactional provenance for all three queries, Figure 13: The time needed to perform basic provenance queries.

Conclusions • The experimental results affirm that provenance can be tracked and managed efficiently using our approach. • This is a promising first step towards providing powerful, general-purpose tools that will make life easier for scientific data curators and increase the reliability and transparency of the scientific record.

Contents 1st paper 2nd paper • “Provenance Management In Curated Databases” • Peter Buneman, • Adriane P. Chapman, • James Cheney • “How to cite curated databases and how to make them citable” • Peter Buneman

What is the problem being addressed in the paper? • Importance of citing databases. Citing something that has: • Internal structure. • Evolves over time. • Propose a stable citation system for IUPHAR. • Describe: • How to publish the database in a form that can be cited. • How to ensure that the citations remain valid. • How to generate and validate the citations automatically.

Preliminaries (1/4) • Citations are used to identify the source material and provide some additional information. • Example: • Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001. Much more than we need to identify the work. • Sufficient: • OR Bioessays 17:999-1001 Bard JB and Davies JA. Development, Databases and the Internet.

Preliminaries (1/4) • Citations are used to identify the source material and provide some additional information. • Example: The citations.. • Ann. Phys., Lpz 18 639-641 • Nature, 171,737-738 while adequate for identification, hardly convey the importance of these publications.

Preliminaries (2/4) • A citation does not give us a specific mechanism for retrieving a document. • It is useful to find what we are looking for. • It is a structure that can be used by a variety of mechanisms such as online indexes and search engines. • A citation consists of two kinds of information. Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001.

Preliminaries (2/4) • A citation does not give us a specific mechanism for retrieving a document. • It is useful to find what we are looking for. • It is a structure that can be used by a variety of mechanisms such as online indexes and search engines. • A citation consists of two kinds of information. Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001. Location

Provenance Management & Citations in Curated Databases