600 likes | 799 Vues
REGNET. A Comparative Analysis Framework For Semi-Structured Documents, With Applications To Government Regulations. Gloria Lau Engineering Informatics Group, Stanford University May 14th, 2004. ADAAG in HTML. UK DDA in HTML. IBC in PDF. Motivation. Multiple sources of regulations
 
                
                E N D
REGNET A Comparative Analysis Framework For Semi-Structured Documents, With Applications To Government Regulations Gloria Lau Engineering Informatics Group, Stanford University May 14th, 2004
ADAAG in HTML UK DDA in HTML IBC in PDF Motivation • Multiple sources of regulations • Multiple jurisdictions: federal, state, local, etc. • Different formats, terminologies, contexts • Amending rules, conflicting ideas
Motivation • Multiple sources of regulations • Multiple jurisdictions: federal, state, local, etc. • Different formats, terminologies, contexts • Amending rules, conflicting ideas  Need for a repository • Locate relevant information • E.g., small business: penalty fees for violations  Need for analysis tool • Complexity of regulations • Multiple jurisdictions • Understanding of regulations & their relationships
Example 1: Related Provisions ADAAG Appendix 4.6.3 … Such a curb ramp opening must be located within the access aisle boundaries, not within the parking space boundaries. CBC 1129B.4.3 … Ramps shall not encroach into any parking space. Exception: 1. Ramps located at the front of accessible parking spaces may encroach into the length of such spaces … • CBC allows curb ramps encroaching into accessible parking stall access aisles, while ADA disallows encroachment into any portion of the stall.
Example 2: Related but Conflicting Provisions ADAAG 4.7.2 Slope. …Transitions from ramps to walks, gutters, or streets shall be flush and free of abrupt changes… CBC 1127B.5.5 Beveled lip. The lower end of each curb ramp shall have a ½ inch (13mm) lip beveled at 45 degrees as a detectable way-finding edge for persons with visual impairments. • ADAAG focuses on wheelchair traversal; CBC focuses on the visually impaired when using a cane.
Scope • Repository development • Relatedness analysis • Performance evaluation , results and applications Relatedness analysis Repository development
Sources of data • Accessibility standards • Americans with Disabilities Act Accessibility Guide (ADAAG) • Drafted chapter for rights-of-way access • Associated public comments • Uniform Federal Accessibility Standards (UFAS) • British Standard BS 8300 • Scottish Technical Standards, Part S • International Building Code (IBC), Chapter 11 • Drinking water standards • Code of Federal Regulations, Title 40 (40 CFR) • California Code of Regulations, Title 22 (22 CCR) • Fire code • International Building Code (IBC), Chapter 9
Computational properties of regulations • Hierarchical tree structure • Referential structure • Discipline-centered, e.g., ADAAG for accessibility  Shallow parser to capture computational properties
Digital publication of regulations • Current standard: HTML, PDF, plain text... • Our system standard: XML • Recreate regulatory structure • Unit of extraction: section/provision • Extract references • Extract features <regulation id="ibc" name="international building code" type="private"> <regElement id="ibc.1107" name="special occupancies"> … <regElement id="ibc.1107.2" name=“assembly area seating"> <reference id="ibc.1107.2.4.1" times="1" /> <concept name="assembl area" times="1" /> … <regText>Assembly areas with fixed seating shall comply … </regText> <regElement id="ibc.1107.2.1" name="services">...</regElement> <regElement id="ibc.1107.2.2" name=“wheelchair …">...</regElement> </regElement> </regElement> </regulation>
Shallow parser: feature extraction • Combination of handcrafted rules and software tools • Generic features • Concepts - noun phrases • Exceptions - negated provisions • Definitions - terminologies defined in regulations • Domain-specific features • Non-structural characteristics specific to a corpus • To aid user retrieval of relevant materials • For analysis purpose: domain knowledge • Glossary terms - definitions from reference guides • Author-prescribed indices - concepts from field handbooks • Measurements - e.g., 2 inches max, 4 ppm • Chemicals - list of drinking water contaminants from EPA • Effective dates - provision updates
Example of indexTerm, concept, measurement & exception features Original Section 4.6.3 from the UFAS 4.6.3* PARKING SPACES. Parking spaces for disabled people shall be at least 96 in (2440 mm) wide and shall have an adjacent access aisle 60 in (1525 mm) wide minimum (see Fig. 9). Parking access aisles shall be part of ... EXCEPTION: … an adjacent access aisle at least 96 in (2440 mm) wide complying with 4.5... Refined Section 4.6.3 in XML format <regElement name=”ufas.4.6.3” title=”parking spaces” asterisk=”1”> <concept name=”access aisl” num=”3” /> … <indexTerm name=”park space” num=”4” /> <measurement unit=”inch” magnitude=”96” quantifier=”min” /> <ref name=”ufas.4.5” num=”1” /> <regText> Parking spaces for disabled people shall ... </regText> <exception> If accessible parking spaces for ... </exception> </regElement>
Scope • Repository development • Relatedness analysis • Performance evaluation , results and applications Relatedness analysis Repository development
Relatedness analysis ADAAG 4.1.6(3)(d) Doors (i) Where it is technically infeasible to comply with clear opening width requirements of 4.13.5, a projection ... UFAS 4.14.1 Minimum Number Entrances required to be accessible by 4.1 shall be part of an accessible route and shall comply with ... Related elements: door and entrance
Relatedness analysis • To utilize the computational properties of regulations for a complete comparison • Measure • Degree of relatedness: similarity score f(A, U)  (0, 1) • Nodes A and U are provisions from two different regulation trees f (0, 1)
Base score f0 computation • Linear combination of feature matching • F(A,U,i) = similarity score between Sections (A,U) based on feature i • N = total number of features •  = weighting coefficient • Feature matching • Based on the Vector model using cosine similarity as the distance between feature vectors • Similarity between two documents M and N = • and are document vectors • i = concept feature • Concept vectors are formed per provision based on concept frequency in each provision • F(provision M, provision N, i=concept) = cosine between 2 concept vectors
Axis dependency: non-Boolean matching • Vector model assumes mutual independence between axes • Domain experts do not necessarily agree • A measurement of “2 inches max” can be a 70% match to “2 inches” • Synonyms exist, e.g., ontology defined for chemicals • Limitation observed • Need flexibility to model domain knowledge, such as a 0, 50%, 75% and 100% measurement match:
Proposed non-Boolean matching model • Define a feature matching matrix E • Eij= % match between features i and j • E.g., a 3-dimensional vector space using “2 ppm”, “2 ppm max” and “2 ft” as the first, second and third measurement axes: E = • Vector space transformation before cosine computation • Map feature vectors onto an alternate space to form consolidated frequency vectors • E.g., based on measurement features • Cosine similarity =
Score refinements based on regulation structure • Neighbor inclusion • Diffusion of similarity between clusters of nodes in the tree • Self vs. parent-sibling-child (psc), fs-psc • psc vs. psc, fpsc-psc
Neighbor inclusion: psc vs. psc • Take a linear combination of neighboring pair scores • Formulate a neighbor structure matrix N • Define score matrix  • We have psc-psc = NA0NUT
Neighbor inclusion: self vs. psc • Take a linear combination of neighbor vs. self scores • Formulate a neighbor structure matrix N • Define score matrix  • We have s-psc = ½ (0NUT + NA0)
Score refinements based on regulation structure • Reference distribution • Diffusion of similarity between referencing nodes and referenced nodes in the tree • E.g., f(A5.3, U6.4(a)) updates f(A2.1, U3.3)
Reference distribution: s-ref and ref-ref • Take a linear combination of reference vs. self and reference vs. reference scores • Formulate a reference structure matrix R • Define score matrix  • We have ref-ref = RA0RUT and s-ref = ½ (0RUT + RA0)
Final score: linear combination of ’s •  = structural weighting coefficient
Scope • Repository development • Relatedness analysis • Performance evaluation, results and applications Relatedness analysis Repository development
Performance evaluation • Conduct a user survey of rankings of similarity • 10 randomly chosen sections from the ADAAG and UFAS • Ranks 1 to 100 in the order of relevance • Root mean square error (RMSE) • = user-generated ranking vector • = machine-predicted ranking vector
Survey results - Tabulated RMSE’s • Compared our analysis to Latent Semantic Indexing (LSI) •  = structural weighting coefficient •  = feature weighting coefficient • Average RMSE smaller than LSI • Measurement feature performs best • No improvement in result observed for structural comparison
Results of comparisons: ADAAG vs. UFAS • Related accessible elements: door and entrance • No ontological information • Neighbor inclusion reveals higher similarity • Content of neighbors imply similarity between Section 4.1.6(3)(d) in ADAAG and Section 4.14.1 in UFAS
Results of comparisons : UFAS vs. BS8300 • Terminological differences - revealed through neighbor inclusion
Results of comparisons : 40CFRdw vs. 22CCRdw • Top ranked: Almost identical provisions, change of enforcing agency
Results of comparisons : 40CFRdw vs. 22CCRdw • Use of ontological information • 40 CFR uses chemical acronyms, e.g., TTHM • 22 CCR spells out “total trihalomethanes”
Application: e-rulemaking • Application domain: e-rulemaking • Comparison between draft of rules and the associated public comments • ADAAG Chapter 11, rights-of-way draft • Less than 15 pages • Over 1400 public comments received within 4 months • Comments ~10MBin size; most are several pages long  New regulation draft can easily generate a huge amount of data that needs to be reviewed and analyzed • Parsing of the draft and comments • From HTML to XML • Recreate structure of the draft using our shallow parser • Extract features from the draft and comments • Treat individual comments as provisions
E-rulemaking Drafted regulations compared with public comments
Results from e-rulemaking application • Related section in draft and public comment
Results from e-rulemaking application • No related provisions identified • Concern not addressed in the draft
Contributions • A framework for regulatory repository • Structure of regulations recreated in XML • Feature extractions • Prototype for similarity comparisons • Contextual comparisons • Domain knowledge • Structural comparisons • Performance Evaluation, Results and Applications • User survey and comparisons with LSI • Observations of comparisons between Federal, State, non-profit organization mandated codes and European standards • Accessibility • Drinking water control • Application on e-rulemaking
Future research directions • In the legal domain • Regulatory competition • Cross border data transfer laws • Especially in the polyglot countries in EU • Regulatory updates • Track changes in updates • Track cross references between regulations • Extension of application to other domains of semi-structured documents • Software specifications • User manuals • Similarity/relatedness is settled - how about differences and conflicts? • Drinking water example of almost identical provisions
Acknowledgments • Committee members • Prof. Kincho Law • Prof. Gio Wiederhold • Prof. Hans Bjornsson • Prof. Cary Coglianese • Prof. Hector Garcia-Molina, defense chair • Family, friends and everyone in the Engineering Informatics Group • Especially REGNET/REGBASE project members • This research is sponsored by the National Science Foundation
Semantics of relatedness/similarity • Similar: having characteristics in common; strictly comparable; alike in substance or essentials; not differing in shape but only in size or position. • Related: connected by reason of an established or discoverable relation.  Similarity is not static; it can depend on one’s viewpoint and desired outcome. • “related” provisions are more interested, e.g., the conflicting cases • Traditionally, it is called a “similarity score”.
Cosine similarity • A document is represented as a n-entry vector M = (w1,M, w2,M, … , wn,M), where n is the total number of index terms in the corpus. • Similarity between two documents = • E.g., we take the frequency count of concept i as the concept weight wi,M in dM = (w1,M, w2,M, … , wn,M).
Example of feature vectors • Traditional term match • each index term i is assigned a positive and non-binary weight wi,M in each document vector d M • Weight selection • Frequency of term, or • tf idf model • tf = term frequency; term density • idf = inverse document frequency = log(n/ni); term rarity • Excluding stopwords
Vector space transformation • Define D such that E = DTD is fulfilled • Cosine between the consolidated frequency vectors: = = = =
Boundary case: reduced space • Measurements i and j are synonyms • The following vectors should return the same answer
Neighbor inclusion • Neighbor structure matrix formulation N • Each Section i corresponds to row i and column i of N • Entry Nij is 0 if ipsc(j) • For jpsc(i), entry Nij is 1/k where k is the total number of neighbors of i • Example:
Matrix representation • Take the average scores of the neighboring pairs • Define •  = similarity scores between two regulations M and N • ij = similarity score between Section i from regulation M and Section j from regulation N  We have psc-psc = NA0NUT and s-psc = ½ (0NUT + NA0)