190 likes | 321 Vues
This presentation focuses on the importance of effective data management and integration through the use of MetriDoc, a tool developed to address challenges faced by legacy solutions like Datafarm. Presented by Joe Zucca and Tommy Barker, the session explores how MetriDoc addresses issues such as maintainability, shareability, and reusability of data. By utilizing open-source technologies like Java, Groovy, and Apache Camel, MetriDoc simplifies data collection across disparate systems, enhances collaboration, and offers robust solutions for transforming and managing data effectively.
E N D
Where’s My Data? Using MetriDoc to manage data integration headaches Joe Zucca– zucca@pobox.upenn.edu Tommy Barker – tbarker@pobox.upenn.edu Sponsored by
The Problem • The request seems simple but the solution is complex • Generally asked “who did / used x?” which leads to other questions • Where’s the data? • What’s the grain of the answer? • So how do we answer these questions? • If lucky, run script / query against a database and generate report • If not lucky, build an application to answer the question • This is what MetriDoc is built for
Current Solution - Datafarm Datafarm = Crontab + Perl + CGI = Spaghetti Gate Count Voyager Blackboard Ezproxy App 1 App 2 App 3 App n DLA logs Penn Community Datafarm Borrow Direct COUNTER
Datafarm Shortcomings • Maintainability issues • Not shareable • Not reusable
MetriDoc = Datafarm 2.0 • As our system grew, we began creating MetriDoc to address Datafarm’s problems • Needed a scheduler that was more sophisticated than cron • Needed languages that were more maintainable than perl • Needed integration tools to simplify data gathering across disparate systems • We built prototypes and services to help us evaluate technologies • Received a grant from IMLS to speed up development • Hired another programmer
MetriDoc Philosophy • Keep it simple • Sometimes a script is all you need • Ease of use is more important than performance • Don’t recreate the wheel • 100% open source • Sharable data
MetriDoc – How it Works • MetriDoc’s core is built around database schemas • A MetriDoc implementation consists of loading tables and normalized tables • Loading tables prime the repository • The user is responsible for populating these tables • Normalized tables are built from the data in the loading tables • MetriDoc takes care of this • Conforming to similar schemas provides interesting possibilities • Sharing data is easy • Sharing a single repository is easy (think amazon web services) • Easier to collaborate • From a user’s perspective • MetriDoc has tools to get your stuff in the loading tables • But ultimately you just need to get it in there, so you can use whatever • Use the MetriDoc tools to manage your integration needs • Useful for getting, transforming / resolving, moving and loading data
MetriDoc – Core Technologies • JVM • Java is used for infrastructure • Groovy is the primary language • Master Scheduler • Essentially the brains of MetriDoc • Using Hudson for now (http://hudson-ci.org/) • Integration Tooling • Tooling built on top of Apache Camel (http://camel.apache.org/) • Helps move data from one place to another • Really helpful for batch processing • Resolutions / Transformation Tools • Patron anonymization, text normalization, resource id to title resolutions, etc.
The Metridoc Solution Metridoc = Hudson + Java / Groovy + Apache Camel = Integration Nirvana Step 1 – Fill the loading tables Voyager Ezproxy COUNTER Load Ezproxy Load Counter Loading Tables Hudson Load Patron Info
Loading Tables 00.000.000.000||Philadelphia||PA||United States||Default+datasets+documents+pwp+vanwert||jsmith||[19/Jan/2011:00:01:44 -0500]||GET||https://proxy.library.upenn.edu:443/login?url=http://www.sciencedirect.com/science?_ob=GatewayURL&_origin=SFX&_method=citationSearch&_volkey=0264410X%2329%23266%232&_version=1&md5=8e47306a7f3a7da8a6fe7b521a7a149b||302||0||http://elinks.library.upenn.edu/sfx_local?genre=article&issn=0264410X&title=Vaccine&volume=29&issue=2&date=20101216&atitle=An+adjuvanted+pandemic+influenza+H1N1+vaccine+provides+early+and+long+term+protection+in+health+care+workers.&spage=266&sid=EBSCO:aph&pid=Madhun%2c+Abdullah+S.%3bAkselsen%2c+Per+Espen%3bSjursen%2c+Haakon%3bPedersen%2c+Gabriel%3bSvindland%2c+Signe%3bN%c3%b8stbakken%2c+Jane+Kristin%3bNilsen%2c+Mona%3bMohn%2c+Kristin%3bJul-Larsen%2c+%c3%85sne%3bSmith%2c+Ingrid%3bMajor%2c+Diane%3bWood%2c+John%3bCox%2c+Rebecca+J.5550217620101216aph||Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5 (.NET CLR 3.5.30729)]||Re07OuEIyQo8X6w||UPennLibrary=AAAAAUkQ36AAAFTaAwO7Ag==; __utma=10244330.1344196133.1295210953.1295404568.1295411821.9; __utmc=10244330; __utmz=10244330.1295411821.9.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn; WRUID=0; __utmv=10244330.|1=User-Type=Current%20Students=1,; __utma=94565761.447912360.1295320755.1295404584.1295411882.4; __utmc=94565761; __utmz=94565761.1295320755.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn%20blackboard; hp=/vanpelt/; __utma=261680716.1522407254.1295392237.1295404624.1295412044.3; __utmc=261680716; __utmz=261680716.1295412044.3.3.utmcsr=library.upenn.edu|utmccn=(referral)|utmcmd=referral|utmcct=/biomed/; proxySessionID=18175547; ezproxy=Re07OuEIyQo8X6w; ARPT=MWPYIPS108CWYL; EHost2=sid=49d81d33-5139-4dbd-b94f-5d76b01ffbdc@sessionmgr13&k2=dGJyMPGtr0iyqbVIrOPfgeyk44Dt6fIA&k3=dGJyMOPY8Xvt&k4=ehost&k6=en&k7=live&k8=DS:live; __utmb=10244330.4.10.1295411821; __utmb=94565761.6.9.1295413021459; __utmb=261680716.1.10.1295412044; ASPSESSIONIDCCAQQCRC=AHJAGJMDDPNIIMLMHBCPCHBL
The Metridoc Solution Metridoc = Hudson + Java / Groovy + Apache Camel = Integration Nirvana Step 2 – Populate the normalized tables Loading Tables Normalize Ezproxy Normalize Counter Repository Hudson Normalize Patron Info
Jenkins – Death to Cron • Generally used for building software, but a fantastic cron replacement • Can run arbitrary scripts locally and remotely • Supports master / slave distribution model seamlessly • Can be managed entirely via REST • Extensible • Helps with job dependencies • It is simple and free • Active community with a huge collection of plugins