200 likes | 310 Vues
Project Omniglean. Kenny Trytek Joe Briggie Abby Birkett Derek Woods. Advisor: Simanta Mitra Client: Matt Good, Kingland Systems. Problem Statement. Large companies have many layers of corporate hierarchy.
E N D
Project Omniglean Kenny Trytek Joe Briggie Abby Birkett Derek Woods Advisor: SimantaMitra Client: Matt Good, Kingland Systems
Problem Statement • Large companies have many layers of corporate hierarchy. • Financial and data records sometimes conflict between various layers/entities. • Accurate and comprehensive company records are needed for auditing and stock conflict resolution. • There is a need for “Data Mastering”, to take multiple conflicting sources of data and determine the reality ofthe matter.
Basic Requirements • System shall autonomously traverse publicly available websites and collect information • System shall store parsed information in a flat file • System shall maintain a normalized database • System shall expose functionality through web services • A single run of system shall complete execution in less than six hours
Design Decisions • Implementation in C# • ASP.NET GUI with jQuery UI widgets • Operable in a Windows environment (XP or later) Risks • Site data structures or hierarchies can change at any time • Reliance on third party PDF text parser, grid control, and AJAX library • Inconsistencies in data
DAL Database ETL Tool Normalized External Client UI Kingland Data Analyst UI Web Svcs. No Conflicts? System Diagram Scraper Tool WWW Data HTML Parser PDF Parser Flat File Create Read Update Delete
Scraper World Wide Web Parser Flat File (XML) PDF Parser HTML Parser Harvester Module • The harvester performs thework of gathering data fromthe external sites • After the data is scraped and parsed,the harvester constructs XMLfiles for each data source • Finally, the ETL is notified the data is ready
Harvester Difficulties • Constructing a POST request to retrieve the PDFs required extracting a complex view state • Difficult to extract text from PDF • Inconsistencies in extracted text • City names were occasionally malformed • Extra formatting characters were present inextracted text
Flat File (XML) ETL Tool DAL ETL (Extract, Transform, Load) • The ETL performscleanup operationson the data fromthe harvester • If there are malformed tags or invalid characters, they are escaped here • Maintains an error log • Loads data into database through DAL (DataAccess Layer)
ETL Difficulties • Implementing multi-threaded execution forbetter performance • Dealing with malformed input
Database DAL Add() Find() Update() Delete() ETL Tool User Interface DAL (Data Access Layer) • Maintains a normalizedMySql database • Provides CRUD operations(Create, Read, Update, Delete) • No particular difficultiesencountered in database creation DAL Difficulties
Services Read() Write() Update() Progress() Delete() Web Services • Expose the DAL for access from external web apps • Accessed by HTTP GET or POST requests • Returns JSON objects containing data • Returning large JSON objects to the UI Web Services Difficulties
GUI Difficulties • Implementing auto complete functionality for query efficiency • Progress bar updates • Grid configuration and updating • Retrieving large amounts of data from web services
Overall Test Plan • Test each module individually to ensure independent functionality • As modules are completed, test integration pairs to ensure channel adequacy • When all modules are integrated, test systemend-to-end using web app
Harvester / Parser Test Plan • Ensure harvester can connect to site for scraping and retrieve the appropriate data • Maintain a list of input files that produce specific output after parsing • Define corner cases for sub-function robustness evaluation / testing • Ensure errors are caught and handled appropriately
ETL Test Plan • Maintain a list of input files that produce specific output after data cleanup • Ensure errors are caught and handled appropriately • Confirm ETL can talk to DAL
DAL Test Plan • Ensure database can have records created, read, updated, and deleted • Define corner cases and error handling for invalid database operations • Create list of operations with expected results
Web Services Test Plan • Call each web service with expected input and check return values • Call web services with invalid input and checkreturn values
Project Future • Database model can be generalized to include any number of data sources • Harvester can be separated from ETL so additional data sources will not require ETL change • Optimization / multithreading of harvester and parser for greater efficiency • User access control features in web application • Two separate GUIs: one for Kingland clients, and one for Kingland data analysts