170 likes | 269 Vues
大数据管理与数据质量 - 美国金融业中的对策. 汪时奇 (博士) 处理速度 容量限制 数据质量. Overview. 数据 <= Data = 信息 ( 并非数字集合 ) 数据科学 ( 约 )= 信息科学 为何研究大数据 ? 因为相关产品 ( 如硬盘 , memory, CPU 等 ) 价格指数下降 因为信息爆炸 因为大数据导致许多新问题 大数据研究是多学科的综合 (IT, DM, BI, BA, …) 实业界对大数据问题的对策 ( 见下文 ). 1. 数据库策略. 1.1 Database (DB) performance
E N D
大数据管理与数据质量- 美国金融业中的对策 汪时奇 (博士) 处理速度 容量限制 数据质量
Overview • 数据 <= Data = 信息 (并非数字集合) • 数据科学 (约)= 信息科学 • 为何研究大数据? • 因为相关产品(如硬盘,memory, CPU等)价格指数下降 • 因为信息爆炸 • 因为大数据导致许多新问题 • 大数据研究是多学科的综合(IT, DM, BI, BA, …) • 实业界对大数据问题的对策 (见下文)
1. 数据库策略 • 1.1 Database (DB) performance • 1.2 DB space
1.1 DB performance • Auditing – 2 tables: a small active & a huge passive • Partition • Index (good/bad; Cluster; Global/Local) • Lock type (when apply row lock) • Transaction: 1-phase or 2-phase • Normalization • Internal optimization (e.g. Execution Plan=> hint in Oracle) • Constraints (e.g. Check) usage to replace trigger • Tricks (e.g. Date function; Search small table first; …)
1.2 DB space • Space arrangement for even distribution(e.g. 1 huge table uses a few data files) • Cleaning procedure with defragment • Partition design with cleaning plan
2. Applications (软件)(Java example) • Using advanced language (e.g. Java or C#) • 2.1 Memory(内存) • 2.2 Disk/network space • 2.3 Performance • 2.4 Maintainability
2.1 Memory • Minimize big objects creation and coexistence • GC (Garbage Collection) or null big objects once out of scope • Choose appropriate GC type • gc() • Try to split one big object to small objects • Use mutable class for frequently changed big objects (e.g. StringBuilder, instead of String)
2.2 Disk/network space • Smart clean and archive processese.g. archive zipped old or not used files to low speed network space and delete very old files from that space • Smart logging settings • e.g. log4j size rolling • e.g. Avoid duplicated or trivial logging info • Monitor for spaces
2.3 Performance • Avoid redundant treatment (in big loops)Maximize reuse • Multi-threading • DB accessing • Logging -- avoid slow options (e.g. line #)
2.4 Maintainability • SOA principles Lose coupling, reusability, granularity, modularity, composability, componentization, interoperability, … • JEE patterns (DAO, DTO, Biz Delegation, …) • Design patterns (23) and MVC • Creation • Structure • Behavior (e.g. Visitor) • OOP principles • Abstraction, encapsulation, polymorphism, … • Open/Close
3. 数据质量控制 • 3.1 Business • 3.2 Process A. Failover & DR (Disaster Recovery) B. QA (Quality Assurance)(see <软件质量管理点滴>for details) C. UAT (User Acceptance Test) • 3.3 Technology
3.1 Business • Reduce manual work; Increase automation • Complete approval system for manual workE.g. 1 level => 2 levels or 3 levels approval • Extend view points to confirm data quality • Reduce redundancy systems (e.g. due to merge, due to vendors) • Schedule Cleansing (see details) • Enhance Reconciliation (see details) • Build Trust level (see details) • Try to cover all rare cases
3.1.E Cleansing • When • At system merge • At major change • How • Develop detection applications • Deliver mismatch reports to IT & business • Find solutions on both IT & business
3.1.F Reconciliation • Where • 1+ subsystems have data for same contents. • 1+ subsystems have independent date change functionality. • What • Run & improve recon. app. routinely. • Categorize reports by urgency. • Analyze reports. • Debug or adjust biz rule or apply Cleansing.
3.1.G Trust level • When • At 1+ fixed data inputs • Inputs are independent • Must decide final details from inputs • How (based on) • Provider level (for a detailed data group) • Data history • Samples: Bloomberg, Reuter, Telekurs, DTCC, …; Moody, S&P, Fitch.
3.2.A Failover & DR • Failover • DB: 2+ at diff. locations; real-time replication • App • Active-Active: Cluster with Load Balancing • Active-Passive • Auto (via SAN) • Manual + Auto • DR • DB: e.g. daily or hourly or real-time replication • App: Manual switch
3.3 Technology • DB design • Constraint ‘Check’ (for sensitive table values) • Normalization (to reduce duplications) • Validation processes (to find conflict data) • Application design • Data integration check • E.g. cryptography signature • E.g. CRC check • Data display (e.g. Excel missing leading 0, date=>num)