基于 Hadoop 的字符串 Join 问题

基于Hadoop的字符串Join问题 唐振坤 2012年7月26日

Outline • 1、问题定义 • 2、相关背景 • 3、想法&工作 • 4、实验 • 5、结论 • 6、后续思路

1.问题定义 • 给定一组字符串，要判断出哪些字符串是相似的？ • 应用： • Web搜索引擎爬取网页时的重复网页检测 • 去除重复数据后的文档聚类 • 文档抄袭、剽窃检测 • 基于查询相似性的用户推荐 • DNA序列分析 • ……

1.问题定义 • 相似性(String metric)： • 给定字符串r和s, • Sim度量： • JaccardSimiliarity: O(n) • Edit Distance(Levenshtein Distance): 动态规划方法O(n^2) • …

1.问题定义 • 字符串相似性Join： • 给定一组基于字符串为特征的数据集，找出其中相似的字符串，并Join相似的记录。 • 类型： • 按操作不同：Self-Join、R-S Join • 按应用不同：JaccardConstraints、Ed Constraints

2.相关背景 • 相关算法： • All-Pairs • PPJoin、PPJoin+、Ed-Join • Pass-Join • PPJoin based on MapReduce

2.相关背景 • Pass-Join: Partition-based method[1]： Partition 字符串 Verify 候选相似对 SubstringSelection 字符串 [1] Li, G. and Deng, D. and Wang, J. and Feng, J. (2011). Pass-Join : A Partition-based Method for Similarity Joins. Proceedings of the VLDB Endowment, 5(3), 253--264.

2.相关背景 • Pass-Join: Partition-based method • 求ed(R=“abcde”,S=“bcfde”)?≦\tau，\tau=2 划分阶段：子串选择阶段： bcfde bcfde √ bcfde bcfde √ bcfde ╳ 产生候选对（两对）： a[bc]de [bc]fde abc[de] bcf[de]

2.相关背景 • 基于Hadoop的Join操作[1]： • Reduce-Side Join • Map-Side Join • Memory-Backed Join Join = [1] Lin, J., & Dyer, C. (2010). Data-intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies (Vol. 3, pp. 1-177). Morgan & Claypool Publishers.

2.相关背景 • Reduce-Side Join Mapper Grouper Reducer 表1： K1, Abc@ K2, bcd@ K1, Abc K2, bcd K1,Abc,123 K1,Abc,789 K2,bcd,456 K1, (Abc@,123#,789#) K2, (bcd@,456#) K1, 123# K1, 789# K2, 456# 表2： K1, 123 K1, 789 K2, 456 [Key,Value]

2.相关背景 • Memory-Backed Join • 在Mapper处理前将小的数据表完全读入内存读入内存放入Map中 K1 -> Abc K2 -> bcd 表1： K1, Abc K2, bcd Mapper 表2：小数据表已经读入内存 K1, 123 K1, 789 K2, 456 K1,Abc,123 K1,Abc,789 K2,bcd,456

3.想法 • 三步连接： • 划分片段并生成反向索引列表 • 子串选择并生成候选对 • 候选对验证

3.想法 • 生成反向索引列表

3.想法 • 子串选择并生成候选对

3.想法 • 连接字符串并验证（Reduce-Side Join）

3.想法 • 问题： • 阶段太多，生成反向索引与子串选择阶段独立，可合并 • 候选对生成过大，连接时读入不可行

3.想法 • 划分片段与子串选择 • 字符串连接验证

3.想法 • 划分片段与子串选择

3.想法 • 字符串连接验证（Memory-Backed Join）

3.想法 • 在Map端划分片段与子串选择 • 在Reduce端连接并验证

4.实验

5.结论 • 1、MapReduce适合于批量处理，不适合多遍迭代的复杂算法 • 2、产生中间输出影响MapReduce性能 • 3、多利用Hadoop内置的Combiner及排序优化算法性能 • 4、MapReduce处理分区不合理会遭遇数据倾斜问题

6.后续思路 • 1、先用前缀过滤将所有字符串分组，在每台机器上再调用串行PassJoin算法 • 2、深入Mapreduce Join算法，再仔细地看下map-side join的效率及实现如何 • 3、考虑如何均匀分配发送至Reducer上的候选对，使得运行时间平均下来 • 4、考虑更多优化因素，如LSH，或使用其它编程模型，如MPI等

谢谢！

基于 Hadoop 的字符串 Join 问题

基于 Hadoop 的字符串 Join 问题

Presentation Transcript

Introduction to MapReduce and Hadoop

Welcome to join Tutorial

Care Plan Team Meeting (As updated during meetings)

Join Using MapReduce

Advanced Auditing

Introduction to Hadoop

Introduction to Hadoop and MapReduce

Parallel Programming With Spark

Spark - Shark Data Analytics Stack on a Hadoop Cluster

Come and join us! Be a member of

Hadoop , a distributed framework for Big Data

Hadoop Ecosystem Overview

O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop

Cloud Computing with MapReduce and Hadoop

Oracle Join Techniques

Berechnung des Spatial Joins/Join Operation

Chapter29: War Abroad, War at Home

Why did America join the imperialist club at the end of the 19th Century?

第四章

雲端運算技術與應用

Hadoop