Efficient String Data Operations and Implementations

Data Structures and Algorithms Lecture notes for String Jiang Lihong Shanghai Jiaotong University Data Structure Lecture notes

Main Topic • 4.1 串的基本概念 • 4.2 串的运算及其实现 • 4.3 串的存储结构 • 4.4 串的模式匹配 Data Structure Lecture notes

4.1 串的基本概念 • 串（或字符串），是由零个或多个字符组成的有限序列。一般记为： • s='a1a2...an'(n>=0) Data Structure Lecture notes

串的例子 S1=“ab123” //长度为5的串 S2=“100” //长度为3的串 S3=" " //含两个空格字符的串长度为2 S4=“” //空串长度为0 a='BEI',b='JING',c='BEIJING',d='BEI JING' • 串长分别为3,4,7,8,且a,b都是c,d的子串。 • 称两个串是相等的，当且仅当这两个串的值相等。 Data Structure Lecture notes

a=` `称为空白串。它的长度为1。由于空白串本身是一个字符，因此它可以出现在其它字符中间，例如“bei jing”。为清楚起见，下文中的空白字符用“σ”表示。 • a=``称为空串。它的长度为0。空串中无任何字符 Data Structure Lecture notes

4.2 串的运算及实现 • 串的抽象数据类型的定义： • ADT String{ • 数据对象：D={ai|ai(-CharacterSet,i=1,2,...,n,n>=0} • 数据关系：R1={<ai-1,ai>|ai-1,ai(-D,i=2,...,n} Data Structure Lecture notes

基本操作： • StrAssign(&T,chars) ////////// • chars是字符常量。生成一个其值等于chars的串T。 • StrCopy(&T,S) • 串S存在则由串S复制得串T • StrEmpty(S) • 串S存在则若S为空串,返回真否则返回假 • StrCompare(S,T) /////////// • 串S和T存在,若S>T,则返回值大于0,若S=T,则返回值=0,若S<T,则返回值<0 • StrLength(S) ////////// • 串S存在返回S的元素个数称为串的长度. • ClearString(&S) • 串S存在将S清为空串 Data Structure Lecture notes

Concat(&T,S1,S2) ////////////// • 串S1和S2存在用T返回由S1和S2联接而成的新串 • SubString(&Sub,S,pos,len) ///////////// • 串S存在,1<=pos<=StrLength(S)且0<=len<=StrLength(S)-pos+1 • Index(S,T,pos) • 串S和T存在,T是非空,1<=pos<=StrLength(S),若主串S中存在和串T值相同的子串,则返回它在主串S中第pos个字符之后第一次出现的位置,否则函数值为0 • Replace(&S,T,V) • 串S,T和V存在,T是非空串,用V替换主串S中出现的所有与T相等的不重叠的子串 • StrInsert(&S,pos,T) • 串S和T存在,1<=pos<=StrLength(S)+1,在串S的第pos个字符之前插入串T • StrDelete(&S,pos,len) • 串S存在,1<=pos<=StrLength(S)-len+1从串中删除第pos个字符起长度为len的子串 • DestroyString(&S) • 串S存在,则串S被销毁 • }ADT String Data Structure Lecture notes

4.3串的存储结构(一) 串的定长存储 • 用一组地址连续的存储单元存储串值的字符序列. • #define MAXSTRLEN 255 • typedef unsigned char SString[MAXSTRLEN+1] //0号单元存放串长 Data Structure Lecture notes

超过予定义长度的串值则被舍去 • 串长可用下标为0的数组元素存储,也可在串值后设特殊标记 Data Structure Lecture notes

串联接的实现Concat(&T,S1,S2) • 假设S1,S2和T都是SString型的串变量,且串T是由串S1联结串S2得到的,即串T的值的前一段和串S1的值相等,串T的值的后一段和串S2的值相等,则只要进行相应的"串值复制"操作即可,对超长部分实施"截断"操作 • 以下是串联接可能出现的三种情况: • S1,S2串长和小于最大值 • S1,S2串长和超过最大串长 • S1串长已等于最大串长 Data Structure Lecture notes

Status Concat(SString &T,SString S1,SString S2){ • if(S1[0]+S2[0]<=MAXSTRLEN){ • T[1..S1[0]]=S1[1..S1[0]]; • T[S1[0]+1..S1[0]+S2[0]]=S2[1..S2[0]]; • T[0]=S1[0]+S2[0]uncut=TRUE; • } • else if(S1[0]<MAXSTRSIZE){ • T[1..S1[0]]=S1[1..S1[0]]; • T[S1[0]+1..MAXSTRLEN]=S2[1..MAXSTRLEN-S1[0]]; • T[0]=MAXSTRLEN;uncut=FALSE; • } • else{ • T[0..MAXSTRLEN]=S1[0..MAXSTRLEN]; • uncut=FALSE; • } • return uncut; • } Data Structure Lecture notes

(二) 串的单链表存储 Data Structure Lecture notes

(三) 块链存储表示 • 可利用空间划分成大小一样的结点(比如说划分成大小为4的结点)，每一个结点有两个域：data域放4个字符，link域放下一个结点的指针。例如, s='abcdefghk' Data Structure Lecture notes

四、堆分配存储表示 • 动态分配一组连续的存储单元 • Malloc() • free() Data Structure Lecture notes

4.3 模式匹配 • 设s和t是给定的两个串，在串s中找到等于t的子串的过程称为模式匹配。 Data Structure Lecture notes

(一) 模式匹配的BF算法 • 一种简单直观的模式匹配算法是布鲁特(Brute)-福斯(Force)算法，简称BF算法。 Data Structure Lecture notes

算法4.5 • Int Index (Sstring S, Sstring T, int pos) { • i=pos; j=1; • while (i<=S[0]&&j<=T[0]){ • if (S[i]=T[j]) {++i:++j;} • else {i=i-j+2; j=1;} • } • if (j>T[0]) return i-T[0]; • else return 0; }//Index Data Structure Lecture notes

模式t='cda'与主串s='acdccdae' 初始 Data Structure Lecture notes

模式t='cda'与主串s='acdccdae' (1) s1!=t1，i <- 2, j <-1, 重新比 Data Structure Lecture notes

模式t='cda'与主串s='acdccdae' (2) s4!=t3，i <-3，j <-1 重新比 Data Structure Lecture notes

模式t='cda'与主串s='acdccdae' (3) s3!=t1，i <-4，j <-1 重新比 Data Structure Lecture notes

模式t='cda'与主串s='acdccdae' (4) s5!=t2，i <-5, j <-1, 重新比 Data Structure Lecture notes

模式t='cda'与主串s='acdccdae' 这个算法很简单，但是效率很低。算法最坏的运行时间是O(m*n)。造成BF算法速度慢的原因是回溯，而这些回溯并不是必要的。 Data Structure Lecture notes

(二)模式匹配的kmp算法 • 克努特(Knuth)，莫里斯(Morris)和普拉特(Pratt) • 造成BF算法速度慢的原因是回溯，而这些回溯并不是必要的。希望在每趟匹配后，指针i不回溯，由j退到某一个位置k上，使t中k前的k-1个字符与s中i指针前的k-1字符相等。这将减少匹配的趟数(和一趟比较的次数)，提高算法的效率。如何得到k值是改进的模式匹配算法的关键。 Data Structure Lecture notes

Knuth等人发现这个k值仅依赖与模式t本身前j各字符的构成，而与主串s无关，且可用一个next(j)表示与j对应的k值。若令next(j)=k，则next(j)表明，当模式t的第j个字符与主串相应字符匹配失败时，需重新和主串该字符进行比较的字符位置。即当si!=tj时，对模式向右移j-next(j)字符，与从起主串续比较下去。若next(j)=0，则j移到t的第一个字符,与从si+1起的主串续比较Knuth等人发现这个k值仅依赖与模式t本身前j各字符的构成，而与主串s无关，且可用一个next(j)表示与j对应的k值。若令next(j)=k，则next(j)表明，当模式t的第j个字符与主串相应字符匹配失败时，需重新和主串该字符进行比较的字符位置。即当si!=tj时，对模式向右移j-next(j)字符，与从起主串续比较下去。若next(j)=0，则j移到t的第一个字符,与从si+1起的主串续比较 Data Structure Lecture notes

算法 4.6 模式匹配的KMP算法(s,t,pos)思路 • 1) i <- 1, j <- 1 • 2 )循环当i <= m 且 j <= n 时执行 • 若 s(i)=t(j) • 则 i <- i+1, j <- j+1 • 否则若 next(j)>0 • 则 j <- next(j) • 否则 j <-1, i <-i+1 • 3) 若 j>n 则输出 i-n 否则输出 '0' Data Structure Lecture notes

在算法4.6中，i值只增不减，且i初值为1循环过程又控制在i<=m，因此循环体中语句i <- i+1最多执行m次，所以该算法运行时间为O(m)。 • 上述算法还遗留一个问题，即如何计算next(j)。 Data Structure Lecture notes

next(j)是一个满足于的整数. • next(j)的值k，应使t中k前的k-1个字符与s中i指针前的k-1个字符相等。 • 所取k值，应使t的右移不丢失任和匹配成功的可能固在存在多个满足性质(2)的k时，取最大的k。 Data Structure Lecture notes

Data Structure Lecture notes

' t 1 ... t k-1 ' = ' t j-k+1 ... t j-1 ' (4-1) • 因为由性质(2)所示,我们可得到两个关系式: • ' t 1 ... t k-1 ' = ' s i-k+1 ... s i-1 ' (4-2) • ' t j-k+1 ... t j-1 ' = 's i-k+1 ... s i-1 ' (4-3) • 那么，由式(4-2)和(4-3)便可得到(4-1)。 Data Structure Lecture notes

计算 next 值算法 • 1 k <-0, j<-1, next(1) <- 0 • 2 循环执行下列语句,直到 j=0 为止 • 若 k=0 或 t j=t k • 则 j <- j+1, k<- k+1, next(j) <- k • 否则 k <- next(k) • {算法结束} • 这个算法的时间复杂度为O(n)。 Data Structure Lecture notes

一判断题(y/n) • 1,子串定位函数的时间复杂度在最坏情况下为O（n*m），因此子串定位函数没有实际使用价值。 • 2, 设有两个串p和q，其中q是p的子串，把q在p中首次出现的位置作为子串q在p中的位置的算法称为匹配。 Data Structure Lecture notes

3, KMP算法的最大特点是指示主串的指针不需回溯。 • 4, 设模式串的长度为m，目标串的长度为n，当n≈m且处理只匹配一次的模式时，朴素的匹配（既子串的定位函数）算法所花的时间代价也可能会更为节省。 Data Structure Lecture notes

二单选题 (请从下列A，B，C，D选项中选择一项） • 1，设字符串s1='ABCDEFG',s2='PQRST'，则运算s=CONCAT(SUB(s1,2,LEN(s2)),SUB(s1,LEN(s2),2))后的串值为： • ‘BCDEF’ • ‘BCDEFG’ • ‘BCPQRST’ • ‘BCDEFEF’ • ‘BCQR’ Data Structure Lecture notes

2，设有两个串 p 和 q ,求 q 在 p 中首次出现的位置的运算： • 连接 • 模式匹配 • 求子串 • 求串长 Data Structure Lecture notes

三编程题 • 1.设x和y是表示成单链表的两个串，试写出一个算法，找出x中第一个不在y中出现的字符（假定每个结点只存放一个字符）。 • 2.试设计在顺序串上实现串的比较运算strcmp(s,t)的算法。 Data Structure Lecture notes

PROCEDURE ds0405(x,y : ctr;VAR p : ctr); • VAR px,py : ctr; • BEGIN • px : = x; • WHILE px <> NIL DO • BEGIN • py : = y; • WHILE (py <> NIL) AND (px↑.data <> py↑.data) DO • py : = py↑.link; • IF py = NIL THEN px : = NIL • ELSE BEGIN • px : = px↑.link; x : = px • END • END; • p : = x; • END; Data Structure Lecture notes

设串x和y分别存放在向量A[1..maxsize]的前m个分量和B[1..maxsize]的前n个分量中，且0 < m，n < maxsize。 FUNCTION ds0406(A,B : ARRAY[1..maxsize] of char; m,n : 1..maxsize) : -1..1; VAR s,i : integer; BEGIN IF m < n THEN s : = m {s = min{m,n}} ELSE s : = n; i : = 1; WHILE i <= s DO IF A[i] < B[i] THEN RETURN (-1) {x < y} ELSE IF A[i] > B[i] THEN RETURN(1) {x > y} ELSE IF i = s THEN IF m = n THEN RETURN(0) {x = y} ELSE IF m > n THEN RETURN(1) {x > y} ELSE RETURN(-1) {x < y} END; Data Structure Lecture notes

Efficient String Data Operations and Implementations

Efficient String Data Operations and Implementations

Presentation Transcript

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Algorithms and Data Structures

DATA STRUCTURES AND ALGORITHMS

Algorithms and Data Structures

Data Structures and Algorithms

Data Structures and Algorithms

Algorithms and Data Structures

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Algorithms and Data Structures