1 / 19

Searching

Searching. Given distinct keys k 1 , k 2 , …, k n and a collection of n records of the form (k 1 ,I 1 ), (k 2 ,I 2 ), …, (k n , I n ) Search Problem - For key value K , locate the record (k j , I j ) in T such that k j = K .

avillanueva
Télécharger la présentation

Searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching • Given distinct keys k1, k2, …, kn and a collection of n records of the form • (k1,I1), (k2,I2), …, (kn, In) • Search Problem - For key value K, locate the record (kj, Ij) in T such that kj=K. • Searching is a systematic method for locating the record(s) with key value kj=K. • A successful search is one in which a record with key kj=K is found. • An unsuccessful search is one in which no record with kj=K is found (and does not exist).

  2. Searching Ordered Arrays • Binary Search - been there done that. • Dictionary Search - interpolation search • Determine how far from an endpoint your value is probably going to be. • Pos=(value-A[lo])/(A[hi]-A[low]) * (hi-lo) • Look here rather than mid • Assumes the data is evenly distributed.

  3. Lists Ordered by Frequency • Order lists by (expected) frequency of occurrence. • Perform sequential search • Cost for first record : 1 • Cost for second record : 2 • Search cost= 1p1 + 2 p2 + 3p3 + … + npn • Worst case (n+1)/2 • Best if a few items are accessed many times

  4. Self Organizing Lists • 80/20 rule: 80% of the accesses are to 20% of the records • expected search cost = .122n • Self organizing lists modify the order of records within the list basedon the actual pattern of record accesses. • Self organizing lists use a rule called a heuristic for deciding how to reorder the list.

  5. Self Organizing Heuristics • Order by actual frequency - most frequently used first • When a record is found, swap it with the first item • When a record is found, move it to the front of the list • When a record is found, swap it with the record ahead of it

  6. Hashing • The process of mapping a key value to a position in a table. • A hash function maps key values to positions. • A hash table is an array that holds the records. • The hash table has M slots (0:M-1) • For any value K in the key range and some hash function h, • h(k) = I where 0≤ I<M, and key(T[I])=K

  7. Hashing Situations • Hashing is appropriate for unique keys. • Good for both in-memory and disk based applications. • Answers the question “What record, if any, has key value K?” • Example: Store the n records with keys in range 0-(n-1). • Store the record with key i in slot i. • Uses the hash function h(k)=k. (Identity function).

  8. Collisions • More reasonable example • Store about 1000 records with keys in the range 0-16,383. • Impractical to keep a table of size 16,384. • We need a hash function to map keys to a smaller range. • Given a hash function h and different keys k1 and k2. Let  be a position in the hash table. • If h(k1 )= h(k2 )=  then k1 and k2 have a collision at  under h.

  9. Collision Resolution • To search for the record with key K: • Compute the table location h(K). • Starting with slot h(K), locate the record containing key K using (if necessary) a collision resolution policy. • Collisions are inevitable in most applications. • Example: In a group of 23 people the odds are good that at least one pair share a birthday.

  10. Hash Functions • Must return a value within the table range. • Should evenly distribute the records to be stored among the table slots. • Ideally, the function should distribute records with equal probability to all the positions. In reality, usually depends on the data. • If we know nothing about the key distribution, evenly distribute the key range among the positions. • If we know about the key distribution, use a distribution dependant hash function.

  11. Example Hash Functions • h(key)=key % 16 - uses only last 4 bits. • H(key)=key % 1000 - uses last 4 digits. • Use % tablesize to make sure result is in the range. • Mid-square method: square the key and take the middle r bits for a table of size 2r • Sum up ASCII characters and take results modulo tablesize (a folding technique).

  12. Collision Handling Categories • Open hashing - when there is a collision, put collided item outside the table. • Closed hashing - when there is a collision, put collided item inside the table.

  13. Open Hashing • Look at each table element as the head of a linked list of items that has to that position. • Can organize the linked lists in many ways • ordered : unsuccessful searches are quickly found. • Ordered by frequency: if a few are searched for frequently, then this is a good technique. • If there are N records to be stored and the table is of size M then the average search length is O(N/M). • Good for internal memory. Linked nodes may be in different blocks on disk and cause many disk accesses.

  14. Closed Hashing - Linear Probe • If the item you are looking for is not in the hash position, look in the next position. • Do the same for insert until you find an empty location. • When you reach the bottom, go to the beginning. • Must have at least one empty slot or there will be an infinite loop. • Tends to have clustering since the collision position is not uniformly distributed (i.e. if collide at position 4, go to position 5, then 6, independent of key).

  15. Better Linear Probe • Instead of going to the next slot, skip by some constant c. • The tablesize M and c should be relatively prime. • This assures the probing will cycle through all the table. • Still has some clustering.

  16. Quadratic Probe • Instead of adding 1 to the key add i2 • i is the probe sequence, so add 1, 4, 9, 16,... • Remember we also mod with table size.

  17. Double Hashing • After a collision, use a different hash function. • Eliminates clustering to some degree. • For example if h(k) causes a collision then use • p(k,i)= i*h2(k) • h2 is a different hash function • generates a different probe sequence

  18. Analysis of Closed Hashing • load factor =lf=N/M • N is the number of records • M is the size of the table • N/M is the percent full • The larger the load factor the greater the probability of a collision • Average search length is O(1/(1-lf))

  19. Deletions • If we delete a value it may stop the search prematurely (break the chain). • Use a special mark to indicate something was deleted. When searching continue if see this mark rather than stopping as if it was empty. • Once we have many deleted items we may wish to rehash everything remaining • best if we rehash the most frequently accessed items first.

More Related