320 likes | 432 Vues
CSC 213 – Large Scale Programming. Lecture 21: Indexed Files. Today’s Goals. Look at how Dictionary s used in real world Where this would occur & why they are used there In real world setting, what problems can/do occur Indexed file usage presented and shown
E N D
CSC 213 – Large Scale Programming Lecture 21:Indexed Files
Today’s Goals • Look at how Dictionarys used in real world • Where this would occur & why they are used there • In real world setting, what problems can/do occur • Indexed file usage presented and shown • How & why we split index & data files • Formatting of each file and how they get used • Describe what problems solved using indexed files • Java coding techniques that simplify using these files • Idea needed when using multiple indexes shown
Dictionaries in Real World • Often need large database on many machines • Split search terms across machines • Updating & searching work split between machines • Database way too large for any single machine • If you think about it, this is incredibly common • Where?
Splitting Keys From Values • In real world, we often have many indices • Simple units measure where we can find values • Values could be searched for in multiple ways
Splitting Keys From Values • In real world, we often have many indices • Simple units measure where we can find values • Values could be searched for in multiple ways
Index & Data Files • Split information into two (or more) files • Data file uses fixed-size records to store data • Index files contain search terms & data locations • Fixed-size records usually used in data file • Each record will use exactly that much space • Extra space wasted if the value is smaller • But limits data size, cannot get more space • Makes it far easier to reuse space & rebuild index
Index File Format • No standard format – depends on type of data • Often variable sized, but this not specific requirement • Each entry in index file begins with exact search term • Followed by position containing matching data • As a result, often find indexes smushed together • Can read indexes at start of program execution • Reasonably assumes index file smaller than data file • Changes written immediately, however • When program starts, do NOT read data file
Indexed Files • Enables splitting search terms across computers • Alphabetical split searches faster on many servers U-X Y-Z A - C S-T D-E Q-R F-H I-P
Indexed Files • Enables splitting search terms across computers • Create indexes for different types of searching Song name Song Length
How Does This Work? • Using index files simplified using positions • Look in index structure to find position of data in file • With this position can then seek to specific record • Create instance & initialize by reading data from file
Starting with Indexed Files IBM 106 IBM AT & T 23 T Ford 2 F
Where Was "Searching" Used? • Indexed files used in Maps and Dictionarys • Read data into searchable object after opening file • For each record, Entryuses indexed data as its key • Single data file has multiple indexes to search it • Not a problem, each index has own Collection • Cannot have multiple instances for each data item • Cannot have single instance for each data item • Then how can we construct each Entry's value?
Proxy Pattern For The Win! • Create proxy instances to use as Entry's value • Proxy pretends has data by defining getters & setters • Data's position & file only fields these objects have • Whenever method called looks up & returns data • Other classes will think proxy has fields declared • Simplifies using class & ensures up-to-date data used • But little memory needed, since data resides on disk!
Starting with Indexed Files IBM 106 IBM AT & T 23 T Ford 12 F
Coding public class Stock {private static final intNAME_OFF = 0;private static finalintNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ = 4;private static final intTICK_OFF = PRC_OFF + PRC_SZ;private static final intTICK_SZ = 6;private static final intSIZE = TICK_OFF + TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile = file;}
Coding public class Stock {private static final intNAME_OFF = 0;private static final intNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ=4;private static final intTICK_OFF= PRC_OFF +PRC_SZ;private static final intTICK_SZ= 6;private static finalintSIZE=TICK_OFF +TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile= file;} Fixed max. sizeof each field Fixed size of a record in data file
Coding public class Stock {private static final intNAME_OFF = 0;private static final intNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ=4;private static final intTICK_OFF = PRC_OFF + PRC_SZ;private static final intTICK_SZ=6;private static final intSIZE=TICK_OFF+TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile= file;} Offset in record to field start
Coding public class Stock { // Continues from last timepublic intgetStockPrice() {theFile.seek(position + PRC_OFF); return theFile.readInt();}public void setStockPrice(int price) {theFile.seek(position + PRC_OFF); theFile.writeInt(price);}public void setTickerSymbol(String sym) {theFile.seek(position + TICK_OFFSET);theFile.writeUTF(sym);}// More getters & setters from here…
Visualizing Indexed Files IBM 106 IBM AT & T 23 T Ford 12 F
How Do We Add Data? • Adding new records takes only a few steps • Add space for record with setLength on data file • Update index structure(s) to include new record • Records in data file updated at each change
Adding New Data To The Files IBM 106 IBM AT & T 23 T Ford 12 F 0 Ø
Adding New Data To The Files IBM 106 IBM AT & T 23 T Ford 12 F Citibank -2 C
How Does This Work? • Removing records even easier • To prevent using record, remove items from indexes • Do NOT update index file(s) until program completes • Use impossible magic numbers for record in data file
Removing Data As We Go IBM 106 IBM AT & T 23 T Ford 12 F Citibank -2 C
Removing Data As We Go IBM 106 IBM AT & T 23 T 0 Ø Citibank -2 C
Using Multiple Indexes • Multiple indexes for data file very often needed • Provides many ways of searching for important data • Since file read individually could also create problem • Multiple proxy instances for data could be created • Duplicates of instance are created for each index • Makes removing them all difficult, since not linked • Very easy to solve: use Map while loading index • Converts positions in file to proxy instances to solve this
Linking Multiple Indexes • Use one Map instance while reading all indexes • For each position in file, check if already in Map • Use existing proxy instance, if position already in Map • If a search in Mapreturns null, create new instance • Make sure to call put()when we must create proxy
For Next Lecture • Angel now has week #9 assignment (due 3/20) • This is after break, but might want to get start now • Angel will also have project #2 available • Has staggered submissionslike previous project • Based upon index files, so can start working now! • Will discuss implementing space efficient BST • Start coloring nodesred&black • Keeps balanced, but limits amount of movement