320 likes | 436 Vues
This lecture explores the real-world applications of indexed files, focusing on how dictionaries are utilized in large databases across multiple machines. It discusses the principles of splitting index and data files, file formatting, and the benefits of fixed-size records. The lecture delves into Java coding techniques that facilitate the use of indexed files, including proxy patterns for efficient data handling. By examining various scenarios where indexed files solve practical problems, participants will gain insights into optimizing data search and organization strategies in large-scale programming.
E N D
CSC 213 – Large Scale Programming Lecture 21:Indexed Files
Today’s Goals • Look at how Dictionarys used in real world • Where this would occur & why they are used there • In real world setting, what problems can/do occur • Indexed file usage presented and shown • How & why we split index & data files • Formatting of each file and how they get used • Describe what problems solved using indexed files • Java coding techniques that simplify using these files • Idea needed when using multiple indexes shown
Dictionaries in Real World • Often need large database on many machines • Split search terms across machines • Updating & searching work split between machines • Database way too large for any single machine • If you think about it, this is incredibly common • Where?
Splitting Keys From Values • In real world, we often have many indices • Simple units measure where we can find values • Values could be searched for in multiple ways
Splitting Keys From Values • In real world, we often have many indices • Simple units measure where we can find values • Values could be searched for in multiple ways
Index & Data Files • Split information into two (or more) files • Data file uses fixed-size records to store data • Index files contain search terms & data locations • Fixed-size records usually used in data file • Each record will use exactly that much space • Extra space wasted if the value is smaller • But limits data size, cannot get more space • Makes it far easier to reuse space & rebuild index
Index File Format • No standard format – depends on type of data • Often variable sized, but this not specific requirement • Each entry in index file begins with exact search term • Followed by position containing matching data • As a result, often find indexes smushed together • Can read indexes at start of program execution • Reasonably assumes index file smaller than data file • Changes written immediately, however • When program starts, do NOT read data file
Indexed Files • Enables splitting search terms across computers • Alphabetical split searches faster on many servers U-X Y-Z A - C S-T D-E Q-R F-H I-P
Indexed Files • Enables splitting search terms across computers • Create indexes for different types of searching Song name Song Length
How Does This Work? • Using index files simplified using positions • Look in index structure to find position of data in file • With this position can then seek to specific record • Create instance & initialize by reading data from file
Starting with Indexed Files IBM 106 IBM AT & T 23 T Ford 2 F
Where Was "Searching" Used? • Indexed files used in Maps and Dictionarys • Read data into searchable object after opening file • For each record, Entryuses indexed data as its key • Single data file has multiple indexes to search it • Not a problem, each index has own Collection • Cannot have multiple instances for each data item • Cannot have single instance for each data item • Then how can we construct each Entry's value?
Proxy Pattern For The Win! • Create proxy instances to use as Entry's value • Proxy pretends has data by defining getters & setters • Data's position & file only fields these objects have • Whenever method called looks up & returns data • Other classes will think proxy has fields declared • Simplifies using class & ensures up-to-date data used • But little memory needed, since data resides on disk!
Starting with Indexed Files IBM 106 IBM AT & T 23 T Ford 12 F
Coding public class Stock {private static final intNAME_OFF = 0;private static finalintNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ = 4;private static final intTICK_OFF = PRC_OFF + PRC_SZ;private static final intTICK_SZ = 6;private static final intSIZE = TICK_OFF + TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile = file;}
Coding public class Stock {private static final intNAME_OFF = 0;private static final intNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ=4;private static final intTICK_OFF= PRC_OFF +PRC_SZ;private static final intTICK_SZ= 6;private static finalintSIZE=TICK_OFF +TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile= file;} Fixed max. sizeof each field Fixed size of a record in data file
Coding public class Stock {private static final intNAME_OFF = 0;private static final intNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ=4;private static final intTICK_OFF = PRC_OFF + PRC_SZ;private static final intTICK_SZ=6;private static final intSIZE=TICK_OFF+TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile= file;} Offset in record to field start
Coding public class Stock { // Continues from last timepublic intgetStockPrice() {theFile.seek(position + PRC_OFF); return theFile.readInt();}public void setStockPrice(int price) {theFile.seek(position + PRC_OFF); theFile.writeInt(price);}public void setTickerSymbol(String sym) {theFile.seek(position + TICK_OFFSET);theFile.writeUTF(sym);}// More getters & setters from here…
Visualizing Indexed Files IBM 106 IBM AT & T 23 T Ford 12 F
How Do We Add Data? • Adding new records takes only a few steps • Add space for record with setLength on data file • Update index structure(s) to include new record • Records in data file updated at each change
Adding New Data To The Files IBM 106 IBM AT & T 23 T Ford 12 F 0 Ø
Adding New Data To The Files IBM 106 IBM AT & T 23 T Ford 12 F Citibank -2 C
How Does This Work? • Removing records even easier • To prevent using record, remove items from indexes • Do NOT update index file(s) until program completes • Use impossible magic numbers for record in data file
Removing Data As We Go IBM 106 IBM AT & T 23 T Ford 12 F Citibank -2 C
Removing Data As We Go IBM 106 IBM AT & T 23 T 0 Ø Citibank -2 C
Using Multiple Indexes • Multiple indexes for data file very often needed • Provides many ways of searching for important data • Since file read individually could also create problem • Multiple proxy instances for data could be created • Duplicates of instance are created for each index • Makes removing them all difficult, since not linked • Very easy to solve: use Map while loading index • Converts positions in file to proxy instances to solve this
Linking Multiple Indexes • Use one Map instance while reading all indexes • For each position in file, check if already in Map • Use existing proxy instance, if position already in Map • If a search in Mapreturns null, create new instance • Make sure to call put()when we must create proxy
For Next Lecture • Angel now has week #9 assignment (due 3/20) • This is after break, but might want to get start now • Angel will also have project #2 available • Has staggered submissionslike previous project • Based upon index files, so can start working now! • Will discuss implementing space efficient BST • Start coloring nodesred&black • Keeps balanced, but limits amount of movement