Technical Issues

Technical Issues LIS 466 4/21/2011 Blended Session

Technical Issues • So far in the semester, we’ve focused on the basic elements of IR systems and discussed the makings of an effective system. We’re going to shift gear today, and think about efficiency and scalability. • In this session, we will examine the concepts of efficiency and scalability, visit approaches to increase system efficiency, and methods used to scale up system performance. Finally, we will look at an example of how such methods may be applied in real life. • The full citations of readings mentioned in the slides are listed in the reference section at the very end. You can find all, except for Grossman and Frieder’s text book, online.

Why Efficiency and Scalability? • Ask yourself, how many times have you left a webpage when it responded too slow (which means it is taking seconds to open a page)? • IR systems by nature deal with a very large amount of data. First, there is the collection that might include thousands if not millions of documents, videos, images, and other types of files, easily taking up gigabytes if not terabytes. A big collection also means a larger set of retrieved documents, raising memory residency issues when recording document scores.Then, there is the index file that, unless compressed, easily exceeds the size of the whole document collection and would likely grow as new documents are added.

Why Efficiency and Scalability? • The system will also need to be able to handle multiple queries simultaneously. Think not only about website traffic, but about how the huge amount of data that needs to be structured, stored and accessed during query processing, and how the query logs should be kept and used.It is important for the IR system to perform efficiently with its given resources, and ready to be scaled up when the system’s current resources are exhausted.

Efficiency Defined • Effectiveness is about accuracy. It examines how well the IR system matches documents to queries. • Concerns relevancy. • Measured by recall and precision. • We’ve talked much about effectiveness in class. What are the devices used in IR systems to improve effectiveness / accuracy? • Efficiency is about speed. It measures how fast the results were retrieved. • Efficiency is measured by statistics such as latency (the time between the start of a process and its completion) and throughput (the amount of work done per unit time).

Improving Efficiency • General approaches to improve IR system efficiency: • Build a good inverted index • Improve query processing procedure

Improving Efficiency – Inverted Index • Inverted index reviewed: • The goal of an inverted index is to dramatically reduce the amount of input/output processing required to execute an ad hoc query. • Composed of the term index and the posting list: Basic entry: Term  (document ID, term frequency) (document ID, term frequency)… • Other data such as term location and term weight can be included in the posting list. • Very likely to exceed the size of the document collection unless compressed.

Improving Efficiency – Inverted Index • With features such as stemming and stop word removal, the term index is usually much smaller than the posting list and easily fits in memory. • Compressing the bit strings of the inverted index (usually its posting list) could decrease the run time required to find the relevant entries. • With a lot more data to be recorded, the posting list is usually larger. That and the fact that only the entries of the relevant terms need be accessed, when the inverted index file size exceeds its allotted memory space, the posting list will be removed and written to disk. (See the figure in the next slide.)

Improving Efficiency – Inverted Index This structure allows the IR system to scan through the term index in its memory, and then only access the posting list of relevant terms. (From Frieder et al., 2000)

Improving Efficiency – Query Processing • Another way to improve system efficiency is to tackle how queries are processed. Example approaches: • Inverted index modifications: segment the inverted index so the system can skip over non-relevance (as judged by document scores) items. • Partial result set retrieval: stop processing after a predetermined threshold of computational resources has been used. To use this approach, the query terms must be sorted first by its quality(usually by term frequency or document frequency) so the more important terms are processed first. • Grossman and Frieder (2004) has a whole chapter on IR system efficiency. Check it out for more details.

Scaling Up • The performances of IR systems can be enhanced by the aforementioned file compression and other optimization techniques. But at some point, the maximum capacity of a single machinery will be reached and the designers will have to look for other ways to expand the system. • Scalability is the ability of a system to adapt to increased demands and continue to function well. • Scalability can be difficult to measure. It is sometimes measured by whether the scaling up of a system disrupts the stability between a system’s throughput and response time. It can also be measured by the instruction count of every function in the system. • To learn more, see Jogalekar and Woodside’s (2000) proposal for a new scalability evaluation metric.

Scaling Up • In recent years, a few system architectures have been proposed to solve the scalability issue: • Distributed computing • Parallel computing • Cloud computing • These concepts are highly pursued by Google for good reasons.

Distributed, Parallel, and Cloud Computing • In the most basic sense, distributed and parallel computing are both forms of computing that breaks a single task into smaller parts to be executed simultaneously by two or more processors (computers). • Cloud computing is the new kid on the block. Where distributed computing and parallel computing have pretty standardized definitions, it is more difficult to pin down what cloud computing exactly means. Broadly defined, it refers to a computer network to which individual computers can connect in order to access data, applications or other services. • We will look at each term more closely in turn.

Serial Computation • Traditionally, software has been written for serial computation, in which: • A software runs on a single computer having a single Central Processing Unit (CPU) • A problem is broken into a discrete series of instructions which are executed one after another • Only one instruction may execute at any moment in time. From: Introduction to parallel computing. Retrieved from https://computing.llnl.gov/tutorials/parallel_comp/

Serial Computation From: Introduction to parallel computing. Retrieved from https://computing.llnl.gov/tutorials/parallel_comp/

Distributed Computing • A distributed computing system is a network of independent computers interacting with each other to serve a common goal. • The advantage of using a distributed system includes: • Able to tolerate problems or failures in any computer (node). • By aggregating the power of lower-end computers, the system can achieve better performance at a lower cost. • Can be easily expanded or reduced by adding or removing a node. • Can make use of non-local resources through the network.

Distributed Computing http://www.naccq.ac.nz/bacit/0203/2004Caukill_OffPeakGrid.htm

Distributed Computing and IR • When designing a distributed computing system, one major task is to decide how the tasks should be distributed. Should each step of the process be executed by different machines? For example, should the queries be processed by some systems, and the document scores handled by others? Or should the document collection be partitioned into many parts and housed in multiple servers? In this case when a query is entered, each individual server runs the query and retrieves its own set of results. The sets of results are merged together and presented to the user in one list.

Distributed Computing and IR • The system designers will have to decide what data should be shared among the servers: should there be a central index or should each server maintain its own index file? Should there be a central vocabulary system? What about the final results retrieved from different servers? How should they be merged when there may be different numbers of relevant documents within the sub-collections of each server? • For a detailed discussion, see de Krester, Moffat, Shimmin & Zobel (1998).

Evaluating Performance of Distributed Computing System • The performance of a distributed system can be measured by: • Effectiveness.As with traditional computing, recall, precision, and other traditional IR measurements need to be taken to track the performance of a distributed system. • Response time. Response time is the delay between issuing the query and the return of answers. • The amount of processing involved, the volume of network traffic and other variables can impact response time. • Resource usage. For example, CPU time and disk space requirement are used to measure the potential overall query throughput the system is able to achieve.

Parallel Computing • Parallel computing and distributed computing are, broadly speaking, very similar. • A task is completed using multiple CPUs at the same time • A problem is broken into discrete parts that can be solved concurrently • Each part is further broken down to a series of instructions • Instructions from each part execute simultaneously on different CPUs From: Introduction to parallel computing. Retrieved from https://computing.llnl.gov/tutorials/parallel_comp/

Parallel Computing From: Introduction to parallel computing. Retrieved from https://computing.llnl.gov/tutorials/parallel_comp/

Comparison of Distributed and Parallel Computing • The difference between distributed and parallel computing lies in what resources are being shared and how information is being exchanged. • Parallel computing: • All processors have access to a shared memory that is used to exchange information between processors. • The focus is on scientific computing • Distributed computing: • Each processor has its own memory. Information is exchanged by passing messages between the processors. • Focus is on cost and scalability, reliability, and resource sharing.

Parallel Computing • The advantages to parallel computing are very similar to that of distributed computing: • Parallel network can be built using cheaper components, leading to cost saving. • With more resources taking on the tasks, a task can be completed faster. • It is able to complete larger/ more complex tasks that are beyond the means of a single computer, such as search engine traffics. • Provide concurrency as well as access and use of non-local resources.

Measuring the Performance of a Parallel System • The performance of a parallel system can be difficult to measure due to its complex setup and the relationship between the applications. Similar parameters used in distributed system measurements are also used to evaluate parallel system performance: • Execution time: the time interval between the beginning of parallel computation and the time the last processing elements finished execution. • Speedup: the ratio of time taken to execute a problem on a single machine to the time required to solve the same problem using a parallel system. • Efficiency: the ratio of speedup to the number of processors used. • Scalability: the change in the performance of the system as the problem size and machine size grows. • Read Chhabra & Singh (2007) for more details on parallel system evaluation.

Parallel Computing and IR • An IR system may use a parallel computing setup by: • Replication: create replicas of the index and assign each replica to a server. This allows multiple queries to be processed in parallel. • Index partitioning: spilt the inverted index into multiple parts. Each server (node) work only on its own portion of the index. In this setup, a query is processed by multiple servers in parallel. • Document partitioning: each node holds an index of a subset of the documents. • Term partitioning: each node holds a subset of the terms in the collection. • Buttcher, Clarke & Cormack (2010) has a chapter dedicated to parallel IR that gives it a thorough review.

Cloud Computing • Cloud computing is a hot term that is being used a lot in system engineering and business. It is often referred to as an affordable and flexible approach to obtain resource or services. As a new concept, there are also varying views on what cloud computing is about. Some definitions use the cloud to describe any abstract network. For example, the National Institute of Standards and Technology (http://csrc.nist.gov/groups/SNS/cloud-computing/) defines cloud computing as “a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction”. Other definitions refers strictly to the Internet as the cloud. Either way, the term “cloud computing” refers both to the service and applications provided and to the datacenters that provide the services.

Cloud Computing From: http://bluemilecloud.com/bcloud/cloud-compute/

Cloud Computing • A simplified explanation of the model: http://youtu.be/QJncFirhjPg • Benefits to the cloud include: • Scalability: users can access additional resources on demand. • Reliability and fault-tolerance: the cloud enjoys the advantage of the built-in redundancy of multiple servers so that when one node goes down, the data and services are still there. • Utility-based: users pay only for what they use. • Cost effective: users use a common infrastructure, therefore sharing its cost.

Cloud Computing • Examples of cloud computing applications: • Amazon Elastic Compute Cloud (EC2) • Google MapReduce. The construction and storing of inverted index in the context of IR has been given as examples to use MapReduce. (Lin, 2008)

Hey… We’re about done! Whew! That was a lot of grounds we covered. Now it’s time to see some of these ideas in action. Where would we find a search engine that deals with very large document collection and very large traffic?Duh! Google!

Google Example • Here’s how Google handles its hardware architecture and maintains its flexibility: http://youtu.be/bs3Et540-_s • The clip is not the most exciting video in the world, and the audio is a bit muted, but it brings you inside of Google’s shipping containers to see how their hardware is set up. A write up of the configuration is here: http://news.cnet.com/8301-1001_3-10209580-92.html • Activity: How would you describe Google’s setup? Is it distributed computing? Parallel computing? Cloud computing? Other?

Reference • Barney, B. (2010, October 12). Introduction to parallel computing. Retrieved from https://computing.llnl.gov/tutorials/parallel_comp/ • Buttcher, S., Clarke, C.L.A., & Cormack, G.V. (2010). Parallel information retrieval. Information retrieval: Implementing and evaluating search engines. Boston, MA: MIT Press. Retrieved from http://www.ir.uwaterloo.ca/book/, March 28, 2011. • Chhabra, A. & Singh, G. (2007). Analysis & Integrated Modeling of the performance evaluation techniques for evaluating parallel systems. International Journal of Computer Science and Securyt, 1(1). • Frieder, O., Grossman, D. A., Chowdhury, A., & Frieder, G. (2000). Efficiency considerations for scalable information retrieval servers. Journal of Digital Information, 1(5). Retrieved from http://journals.tdl.org/jodi/article/view/21/21, March 28, 2011. • Grossman, D. A. and Frieder, O. (2004) Information Retrieval: Algorithms and Heuristics. Springer. • Jogalekar, P. & Woodside, M. (2000). Evaluating the scalability of distributed systems. IEEE Transactions on Parallel and Distributed Systems, 11(6), 589-603.

Reference • de Kretser, O., Moffat, A., Shimmin, T., & Zobel, J. (1998) Methodologies for distributed information retrieval. In ICDCS ’98 Proceedings of the 18th International Conference on Distributed Computing Systems. • Lin, J. (2008). Scalable language processing algorithms for the masses: A case study in computing word co-occurrence matrices with MapReduce. In Proceedings of the Conference on Emperical Methods in Natural Language Processing, 419-428. Retrieved from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.145.714&rep=rep1&type=pdf. April 4, 2011. • Macfarlane, A., Roberston, S.E., & McCann, J.A. (1997). Parallel computing in Information retrieval – An updated review. Journal of Documentations, 53(3). Retrieved from Emerald, March 28, 2011. • Manning, C.D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval. Online content companion: http://nlp.stanford.edu/IR-book/information-retrieval-book.html • Nivio, Z., de Moura, E.S., Navarro, G., & Baeza-Yates, R., (2000) Compression: A key for next-generation text retrieval systems. Computer, 33 (11). doi:10.1109/2.881693

Technical Issues