Limitations and Optimization of Virtual Machines

General Architecture Issues: Real Computers Chapter 5

5.1 The limitations of a Virtual Machine • The JVM • Very simple and easily understandable architecture. • It ignores some of the real-world limitations of actual computer chips. • On a physical computer, there is normally only one CPU and one bank of main memory, which means that two functions running at the same time inside the CPU might compete for registers, memory storage, and so forth.

5.1 The limitations of a Virtual Machine • Machine capacity • The PowerPC has only 32 registers. • The Windows (Pentium) PC has even fewer.

5.2.1 Building a Better Mousetrap • Increasing the word size of the computer will increase the overall performance numbers. • Increasing the clock speed should results in increasing in machine performance. • In practical terms, this is rarely effective. • Almost all machines today are 32 bits. • Increasing to a 64-bit register would let the programmer do operations involving numbers in the quadrillions more quickly—but how often do you need a quadrillion of anything? • Making a faster CPU chip might not help if the CPU now can process data faster than the memory and bus can deliver it.

5.2.1 Building a Better Mousetrap • Increasing performance this way is expensive and difficult. • Performance improvements can be made within the same general technological framework.

5.2.2 Multiprocessing • One fundamental way to make computers more useful is to allow them to run more than one program at a time. • CPU time-sharing. • You get to use the equipment for one slice. • After that slice is done, someone else comes in. • The computer must be prepared to stop the program at any point, copy wall the program-relevant information (the state of the stack, local variables, the current program counter, etc.) into make memory somewhere, then load another program's relevant information from a different area.

5.2.2 Multiprocessing • As long as the time slices are kept separate and the memory areas are kept separate the computer appears to be running several different programs at once. • Each separate program needs to be able to run independently of each other, and each separate program needs to be prevented from influencing other programs. • The computer needs to have a programmatic way to swap user programs in and out of the CPU at appropriate times.

5.2.2 Multiprocessing • The operating systems primary job is to act as a program control program and enforcer of the security rules. • The operating system is granted privileges and powers including the ability to interrupt a running program, the ability to write to an area of memory irrespective of the program using it, and so forth. • These powers are often formalized as programming models and define the difference between supervisor and user level privileges and capacities.

5.2.3 Instruction set Optimization • A particular instruction that occurs very frequently might be “tuned” in hardware to run faster than the rest of the instruction set would lead you to expect. • e.g. Iload_0 is shorter (one byte vs. two) and faster than the equivalent iload 0. • On the multiprogramming system “save all local variables to main memory” might be a commonly-performed action.

5.2.3 Instruction set Optimization • Good graphics performance demands a fast way of moving a large block of data directly from memory to the graphics card. • The ability to perform arithmetic operations on entire blocks of memory (for example, to turn the entire screen orange in a single operation) is part of the basic instruction set of some of the later Intel chips.

5.2.3 Instruction set Optimization • By permitting parallel operations to proceed at the same time (this kind of parallel operation is called SIMD parallelism, an acronym for “Same Instruction, Multiple Data”), the effective speed of a program can be greatly increased.

5.2.4 Pipelining • To do more than one different instruction at a time the CPU has a much more complex, pipelined, fetch-execute cycle that allows it to process several different instructions at once. • Operations can be processed in a sort of assembly-line fashion. • Instead of putting cars together one at a time, everyone has a single well-defined job and thousands of cars are put together via tiny steps.

5.2.4 Pipelining • While part of the CPU is actually executing one instruction, a different part of the CPU could already be fetching a different instructions. • By the time the instruction finishes executing, the next instruction is already here and available to be executed. • Instruction pre-fetch is where an instruction is fetched before the CPU actually needs it, so it's available at once.

5.2.4 Pipelining • This doesn't improve the latency—each operation still take the same amount of time from start to finish—but can substantially improve the throughtput, the number of instructions that can be handled per second by the CPU as a whole.

5.2.4 Pipelining • Unpipelined laundry

5.2.4 Pipelining • Pipelined laundry

5.2.4 Pipelining • Fetch stage • Instruction is loaded from main memory. • Dispatch stage • Analyze what kid of instruction it is. • Get the source arguments from the appropriate locations. • Prepare the instruction for actual execution by the third execute stage of the pipeline. • Complete/writeback stage • Transfer the results of the computation to the appropriate registers. • Update the overall machine state as necessary.

5.2.4 Pipelining

5.2.4 Pipelining • Pipeline can only run as fast as it slowest stage. • When one of these instructions needs to be executed, it can cause a blockage (sometimes called a “bubble” in the pipeline as other instructions pile up behind it. • Each pipeline-stage should take the same amount of time.

5.2.4 Pipelining • “jump if less than” • Once this instruction has been encountered the next instruction will come either from the next instruction in sequence, or else from the instruction at the target of the the jump—we may not know which. • The condition depends on the results of a computation somewhere ahead of us in the pipeline and therefore unavailable.

5.2.4 Pipelining • Returns from subrountines create their own problems. • The computer may have no choice but to stall the pipeline until it is empty. • Branch prediction is the art of guessing whether or not the computer will take a given branch (and to where). • If the guess is wrong, these locations (and the pipeline) are flushed and the computer restarts with an empty pipeline. • No worse than having to stall the pipeline.

5.2.4 Pipelining • Since most loops are executed more than once the branch will be taken many, many times and not taken once. • A guess of “take the branch” in this case could be accurate 99.9% of the time without much effort. • By adapting the amount and kind of information available, engineers have gotten very good (well above 90%) at their guessing, enough to make pipelining a crucial aspect of modern design.

5.2.5 Superscalar Architecture • Superscalar processing performs multiple different instructions at once. • Instruction queue • Instead of just loading one instruction at a time, we instead have an a queue of instructions waiting to be processed. • This is an example of MIMD (Multiple Instruction Multiple Data) parallelism—while one pipeline is performing one instruction (perhaps a floating point multiplication) on a piece of data, another pipeline can be doing an entirely different operation on entirely different data.

5.3 Optimizing Memory • Data the computer needs should be available as quickly as possible. • The memory should be protected from accidental re-writing.

5.3.1 Cache Memory • 32-bit word size, each register can hold 232. • This allows up to about four gigabytes of memory to be used. • The program generally is only using a small fraction of memory at any given instant. • Because speed is valuable, the fastest memory chips also cost the most. • Most real computers use a multi-level memory structure. • CPU run at 2 or 3 gigahertz, most memory chips are substantially slower, four hundred times slower than the CPU.

5.3.1 Cache Memory • Cache memory • Frequently and recently used memory locations are copied into cache memory so that they are available more quickly when the CPU needs them. • Level one (L1) cache is built into the CPU chip itself and runs at CPU speed. • Level two (L2) cache is a special set of high-speed memory chips placed next to the CPU on the motherboard.

5.3.2 Memory Management • Rather than referring to specific physical locations in memory, the program refers to a particular logical address which is reinterpreted by the memory manager to a particular physical location, or possibly even on the hard disk. • Memory management for many computers provide hardware support in the interest of speed, portability, and security.

5.3.2 Memory Management • User-level programs can just assume that logical addresses are identical to physical addresses, and that any bit patterns of appropriate length represents a memory location somewhere in physical memory, even if the actual physical memory is considerably larger or smaller than the logical address space. • Under the hood is a sophisticated way of converting logical memory addresses into appropriate physical addresses.

5.3.3 Direct Address Translation • Direct address translation occurs when hardware address translation has been turned off (only the supervisor can do this). • Only 4GB of memory can be accessed. • Done only in the interests of speed on a special purpose computer expected to only be running one program at once.

5.3.4 Page Address Translation • Virtual address space • We could define a set of 24-bit segment registers to extend the value address value. • The top four bits of the logical address will define and select a particular segment register. • The value stored in this register defines a particular virtual segment identifier (VSID) of 24 bits. • The virtual address is obtained by concatenating the 24-bit VSID with the lower 28 bits of the logical address. • Creates a new 52 bit address.

5.3.4 Page Address Translation • 0x13572468 (a 32-bit address) • #1 • 0xAAAAAA • 52-bit VSID 0xAAAAAA3572468 • Two different programs, accessing the same logical location would nevertheless get two separate VSID

5.3.4 Page Address Translation • 252 bytes of memory (4 million gigabytes, or 4 petabytes). • Physical memory is divided into pages of 4196 (212) bytes each. • Each 52-bit virtual address can be thought of as a 10-bit page identifier. • The computer stores a set of “page tables” in essence a hash table that stores the physical location of each page as a 20-bit number. • The 40-bit page identifier is thus converted, via a table lookup, to a 20-bit physical page address. • 32-bit physical address is the page address plus the offset.

5.3.4 Page Address Translation

5.4.1 The Problem with busy-waiting. • To get the best performance out of peripherals, they should not be permitted to prevent the CPU from doing other useful stuff. • A good human typist can type at about 120 words per minute. • A 1GHz computer can add 100,000,000 numbers together between two keystrokes. • Polling checks to see if anything useful has happened at periodic intervals.

5.4.1 The Problem with busy-waiting. While (no key is pressed) Wait a little bit Figure out what the key was and do something

5.4.1 The Problem with busy-waiting. • Polling (or busy-waiting)is an inefficient use of the CPU because the computer is being “busy” waiting for the key to be pressed and can't do anything else useful.

5.4.2 Interrupt Handling • Set up a procedure to follow when the event occurs, and then to do whatever else needs doing in the meantime. • When the event happens, one will then interrupt the current task to deal with the event using the previously established procedure. • The CPU established several different kinds of interrupt signals that are generated under pre-established circumstances such as the press of a key.

5.4.2 Interrupt Handling • The normal fetch-execute cycle is changed slightly. • Instead of loading and executing the “next” instruction the CPU will consult a table of interrupts Control is then transferred to that location and the special interrupt handler will be executed to do whatever is needful. • At the end of the interrupt handler, the computer will return to the main task at hand.

5.4.2 Interrupt Handling • The possible interrupts for a given chip are numbered from zero to a small value like 10. • These numbers also correspond to locations programmed into the interrupt vector—when interrupt number 0 occurs, the CPU will jump to location 0x00 and execute whatever code is stored there. • Usually all that is stored in the actual interrupt location itself is a single JMP instruction to transfer control (still inside the interrupt handler) to a larger block of code that does the real work.

5.4.2 Interrupt Handling • Interrupt handling mechanism can handle system-internal events. • For example, the time-sharing aspect of the CPU can be controlled by setting an internal timer. • When the timer expires, an interrupt will be generated, causing the machine, first, to switch from user to supervisor mode, and second, to branch to an interrupt handler that swaps the programming context for the current program out, and the context for the next program in. • The timer can then be reset and computation resumed for the new program.

5.4.3 Communicating with the peripherals: using the Bus • Data must move between the CPU, memory and peripherals using one or more buses. • You would like it to be as fast as possible. • A bus is usually just a set of wires, and so connects all the components together at the same time. • Where every peripheral gets the same message at the same time. • Only one device can be using the bus at once. • To use a bus successfully requires discipline from all parties involved.

5.4.3 Communicating with the peripherals: using the Bus • A typical bus protocol might involve the CPU sending a START and then and identifier for a particular device. • Only the specific device will respond with some sort of ACKNOWLEDGE message. • All other devices have been warned by this START message not to attempt to communicate until the CPU finishes and send a similar STOP message.

5.5 Chapter Review • The JVM is freed from some practical limitations. • Engineers have found many techniques to squeeze better performance out of their chips. • One way to get better user-level performance is by improving the basic numbers of the chip, but this is usually a difficult and expensive process.

5.5 Chapter Review • Another way to improve the performance of the system is to allow it to run more than one program at a time. • Engineers create special-purpose instructions and hardware specifically to support those programs. • Performance can also be increased by parallelism. • We can distinguish SIMD parallelism from MIMD parallelism in terms of the flexibility of what kind of instruction can be simultaneously executed.

5.5 Chapter Review • Pipelining where the fetch/execute cycle is broken down into several stages, each of which are independently executed. • Superscalar architecture provides another way to speed up processing by doing the same thing several times over. • Memory access times can be improved by using cache memory.

5.5 Chapter Review • Virtual memory and paging indexpage can provide computers with access to greater amounts of memory more quickly and securely. • The use of interrupts can give substantial performance increases when using peripherals. • A suitable design of a bus protocol can speed up how fast data moves around the computer.

Limitations and Optimization of Virtual Machines

Limitations and Optimization of Virtual Machines

Presentation Transcript

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5 5

chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

CHAPTER 5

Chapter 5

CHAPTER 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5