QQ: Nanoscale Timing and Profiling

QQ: Nanoscale Timing and Profiling James Frye † *, James G. King † *, Christine J. Wilson * ◊, Frederick C. Harris, Jr. † * †Department of Computer Science and Engineering*Brain Computation Lab◊Biomedical Engineering University of Nevada Reno, NV 89557

What is QQ • QQ is a simple and efficient tool for measuring timing and memory use • Developed for the examination of a massively parallel program (NCS) • Easily extensible to inspect other programs

The Place: The Human Brain • Goal: • create the first large-scale, synaptically realistic cortical computational model. • Purpose: • Simulation Experiments • Drug Trials • Alzheimer’s Research • Robotics

Neurons Excitatory Interneurons (inhibitory) Columns High connectivity within columns. Less connectivity across columns The Science:

Channels Potassium Family M, A, AHP Channels Suppressing behavior on parent cell Synapses Analog converter of binary spike event. Contextual filters. The Science (cont):

The Science (cont): • Neurons

NCS Biology • The membrane voltage determines the cell’s firing rate • Once threshold voltage is reached the cell sends an action potential to it’s connected synapses Action Potential 30 mV 0 -45 Time (mS)

Pre-Synaptic Cell Post-Synaptic Cell 0.2 mV 100 200 300 400 500 0 Time (ms) 2-Cell Model

No Channels Sustained firing at maximum rate during a continuous stimulus

Ka Channel Slows the initial response during a sustained stimulus

Km Channel Prevents continuous bursting during a continuous stimulus

Kahp Channel Dampens the effect while still allowing for some action potentials during a sustained stimulus

QQ Development • QQ was developed to optimize a parallel program used to simulate cortical neurons – NeoCortical Simulator (NCS) • Our goal for the summer of 2002 was to simulate 106 neurons with 109 synapses within a realistic run time • Before optimization, NCS would run about 1.5 million synapses at a rate of 1 day per simulated second of synaptic activity • Clearly optimization of NCS was needed

QQ Design • QQ is designed so that all of its routines can be selectively compiled into a program • In the QQ.h header file, each routine is defined with a preprocessor directive, so that if profiling is not enabled, it reduces to an empty statement. #ifdef QQ_ENABLE void QQInit (int); #else #define QQInit (dummy) #endif

QQ Design • Memory profiling routines also use the C preprocessor to intercept library calls #ifdef QQ_ENABLE #define malloc(arg) MemMalloc (MEM_KEY, arg) #endif • The MemMalloc function records allocation information, calls the malloc function to do the actual allocation, and returns the result to the caller

QQ Timing • Extremely accurate measurement of execution speed. • In theory fine-grained resolution to a single clock cycle. • Using the IA32 instruction RTDSC • In practice, measurements are accurate to tens of cycles • Because of instruction reordering and multiple pipelines in the CPU

Timing Measurements • Measuring the impact of a line change in the calculation for the Km channel From: I = unitaryG * strength * pow (m, mPower) * (ReversePot – CmpV); To: I = unitaryG * strength * (ReversePot – CmpV); • Km-type channel, mPower is always 1, so we were able to change the equation to streamline the execution • Wrapping the line in calls to QQ, we measure the effect of this single change QQStateOn (QQ_Km); I = unitaryG * strength * (ReversePot – CmpV); QQStateOff (QQ_Km);

Timing Measurements • Note that both code versions give similar cycle counts on different processors, though more consistent and somewhat fewer on P4 than P3. • Times for similar counts are proportional to processor speed, as expected. • Function call pays a heavy penalty for first call. It's only called by Km channel code in this code, so time represents first load of the code into cache

Timing Measurements PIII – 800 MHz

Timing Measurements P4 – 2200MHz

Expanding Timing Information • QQ allows the user to record an additional item of information with the normal timing. • QQCount records an integer with the key • QQCount( eventKey, integer_of_interest ); • QQValue records a double precision floating point value with the key • QQValue( eventKey, double_of_interest ); • QQState records a state of ON or OFF with the key • QQStateOn( eventKey ); QQStateOff( eventKey ); • These will be described during discussion of the output format

QQ Memory • Records memory allocation dedicated to the code-block, rather than the total allocation due to code and library calls, to single-byte accuracy

QQ Memory Example • NCS implementation of ion channels • Suppose we want to know the total memory used by all channels. Each channel function would require channel key: #define MEM_KEY KEY_CHANNEL • Then at any point in the program execution, just call the MemPrint function to display memory use

Memory Usage Output Memory Allocation: Total Allocated = 988 KBytes Object Number Number Object Alloc Total Max Item Size Created Deleted KB KB Kb KB Brain 120 1 0 1 0 1 1 CellManager 44 1 0 1 1 1 1 Cell 16 100 0 2 0 2 2 Channel 252 300 0 74 0 74 74 Compartment 324 100 0 32 2 33 33 MessageMgr 16 1 0 1 205 205 205 MessageBus 0 0 0 0 1 1 1 Report 80 1 0 1 1 1 1 Stimulus 252 1 0 1 1 1 1 Synapse 44 10000 0 430 118 547 547 --------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 2 3 4 5 6 7 8 Key 1 - Internal name given to recording category 2 - The size of the object being allocated - it's valid only if all allocations are the same size, as with "new Object". 3 - Number of allocation calls made: new, malloc, calloc, etc. 4 - Number of free or delete calls made 5 - KBytes allocated via object creation (new) 6 - KBytes allocated via *alloc calls 7 - Total memory currently allocated 8 - Max memory ever allocated = high-water mark.

QQ Applications • Brain Communication Server (BCS) • NCS

Brain Communication Server • Further experimentation with the simulator required another application be developed to coordinate communication between NCS and numerous potential clients: • virtual creatures • physical robots • visualization tools NCS BCS

Optimizing BCS Different applications make non-sequential requests. No single function was called in a loop iterating several times, so time needed to be measured over the course of execution. Then perform an analysis of QQ’s final output.

Parsing QQ’s output • QQ uses a straight forward layout for the final output file • The data can be easily extracted and displayed in a text report as shown on the previous slide or sent to a graphical display • The following slides describe the output format and how to manage the information

Header Number of Keys (int), Key Name string length (int) Key Table For each Key – Key ID (int), Key type (int), Key name (char *) Node Information Number of nodes (int) Node Table For each Node – Byte offset to data (size_t), Number of entries (int), Starting Base Time (unsigned long long), Mhz (double) Data For each Node, For each entry – item (QQItem) QQ file format

Previous Sections Node 0 – For each entry Key (int), [Optional Info], Event Time (unsigned long long) Data Node 1 – For each entry Key (int), [Optional Info], Event Time (unsigned long long) Node 2 – For each entry … QQ Format – Data Close Up Node 0 Byte offset Node 1 Byte offset Node 2 Byte offset Where Optional Info is the size of a double, but contains a State (int), a Count (int), or a Value (double)

Gathering the Results • After reading a node’s data section, entries with the same key can be gathered. • Using the key table, the user knows what is contained in the second block of a timing entry • Example: • Key 2 has type “State” • The second block contains integer 1 for “on” or integer 0 for “off” • By subtracting the event times, the length of time spent in the “on” state is determined 2 1 109342759 2 0 109342768

Another example • Example: • Key 4 has type “Value” • The second block contains a double precision value passed in during execution • The value can be saved and displayed with timing information, or sent to a separate graph • Timing is obtained the same as before, by subtracting the event times 4 -65.3477 109342735 4 -58.2367 109342819

NCS Performance Measurement • QQ was able to hone in on specific blocks of code and allow measurement at a resolution necessary to allow for easy interpretation

Optimization Targets • QQ analysis quickly identified two major targets within the code • Synapses • Message Passing

Synapses • Synapses were by far the most common element of any NCS model with the most memory usage • Active only when an action potential was processed through the synapse • Pass information between the nodes via message passing

Message Parsing Overhead • Using QQ, we were able to identify areas for improvement within NCS 3 • Many unneeded fields requiring better encoding of their destination • Fixed number of messages pre-allocated, far more than needed by the program • Implemented a shared pool, buffers allocated as needed • Messages sent individually, processed multiple times • Implemented a packet scheme: process packet once for send, once for receive • Process messages only when used

Optimization Results

Execution Time Measurements after Optimization

Conclusions • QQ allows profiling of nanoscale timing of code segments and memory usage analysis • Fine grained measurements of specific events • Ability to measure memory at an object or event level with a small memory and performance footprint • Simple and effective tool

Future Work • New Opteron cluster • BlueGene migration • NCS is currently being installed at our sister lab The Brain Mind Institute at EPFL in Switzerland on their new machine • Robotic integration

Acknowledgements • Office of Naval Research • 6 years of funding for people (3 year renewable) • 4 DURIP grants for hardware

QQ: Nanoscale Timing and Profiling James Frye † *, James G. King † *, Christine J. Wilson * ◊, Frederick C. Harris, Jr. † * †Department of Computer Science and Engineering*Brain Computation Lab◊Biomedical Engineering University of Nevada Reno, NV 89557

QQ API

QQ: Nanoscale Timing and Profiling

QQ: Nanoscale Timing and Profiling

Presentation Transcript

Character Animation In Videogames

Warehouse Activity Profiling

Delayed Puberty – A Disorder in Timing????

Timing Closure Today

VOCATIONAL PROFILING

Asynchronous and Synchronous Transmission

The Microarchitecture Level

STATIC TIMING ANALYSIS

Pacemaker Timing Part I

Forensic DNA profiling workshop

Profiling tools

Code Tuning and Optimization

Parallel Programming and Timing Analysis on Embedded Multicores

Tutorial 2: QSAR modeling of compounds profiling

Training and Presentation

Routine Anomaly Scan

Communicating in Systems with Heterogeneous Timing

The Business Case for Infection Prevention and Control: Knowledge, Tools and Timing

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A

CHAPTER 11 Variable Valve Timing Systems

Timing Analysis in Quartus