National Sun Yat-sen University Embedded System Laboratory

National Sun Yat-sen University Embedded System Laboratory HQEMU: A Multi-Threaded and Retargetable Dynamic Binary Translator on Multicores Presenter: Zong-Ze Huang Cite count: 7 Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu Wei-Chung Hsu, Pangfeng Liu, Chien-Min Wang, Yeh-Ching Chung Proceedings of the Tenth International Symposium on Code Generation and Optimization, 2012

Abstract (1) • Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation and security. However, there are several factors that often impede its performance: (1) emulation overhead before translation; (2) translation and optimization overhead, and (3) translated code quality. On the dynamic binary translator itself, the issues also include its retargetability to support guest applications from different instruction-set architectures (ISAs) to host machines also with different ISAs, an important feature for system virtualization. • In this work, we take advantage of the ubiquitous multicore platforms, using multithreaded approach to implement DBT. By running the translators and the dynamic binary optimizers on different threads on different cores, it could off-load the overhead caused by DBT on the target applications; thus, afford DBT of more sophisticated optimization techniques as well as the support of its retargetability.

Abstract (2) • Using QEMU (a popular retargetable DBT for system virtualization) and LLVM (Low Level Virtual Machine) as our building blocks, we demonstrated in a multi-threaded DBT prototype, called HQEMU, that it could improve QEMU performance by a factor of 2.4X and 4X on the SPEC 2006 integer and floating point benchmarks for x86 to x86-64 emulations, respectively, i.e. it is only 2.5X and 2.1X slower than native execution of the same benchmarks on x86-64, as opposed to 6X and 8.4X slowdown on QEMU. For ARM to x86-64 emulation, HQEMU could gain a factor of 2.4X speedup over QEMU for the SPEC 2006 integer benchmarks.

What is the Problem • Three factors often impede DBT’s performance • Emulation overhead before translation • Translation and optimization overhead • Translated code quality • Proposal methods • Developed a multi-threaded retargetable DBT prototype, call HQEMU. • Propose a novel trace combination technique to improve existing selection algorithm

Related work Sampling profiles HPM [19] COREMU [28] NET algorithm [16] QEMU+LLVM [17] Multiple QEMU instance to multiple threads Propose some technique to improve accuracy of HPM Can send one trace at a time to LLVM To choose the hot TBs PQEMU [11] Accuracy sampling profiles HPM [7] Used TCG IR to achieve retargetable One instance of QEMU but parallelizes it internally HQEMU: A Multi-Threaded and Retargetable Dynamic Binary Translator on Multicores

Overview of QEMU • TCG (Tiny Code Generator) is core translation engine which provides a small set of IR operations (about 142 operation code). • A generic backend for a C compiler and used in QEMU. • Flow: • Advantages: • Translation fast. (compare with LLVM) • Code optimization i.e. dead code elimination. (compare with Dyngen) • Defects: • Code is low quality. (compare with LLVM) • Without further optimizations, there are often many redundant load and store operation left in the generated host code. translation compile TCG IR TCG Guest code Host code

Multi-Threaded Hybrid DBT System (1) • The goal is to design a DBT that not only can emit high-quality host codes but also exert low overhead on the running application. • Two translations are designed for different purposes. • TCG translator for fast translation. • LLVM translator for generating high quality host codes. • When LLVM optimizer receives an optimization request from the FIFO queue, it converts its TCG IRs to LLVM IRs. • Retargetable. • Simplifies the backend translator tremendously.

Multi-Threaded Hybrid DBT System (2) • Use Multi-Thread to hidden these optimization overhead. • LLVM translator is running on the other thread without interfering with the execution of the program.

Trace Optimization Support • Problem definition: • Binary translator needs to save and restore program contexts when the switches between the one TB to other TB because different register mapping. • Even if two TBs have a direct transition path (e.g. through block chaining) and also have the same guest to host register mappings. • Frequent storing and reloading of registers, performance pool. • Proposal method: • Merge many small TBs into larger ones, called traces. • Eliminating the redundant load and store operation by promoting such memory operation to register accesses within traces. Load CPU state Save CPU state Load CPU state Load CPU state Save CPU state Save CPU state TB1 TB2 TB3

Trace detection and algorithm • If dispatcher looks up directory hit, the basic block has been translated before and a cycle execution path is found. • Profile routine is enabled to count each time this block is executed. • Predict routine is enabled to record the head block to recording list. • Patch a direct jump to redirect optimized code.

Trace merging • Problem definition: • Trace optimization only could handle are either a straight path or a simple loop, cannot deal more complex control flow graph (CFG). • Proposal method: Trace merging • Force the merging of problematic traces that frequently jump among themselves. • Use a feedback-directed approach with the help of on-chip hardware performance monitor (HPM) to perform trace merging

Dynamic Binary Optimizer • Trace has to meet three criteria to be considered as a hot trace • The trace is in a stable state • Assume 100 traces and collect most recent N=10 sampling intervals. • Consider a trace is in a stable state if it appears in all entries of the circular queue. • The count of the trace must be greater than a threshold. 90~81 traces 100~91 traces 80~71 traces 70~61 traces …… …… Circular queue

Before the Experiment • HQEMU performance compare with QEMU. • Multi-thread HQEMU compare with single-thread HQEMU • How many memory operations reduce by trace formation and trace merging. • Overhead of trace formation.

Experimental environment (1) • Host platform • 3.3 GHz quad-core Intel Core i7 processor • 12 GBytes main memory • 64-bit Gentoo Linux with kernel version 2.6.30 • Target platform • Two different ISAs, ARM and x86 • LLVM version 2.8 • SPEC2006 benchmark suite is tested. • Trace profiling threshold is set to 50 and the maximum length of a trace is 16 TBs. • The trace merging in the dynamic optimizer is set to 8.

Experimental environment (2) • Four different configurations are used to evaluate the effectiveness of HQEMU. • QEMU • Which is the QEMU version 0.13 with the fast TCG translator. • LLVM • Which uses the same modules of QEMU except that the TCG translator is replaced by the LLVM translator. • HQEMU-S • Which is the single threaded HQEMU with TCG and LLVM translators running on the same thread. • HQEMU-M • Which is the multi-threaded HQEMU, with TCG and LLVM translators running on separate threads.

Performance of HQEMU-M • For SPEC2006 CINT benchmark with test input set. • HQEMU-M is faster than both the QEMU and the LLVM configurations. • The average slowdown of QEMU for CINT is 7.7X, 12.8X for LLVM, and 4.X for HQEMU-M. • For SPEC2006 CFP benchmark with test input set. • HQEMU-M is faster than both the QEMU and the LLVM configurations. • The average slowdowns of QEMU and LLVM are both 9.95X for CFP and HQEMU-M only 3.3X. • Form four benchmark that have too much translation can know our propose multi-thread HQEMU is useful compare with single thread HQEMU. (A) CINT (test input) (B) CFP (test input)

Performance of HQEMU • Programs spend much more time running in the optimized code caches. • LLVM configuration outperforms QEMU • Optimization overhead is very much amortized. • HQEMU significant improvement over both QEMU and LLVM. (D) CFP (Ref input) (C) CINT (Ref input)

Results of Trace Generation and Merging • Use Trace formation and trace merging can reduce much number of memory operation.

Overhead of Trace Generation • The translation time represents the time spending on trace generation by the thread of the LLVM translator. • As the table shows, most benchmarks spend less than 1% of total time conducting trace translation.

Conclusion • presented the multi-threaded QEMU+LLVM hybrid (HQEMU) approach can achieve low translation overhead and good translated code quality on the target binary applications. • Proposed a novel trace merging technique could remove redundant memory operations • My Comment • This paper let me know other methods of improve performance for QEMU. • Experimental is very detail.

National Sun Yat-sen University Embedded System Laboratory