Efficient User-Level Networking in Java

Efficient User-Level Networking in Java Chi-Chao Chang Dept. of Computer Science Cornell University (joint work with Thorsten von Eicken and the Safe Language Kernel group)

Goal High-performance cluster computing with safe languages • parallel and distributed applications • communication support for operating systems Use off-the-shelf technologies • User-level network interfaces (UNIs) • direct, protected access to network devices • inexpensive clusters • U-Net (Cornell), Shrimp (Princeton), FM (UIUC), Hamlyn (HP) • Virtual Interface Architecture (VIA): emerging UNI standard • Java • safe: “better C++” • “write once run everywhere” • growing interest for high-performance applications (Java Grande) Make the performance of UNIs available from Java • JAVIA: a Java interface to VIA 2

Apps RMI, RPC Sockets Active Messages, MPI, FM UNI Networking Devices Why a Java Interface to UNI? Different approach for providing communication support for Java Traditional “front-end” approach • pick favorite abstraction (sockets, RMI, MPI) and Java VM • write a Java front-end to custom or existing native libraries • good performance, re-use proven code • magic in native code, no common solution Javia: exposes UNI to Java • minimizes amount of unverified code • isolates bottlenecks in data transfer 1. automatic memory management 2. object serialization Java C 3

Contribution I PROBLEM lack of control over object lifetime/location due to GC EFFECT conventional techniques (data copying and buffer pinning) yield 10% to 40% hit in array throughput SOLUTIONjbufs: explicit, safe buffer management in Java SUPPORT modifications to GC RESULT BW within 1% of hardware, independent of xfer size 4

Contribution II PROBLEM linked, typed objects EFFECT serialization >> send/recv overheads (~1000 cycles) SOLUTIONjstreams: in-place object unmarshaling SUPPORT object layout information RESULT serialization ~ send/recv overheads unmarshaling overhead independent of object size 5

Outline Background • UNI: Virtual Interface Architecture • Java • Experimental Setup Javia Architecture • Javia-I: native buffers (baseline) • Javia-II: jbufs (buffer management) and jstreams (marshaling) Summary and Conclusions 6

V V OS OS V V V V NI V NI V V V V V OS OS UNI in a Nutshell Enabling technology for networks of workstations • direct, protected access to networking devices Traditional • all communication via OS VIA • connections between virtual interfaces (Vi) • apps send/recv through Vi, simple mux in NI • OS only involved in setting up Vis Generic Architecture • implemented in hardware, software or both 7

Application Memory Library buffers sendQ recvQ descr DMA DMA Doorbells Adapter VI Structures Key Data Structures • user buffers • buffer descriptors < addr, len>: layout exposed to user • send/recv queues: only through API calls Structures are • pinned to physical memory • address translation in adapter Key Points • direct DMA access to buffers/descr in user-space • application must allocate, use, re-use, free all buffers/desc • alloc&pin, unpin&free are expensive operations, but re-use is cheap 8

Java Storage Safety class Buffer { byte[] data; Buffer(int n) { data = new byte[n]; } } No control over object placement Buffer buf = new Buffer(1024); • cannot pin after allocation: GC can move objects No control over de-allocation buf = null; • drop all references, call or wait for GC; Result: additional data copying in communication path 9

buf Buffer vtable byte[] vtable lock obj lock obj 1024 0 1 2 ... Java Type Safety Cannot forge a reference to a Java object • e.g. cannot cast between byte arrays and objects No control over object layout • field ordering is up to the Java VM • objects have runtime metadata • casting with runtime checks Object o = (Object) new Buffer(1024) /* up cast: OK */ Buffer buf = (Buffer) o; /* down cast: runtime check */ • array bounds check for (int i = 0; i < 1024; i++) buf.data[i] = i; Result: expensive object marshaling 10

Marmot Java System from Microsoft Research • not a VM • static compiler: bytecode (.class) to x86 (.asm) • linker: asm files + runtime libraries -> executable (.exe) • no dynamic loading of classes • most Dragon book opts, some OO and Java-specific opts Advantages • source code • good performance • two types of non-concurrent GC (copying, conservative) • native interface “close enough” to JNI 11

Example: Cluster @ Cornell Configuration • 8 P-II 450MHz, 128MB RAM • 8 1.25 Gbps Giganet GNN-1000 adapter • one Giganet switch • total cost: ~ $30,000 (w/university discount) GNN1000 Adapter • mux implemented in hardware • device driver for VI setup • VIA interface in user-level library (Win32 dll) • no support for interrupt-driven reception Base-line pt-2-pt Performance • 14s r/t latency, 16s with switch • over 100MBytes/s peak, 85MBytes/s with switch 12

Outline Background Javia Architecture • Javia-I: native buffers (baseline) • Javia-II: jbufs and jstreams Summary and Conclusions 13

Java (Marmot) Apps Apps Javia classes Javia C library Giganet VIA library GNN1000 Adapter Javia: General Architecture Java classes + C library Javia-I • baseline implementation • array transfers only • no modifications to Marmot • native library: buffer mgmt + wrapper calls to VIA Javia-II • array and object transfers • buffer mgmt in Java • special support from Marmot • native library: wrapper calls to VI 14

GC heap byte array ref send/recv ticket ring Vi Java C descriptor send/recv queue buffer VIA Javia-I: Exploiting Native Buffers Basic Asynch Send/Recv • buffers/descr in native library • Java send/recv ticket rings mirror VI queues • # of descr/buffers == # tickets in ring Send Critical Path • get free ticket from ring • copy from array to buffer • free ticket Recv Critical Path • obtain corresponding ticket in ring • copy data from buffer to array • free ticket from ring 15

Javia-I: Variants GC heap byte array ref Two Send Variants: Sync Send + Copy • goal: bypass send ring • one ticket • array -> buffer copy • wait until send completes Sync Send + Pin: • goal: bypass send ring, avoid copy • pin array on the fly • waits until send completes • unpins after send One Recv Variant: No-Post Recv + Alloc • goal: bypass recv ring • allocate array on the fly, copy data send/recv ticket ring Vi Java C descriptor send/recv queue buffer VIA 16

Javia-I: Performance Basic Costs: VIA pin + unpin = (10 + 10)us Marmot: native call = 0.28us, locks = 0.25us, array alloc = 0.75us Latency: N = transfer size in bytes 16.5us + (25ns) * N raw 38.0us + (38ns) * N pin(s) 21.5us + (42ns) * N copy(s) 18.0us + (55ns) * N copy(s)+alloc(r) BW: 75% to 85% of raw, 6KByte switch over between copy and pin 17

jbufs Lessons from Javia-I • managing buffers in C introduces copying and/or pinning overheads • can be implemented in any off-the-shelf JVM Motivation • eliminate excess per-byte costs in latency • improve throughput jbuf: exposes communication buffers to Java programmers 1. lifetime control: explicit allocation and de-allocation of jbufs 2. efficient access: direct access to jbuf as primitive-typed arrays 3. location control: safe de-allocation and re-use by controlling whether or not a jbuf is part of the GC heap 18

jbufs: Lifetime Control public class jbuf { public static jbuf alloc(int bytes);/* allocates jbuf outside of GC heap */ public void free() throws CannotFreeException; /* frees jbuf if it can */ } 1. jbuf allocation does not result in a Java reference to it • cannot directly access the jbuf through the wrapper object 2. jbuf is not automatically freed if there are no Java references to it • free has to be explicitly called C pointer jbuf GC heap 19

jbufs: Efficient Access public class jbuf { /* alloc and free omitted */ public byte[] toByteArray() throws TypedException;/*hands out byte[] ref*/ public int[] toIntArray() throws TypedException; /*hands out int[] ref*/ . . . } 3. (Memory Safety) jbuf remains allocated as long as there are array references to it • when can we ever free it? 4. (Type Safety) jbuf cannot have two differently typed references to it at any given time • when can we ever re-use it (e.g. change its reference type)? jbuf Java byte[] ref GC heap 20

jbuf jbuf jbuf Java byte[] ref Java byte[] ref Java byte[] ref GC heap GC heap GC heap unRef callBack jbufs: Location Control public class jbuf { /* alloc, free, toArrays omitted */ public void unRef(CallBack cb); /* app intends to free/re-use jbuf */ } Idea: Use GC to track references unRef: application claims it has no references into the jbuf • jbuf is added to the GC heap • GC verifies the claim and notifies application through callback • application can now free or re-use the jbuf Required GC support: change scope of GC heap dynamically 21

jbufs: Runtime Checks toArray, GC alloc toArray Unref ref free Type safety: ref and to-be-unref states parameterized by primitive type GC* transition depends on the type of garbage collector • non-copying: transition only if all refs to array are dropped before GC • copying: transition occurs after every GC unRef GC* to-be unref toArray, unRef 22

GC heap send/recv ticket ring jbuf state array refs Vi Java C descriptor send/recv queue VIA Javia-II: Exploiting jbufs Send/recv with jbufs • explicit pinning/unpinning of jbufs • tickets point to pinned jbufs • critical path: synchronized access to rings, but no copies Additional checks • send posts allowed only if jbuf is in ref state • recv posts allowed only if jbuf is in unref or ref state • no outstanding send/recv posts in to-be-unref state 23

Javia-II: Performance Basic Costs allocation = 1.2us, to*Array = 0.8us, unRefs = 2.5 us Latency (n = xfer size) 16.5us + (0.025us) * n raw 20.5us + (0.025us) * n jbufs 38.0us + (0.038us) * n pin(s) 21.5us + (0.042us) * n copy(s) BW: within margin of error (< 1%) 24

Parallel Matrix Multiplication Goal: validate jbufs flexibility and performance in Java apps • matrices represented as array of jbufs (each jbuf accessed as array of doubles) • A, B, C distributed across processors (block columns) • comm phase: processor sends local portion of A to right neighbor, recv new A from left neighbor • comp phase: Cloc = Cloc + Aloc * Bloc’ Preliminary Results • no fancy instruction scheduling in Marmot • no fancy cache-conscious optimizations • single processor, 128x128: only 15 Mflops • cluster, 128x128 • comm time about 10% of total time Impact of Jbufs will increase as #flops increase C A B += * p0 p1 p2 p3 p0 p1 p2 p3 p0 p1 p2 p3 25

Active Messages Goal: Exercise jbuf mgmt Implemented subset of AM-II over Javia+jbufs: • maintains a pool of free recv jbufs • when msg arrives, jbuf is passed to the handler • AM calls unRef on jbuf after handler invocation • if pool is empty, either alloc more jbufs or invoke GC • no copying in critical path, deferred to GC-time if needed class First extends AMHandler { private int first; void handler(AMJbuf buf, …) { int[] tmp = buf.toIntArray(); first = tmp[0]; } } class Enqueue extends AMHandler { private Queue q; void handler(AMJbuf buf, …) { int[] tmp = buf.toIntArray(); q.enq(tmp); } } 26

AM: Preliminary Numbers Summary • AM latency about 15 us higher than Javia • synch access to buffer pool, endpoint header, flow control checks, handler id lookup • room for improvement • AM BW within 5% of peak for 16KByte messages 27

writeObject “typical” readObject NETWORK “in-place” readObject jstreams Goal: efficient transmission of arbitrary objects • assumption: optimizing for homogeneous hosts and Java systems Idea: “in-place” unmarshaling • defer copying and allocation to GC-time if needed jstream • R/W access to jbuf through object stream API • no changes in Javia-II architecture 28

jstream: Implementation writeObject • deep-copy of object, breadth-first • deals with cyclic data structures • replace object metadata (e.g. vtable) with 64-bit class descriptor readObject • depth-first traversal from beginning of stream • swizzle pointers, type-checking, array-bounds checking • replace class descriptors with metadata Required support • some object layout information (e.g. per-class pointer-tracking info) Minimal changes to existing stub compilers (e.g. rmic) • jstream implements JDK2.0 ObjectStream API 29

writeObject, GC alloc writeObject Unref Unref w/obj free clearWrite Only send posts allowed Only recv posts allowed readObject readObject GC* to-be unref Ref clearRead readObject readObject, GC No outstanding send/recv posts No send/recv posts allowed 30 jstreams: Safety

jstream: Performance 31

Status Implementation Status • Javia-I and II complete • jbufs and jstreams integrated with Marmot copying collector Current Work • finish implementation of AM-II • full implementation of Java RMI • integrate jbufs and jstreams with conservative collector • more investigation into deferred copying in higher-level protocols 32

Related Work Fast Java RMI Implementations • Manta (Vrije U): compiler support for marshaling, Panda communication system • 34 us null, 51 Mbytes/s (85% of raw) on PII-200/Myrinet, JDK1.4 • KaRMI (Karlsruhe): ground-up implementation • 117 us null, Alpha 500, Para-station, JDK1.4 Other front-end approaches • Java front-end for MPI (IBM), Java-to-PVM interface (GaTech) Microsoft J-Direct • “pinned” arrays defined using source-level annotations • JIT produces code to “redirect” array access: expensive Comm System Design in Safe Languages (e.g. ML) • Fox Project (CMU): TCP/IP layer in ML • Ensemble (Cornell): Horus in ML, buffering strategies, data path optimizations 33

Summary High-Performance Communication in Java: Two problems • buffer management in the presence of GC • object marshaling Javia: Java Interface to VIA • uses native buffers as baseline implementation • jbufs: safe, explicit control over buffer placement and lifetime, eliminates bottlenecks in critical path • jstreams: jbuf extension for fast, in-place unmarshaling of objects Concluding Remarks • building blocks for Java apps and communication software • should be integral part of a high-performance Java system 34

Javia-I: Interface package cornell.slk.javia; public class ViByteArrayTicket { private byte[] data; private int len, off, tag; /* public methods to set/get fields */ } public class Vi { /* connection to remote Vi */ public void sendPost(ViByteArrayTicket t); /* asynch send */ public ViByteArrayTicket sendWait(int timeout); public void recvPost(ViByteArrayTicket t); /* async recv */ public ViByteArrayTicket recvWait(int timeout); public void send(byte[] b, int len, int off, int tag); /* sync send */ public byte[] recv(int timeout); /* post-less recv */ } 35

Javia-II: Interface package cornell.slk.javia; public class ViJbuf extends jbuf { public ViJbufTicket register(Vi vi); /* reg + pin jbuf */ public void deregister(ViJbufTicket t); /* unreg + unpin jbuf */ } public class ViJbufTicket { private ViJbuf buf; private int len, off, tag; } public class Vi { public void sendBufPost(ViJbufTicket t); /* asynch send */ public ViBufTicket sendBufWait(int usecs); public void recvBufPost(ViJbufTicket t); /* async recv */ public ViBufTicket recvBufWait(int usecs); } 36

baseAddr vtable lock native desc ptr length array body Jbufs: Implementation alloc/free:Win32 VirtualAlloc, VirtualFree to{Byte,Int,...}Array:no alloc/copying clearRefs: • modification to stop-and-copy Cheney scan GC • clearRef adds a jbuf to that list • after GC, traverse list to invoke callbacks, delete list Stack + Global Stack + Global Before GC After GC from-space to-space to-space from-space ref’d jbufs unref’d jbufs 37

State-of-the-Art Matrix Multiplication Courtesy: IBM Research 38

Efficient User-Level Networking in Java

Efficient User-Level Networking in Java

Presentation Transcript

Networking Support In Java

Efficient Character-level Taint Tracking for Java

Java Networking

Java Networking

Java Networking

Networking with Java

Java Networking

Networking in Java

Energy Efficient Networking

Java Networking

NETWORKING IN JAVA java PACKAGE

Java ME Networking

Networking Support In Java 2

Java Networking

Networking Support In Java 2

Java Networking

Java Networking

Java Networking

Java networking

Networking with Java

Networking with Java