1 / 24

Reliable Datagram IPC

Reliable Datagram IPC. Richard.Frank@oracle.com , Zach.Brown@oracle.com. Vision Statement. A low overhead, low latency, high bandwidth, ultra reliable, supportable, IPC protocol and transport system Which matches Oracles existing IPC models for RAC communication

skaplan
Télécharger la présentation

Reliable Datagram IPC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reliable Datagram IPC Richard.Frank@oracle.com, Zach.Brown@oracle.com Oracle Corporation @ OpenIB 8/05

  2. Vision Statement • A low overhead, low latency, high bandwidth, ultra reliable, supportable, IPC protocol and transport system • Which matches Oracles existing IPC models for RAC communication • Optimized for Xfers from 200 bytes to 8meg Oracle Corporation @ OpenIB 8/05

  3. Goal and Objective • Support for a reliable datagram IPC in OpenIB • Based on Socket API • Minimal code change / testing for Oracle • Failover inter HCA and intra HCA ports • Runs over IB, Ether, iWARP, etc • 2 month validation / certification for RAC Oracle Corporation @ OpenIB 8/05

  4. Today’s Situation • TCP streams used for connections to database by external clients, app servers, etc. • Reliable Data grams used for internal database IPC (RAC) • Thousands of processes • 200k+ associations (not connections) • 64 nodes Oracle Corporation @ OpenIB 8/05

  5. Parallel Query • SQL decomposed into execution plan / tree • Set of producer / consumer pipelined stages • Based on data accessed (#rows,physical organization,logical operations (hash,index) • Each execution stage has producer / consumer slave groups (source,sync) • Each group can be many slaves – 32 Oracle Corporation @ OpenIB 8/05

  6. Parallel Query • Operation tree / plan is not aware of slave locality – comm. could be local via shared memory or remote via IPC. • N : N, 1 : N, N : 1 comm. between groups (16 source, 16 sync, 16 nodes = ~ 65k associations for n:n com) 1 query • May change group organization / comm model at each stage of plan. • 64k msg size capable – typical today 16k Oracle Corporation @ OpenIB 8/05

  7. Oracle Buffer Cache • Distributed Cache • Client / Server • Client sends request for buffer • Server Sends back buffer (DDP) • Each node has pool of servers • Any client can ask any server Oracle Corporation @ OpenIB 8/05

  8. Oracle Buffer Cache • Buffer size is 8k by default but can be 2k, up to 32k in size • Associations per server are n-1 * C • C = clients per node, n = Nodes • 16-1*800 = 12k per server process. • 8 servers per node = 96k associations Oracle Corporation @ OpenIB 8/05

  9. Oracle IPC Usage • New database functionality will significantly increase IPC utilization • Approaches database I/O rates • Very large msgs -> 8meg + Oracle Corporation @ OpenIB 8/05

  10. Reliable Datagram IPC • UDP – Oracle adds reliable delivery via user mode wire protocol engine. • Two sockets per process, thousands of msgs on wire • Slow sends times (windowing,acks,retrans) • Holds together but degenerates under CPU load • Well tested ! Oracle Corporation @ OpenIB 8/05

  11. Available Options • uDAPL / itAPI – not supporting • IPOIB – high CPU overhead, same unreliable delivery (UDP) • SDP – connection oriented • We want to take our existing well tested UDP module, shutoff most of it to run over an O/S provided RD IPC Oracle Corporation @ OpenIB 8/05

  12. Recommendation • RD – Reliable Datagram IPC over IB • 50% less CPU than IPOIB, UDP • ½ Latency of UDP (no user-mode acks) • Within 5% of uDAPL thru-put using Oracle • Minimal code change – reduced our UDP module by 70% - removed windowing, acks, retransmissions, etc. • RDS driver ~ = 1k C lines (b-copy) • Decoupled from user-mode CPU loading • Passes all Oracle regression tests in < 2 wks !!!! • Supports fail-over across and within HCAs. Oracle Corporation @ OpenIB 8/05

  13. RDS IPC over IB • Uses IB reliable connection (RC) • Node to Node level connection • User mode sockets share small pool of node to node RCs. • Formed either dynamically at send or at system startup Oracle Corporation @ OpenIB 8/05

  14. Oracle Block Service Rate Oracle Corporation @ OpenIB 8/05

  15. Service Response Time Oracle Corporation @ OpenIB 8/05

  16. Cpu Cost Per Block Served Oracle Corporation @ OpenIB 8/05

  17. Oracle Corporation @ OpenIB 8/05

  18. RDS IPC • Implemented in 3 phases • b-copy • Zero Copy • Z-copy Directed Sends / Recvs (ES-API additions) Oracle Corporation @ OpenIB 8/05

  19. B-Copy • Sends are copied and completed immediately • Sends are not guaranteed to have made it to remote application. • If Send fails async to submission – application must detect loss of send • Can only fail if no path to destination (remote port / process is gone or path has failed – no alternate path Oracle Corporation @ OpenIB 8/05

  20. B-Copy Send/Recv • Recvs are buffered in kernel / queued to remote socket. • If total buffers queued to remote socket exceeds threshold – then sending socket is back pressured (ewouldblock) when sending to blocked remote socket. Oracle Corporation @ OpenIB 8/05

  21. Z-Copy Send/Recv • Dynamic registration of buffer > size • Application is not required to do explicit registration. • Oracle IPC buffers are in shared memory and private heap • impractical to pre-register • O/S must manage any caching of registrations Oracle Corporation @ OpenIB 8/05

  22. Directed Sends / Recvs(DDP) • Key for target buffer returned from RDS interface (get memory handle) • Key is sent by application to remote side • Remote side initiates directed send passing in key of remote target buffer • Uses RDMA write to move data • ES-API additions – working on definition Oracle Corporation @ OpenIB 8/05

  23. Next Steps ? • RDS bcopy supported in Oracle 10.2.0.2. • RDS from SilverStorm ported to OpenIB Gen2 • Preparing to test OpenIB + RDS at Oracle Oracle Corporation @ OpenIB 8/05

  24. Next Steps • Work on zcopy / directed send (ddp) specification now (ES-API). • RD IPC Docs from Oracle • Richard.Frank@oracle.com • RDS/eth • Zach.Brown@oracle.com Oracle Corporation @ OpenIB 8/05

More Related