Applications of Computing in Industry: What is Low Latency All About?

Applications of Computing in Industry:What is Low Latency All About? eFX – January 2014

DivyakantBengani • Undergrad degree in Management and IT from Manchester • Vice President at CS, responsible for eFX Core Technologies • Working in the banking industry since 2003 & CS for ~3 years 2

EFX - What do we do? • Cash FX Only • Spot, Forwards and Swaps • Continuous Publication of Prices • Streaming Executable Rates • Response to Request for Quotes • Acceptance and Booking of Trades 3

Key Statistics • ~200 Currency Pairs (E.g EURUSD / GBPJPY etc.) • 3 billion prices broadcast a day • 60000 trades a day • >200 client connections 4

Technologies Used • Java • C# for UIs • GWT for Web UIs • Oracle Coherence • Oracle DB • Derby DB • Azul Zing JVM • Low Latency Fix Engine 5

Protocols • Socket Connections • Asynchronous JMS • Java RMI • HTTP (JSON, HESSIAN) 6

Payloads • Google Protobuf • Fixed Length Byte Arrays • FIX - Industry Standard • JMS Map Messages • Java Serialization 7

EFX - Overall Architecture 8

Service Discovery • Zero Conf • Dynamically add and remove services • Applications do not need to know about each other - just pick up what’s advertised 9

Automated Testing 10

Code Quality Analysis 11

Continuous Integration 12

HowtoAchieve Low Latency

Daniel Nolan-Neylan • Graduated from UCL in 2004 • Started working at Credit Suisse in 2006 • First, networking for 4 years • Now, Application Developer in FX IT • Different projects: • Distributed caching system for static data • Simplified credit checking library • Pricing and trading gateway (now team lead) Corporate Design, HCBC 1

Wait a second! • Reminder: • 1 second is: • 1,000 milliseconds • 1,000,000 microseconds • 1,000,000,000 nanoseconds

Latency Numbers Every Programmer Should Know • L1 cache reference 0.5 ns • Branch mispredict 5 ns • L2 cache reference 7 ns 14x L1 cache • Mutex lock/unlock 25 ns • Main memoryreference 100 ns 20x L2 cache, 200x L1 cache • Compress 1K bytes with Zippy 3,000 ns • Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms • Read 4K randomly from SSD* 150,000 ns 0.15 ms • Read 1 MB sequentially from memory 250,000 ns 0.25 ms • Round trip within same datacenter 500,000 ns 0.5 ms • Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory • Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip • Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD • Send packet CA->Netherlands->CA 150,000,000 ns 150 ms By Jeff Dean: http://research.google.com/people/jeff/

FX Trading – Latency Numbers • 250ms – A human responding to price update • 30ms – Bank accepting trade • 10ms – Credit checking client • 9ms – JVM Garbage Collecting • 5ms – Persisting a trade to disk • 2ms – JMS networking round-trip • 1ms – Raw socket networking round-trip • 0.5ms – Max wire-to-wire pricing latency • 0.05ms – Min pricing latency • 0.005ms – Writing price to FIX engine

Optimization Quotes • Michael A. Jackson: “The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.” • Rob Pike: “Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you have proven that's where the bottleneck is.”

Where to Optimize? Use Profiler

Measuring Milliseconds and Nanoseconds in Java • Measure time taken for operations and log: • System.currentTimeMillis() • Good for taking a time/date that can be compared against other systems. Accuracy depends on OS, but 1ms accuracy achievable on modern Unix-based OS (Linux) • Bad if more precise measurements are required • System.nanoTime() • Good for sub-millisecond measurements • Bad if comparable time with other systems required • Realistically, need to use both Corporate Design, HCBC 1

Quote Journalling – log latency of every price Corporate Design, HCBC 1

Our Soak Test Harness Corporate Design, HCBC 1

…and the graphs it can produce Corporate Design, HCBC 1

Removing Millisecond Delays • Identify the longest-running tasks • Usually I/O delays • Disk • Database activity • Synchronous logging • Writing files • Network • Calling network services • Remote services far away (e.g. Across Atlantic ~50ms)

Removing Millisecond Delays (2) • Analyze whether delays can be eliminated • Disk • Database activity -> Use a cache • Synchronous logging -> Use asynchronous logging • Writing files -> Use buffers and write asynchronously • Network • Calling network services -> Cache where possible • Remote services far away -> Co-locate in same place

FX Trading – RFQ Example • E.g. Incoming request for a price, target response time is 10ms • Need to: • Validate request parameters • Internally subscribe for prices • Obtain a globally unique transaction ID • Perform a credit check • How to get all this done in just 10ms?

FX Trading – RFQ Example (2) • Credit check • Old one took 30-200ms • New one takes 5-10ms • Using Caching and Co-location • Parallelize all validation • Pre-cache prices • by opening up price streams in advance of being required

Don’t Optimize Too Soon • Remember: • Only optimize what you need to optimize • Remove longest delays first • No point removing micros if you still have delays of millis or worse • Always measure your operations carefully • Determine what minimum, maximum, mean, standard deviation, and other percentiles are (99%, 99.9%, etc) • Watch for jitter and solve separately

Removing Microsecond Delays • Intra-process delays • Unbalanced / slow queues • Slow algorithms • Expensive loops repeated many times • Poor use of object creation / memory allocation • Contented memory controlled with locks • Wasted effort calculating unwanted results

FX Trading – Pricing Example • Achieving wire-to-wire latencies of 50μs • Google protobuf parsers replaced with low-garbage creating versions • each GC stops the JVM for 9,000μs (i.e. 9ms) • LMAX Disruptors used instead of queues • Busy spin consumer threads / single-write principle • “PriceBigDecimal” class to replace Java BigDecimal class • BigDecimal slow to instantiate and impossible to mutate • No synchronous logging or network calls • Pre-cache static data before starting price stream

Disruptor or Blocking Queues? Corporate Design, HCBC 1

Java BigDecimal or use Low Latency replacement? Corporate Design, HCBC 1

Removing Nanoseconds? • Use specialist hardware (such as FPGA) • Understand low-level CPU interconnectivity with memory, and how CPU caching works (including cache-lines) • http://mechanical-sympathy.blogspot.com • eFX – No need to pursue this level of performance at the moment

Latency vs Throughput • Latency - time taken (typically mean, percentile or worst case) to complete a task • Throughput – the number of tasks completed in a given time period (typically, per second) • Throughput is 1/latency (per pipeline)

Increasing Throughput • Identify delays • Throughput constrained by latency • Blocking I/O calls delay unprocessed messages • Data bursts • What’s the peak throughput required? • What’s the gap typically between bursts?

Techniques to Increase Throughput • Batching • Sometimes latent calls are unavoidable • Using batching can strip overhead of making call per transaction • Cost of batching is the delay incurred waiting for new items to add to batch • More difficult to accurately measure delay per item when multiple items are in a batch

FX Trading – Batching Example • Legacy global server in London • Regional trade acceptance components • Latency between New York and London - 50ms • Per thread: 1/0.05 = 20 trades per second max • How to increase? • More threads • Add batching per thread • Now, with batch size of 5, 100 trades per second per thread.

Techniques to Increase Throughput(2) • Use Asynchronous callbacks • Synchronous calls: • booleandoCall() • Wait for response • Can be delayed for varying time • Asynchronous calls: • void doCall(Callback callback) • Do not wait and keep processing more events • Can additionally overlay timeouts to improve resilience

FX Trading – Asynchronous Callbacks • Submission of trade to price service for verification – was originally synchronous • Call blocks for 50ms – max 20 trades per second per thread • After converting to asynchronous callbacks, the only delay is putting packets on network buffer (μs), so effectively no delay – max numbers of trades is very high!

Q & A eFX – January 2014

Applications of Computing in Industry: What is Low Latency All About?