Understanding VoIP

Understanding VoIP Dr. Jonathan Rosenberg Chief Technology Strategist Skype

What is this course about? • Getting “under the hood” and understanding how VoIP works • An exploration of the protocols and technologies behind VoIP • Conveying an understanding of the various problems that need to be solved for VoIP to work

What this course is not about • A general introduction to telephony • A detailed cookbook or deployment guide to VoIP • A product survey of VoIP and IP telephony products • In particular, Cisco or Skype products are not discussed except in passing

Ground Rules • Ask Questions ANY TIME! • I will be bored if this is a one way conversation • No question is too stupid • Laughing or mocking anyones questions is unacceptable • Please ask off-the-wall or exploratory questions – there is a lot that is not in here!

Agenda • Breaking up the problem • Voice and Video coding • Voice and Video Transport • Quality of Service • Signaling • Security • NAT Traversal

Non-Agenda • Programming APIs • Emergency Services, Lawful Intercept • Numbering, Routing, Naming (ENUM, TRIP) • PSTN Interworking • Billing, Provisioning, OAM • Conferencing, IVR, Applications

Breaking Up the Problem Application Server IP Directories Databases LDAP, ENUM SIP Accounting Billing Signaling Servers Presence Servers Media Servers RADIUS DIAMETER OAM IP Network SIP, H.323,MGCP,H.248 SIMPLE,XMPP Endpoint Endpoint RTP

Voice Coding

Voice Endpoint Model No Speech + NonlinearProcessing Speech Encoding Packetizer Speech - DTMF/ Tone Detection Silence Detection Echo Hybrid Canceller Loss Admin Speech Decoding Unpacker DTMF/ Tone Generation Comfort Noise Generation 2-wire interface

Codecs • Waveform codecs: • Directly encode speech in an efficient way by exploiting temporal and/or spectral characteristics • Attempt to reproduce input signal’s waveform by minimizing error between input and coded signals • Source codecs / vocoders: • Estimate and efficiently encode a parametric representation of speech

CELP • Minimizes perceptually weighted error • similar to waveform coders • Short-term predictor is LP (vocal tract) filter • Excitation is obtained from codebook and long-term pitch predictor • Closed-loop search is MIPS intensive

Codec Comparison Listen at: http://www.voiceage.com/listeningroom.php

Echo Cancellation • ERL: Echo Return Loss (dB) • ERLE: Echo Return Loss Enhancement • Double-talk • Convergence time Analog ERLE Non-Linear Processor + Reflection - Echo Path Estimation ERL Packet Network 2-4-wire Hybrid Echo Canceller Digital This echo canceller cancels ‘local’ echoes from the hybrid reflection

Echo Canceller Specifics • The voice echo path is like an electrical circuit • If a ‘break’ (cancellation) is made anywhere in the ‘circuit’, you will eliminate the echo • The easiest place to make the break is with a canceller ‘looking into’ the local analog/digital telephony network, NOT the packet network (which has much longer and variable delays) • The echo canceller at the other end of the call eliminates the echoes that YOU hear, and vice versa • Echo canceller coverage (e.g. 32 ms) is the maximum length of echo impulse response that can be cancelled from the local analog/digital network (the packet network delay does not matter) • The non-linear processor is used to ‘clean-up’ any residual echo left over from the canceller

Voice Activity Detection Speech Magnitude (dB) Speech Detected SpeechDetected Hang-Over Hang-Over Typically fixed at 200 ms Sentence 1 Sentence 2 Signal-to- Noise Threshold Noise Floor time Front-end Speech Clipping Front-end Speech Clipping

Comfort Noise Generation • Silence isn’t golden…it’s annoying • When speech stops…what do you play to the listener? • Simple techniques: • Play white/pink noise • Replay last receiver packet over and over • Fancier technique: • Transmitter measures local “noise environment” • Transmitter sends special “comfort noise” packet as last packet before silence • Receiver generates noise based CN packet.

Voice Quality:Mean Opinion Scores Source Channel Simulation Impairment Codec ‘X’ 1 2 3 4 5 “Nowadays, a chicken leg is a rare dish” 1 2 3 4 5 MOS of 4.0 = Toll Quality

Clear Channel MOS’s 5 4.1 Mean Opinion Score 3.8 3.9 3.9 4 3.4 3 2 1 G.711 (64 kbit/s PCM) G.726 (32 kbit/s ADPCM) G.723.1 (6.4 kbit/s MP- MLQ) G.729 (8 kbit/s CS-ACELP) IS-54 (8 kbit/s NA Dig Cellular)

MOS Under Varying Conditions

Video Coding

Key Terms

Key Concept: Macroblocks Rectangular block inan image which isa basic unit ofcompression. Typically16x16 pixels.

Key Concept: Inter-Frame Prediction Encode Predict information in the current frame by looking at previous frames,possibly taking into account motion.

Key Concept: Discrete Cosine Transform (DCT) Increasing horizontal frequencies A technique for representing amacroblock by its component frequencies. Discarding the higherfrequencies throws away the finerdetails without losing the core image. Increasing vertical frequencies

Video Encoder Block Diagram

Key Codec Comparisons

Voice and Video Transport: RTP

Real Time Transport Protocol RFC 3550 product of avt working group 1996 proposed standard – RFC1889 2004 full standard What does it do e2e transport of real time media optimized for multicast provides sequencing, timing, framing, loss detection provides feedback on reception quality What does it do (cont) provides information on group members provides data to correlate audio and video and other media Works with any codec need payload format for each codec Flexible RTP: What is it?

Doesn’t guarantee quality of service doesn’t reserve network resources doesn’t guarantee no loss or bounded delay can work with QoS protocols (RSVP) Doesn’t provide signaling other protocols must be used to set up RTP (like SIP or H.323) Not a specific protocol type Does not run directly ontop of IP Runs ontop of UDP No fixed port number RTP: What isn’t it?

RTP Stack RTP RTCP UDP IP

Big Picture: RTP, SDP and SIP C=IN IP4 123.1.2.3 m=audio RTP/AVP 1122 0 1 m=video RTP/AVP 1130 98 a=rtpmap:98 h263 Proxy Proxy SIP w/ SDP IP Network End User End User RTP

Data aka RTP very confusing Usually on an even UDP port (NATs change this – later) Provides sequencing timing framing content labeling User identification Control = Real Time Control Protocol (RTCP) Same address as data, but one higher port usually Provides reception quality sender statistics participant information (multicast) synchronization information RTP Components: Data + Control

Originator breaks stream into packets (segmentation) application layer framing (ALF)!!! Packets sent; network may lose, delay, reorder packets Must, at receiver: reorder recover resegment rescynchronize clock synchronization! Real Time Data Transport RTP Source RTP Packets RTP Sink

Source Digitize Audio from mike Silence Suppression Echo cancellation Compress Audio G.711: 64 kbps G.729: 8 kbps G.723.1: 5.3/6.3 kbps Packetize Audio in RTP Send Sink Receive packets Un-packetize decompress comfort noise generation reorder recover loss jitter buffer A/D conversion to speakers Transport System

Packets delayed differently Must play them out periodically Packets may arrive after designated playout time -> loss Insert extra delay to compensate May need to adapt this amount Jitter Buffer pkts time

RTP Packet Header +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | contributing source (CSRC) identifiers | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Version: 2 P: indicates padding (for encryption) X: extension bit CSRC count: for mixers (later) M: Marker Bit: indicates framing audio codecs: first packet in talkspurt video: last packet in frame Payload Type: indicates encoding in RTP packet allows changes per-packet Useful for: adaptation DTMF codec silence codecs SN: defines ordering of packets Timestamp: when packet was generated SSRC: identifier CSRC: list of mixed users RTP Header Fields

Tick units are dependent on codec For speech: 125 microseconds (standard 8 khz sampling rate) For video: 90 KhZ For audio: 44.1 KhZ (CD rate) Gaps in TS, but not in SN mean silence Initial value random for security Video Timestamp represents time at beginning of frame Many packets may have same timestamp Speech Time per packet may vary Depends on packetization: 20-100ms typical RTP Timestamp

Each codec needs a way to be encapsulated in RTP RFC3550 defines mechanisms for many common codecs G.711, G.729, G.723.1, G.722, etc. Some simple video More complex codecs have their own payload format documents MPEG H.263 and H.261 Payload format defines How to break frame into packets extra fields needed below main RTP header Payload Formats

DTMF and Tones RFC 2833 Special codecs for encoding touch tones (DTMF) and other signals Can send either the waveform (frequency, amplitude) Or the actual signal (#, 8, 0) Compressed RTP RFC 2508 For dialup links Don’t send header, just send index Far side uses index to retrieve header, and then increments certain fields Advanced Topics

Quality of Service

Quality of Service The problem we are trying to solve is to give “better” service to some at the expense of giving worse service to to others — QoS fantasies to the contrary, it’s a zero sum game - Van Jacobson In other words, QoS is Managed Unfairness

Toll Quality Satellite Zone CB Zone Fax Relay, Broadcast Private Network VoFR & VoIP Technology Early I-Phone Technologyy Improving I-Phone means: • Lower PC Delay • Lower Network Latency • Tighten Network Jitter Quality of Service • So, what’s the problem?

Delay Budget • Device sample capture • Encode delay (algorithmic delay + processing delay) • Packetization/framing • Move to output queue/queueing delay • Access (up) link transmission • Backbone network transmission • Access (down) link transmission • Input queue to application • Jitter buffer • Decode processing delay • Device playout delay “The Network”

Some Techniques to Improve “Network QoS” • RED — Random Early Drop (or “Detect”) • WFQ — Weighed Fair Queuing • Intserv/RSVP — ReSerVation Protocol • IP Precedence  DiffServ • CRTP — Compressed Realtime Transport Protocol • MCML — Multi-Class Multi-Link PPP

Objectives Keep average queue size low – good for voice Fairness – bigger streams punished more Avoid synchronization Only works with loss responsive transport protocols Algorithm – probabilistic dropping of packets Random Early Detect (RED)this is Basic Hygiene! 1 Drop Probability Min Max Queue Size

Poll: Will RED Help Voice? Yes No • Voice not loss responsive • Mixing voice and data in same queue bad • Voice queues usually not congested

Each flow “sees” a dedicated amount of bandwidth Bj A packet arriving at time t is transmitted at time t+size/Bj B3 B2 Weighted Fair Queueing B B1 B = B1 + B2 + B3

WFQ is unrealizable because Variable packet sizes Causality Example: Link speed 100Kbps Flow 1: 10Kbps Flow 2: 90Kbps Whats the Problem?? 8.8ms Theory 128ms Actual 1500 100 1500 100

Many PhDs written with approximate and implementable algorithms Algorithms differ in their delay bound How much worse than perfect WFQ is this? Delay bounds a function of bandwidth, number of queues, other params Approximations of WFQ Algorithms SCFQ: Self-Clocked Fair Queueing WF2Q: Worst-Case Fair Weighted Fair Queueing FBFQ: Frame-Based Fair Queueing PGPS: DRR:

Understanding VoIP