ECE 1747 Parallel Programming Paper Presentation Student: Qingan Andy Zhang Parallelization and Performance of Interactive Multiplayer Game Servers Ahmed Abdelkhalek and Angelos Bilas. In IPDPS 2004.
Online Game Intro • Definition • Attraction • Requirement • Category • History • Example A Famous First-Person Shooting Game
Online Game Intro Early MUD Game A Famous RTS Game: StarCraft
Massively Multiplayer Online Game • Over Internet • Supports hundreds or thousands of players playing at the same time • Simulates the real world • Client-Server mode • Centralized server • Example: World of WarCraft (WoW) WoW Picture
Quake Intro • First-Person Shooting Game • Developed in 1990s by ID Software • Support Multi-Player Gaming • Has 3D effect • Pretty Much Like Counter-Strike
Internet Server Clients Quake Client-Server Structure 1, Interpreting client actions 2, Maintaining consistency 3, Communications among clients 1, Accept user input though keyboard, mouse or joystick 2, Display output (Graphics related processing), detailed model of 3D World
1 2 3 Sequential Server Structure Simulate World: time in frames Client Server Select Start frame Update World Interaction SendReceive Update State Render Receive & Process Requests Rx Update Form & Send Replies Tx End frame • Maintain consistent state • Reply ASAP • Maintain frame rate ~30 fps i.e. duration ~30ms
Sequential Server Performance • Server machine: • Linux, Intel P3 1.4 GHz • 100 Mbit Ethernet • Saturates ~128 players • Scalability limited by server processing • Total bandwidth at server + clients not an issue How can we improve on this?
Quake Sequential Server Summary • Quake: Fine grain (Instantaneous control of player actions) • High degree of interaction among players • Server network bandwidth and memory requirement are not issues • Server CPU processing is the bottleneck for scaling to large number of players
Objectives • To scale to large number of interactive multi-players with cost effective servers • To parallelize the server application • To analyze performance of the parallel server application
How to Parallelize? • Analyze the dependency • What’s a correct order? • Task parallelism • Requests / Replies / Update • How many clients per thread
Methodology • Parallelize the server code using Shared memory model (Pthreads) • Workload distribution (static task assignment) • Synchronization (region based lock synchronization in the request processing phase for correct user request processing also parent locking is used) • Separate server execution into phases by global synchronization; deal with each phase separately • Thread Multiplex
1 Methodology Start from sequential server Select Update World Small CPU time (5%) Leave as sequential Receive & Process Requests 2 Rx Target these only! Form & Send Replies 3 Tx
1 2 Parallel Server Architecture Select Update World Multiple server threads Receive & Process Requests Rx Form & Send Replies 3 Tx
1 2 Parallel Server Architecture Conservative approach! Select Update World Global barriers between phases Receive & Process Requests Rx Form & Send Replies 3 Tx
1 2 Parallel Server Architecture Intra-phase synchronization? Select Update World Not an issue! Sequential phase Receive & Process Requests Big issue! Rx Read-Write phase Lock Needed! Form & Send Replies Not an issue! Read-only phase 3 Tx
Shared Data Synchronization • Move execution: • Bounding box • Short-range player figure motioneg. Move around • Long-range player interaction/actioneg. Throw a grenade or shoot • Readandwrite world state (regions and objects) • Atomic access to world state • find all related objects • simulating • Key data structure: Areanode tree Top-view of game world Long-range Short-range Objects such as a tree, a wall, a building or your opponent
Analysis of Quake Data • Structure of Game World: irregular, recursive data structure • Objects in World:have associated actions + properties Map 3D Data Structure: Binary Space Partition Server has a secondary data structure: Areanode Tree
Region (leaf) Locking Tree leafs represent distinct regions in world Areanode tree Top-view of world Overlapping regions (squared regions are bounding box regions, objects may exist in this region) Need to lock corresponding leafs – true contention! But, does leaf-locking ensure atomic access to ALL objects within region? ANS: NO
Object (parent node) Locking • Objects linked to tree nodes depending on position • If cross leaf boundary, then link to a parent node • Else link to leaf node • So sometimes (if object crosses the boundary) we have to lock parent nodes. THUS, • Threads in non-overlapping regions may contend for access to parent nodes –false sharing Top-view of world Object lists Play A’s bounding box Non-Overlapping regions Play B’s bounding box Objects cross boundary Region leafs
Lock Optimization ---Reduce Contention • Game specific knowledge needed • Lock only necessary objects • Estimate object moving range instead of conservative locking • Minimize the area of the bounding box • Directional bounding box locking instead of locking whole map • Cut down the dimension of the bounding box
Result • What should be measured?Benchmarking [ISPASS01] Intel P3 1.4 GHz Quad SMP w/ Hyperthreading 2 GB RAM Linux Multiple automatic players per client Server 100 Mbit LAN Clients Measure server execution time breakdown, response rate and response time
Result Parallel server performance---un-optimized locking At saturation: 1, Response rate decreases; 2, Replies arrive late at clients. Max 160 players with 8 threads!
Result Parallel server performance---optimized locking At saturation: 1, Response rate decreases; 2, Replies arrive late at clients. Max 176 players with 8 threads!
Execution Time Breakdown Receive, Reply and Request processing scale!! (good!)Bottlenecks: Wait time ~20-40%, Lock time ~10-20%
Lock time during request processing Wait time at global barriers Rx Tx Bottleneck Source Bottlenecks: Wait time ~20-40%, due to imbalancesLock time ~10-20%, due to contention
Conclusion • Number of players limited by server CPU processingServer load increases super-linearly with player number • Scaling this application is challenging, difficulty comes from irregularity of application • Supports about 25% more players on 4-way Intel P3 SMP with Hyperthreading • Remaining bottlenecks • Wait time (up to 40%), due to imbalances • Lock synchronization time (up to 20%), due to contention
Future Work • Self-learning ability of the server • Dynamic task assignment (game-specific)
Future Work • Why the replies consume 40% time? • Further parallelize this part • Game map factor? • The experimental map is for 16-32 player • Can a bigger map get better performance? • Fraction of time on locking • Global state buffer v.s. Areanode tree, which is the major one?
Discussion about ImbalanceDealing with Waiting Time! • Does task queue help? • Improve their static task assignment • How many threads we should create? • Can we remove barriers? • What is a correct event order? • Is it equal to network latency effect? • Does pipeline help?
Discussion about ContentionDealing with Lock Time! • The influence of the depth of areanode • A leaf may represent a large region • What is a proper range for a leaf • Change data structure • Push all objects into leaves to reduce false sharing • More game knowledge • Lazy locking – lock when we really use it • More accurate movement prediction
Relevant Work • Much work is being done in parallel computer architecture • Much work is being done on client side, such as improving 3D graphics realism and rendering speed
Acknowledgement • Prof. Cristiana Amza for valuable material • Ms. Jin Chen for helpful discussion
THE END Thanks for your attention! Any questions?