Fast Triangle Reordering for Vertex Locality and Reduced Overdraw

1. Fast Triangle Reordering for Vertex Locality and Reduced Overdraw Pedro V. Sander Hong Kong University of Science and Technology Diego Nehab Princeton University Joshua Barczak 3D Application Research Group, AMD

2. Triangle order optimization Objective: Reorder triangles to render meshes faster Rendering 3D triangle meshes, like this good old fandisk is commonplace in the majority of 3D graphics applications. This paper presents a new method to reorder the triangles in a mesh, so that the mesh can be rendered faster. Rendering 3D triangle meshes, like this good old fandisk is commonplace in the majority of 3D graphics applications. This paper presents a new method to reorder the triangles in a mesh, so that the mesh can be rendered faster.

3. Motivation:Rendering time dependency To motivate the problem, we note that the time it takes to render a model is proportional to the number of vertices and the number of pixels that are processed by the GPU. So, if our algorithm can reduce these quantities, we can reduce rendering time in both vertex-bound and pixel-bound scenes. In these graphs, we render the same exact model with different triangle orders. Notice that the rendering speed heavily depends in the amount of processing, which in turn is based on these triangles orders. XXX One important note is that reducing both vertex and pixel processing is even more important with the advent of unified architectures. So, even if the application is so called �vertex bound� or �vertex heavy�, the amount of pixel processing still affects rendering time, and vice-versa. So it is always important to reduce both. To motivate the problem, we note that the time it takes to render a model is proportional to the number of vertices and the number of pixels that are processed by the GPU. So, if our algorithm can reduce these quantities, we can reduce rendering time in both vertex-bound and pixel-bound scenes. In these graphs, we render the same exact model with different triangle orders. Notice that the rendering speed heavily depends in the amount of processing, which in turn is based on these triangles orders. XXX One important note is that reducing both vertex and pixel processing is even more important with the advent of unified architectures. So, even if the application is so called �vertex bound� or �vertex heavy�, the amount of pixel processing still affects rendering time, and vice-versa. So it is always important to reduce both.

4. Goal Render faster Two key hardware optimizations Vertex caching (vertex processing) Early-Z culling (pixel processing) Reorder triangles efficiently at run-time No changes in rendering loop Improves rendering speed transparently To be more specific, our algorithm takes advantage of two key hardware optimizations. Vertex caching, to reduce vertex processing, and early-Z culling, to reduce pixel processing. They are both available on most boards on the market. We seek to reorder the triangles very efficiently at run-time and in a way that is transparent to the application. So, the rendering loop of the application is unchanged. We simply shuffle the values in the index buffer.To be more specific, our algorithm takes advantage of two key hardware optimizations. Vertex caching, to reduce vertex processing, and early-Z culling, to reduce pixel processing. They are both available on most boards on the market. We seek to reorder the triangles very efficiently at run-time and in a way that is transparent to the application. So, the rendering loop of the application is unchanged. We simply shuffle the values in the index buffer.

5. Algorithm overview Part I: Vertex cache optimization Part II: Overdraw minimization The algorithm consists of two major steps or parts. A vertex cache optimization algorithm to reduce vertex processing, and an overdraw minimization algorithm to reduce pixel processing. We will now address each of those steps in more detail.The algorithm consists of two major steps or parts. A vertex cache optimization algorithm to reduce vertex processing, and an overdraw minimization algorithm to reduce pixel processing. We will now address each of those steps in more detail.

6. Part I: The Post-Transform Vertex cache Transforming vertices can be costly Hardware optimization: Cache transformed vertices (FIFO) Software strategy: Reorder triangles for vertex locality Average Cache Miss Ratio (ACMR) # transformed vertices / # triangles varies within [0.5�3] Many of you are probably familiar with the vertex cache, so I will go through this relatively quickly. When rendering meshes, the price paid for vertex transformation may be costly. This can be due to a high number of vertices, to an expensive vertex shader, or both. To mitigate that, the post-transform vertex cache stores the post-transform data for a small number of vertices with a FIFO replacement policy. Therefore, if a vertex has been transformed recently and is in the cache, it doesn�t need to be transformed again. So it is important to reorder the triangles for vertex locality, and I will show some examples in a bit. In order to measure the effectiveness of the vertex cache for a given triangle order, we can calculate the average cache miss ratio or ACMR. This is the total number of transformed vertices over the total number of rendered triangles. It varies between 0.5 and 3 in manifold meshes.Many of you are probably familiar with the vertex cache, so I will go through this relatively quickly. When rendering meshes, the price paid for vertex transformation may be costly. This can be due to a high number of vertices, to an expensive vertex shader, or both. To mitigate that, the post-transform vertex cache stores the post-transform data for a small number of vertices with a FIFO replacement policy. Therefore, if a vertex has been transformed recently and is in the cache, it doesn�t need to be transformed again. So it is important to reorder the triangles for vertex locality, and I will show some examples in a bit. In order to measure the effectiveness of the vertex cache for a given triangle order, we can calculate the average cache miss ratio or ACMR. This is the total number of transformed vertices over the total number of rendered triangles. It varies between 0.5 and 3 in manifold meshes.

7. ACMR Minimization NP-Complete problem GAREY et. al [1976] Heuristics reach near-optimal results [0.6�0.7] Hardware cache sizes range within [4�64] Substantial impact on rendering cost From 3 to 0.6 ! Everybody does it So, how do we minimize ACMR? Determining the optimal order is an NP-Complete problem, but heuristics reach near-optimal results for common cache sizes, which range from 4 to 64 vertices. This is a huge improvement over using a random order. It can go from 3 processed vertices per triangle down to around 0.6 in many cases. That�s a factor of 5 in rendering speed in the particular cases where the application is completely vertex bound. So everybody uses some sort of vertex cache optimization on real-world applications.So, how do we minimize ACMR? Determining the optimal order is an NP-Complete problem, but heuristics reach near-optimal results for common cache sizes, which range from 4 to 64 vertices. This is a huge improvement over using a random order. It can go from 3 processed vertices per triangle down to around 0.6 in many cases. That�s a factor of 5 in rendering speed in the particular cases where the application is completely vertex bound. So everybody uses some sort of vertex cache optimization on real-world applications.

8. Parallel short strips Due to the FIFO replacement policy, for a completely regular mesh, the optimal reordering strategy is to use parallel strips, as shown here. <> <> <> Then, in regular regions of the mesh, for each internal strip, triangles alternate between having 0 and 1 processed vertices, <0> <1> <0> <1> thus achieving an average close to the optimal of 0.5.Due to the FIFO replacement policy, for a completely regular mesh, the optimal reordering strategy is to use parallel strips, as shown here. <> <> <> Then, in regular regions of the mesh, for each internal strip, triangles alternate between having 0 and 1 processed vertices, <0> <1> <0> <1> thus achieving an average close to the optimal of 0.5.

9. Previous work Algorithms sensitive to cache size MeshReorder and D3DXMesh [HOPPE 1999] K-Cache-Reorder [LIN and YU 2006] Many others� Recent independent work[CHHUGANI and KUMAR 2007] There are numerous cache sensitive methods for reordering triangles for vertex cache optimization. Here, I will just focus on a couple of representative ones. The method of Hoppe 99, which the D3D optimizer is based on, tries to create parallel strips like those shown in the previous slide and achieve very good ACMR results around 0.6. Lin, on the other hand observes that, for large caches, local strip order is not as critical, and suggest an algorithm that traverses vertices in the mesh, emitting all non-emitted triangles adjacent to each vertex. The added flexibility allows them to further improve results a bit. Very recently, at I3D this year, Chhugani and Kumar proposed a new method that further reduces ACMR by another few percent.There are numerous cache sensitive methods for reordering triangles for vertex cache optimization. Here, I will just focus on a couple of representative ones. The method of Hoppe 99, which the D3D optimizer is based on, tries to create parallel strips like those shown in the previous slide and achieve very good ACMR results around 0.6. Lin, on the other hand observes that, for large caches, local strip order is not as critical, and suggest an algorithm that traverses vertices in the mesh, emitting all non-emitted triangles adjacent to each vertex. The added flexibility allows them to further improve results a bit. Very recently, at I3D this year, Chhugani and Kumar proposed a new method that further reduces ACMR by another few percent.

10. Previous work Algorithms oblivious to cache size dfsrendseq [BOGOMJAKOV et al. 2001] OpenCCL [YOON and LINDSTROM 2006] Based on space filling curves Asymptotically optimal Not as good as cache-specific methods Long running time Do not help with CAD/CAM A completely different set of techniques generate orderings that are oblivious to cache size. Methods from Bogomjakov et al and Yoon and Lindstrom are based on space filling curves and provide asymptotically optimal results. The results are very good considering that the order must apply to all cache sizes, but they do not fair as well as the methods that optimize for a particular cache size. Furthermore, they can also have long running times since they are preprocessing algorithms, so it does not help with interactive applications, such as CAD/CAM.A completely different set of techniques generate orderings that are oblivious to cache size. Methods from Bogomjakov et al and Yoon and Lindstrom are based on space filling curves and provide asymptotically optimal results. The results are very good considering that the order must apply to all cache sizes, but they do not fair as well as the methods that optimize for a particular cache size. Furthermore, they can also have long running times since they are preprocessing algorithms, so it does not help with interactive applications, such as CAD/CAM.

11. Our objective Optimize at run-time We even have access to the exact cache size Faster than previous methods, i.e., O(t) Must not depend on cache-size Should be easy to integrate Run directly on index buffers Should be general Run transparently on non-manifolds In this paper we seek a triangle ordering algorithm that can be used to optimize at run-time, when we have access to the exact cache size of the board. Ideally, such a method should run in time linear on the number of triangles, regardless of cache size. It should also be easy to integrate to pipelines and be general enough to run on any input triangulation, including non-manifolds. [5:20] In this paper we seek a triangle ordering algorithm that can be used to optimize at run-time, when we have access to the exact cache size of the board. Ideally, such a method should run in time linear on the number of triangles, regardless of cache size. It should also be easy to integrate to pipelines and be general enough to run on any input triangulation, including non-manifolds. [5:20]

12. �Triangle-triangle� adjacency unnecessary Awkward to maintain on non-manifolds By the time this is computed, we should be done Use �vertex-triangle� adjacency instead Computed with 3 trivial linear passes When trying to devise such a method, one observation is that triangle-triangle adjacency is not necessary. It is awkward to maintain it on non-manifolds and we actually want to be all done by the time this adjacency is computed. So, we use vertex-triangle adjacency instead, which is very simple to compute, much like counting sort. For details, please refer to the paper.When trying to devise such a method, one observation is that triangle-triangle adjacency is not necessary. It is awkward to maintain it on non-manifolds and we actually want to be all done by the time this adjacency is computed. So, we use vertex-triangle adjacency instead, which is very simple to compute, much like counting sort. For details, please refer to the paper.

13. Simply output vertex adjacency lists So, if we output these vertex adjacency lists, this is the order we get. We process the vertices in the order shown by the green line. And each vertex outputs its adjacent triangles in arbitrary order, shown by the red line. We get these locally random, or tipsy, fans. Hence, we call our algorithm �Tipsify� Note that since usually the entire neighborhood fits in the cache, this does not hurt ACMR. So, if we output these vertex adjacency lists, this is the order we get. We process the vertices in the order shown by the green line. And each vertex outputs its adjacent triangles in arbitrary order, shown by the red line. We get these locally random, or tipsy, fans. Hence, we call our algorithm �Tipsify� Note that since usually the entire neighborhood fits in the cache, this does not hurt ACMR.

14. Choosing a better sequence However, instead of choosing the vertices to fan around in random order, we seek to find an order with improved vertex locality. Here is such an order, where the vertices are processed in a zig-zag pattern and each vertex outputs all of its adjacent triangles that have not been previously output.However, instead of choosing the vertices to fan around in random order, we seek to find an order with improved vertex locality. Here is such an order, where the vertices are processed in a zig-zag pattern and each vertex outputs all of its adjacent triangles that have not been previously output.

15. Selecting the next fanning vertex Must be a constant time operation Select next vertex from 1-ring of previous If none available, pick latest referenced If none available, pick next in input order In order for the algorithm to run in linear time, selecting the next vertex to process must be a constant time operation that is independent of cache size. We basically select the next vertex from the 1-ring of the previous processed vertex. If none are available, we pick the latest referenced vertex, which is available from a stack. If there are still no options, then we pick the next one in the mesh input order. This results in an ACMR that is closer to the optimal of 0.5.In order for the algorithm to run in linear time, selecting the next vertex to process must be a constant time operation that is independent of cache size. We basically select the next vertex from the 1-ring of the previous processed vertex. If none are available, we pick the latest referenced vertex, which is available from a stack. If there are still no options, then we pick the next one in the mesh input order. This results in an ACMR that is closer to the optimal of 0.5.

16. Best next fanning vertex within 1-ring Consider vertices referenced by emitted triangles Furthest in FIFO that would remain in cache Now, how do we pick the best vertex from the 1-ring? There are 4 options here, which are the 4 vertices adjacent to the 3 emitted triangles. (The black triangles have been emitted previously.) We basically want to pick the vertex that is closest from being removed from the cache, but will still be in the cache until all of its adjacent triangles are output. To do this, we keep a cache time stamp for each vertex U that enters the cache. Call it C sub U. And by subtracting the current time stamp by it, we get its position in the cache. In addition, we also use another data structure that keeps the number of adjacent �live� triangles, which weren�t output yet. In the case of vertex U it is 3. Making a worst-case assumption that each triangle would require entering two additional vertices in the cache, we can conservatively determine whether the vertex U will still be in the cache after emitting all of its live triangles. That is given by the inequality in the bottom right. We pick the vertex that is closest to exiting the cache among those that satisfy that inequality.Now, how do we pick the best vertex from the 1-ring? There are 4 options here, which are the 4 vertices adjacent to the 3 emitted triangles. (The black triangles have been emitted previously.) We basically want to pick the vertex that is closest from being removed from the cache, but will still be in the cache until all of its adjacent triangles are output. To do this, we keep a cache time stamp for each vertex U that enters the cache. Call it C sub U. And by subtracting the current time stamp by it, we get its position in the cache. In addition, we also use another data structure that keeps the number of adjacent �live� triangles, which weren�t output yet. In the case of vertex U it is 3. Making a worst-case assumption that each triangle would require entering two additional vertices in the cache, we can conservatively determine whether the vertex U will still be in the cache after emitting all of its live triangles. That is given by the inequality in the bottom right. We pick the vertex that is closest to exiting the cache among those that satisfy that inequality.

17. Tipsy pattern And when we do this, the sequence converges to that zigzag pattern I showed previously. The reason why it turns around is because if it were to keep on going straight, the potential fanning vertex would not be in the cache. So, the larger the cache, the longer the zig zags.And when we do this, the sequence converges to that zigzag pattern I showed previously. The reason why it turns around is because if it were to keep on going straight, the potential fanning vertex would not be in the cache. So, the larger the cache, the longer the zig zags.

18. Tipsy pattern And this is the resulting locally �tipsy� pattern that we get. And this is the resulting locally �tipsy� pattern that we get.

19. Typical running times And BTW, this all can be done very efficiently. Tipsy is faster than the alternatives. This is a log plot with mesh complexity in the x axis and timing in the y axis. It uses the original authors� implementations. Note that our method is about 100 times faster than k-cache reorder, which is Lin�s algorithm. Note that this is for a small cache size of 12. With larger cache sizes, our algorithm performs even better in relation to the best alternatives, since its running time does not depend on the cache size.And BTW, this all can be done very efficiently. Tipsy is faster than the alternatives. This is a log plot with mesh complexity in the x axis and timing in the y axis. It uses the original authors� implementations. Note that our method is about 100 times faster than k-cache reorder, which is Lin�s algorithm. Note that this is for a small cache size of 12. With larger cache sizes, our algorithm performs even better in relation to the best alternatives, since its running time does not depend on the cache size.

20. Preprocessing comparison Here you can see that as cache sizes increase, our method provides better results when compared to many other methods even when they are optimized for the correct cache size. Lin still yields slightly better results at 100x the processing cost.Here you can see that as cache sizes increase, our method provides better results when compared to many other methods even when they are optimized for the correct cache size. Lin still yields slightly better results at 100x the processing cost.

21. Typical ACMR comparisonCache size of 12 Lin and Hoppe have best results when they happen to be pre-processed for the correct cache size. However, note that if we do run-time processing with tipsy, we improve on them if the board has a different cache size. In the paper, we also provide additional rendering time tests on several different available boards. To see all of those graphs, please refer to the paper. But in summary, we get most if not all of the gain, in a fraction of the processing time.Lin and Hoppe have best results when they happen to be pre-processed for the correct cache size. However, note that if we do run-time processing with tipsy, we improve on them if the board has a different cache size. In the paper, we also provide additional rendering time tests on several different available boards. To see all of those graphs, please refer to the paper. But in summary, we get most if not all of the gain, in a fraction of the processing time.

22. Motivation:Rendering time dependency Ok. So far we only addressed reducing the amount of vertex processing, which is the left side. Remember, we are also interested in reducing pixel processing as well in case the scene is highly pixel bound. [10:00]Ok. So far we only addressed reducing the amount of vertex processing, which is the left side. Remember, we are also interested in reducing pixel processing as well in case the scene is highly pixel bound. [10:00]

23. Part 2: Overdraw As you know, pixel shaders can do a lot of work and it can dominate rendering cost, particularly if these operations are very expensive and if there is a lot of depth complexity in the scene. Here are some examples of overdraw on 2 meshes. These methods were optimized for vertex locality using the method of Hoppe 99. Since this and the other similar approaches tend to traverse the triangles from one side of the mesh to the other, it is natural that from some viewpoints, such as these, there will be high pixel overdraw, which is shown in dark. Note that I am not plotting the depth complexity of the model, but rather just the number of times each pixel was processed. If a pixel is obscured by an already processed pixel, then it does not get processed. This is a hardware optimization known as early-Z culling and can save significant rendering time. It is available and automatically enabled on most boards on the market.As you know, pixel shaders can do a lot of work and it can dominate rendering cost, particularly if these operations are very expensive and if there is a lot of depth complexity in the scene. Here are some examples of overdraw on 2 meshes. These methods were optimized for vertex locality using the method of Hoppe 99. Since this and the other similar approaches tend to traverse the triangles from one side of the mesh to the other, it is natural that from some viewpoints, such as these, there will be high pixel overdraw, which is shown in dark. Note that I am not plotting the depth complexity of the model, but rather just the number of times each pixel was processed. If a pixel is obscured by an already processed pixel, then it does not get processed. This is a hardware optimization known as early-Z culling and can save significant rendering time. It is available and automatically enabled on most boards on the market.

24. Options Dynamic depth-sort Can be too expensive Destroys mesh locality Z-buffer priming Can be too expensive Sorting per object E.g. GOVINDARAJU et al. 2005 Does not eliminate intra-object overdraw Not transparent to application Requires CPU work Orthogonal method So, what are the alternatives for reducing this overdraw? A dynamic depth sort is too expensive and harms vertex cache performance. Z-buffer priming is another alternative but this can be expensive if the mesh has lots of vertices. There are several excellent methods that sorts per object, such as that of Govindaraju 2005. They do not eliminate intra-object overdraw unless we break the object into many small draw calls. So it is not transparent to the application and requires CPU work within the rendering loop in order to perform the sort. As an aside, we still highly recommend that sorting per object be used orthogonally to our method. Our method should be used to optimize for overdraw within each draw call. Different draw calls can still be sorted on the CPU to further reduce overdraw. So, what are the alternatives for reducing this overdraw? A dynamic depth sort is too expensive and harms vertex cache performance. Z-buffer priming is another alternative but this can be expensive if the mesh has lots of vertices. There are several excellent methods that sorts per object, such as that of Govindaraju 2005. They do not eliminate intra-object overdraw unless we break the object into many small draw calls. So it is not transparent to the application and requires CPU work within the rendering loop in order to perform the sort. As an aside, we still highly recommend that sorting per object be used orthogonally to our method. Our method should be used to optimize for overdraw within each draw call. Different draw calls can still be sorted on the CPU to further reduce overdraw.

25. Objective Simple solution Single draw call Transparent to application Good in both vertex and pixel bound scenarios Fast to optimize For our purpose, we want a solution that is simple, completely transparent and provides one index buffer that has both good vertex cache and good overdraw performance. And, of course, we want it to be computed efficiently, just like tipsify. For our purpose, we want a solution that is simple, completely transparent and provides one index buffer that has both good vertex cache and good overdraw performance. And, of course, we want it to be computed efficiently, just like tipsify.

26. Insight: View Independent Ordering[Nehab et al. 06] Back-face culling is often used Convex objects have no overdraw, regardless of viewpoint Might be possible even for concave objects! First we start with a bit of background. In Nehab 06, we observed that in the presence of backface culling, convex objects never have overdraw, and even concave objects may have an order that results in no overdraw. Both of these examples do, if we render the areas labeled 1 before 2, and 2 before 3. Then no matter where you look at them from, there is never any overdraw. All thanks to backface culling.First we start with a bit of background. In Nehab 06, we observed that in the presence of backface culling, convex objects never have overdraw, and even concave objects may have an order that results in no overdraw. Both of these examples do, if we render the areas labeled 1 before 2, and 2 before 3. Then no matter where you look at them from, there is never any overdraw. All thanks to backface culling.

27. Overdraw (before) Here is the overdraw result using the original vertex cache-only order.Here is the overdraw result using the original vertex cache-only order.

28. Overdraw (after) And here is the result of Nehab 06. So, there is much less overdraw now. And this is achieved with a small penalty to vertex cache ACMR. If the scene is pixel bound, it can achieve a speedup factor of 2 or even more.And here is the result of Nehab 06. So, there is much less overdraw now. And this is achieved with a small penalty to vertex cache ACMR. If the scene is pixel bound, it can achieve a speedup factor of 2 or even more.

29. Our algorithm Can we do it at load-time or interactively? Yes! ? (order of milliseconds) Quality on par with previous method Can be immediately executed after vertex cache optimization (Part 1) Like tipsy, operates on vertex and index buffers However, the approach presented in Nehab 06 is much slower and executed offline. We want a method that is executed at runtime immediately after vertex cache optimization and that can trade off vertex locality and overdraw depending on the system it is running on. So, we present and algorithm that does just that. And like tipsify, it operates on vertex and index buffers directly.However, the approach presented in Nehab 06 is much slower and executed offline. We want a method that is executed at runtime immediately after vertex cache optimization and that can trade off vertex locality and overdraw depending on the system it is running on. So, we present and algorithm that does just that. And like tipsify, it operates on vertex and index buffers directly.

30. Algorithm overview Vertex cache optimization Optimize for vertex cache first (Tipsify) Linear clustering Segment the index buffer into clusters Overdraw sorting Sort clusters to minimize overdraw Our full algorithm consists of 3 steps. First, we use tipsify to optimize for vertex locality, as I described earlier in the talk. Then we cluster the mesh. For that, we simply segment the tipsified index buffer into clusters, which is very efficient. Finally we sort the clusters to reduce overdraw.Our full algorithm consists of 3 steps. First, we use tipsify to optimize for vertex locality, as I described earlier in the talk. Then we cluster the mesh. For that, we simply segment the tipsified index buffer into clusters, which is very efficient. Finally we sort the clusters to reduce overdraw.

31. 2. Linear clustering During tipsy optimization: Maintaining the current ACMR Insert cluster boundary when: A cache flush is detected The ACMR reaches above a particular threshold ? Threshold ? trades off cache efficiency vs. overdraw If we care about both, use ? = 0.75 on all meshes Good enough vertex cache gains More than enough clusters to reduce overdraw Let me describe linear clustering first. In order to cluster, during the tipsy optimization, we can maintain the current ACMR. We then insert a cluster boundary when a cache flush occurs, or when the ACMR reaches above a particular threshold lambda. This allows us to easily tradeoff vertex vs. pixel processing by changing this threshold at run-time depending on hardware. Note that more clusters lead to lower pixel overdraw, but a higher number vertex cache miss rates. And if we care about both vertex and pixel processing, we note that using a lambda of 0.75 achieves reasonable results in ACMR while providing more than enough clusters for overdraw reduction in the models we tested. However depending on the application, or hardware, one may want to experiment to further fine tune the results. Let me describe linear clustering first. In order to cluster, during the tipsy optimization, we can maintain the current ACMR. We then insert a cluster boundary when a cache flush occurs, or when the ACMR reaches above a particular threshold lambda. This allows us to easily tradeoff vertex vs. pixel processing by changing this threshold at run-time depending on hardware. Note that more clusters lead to lower pixel overdraw, but a higher number vertex cache miss rates. And if we care about both vertex and pixel processing, we note that using a lambda of 0.75 achieves reasonable results in ACMR while providing more than enough clusters for overdraw reduction in the models we tested. However depending on the application, or hardware, one may want to experiment to further fine tune the results.

32. 3. Sorting: The DotRule How do we sort the clusters? Intuition: Clusters facing out have a higher occluder potential Finally, let me describe how we sort the clusters to reduce overdraw. The intuition here is that clusters that are facing out have a higher occluder potential, that is, they are more likely to occlude other portions of the mesh. This concept can be captured by computing the dot product of the average cluster normal, and the vector from the cluster centroid to the mesh centroid. So, if the dot product between those two vectors is high, the occluder potential is high and the cluster should be rendered early. Here, for example, cluster 1 has a very high dot product since it is far from the centroid and facing out. So it should be rendered early.Finally, let me describe how we sort the clusters to reduce overdraw. The intuition here is that clusters that are facing out have a higher occluder potential, that is, they are more likely to occlude other portions of the mesh. This concept can be captured by computing the dot product of the average cluster normal, and the vector from the cluster centroid to the mesh centroid. So, if the dot product between those two vectors is high, the occluder potential is high and the cluster should be rendered early. Here, for example, cluster 1 has a very high dot product since it is far from the centroid and facing out. So it should be rendered early.

33. 3. Sorting: The DotRule How do we sort the clusters? Intuition: Clusters facing out have a higher occluder potential Here, cluster 2 also has a positive dot product, although not as high, so it is rendered after 1. [14:50]Here, cluster 2 also has a positive dot product, although not as high, so it is rendered after 1. [14:50]

34. 3. Sorting: The DotRule How do we sort the clusters? Intuition: Clusters facing out have a higher occluder potential And 3 has a negative dot product, since it is actually facing in toward the center of the mesh, so it is rendered last. In many cases, such as this one, the resulting sorted mesh may have no occlusion at all, no matter where you render it from. This is a heuristic, and there are some special cases where it doesn�t work perfectly, but it reduces overdraw in the vast majority of cases.And 3 has a negative dot product, since it is actually facing in toward the center of the mesh, so it is rendered last. In many cases, such as this one, the resulting sorted mesh may have no occlusion at all, no matter where you render it from. This is a heuristic, and there are some special cases where it doesn�t work perfectly, but it reduces overdraw in the vast majority of cases.

35. Sorted triangles Here are some examples on a complex mesh. This mesh has its triangles color-coded by the occluder potential, that is, the result of the dot product. Here we see it from 3 different view points. Note that no matter where we look at it from, we usually see darker portions in front of whiter ones. That is exactly what is expected, since the darker portions should occlude whiter ones in order to reduce overdraw when rendering in this sorted order.Here are some examples on a complex mesh. This mesh has its triangles color-coded by the occluder potential, that is, the result of the dot product. Here we see it from 3 different view points. Note that no matter where we look at it from, we usually see darker portions in front of whiter ones. That is exactly what is expected, since the darker portions should occlude whiter ones in order to reduce overdraw when rendering in this sorted order.

36. Sorted triangles This is another example. Note that the bottom of the ears are white and rendered last, since they don�t occlude any other front-facing regions when they are visible. On the other hand, the top of the ears are darker since when they are visible they usually occlude the other parts of the model.This is another example. Note that the bottom of the ears are white and rendered last, since they don�t occlude any other front-facing regions when they are visible. On the other hand, the top of the ears are darker since when they are visible they usually occlude the other parts of the model.

37. Sorted clusters Here, we show an example in which we sort with small clusters of triangles in order to preserve some vertex locality. So, the lambda is lower in this example.Here, we show an example in which we sort with small clusters of triangles in order to preserve some vertex locality. So, the lambda is lower in this example.

38. Comparison to Nehab et al. 06 We optimize for vertex cache first Allows for significantly more clusters Clusters not as planar, but we can afford more New heuristic to sort clusters very fast Tradeoff vertex vs. pixel processing at runtime Now let me contrast this new method with our I3D 06 method. So as opposed to Nehab et al, this method optimizes for vertex cache first. This allows for significantly more clusters, since many of the cluster boundaries can be placed exactly at cache flushes, where there is no penalty to vertex caching. Clusters are not as planar, but we can afford more of them since some of these boundaries come for free. And the new heuristic can sort the clusters much faster. Finally, it allows us to trade off between vertex and pixel processing at runtime on a per-hardware basis.Now let me contrast this new method with our I3D 06 method. So as opposed to Nehab et al, this method optimizes for vertex cache first. This allows for significantly more clusters, since many of the cluster boundaries can be placed exactly at cache flushes, where there is no penalty to vertex caching. Clusters are not as planar, but we can afford more of them since some of these boundaries come for free. And the new heuristic can sort the clusters much faster. Finally, it allows us to trade off between vertex and pixel processing at runtime on a per-hardware basis.

39. Timing comparisons To give an idea, on average the overall speedup that we get is over 3000x compared to the offline method over several meshes.To give an idea, on average the overall speedup that we get is over 3000x compared to the offline method over several meshes.

40. Overdraw comparison Here are some qualitative results with varying lambda. At the top we have the vertex cache ACMR, while at the bottom we have the maximum overdraw ratio, where 1 means no overdraw, or optimal front-to-back rendering. Note that as lambda increases, ACMR degrades significantly. At the extreme, where lambda is set to 3, (the red bar in the graph), every triangle is its own cluster, so the ACMR is really high since we are sorting all triangles based on the dot-rule. However, as lambda increases, MOVR improves and gets closer to 1, while vertex cache efficiency degrades. This graph is also shown in the paper, and if you observe it carefully, you will note that in some cases, it outperforms the quality of the results of Nehab 06, while in some other cases it doesn�t. However, remember it is often over 3000 times faster to compute on average, so it can be executed at runtime.Here are some qualitative results with varying lambda. At the top we have the vertex cache ACMR, while at the bottom we have the maximum overdraw ratio, where 1 means no overdraw, or optimal front-to-back rendering. Note that as lambda increases, ACMR degrades significantly. At the extreme, where lambda is set to 3, (the red bar in the graph), every triangle is its own cluster, so the ACMR is really high since we are sorting all triangles based on the dot-rule. However, as lambda increases, MOVR improves and gets closer to 1, while vertex cache efficiency degrades. This graph is also shown in the paper, and if you observe it carefully, you will note that in some cases, it outperforms the quality of the results of Nehab 06, while in some other cases it doesn�t. However, remember it is often over 3000 times faster to compute on average, so it can be executed at runtime.

41. Comparison Here is an example in which we found a lambda setting where we achieved nearly identical ACMR and overdraw compared to Nehab 06 in a fraction of the time. That is 76 milliseconds as opposed to 40 seconds of processing.Here is an example in which we found a lambda setting where we achieved nearly identical ACMR and overdraw compared to Nehab 06 in a fraction of the time. That is 76 milliseconds as opposed to 40 seconds of processing.

42. Summary Run-time vertex cache optimization Run-time overdraw reduction Operates on vertex and index buffers directly Works on non-manifolds Orders of magnitude faster Allows for varying cache sizes and animated models Quality comparable with previous methods About 500 lines of code! Extremely easy to incorporate in a rendering pipeline Expect most game rendering pipelines will incorporate such an algorithm Expect CAD applications to use and re-compute ordering interactively as geometry changes To summarize, we presented a fast run-time system that orders the triangles in a mesh to reduce both vertex cache and overdraw. It operates directly on vertex and index buffers, so it works on non-manifolds. It is orders of magnitude faster than previous methods and allows for varying cache sizes and animated models. The quality of the results is comparable with previous methods. It is only about 500 lines of code. To summarize, we presented a fast run-time system that orders the triangles in a mesh to reduce both vertex cache and overdraw. It operates directly on vertex and index buffers, so it works on non-manifolds. It is orders of magnitude faster than previous methods and allows for varying cache sizes and animated models. The quality of the results is comparable with previous methods. It is only about 500 lines of code.

43. In fact, here is the entire overdraw part, in case anyone is interested in copying it down. We will be making a fully commented version of this available online in two weeks. So, please check any of our websites for a link to it.In fact, here is the entire overdraw part, in case anyone is interested in copying it down. We will be making a fully commented version of this available online in two weeks. So, please check any of our websites for a link to it.

44. Summary Run-time triangle order optimization Run-time overdraw reduction Operates on vertex and index buffers directly Works on non-manifolds Allows for varying cache sizes and animated models Orders of magnitude faster Quality comparable with state of the art About 500 lines of code! Extremely easy to incorporate in a rendering pipeline Hope game rendering pipelines will incorporate such an algorithm Hope CAD applications to use and re-compute ordering interactively as geometry changes Now, returning to the summary, another important feature is that it is easy to incorporate in a rendering pipeline. And we hope that many games rendering pipelines will incorporate such an algorithm. It may improve speeds in some scenes, perhaps not in others, but it certainly doesn�t hurt because the processing is nearly free. We also hope CAD applications in the future will also use and re-compute these orderings interactively as geometry changes. Now, returning to the summary, another important feature is that it is easy to incorporate in a rendering pipeline. And we hope that many games rendering pipelines will incorporate such an algorithm. It may improve speeds in some scenes, perhaps not in others, but it certainly doesn�t hurt because the processing is nearly free. We also hope CAD applications in the future will also use and re-compute these orderings interactively as geometry changes.

45. Thanks Phil Rogers, AMD 3D Application Research Group, AMD We would like to thank Phil Rogers from AMD, as well as the members of the 3D Application Research Group for their numerous comments, discussion and support in the early stages of this project. This concludes my talk. Thank you for your attention and I will be happy to take any questions at this point. Forsyth Recent fast reordering algorithm described on a web page From what I recall, there are no timing results provided, it was optimized with LRU in mind although it gave some good FIFO results, and it did not handle overdraw, just v-cache optimization.We would like to thank Phil Rogers from AMD, as well as the members of the 3D Application Research Group for their numerous comments, discussion and support in the early stages of this project. This concludes my talk. Thank you for your attention and I will be happy to take any questions at this point. Forsyth Recent fast reordering algorithm described on a web page From what I recall, there are no timing results provided, it was optimized with LRU in mind although it gave some good FIFO results, and it did not handle overdraw, just v-cache optimization.

46. ?

Fast Triangle Reordering for Vertex Locality and Reduced Overdraw

Fast Triangle Reordering for Vertex Locality and Reduced Overdraw

Presentation Transcript

Discontinuity Edge Overdraw

Exploiting Sequential Locality for Fast Disk Accesses

Cryptography and Non-Locality

Vertex

Vertex

Vertex

REDUCED

Variable reordering strategies for SLAM

Training a Parser for Machine Translation Reordering

Reordering

Measuring Packet Reordering

Hashing – Reordering Schemes

Fast Triangle Reordering for Vertex Locality and Reduced Overdraw

Query Reordering for Photon Mapping

Benchmark for Vertex/Tracker

Locality

KFParticle for vertex finder

Locality and Caching

Locality

Benchmark for Vertex/Tracker

Chapter 41. Locality and The Fast File System

Discontinuity Edge Overdraw