Skip to content

MFT Prototype Performance Findings

You are here:
Estimated reading time: 6 min

Intro

The objective going into this project was to determine if we could render over 5000 guests animated at well over the target frame rate of 60 on mid-range hardware (GTX1060, i7-8750) in a bare test environment. The ideal would be hitting 150+ fps so we have some frame budget to work with for the rest of the game without compromising on guest numbers immediately.

 

To reach guest counts past 100~ during gameplay without significant framerate issues, rendering the guests and their animations seems to require using vertex animation textures (VATs). VATs will just be part of the final solution. These are textures within which all necessary data from a skeletal mesh’s animations are baked. This can include skeletal mesh LODs as well. The VATs are then rendered through static meshes, resulting in a significant performance improvement over skeletal meshes. Downsides to this method are that animation events and any tech achieved through animation graphs (such as anim blending) are not available, but at scale for our use case, this isn’t an issue. However, from my measurements, just using VATs with static meshes instead of skeletal meshes is not enough of a performance benefit alone. Just using static meshes with VATs failed to reach the target performance for a 5000 guest count by a large margin.

 

When using VATs, the models can be rendered as either static meshes or hierarchical instanced static meshes. If using individual static meshes, maximum guest count is still highly limited by the draw call count that comes with rendering so many separate meshes. Secondarily when gameplay implementation becomes a factor, the number of individually simulated Actor objects with separate components becomes a performance problem at this scale. Now, getting into the specifics of the costs of each approach in the most simple environment, I made measurements and outlined them in the next section.


Measurements

First to define the test circumstances:

5184 guests were spawned for each test.

Guests had no AI, and no state to manage. The purpose of taking these measurements was to isolate the mesh rendering and animation simulation costs.

All guests shared the same material instance reference – they did not use dynamically created material instances.

Approximately 15% of the group of 5184 guests are off the screen to account for possibility in variance on whether some techniques are more or less subject to occlusion culling.

The guest mesh used for the testing has 5 material slots.

The guest static mesh (when using VAT) and the skeletal mesh (when not using VAT) both had aggressive LOD settings. The VAT supports LODs.

The materials each test used were mostly default (just color parameter) and didn’t use any textures except the VAT materials using their VAT textures for the test cases where the guests were animated.


Results

  • BASELINE: Test scene with no guests: 186.64 FPS, 5.36ms frame time, 197 draw calls, 18003 tris
  • Approach 1: Guest using Skeletal mesh, not playing any animations: 4.98 FPS, 200.62ms frame time, 42276 draw calls, 588,843 tris
  • Approach 2: [This is a pointless method, but including it for completion sake] Individual guest actors each with an instanced static mesh component which just contains 1 instance, not animated: 9.08 FPS, 110.17 ms frame time, 41,877 draw calls, 827,626 tris
  • Approach 3: Actors with 1 static mesh component each, using vertex animation texture, playing walk cycle anim: 12.08 FPS, 82.80 ms frame time, 9576 draw calls, 828,640 tris
  • Approach 4: One actor with one Hierarchical instanced static mesh containing all 5184 instances, using vertex animation texture, playing walk cycle anim: 182.57 FPS, 5.48ms frame time, 475 draw calls, 1,439,968 tris

 

Reaching 182 fps for 5184 animated guests, the last method listed (approach 4) seems the only realistic way to achieve the guest counts required, factoring in that we will need to also include movement and AI costs for each guest on top of this rendering cost. If our target is 60 fps, the frame budget we have is 16.67ms. This means approach 4 is within budget with the majority of time to spare (11 ms), and the next fastest approach (#3) is outside of frame budget, by a factor of 5x over budget.

 

About Approach #4

Given that approach #4 seems like our best option here, I’m going to focus on details about it.

 

Note that approach 4 resulted in the 5184 animated guests rendering at 182.5 FPS, compared to the baseline empty scene’s 186 FPS. This suggests that most of the guest cost will be in simulating these guests, rather than the bone simulation or rendering.

 

First, it may seem difficult to simulate guests when they are just indices in a hierarchical instanced static mesh component. However, it is possible – if you research crowd simulation implementations in Unreal, you’ll see that most of them achieve their simulations and their complexities within the constraints of this limitation. Here is an example: Flock AI on Git. This is possible because you still have position and rotation control over each separate instance, as well as the ability to access mesh-instance-specific data in a single material instance shared by all instances. As a result, you can even have separate instances playing separate animations.

 

One of the big advantages to approach #4 is the fact that it keeps the draw call count very low. A potential offset to this advantage is that one may think we would require many different material instances to render all guest types needed in the actual game. This is not required for some techniques and styles. For example, there is a method in which all characters can reuse the same material and use it like a palette color lookup table, in which their UVs just act as indexes into those palettes. Below are some examples:

 

Palette Texture:

 

UV island placed within one of the color cells, to effectively “assign” that color to those faces of the mesh:

 

This technique works well for styles that use solid colors for each different part of the mesh, disregarding texture and detail entirely. This is a style we were leaning toward after discussing the direction of style but is not present in the current gameplay demo.

 

Before explaining more nuances to approach #4, I want to clarify that there are strong capabilities of supplying mesh-instance-specific data to a material instance. The result is that all instances rendered through hierarchical instanced mesh components (HIMC) can share the same material instance, and get their visual variations achieved through this instance-specific data supplied to the material. A key example of instance-specific data is supplying information on which animation from the VAT should be used. The result is that different instances all within one HIMC can play different animations within a frame from a single material.

 

A potential cost in approach 4 which would become apparent during gameplay implementation, is that if we require different meshes for different types of guests, we may require different HIMC to render all instances which use those meshes. This wouldn’t be that bad because all the instances within each HIMC benefit from largely sharing draw calls with each other, as shown in the test measurements. That is one solution, but there are other options, and some of these can be combined. For example, if we use an opacity masked material, we can have a single mesh include all the geometry for each variation, and use mesh-instance-specific UV offsets for the palette lookup texture based upon which parts of the meshes should be visible for that instance. The palettes indexed would have alpha as 0 for some UV islands when affected by these offsets. Worth noting here that the more geometry is included in a mesh, the more data needs to be packed into the VAT, which will hit a limit, but is optimizable. One route is reducing baked frame rate and depending more on interpolation, or even potentially using a variation on the VAT technique, which is to bake bone animations instead of vertex animations (baked bone animations can be reused between meshes with the same skeleton, showing that the vertex data doesn’t matter for the baked data). An example of a plugin that supports both VAT baking and bone animation baking is this:

Vertex Anim Toolset in Code Plugins – UE Marketplace


About Movement Costs and Navigation

With approach 4, 5184 guests using a simple pathing algorithm to simultaneously walk toward a destination actor, updating their facing direction based upon their heading, brings the FPS down to 79. Given that not all guests will be doing movement calculations all the time, this seems acceptable.

 

Using splines seems to be the preferred approach for achieving efficient movement and pathing with this number of guests. Splines from placed roads in the festival could be reused for this pathing purpose. Additionally, computed navigation paths can be converted into splines, and then cached to be reused among all guests. These cached navigation splines can be regenerated as festival attractions and festival roads are placed. The navigation calculations can use the player-defined road/walkway splines for determining navigation start/goal points.


About Rendering Bottlenecks

Gameplay thread (AI, movement, gameplay simulation) will likely be our primary performance bottleneck if we use approach #4, given these measurements from the guest performance test environment, and measurements of rendering costs in the current demo. However, in the event Rendering becomes the bottleneck (rendering thread takes longer than game thread per frame) there are a few good optimization opportunities with rendering the festival.

 

  1. The first is regarding rendering assets that are placed as a part of the festival’s features or decorations. Given that these placed assets have the same mesh and material for the asset type (e.g, Bush 01 always has the same mesh and material), all instances of a type of asset that are placed could be rendered collectively using an instanced mesh component (IMC). For example, all Bush 01s would be rendered through one IMC. The assets can still exist as separate actors for gameplay purposes, but the mesh rendering would be delegated to some other actor which manages the instanced mesh representations. The benefit is that draw calls for mesh instances get to be batched together, creating significant performance benefits as seen with the guest static mesh measurements.

 

  1. Make sure all of the highest tri-count festival assets have 3 to 5 LODs generated for them.

 

  1. Dynamically adjust the screen percentage rendering parameter through dynamic resolution in tandem with temporal upsampling. With temporal upsampling, screen percentage reductions produce a blurry image as screen percentage gets even slightly reduced. Temporal Upsampling with dynamic resolution will produce a more crisp image in exchange for slightly worse performance than pure reduction to screen percentage. This combination offers an acceptable middleground.

 

  1. Culling volume actors. These allow fine-grained control over the culling the visibility of any actors within the volume based upon size and distance from the camera. In our use case, it may be helpful to cull the smallest actors when the camera is sufficiently zoomed out.
Was this article helpful?
Dislike 0
Views: 1
Back To Top