6

Making Your Game Go Fast by Asking Windows Nicely

 2 years ago
source link: https://www.anthropicstudios.com/2022/01/13/asking-windows-nicely/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Making Your Game Go Fast by Asking Windows Nicely

Jan 13, 2022 • Mason Remaley • way-of-rhea tech

(original)

Normally, to make your software go faster, it has to do less work.

This usually involves improving your algorithms, skipping work the user won’t see, factoring your target hardware into the design process, or modifying your game’s content.

We’re not talking about any of that today. This post is a list of ways to make your game run faster on Windows–without making any major changes to your game’s content, code, or algorithms.

I employ these optimizations in Way of Rhea, a puzzle adventure written in a mix of Rust and a custom Rust scripting language. While this project is written in Rust, all the optimizations listed here are language independent–where the translation isn’t straightforward I’ve provided both Rust and C++ examples.

Table of Contents

Trick #1: Asking Windows Nicely

One day, while working on some optimizations, I discovered that completely disabling one of Way of Rhea’s subsystems made the game slower. That…didn’t add up. Doing less work should take less time, so I did some investigating.

My initial assumption was that some weird dependency on that subsystem resulted in more work being done when it was absent, but that turned out not to be the case.

When running at 60hz, and especially with a major subsystem disabled, Way of Rhea easily spends the majority of the 16.66ms frame blocking–waiting for the next vblank–and apparently this lead Windows to believe that the workload was low priority and should be throttled, increasing both the frame time and frame time variability, causing me to start occasionally missing vblank which is how I noticed the problem in the first place.

You can ask Windows nicely not to do this by linking with PowrProf.dll, and then calling this function from powersetting.h as follows:

if (PowerSetActiveScheme(NULL, &GUID_MIN_POWER_SAVINGS) != 0) {
    LOG_WARN("PowerSetActiveScheme failed");
}

This solved my odd performance problem, and had a positive impact on my game’s performance when running on battery power.

Update:

It was pointed out to me that this isn't just tweaking a heuristic, power schemes can actually be configured by users through Windows Settings. With that in mind, should a game mess with them?

I'm torn on this, because I want to say no, but on the other hand as a player I had no idea that this was configurable, and now that I know I would never intentionally play a game on balanced mode–you're not gonna save much battery if you're running a game, regardless of the mode!

I'll check if Microsoft has any guidelines on this, and whether previous games have set a precedent. The ideal solution would involve getting the same effect as calling this API without making a lasting change to the mode. Alternatively, the user could be prompted if they're not already in high performance mode.

I'll replace this note once I've reached a conclusion.

Aside: How much of a difference should I expect?

I added a debug option to Way of Rhea that toggles between power schemes at runtime, and tested the result on two laptops, both plugged in and on battery power.

The actual measured frame times vary too much to make a readable table–they depend on how much and when Windows chooses to throttle you–but here are my key takeaways:

  • I've repeatedly seen replacing the default scheme by calling this API instantly turn 20ms frames into 7ms frames.
    • You won't see this unless you're currently throttled, though. Throttling occurs most often but not solely when you're using a laptop running on battery power.
  • GUID_MAX_POWER_SAVINGS is the slowest mode, GUID_MIN_POWER_SAVINGS is the fastest.
    • GUID_MIN_POWER_SAVINGS and GUID_TYPICAL_POWER_SAVINGS often behave similarly, but not always. You would expect typical to be the default, but this does not seem to be the case (?).
  • While this API helps, it does not completely turn off throttling.

Trick #2: Set your DPI Correctly

Your players probably play other games, right? So they probably have a bunch of fancy hardware, like a 4k monitor that they wanna play your game on?

Well, to prevent scaling issues in legacy apps, on high DPI monitors Windows renders all applications at a lower resolution and then upscales them, unless the application explicitly opts out by becoming “DPI aware.”

Not only does this mean your players won’t get to take advantage of their nice monitors, it also means you have less headroom before vblank because you have to go through the compositor for upscaling, which likely also rules out fullscreen exclusivity.

Aside: Measuring Perf

If you want to measure the perf cost of this at home, you need to measure headroom–not throughput, not frame time. The upres happens outside of your code so it won’t affect your frame time, and vsync needs to be off to measure throughput which means you may draw extra frames that never make it to the compositor and therefore don’t need to be upresed, biasing your measurement.

Your headroom is equivalent to the amount of time you can busy wait before you miss vblank. On a 60hz monitor you’d expect nearly 16.66ms of headroom, but I found that when my DPI was not set correctly, it was closer to 14ms. That’s ~2.66ms less you have to work with before your game stutters. The heuristics for fullscreen exclusivity and the details of the compositor appear to change over time, and none of this is well documented, so YMMV.

Please note that it may be harder than you think to check if you’re missing vblank. Vsync on Windows recently changed–in the past when you missed vblank with vsync enabled, you’d end up with a multiple of your target framerate. e.g. 60hz would become 30hz. As of a Windows update a year or two ago, your frame time doesn’t double in this scenario, but you don’t tear either–you just get stuttering instead. Don’t believe me? Try it. If you have any information why this might have changed, I’d love to hear from you.

Setting your “DPI Awareness”

Every couple versions of Windows, Microsoft introduces an entirely new way of handling DPI to fix issues that they couldn’t possibly have anticipated in previous versions, like users with multiple monitors.

We can either become DPI aware programmatically, or through an “Application Manifest”. I’ll demonstrate both methods here, and leave the choice up to you.

For those of you who support Linux via Proton, either option is fine–apps run through proton appear to be automatically DPI aware even if you do nothing.

Setting DPI Awareness Programmatically

From Microsoft, emphasis mine:

It is recommended that you set the process-default DPI awareness via application manifest. See Setting the default DPI awareness for a process for more information. Setting the process-default DPI awareness via API call can lead to unexpected application behavior.

This is probably bullshit. I’ve never seen this API cause a problem, and moreover, if using the DPI API can cause “unexpected application behavior” then why the fuck is there a DPI API in the first place? They didn’t have to write this API, and they sure as hell didn’t have to document it.

GLFW uses the forbidden APIs, SFML uses the forbidden APIs, and SDL probably will soon too.

So it’s probably fine. If you’re feeling brave, call this function before anything that depends on the DPI to opt yourself into DPI awareness programmatically:

if (!SetProcessDpiAwarenessContext(DPI_AWARENESS_CONTEXT_PER_MONITOR_AWARE_V2)) {
    LOG_WARN("SetProcessDpiAwarenessContext failed");
}

This function is part of User32.lib, and is defined in winuser.h which is included in Windows.h.

If your monitor is set to anything other than 100% scaling in Windows Settings, you should be able to see the difference visually, and any APIs that output measurements in pixels should now output actual pixels instead of scaled pixels.

Aside: Compatability

As of April 5th 2017 with the release of Windows 10 Version 1703, SetProcessDpiAwarenessContext used above is the replacement for SetProcessDpiAwareness, which in turn was a replacement for SetProcessDPIAware. Love the clear naming scheme.

You might be tempted to call one of the older versions of this function to remain compatible with older systems–I don’t recommend doing that. SetProcessDpiAwareness requires you do additional work on top of calling this function, to keep the titlebar in sync in Windowed mode, and SetProcessDPIAware doesn’t properly support multiple monitors. Besides, as of December 2021, only 4.16% of Windows users surveyed by Steam haven’t updated to Windows 10 or Windows 11, and of those users nearly 100% should be well past Windows 10 Version 1703 since Windows Update no longer asks for consent before messing with your system.

If you need to be backwards compatible you can load this API dynamically and skip it if it doesn’t exist, or just go the manifest route.

Setting DPI Awareness with an Application Manifest

“Application Manifests” are xml files that define options for your application. They are usually compiled and linked into your executable. Calling C functions to configure your app was too easy, I guess?

If you’re using Visual Studio then you probably already have one of these.

For the rest of us, it turns out that the compilation and linking step is optional, which is great, because it means we don’t need to use Microsoft’s compiler or linker to include a manifest. From Microsoft:

The build system in Visual Studio allows the manifest to be embedded in the final binary application file, or generated as an external file.

Microsoft recommends you embed your manifest files in the executable, but this is presumably so users don’t move the exe without bringing the manifest along for the ride, or mess with the contents itself. Most games already have external data folders with the same problem so this isn’t an issue for us.

If your game is located at foo/bar/game.exe, then you just need to create foo/bar/game.exe.manifest with the following content.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0" xmlns:asmv3="urn:schemas-microsoft-com:asm.v3">
  <asmv3:application>
    <asmv3:windowsSettings>
      <dpiAware xmlns="http://schemas.microsoft.com/SMI/2005/WindowsSettings">true</dpiAware>
      <dpiAwareness xmlns="http://schemas.microsoft.com/SMI/2016/WindowsSettings">PerMonitorV2</dpiAwareness>
    </asmv3:windowsSettings>
  </asmv3:application>
</assembly>

This manifest will enable DPI_AWARENESS_CONTEXT_PER_MONITOR_AWARE_V2 where supported, falling back first to PROCESS_SYSTEM_DPI_AWARE, and then to PROCESS_DPI_UNAWARE. PROCESS_PER_MONITOR_DPI_AWARE is intentionally left off since it requires additional code changes to render the titlebar correctly.

Just like with the programtic route: If your monitor is set to anything other than 100% scaling in Windows Settings, you should be able to see the difference visually, and any APIs that output measurements in pixels should now output actual pixels instead of scaled pixels.

Getting the Scale Factor

This isn’t often relevant for games, but, if you need to check how much things would have been scaled if you weren’t DPI aware, you can call GetDpiForWindow and divide the result by 96.

Changing the Resolution

If you followed these steps, but your game is running slower now, I didn’t lie to you–you’re just rendering more pixels than you were before. That’s a good thing, you want the option to support the user’s native resolution right?

You probably want the option to render at a lower (or higher!) resolution too.

This should be done by rendering to a framebuffer that the user can resize from the options menu of your game. I don’t recommend messing with the user’s desktop resolution, I’m willing to believe that was worthwhile in the past but in the admittedly informal tests that I’ve done on Windows 10, I’ve seen no benefit and it’s a bad experience for your players. The framebuffer method is also more flexible–you have the option to render your UI at a higher resolution than everything else.

I provide the player with an option called “resolution” that can be set anywhere from 20% to 200%. I think other games call this “render scale”, I guess I could rename it. The framebuffer is set to the given percent of the game window’s current size. AFAICT there’s not advantage to a preset list of supported resolutions unless you’re messing with the monitor itself. Careful with aspect ratios though.

Trick #3: Use the Faster GPU

Here’s the scene. Someone buys your game. They bought a laptop with a fancy 3080 or something (because scalpers bought all the standalone cards so how else are you gonna get one), and they wanna run your game on it. When they boot it up, it launches on their integrated Intel GPU instead of their 3080 and runs super slow. They don’t know that’s what happened though. To them, your game is just so slow that it can’t even run on their 3080.

Alright that’s no good. If we’re running on a laptop and we don’t explicitly request the faster card, we’ll often end up running on the slower integrated GPU instead. How do we make that request? Well, if you’re running a new graphics API, you may be able to enumerate the available graphics cards and pick the fastest one (or let the user pick it). I haven’t updated to the newer APIs yet, but if you’re on one of them and that’s correct, awesome–problem solved.

But what about all of us still on OpenGL?

Note: It was pointed out to me that the original version of the above story was unrealistic because GPU switching only works the way I’m describing on laptops, so I’ve updated the story and the paragraph following it to clarify this. I believe this is correct, but I don’t have a good source on it to link to, if you have one feel free to get in touch!

Hinting You Want the Faster GPU

You’re gonna have to hint to each graphics card vendor that you want the fast card. You can find the documentation for Nvidia here. At one point I also managed to track down AMD’s documentation on this, but I can’t seem to find it anymore.

In case your idea of a fun afternoon doesn’t involve piecing through the Nvidia docs to figure this out on your own, and then hunting down wherever AMD hid their explanation, I’ve written up what I learned from reading the docs and implementing their recommendations myself.

Aside: Testing on AMD GPUs

I’ve tested this code with Nvidia cards to make sure it has the desired effect. There seems to be little room for error, especially considering that I’ve been able to verify that AAA games often use the same method as I’ll show later, but unfortunately I don’t have the hardware to personally verify that this works on AMD.

If you have a computer with both an integrated GPU and a discrete AMD GPU and wanna see if this works I’d much appreciate that, my demo is here, it should always run on the discrete GPU when available (you can see which GPU it’s using by enabling basic stats in the debug menu.) Keep in mind that Bootcamp doesn’t support GPU switching, so unfortunately this test won’t provide meaningful results on dual booted Macbooks.

The NVIDIA docs provide two methods of preferring the NVIDIA card over integrated cards:

  • Linking with one of a long list of NVIDIA libraries
  • Exporting a special symbol

The former did not work in my tests, so I’m only demonstrating the latter (+ the AMD equivalent.) First, you need to add this code somewhere in your project:

(rust version)

#[no_mangle]
pub static NvOptimusEnablement: i32 = 1;
#[no_mangle]
pub static AmdPowerXpressRequestHighPerformance: i32 = 1;

(c++ version)

extern "C" {
    _declspec(dllexport) DWORD NvOptimusEnablement = 0x00000001;
    _declspec(dllexport) DWORD AmdPowerXpressRequestHighPerformance = 0x00000001;
}

Next, when you compile, make sure your compiler exports these variables.

In some languages this will happen automatically if the proper keywords are provided. In Rust, you need to set the following linker flags to get this to happen. (I’m unfortunately not aware of any way to set this from Cargo.toml, I ended up creating a Makefile just for this purpose.)

cargo.exe rustc --release -- -C link-args="/EXPORT:NvOptimusEnablement /EXPORT:AmdPowerXpressRequestHighPerformance"

If you’ve done this, then on computers that have an integrated card and a discrete Nvidia or AMD card, the discrete card should be used by default.

Update: Linker Flags in Rust Without Makefiles

Turns out it’s possible to do this with a build.rs script!

Thanks to @asajeffrey for helping me figure out the correct syntax–a few people suggested this route, but the syntax was harder to get right than you’d think seeing the result! Instead of creating a Makefile, you can create a build.rs script with the following content:

fn main() {
    println!("cargo:rustc-link-arg=/EXPORT:NvOptimusEnablement");
    println!("cargo:rustc-link-arg=/EXPORT:AmdPowerXpressRequestHighPerformance");
}

Checking That It Worked

How do we know it worked? Your game will, presumably, render faster, but we should also verify that we’re having the intended effect directly.

First, lets check that we successfully exported the symbols. Start a “Developer Command Prompt” (this requires installing Visual Studio), and then enter the following:

dumpbin /exports $YOUR_GAME.exe

Our two exported variables, as well as anything else you exported, should show up in the output.

You can also use this to check if AAA games use this method; in my experience they do.

Next, assuming you have hardware to test with, we can check that the exports are being respected by using glGetString with the following constants:

  • GL_VENDOR
  • GL_RENDERER
  • GL_VERSION
  • GL_SHADING_LANGUAGE_VERSION

Without this change I get the following:

vendor: Intel
renderer: Intel(R) UHD Graphics 630
version: 4.1.0 - Build 26.20.100.7261
shading_language_version: 4.10 - Build 26.20.100.7261

With the change I get this:

vendor: NVIDIA Corporation
renderer: NVIDIA GeForce RTX 2060/PCIe/SSE2
version: 4.1.0 NVIDIA 471.68
shading_language_version: 4.10 NVIDIA via Cg compiler

Other ideas…

Windows is complicated, graphics cards are complicated. There are probably other flags and stuff I don’t know about. Feel free to email or Tweet at me if I missed anything!

Here are a couple of other things you could look into:


If you enjoyed this post and want to be notified about future posts, you can sign up for my mailing list or follow me on Twitter.

If this article helped you quickly shave a few milliseconds off your frame time, you have my permission to spend the rest of the work day playing Way of Rhea’s demo with full confidence that Windows will not unnecessarily throttle your experience.

Sharing this article your social platform of choice is always much appreciated.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK