With more big announcements from Apple, comes the introduction of Metal. Metal is a new graphics API from Apple for iOS, that is not only extremely efficient, but created specifically for the A7 chip. This means it allows developers to take complete advantage of all of the killer hardware running iOS, and make games far more interactive, real, detailed and engaging.
That’s a huge deal, and our friends over at Unity did a great job of explaining exactly why…
Metal Overview
- Create and validate as much state up-front as possible. Shaders can be compiled and partially optimized offline. Everything that is related to rendering pipeline state: shaders, vertex layout, blending modes, render target formats, etc. can be created and validated before rendering even starts. This means no more state checks every draw call and a lot of CPU processing power freed.
- Enable much more versatile multi-threading. Resources can be created from any thread and there are several ways to prepare draw call submission from multiple threads in parallel.
- All iOS devices have shared memory for CPU & GPU. There’s no need to pretend that data from the CPU has to be “copied” into some video memory anymore. When you create a buffer, you just get a pointer to it, and that’s the same memory that the GPU sees.
- Let the user (engine) handle synchronization. OpenGL ES has to jump through lots of hoops and do lots of guesswork in order to behave in every imaginable scenario. In Metal, synchronization of data between CPU & GPU is user’s responsibility. The engine has much better knowledge of what it tries to do, afterall!
- All GPUs in iOS devices are using Tile-Based Deferred Rendering architecture. It is explicitly reflected in Metal API, particularly when it comes to render targets. The API does not try to guess anything anymore – all framebuffer related actions such as tile loads & stores, anti-aliasing resolves are done explicitly.
- All the points above translate to much lower CPU overhead and much more predictable performance.
- A new C/C++11-based language is introduced for both graphics & compute shaders. This also means that iOS can do compute shaders, atomics, arbitrary buffer writes and similar fancy sounding tricks on the GPU now.
- No legacy baggage, the API is very simple & streamlined. Oh, and it also has a super-helpful optional “debug layer” that does extra validation and notifies you of any errors or mistakes you make.
Now let’s go into even more details!
Draw Call
If you’re making games, particularly mobile games, you’re probably aware of The Draw Call Problem. Each and every object that is rendered in the game has some CPU cost, and realistically on mobile right now you cannot afford more than a few hundred objects being rendered. In a real game, you also very much want to use CPU for other things – gameplay logic, physics, AI, animations and so on. Unity has some measures to minimize the number of draw calls being made – static & dynamic batching, occlusion culling, LOD and distance-based layer culling; you can also merge close objects together, put textures into atlases to reduce number of materials.
A good question is – why there has to be a CPU cost to render something? After all, it’s the GPU that is doing the actual work.
Some of the overhead is on “the engine” side – CPU has to iterate over visible objects, figure out which shader passes need to be rendered, which lights affect which objects, which material parameters to apply and so on. Some of that is cached; some of that is multi-threaded; and generally this is platform-independent code. In each Unity release, we try to optimize this part, and Metal generally does not affect this.
However, other part of the CPU overhead is in the “graphics API & driver” part. Depending on the game, this part can be significant. Metal is an attempt to make this part virtually go away, by being a much better match for modern hardware, somewhat lower level and doing massively less guesswork than OpenGL ES used to do. Up-front rendering state creation & validation; explicit actions related to render target loads & stores; no synchronization dances done on the API side — all these things contribute to much lower CPU overhead.
Based on our testing so far, we have seen API+driver overhead vanish to just a few percent of CPU time. That is a tremendous improvement comparing to 15-40% of CPU time that it used to be before! That means majority of the remaining CPU overhead is in our own code now. I guess we’ll have to continue optimizing that 🙂
We’re also looking forward to using Metal ability to do rendering submissions from multiple threads; this opens up very interesting optimization opportunities as well.
Compute Opportunity
With Metal, the GPU can be used for doing computation outside of typical vertex+fragment shaders area — known as “compute shaders”. Basically, this is an ability to run any kind of “parallel computation” on the many little processors inside a GPU. Compute shaders also have a concept of “local storage” – very fast piece of dedicated on-GPU memory that can be used to share data between these parallel work items. This particular piece of memory enables using GPU for things that aren’t easily expressible in ye olde vertex and fragment shaders.
There are tons of interesting areas to use compute shaders for — optimized post-processing effects, particle systems, shadow and light culling and so on.
While we aren’t using compute shaders much in Unity just yet, we’re looking forward to using them for more and more stuff. Exciting times ahead!