Brought to you by the Intel® Visual Computing Source
Download the source code and watch the demo
Introduction
This sample demonstrates how to render large-scale terrains
on Intel® microarchitecture codename Sandy Bridge in real time by efficiently
distributing the tasks between the CPU cores and the processor graphics unit.
The sample pre-processes an input height map into a hierarchical quadtree
representation which is used to render the terrain with adaptively selected
level of detail (LOD). The adaptive simplified triangulation calculated during
the pre-processing is compactly encoded to save runtime processing and memory
space. LOD construction is asynchronously performed by the CPU cores while
rendering is done by the processor graphics unit.
Application
Terrain Rendering is an application using DXUT and Microsoft
DirectX* 11 with D3D_FEATURE_LEVEL_10_0. The application handles all rendering,
user interaction and GUI. Upon initialization, the application loads all
models, allocates resources and compiles shaders. On the first run, the
application pre-calculates triangulation, which can take some time (up to one
minute), and stores it on the disk. On subsequent runs, the application loads
the data from the disk.
Overview
Terrain rendering demonstrates how to render large-scale
terrains on Intel microarchitecture codename Sandy Bridge in real time by
efficiently distributing the tasks between the CPU cores and the processor
graphics unit. The terrain rendering can be optimized by constructing
simplified triangulation, which is adaptive to the terrain surface
characteristics. Such triangulation contains more primitives in sharp regions
with high-frequency details and allocates a small number of large triangles for
flat areas. This significantly reduces the total triangle count while providing
almost the same visual quality (compare fig. 1 and 2).
Figure 1: Full-detail triangulation, 1 million triangles, 40 fps on
Intel® microarchitecture codename Sandy Bridge, 1280x1024 screen resolution
Figure 2: Adaptive simplified triangulation, 200k triangles, 112 fps on Intel® microarchitecture codename Sandy Bridge, 1280x1024 screen resolution
Full-resolution triangulation (fig. 1) is more than 5x redundant compared to
adaptive (fig. 2), which results in nearly 3x lower performance with almost the
same visual quality.
Pre-computing adaptive triangulation is a complex and computation-intensive
task. Doing this at runtime could require a significant amount of time and
would lead to perceptible stalls or delays. To solve this problem, the
application pre-computes adaptive triangulation for the whole terrain at the
pre-process stage and stores the resulting data on the disk in a compact
representation (see section 4). For example, the whole encoded triangulation
for an 8192×8192 terrain consumes approximately 6 MB (compare this to 128 MB
required to store the height map). At the runtime stage, this data is used to
efficiently construct the triangulation.
The triangulation must be adaptive not only to the terrain surface features,
but also to the camera position, because distant terrain regions can be
rendered using coarser representation without loss of visual fidelity. To support
multiple levels of detail (LODs), the input height map is pre-processed and a
patch quadtree data structure [3] is constructed (fig. 3). At runtime, the
level of detail selection is done on a per-patch level, not on the triangle
level. If the adaptive triangulation were constructed every frame, it would
require intensive data transfer between CPU and GPU memory, which is not
efficient. With patch-based LOD selection, the data is uploaded to the GPU
memory only when new patches are created, which happens once in a number of
frames.
Figure 3: Three levels of a patch quadtree
Each patch in the hierarchy is a height map of a fixed size with additional vertices required to
seamlessly connect neighboring patches. Each patch covers the same area as its
four direct children, but approximates the terrain with lower accuracy. Each
patch is assigned a unique adaptive triangulation which is pre-computed and
encoded as described in section 4.
For each patch in the hierarchy, a world-space geometric error τ is
pre-computed during the pre-processing. This error shows the maximum geometric
deviation of the patch polygonal surface to the height map samples at the
finest resolution covered by the patch. Given τ and camera position, we
can estimate the patch screen-space error, which is the maximum visible
deviation of the simplified model to the samples of the original full-detail
height map, using the following standard formula [3]:
where W and H are width and height of the view port, are horizontal and vertical fields
of view and dist(V,c) is the distance from the camera c to the nearest point on
the patch bounding box V.
At the runtime stage, an unbalanced patch quadtree is maintained with leaf
nodes satisfying the given screen-space error bound . Each patch in this tree stores
height map, normal map and adaptive triangulation indices. On each frame, a
recursive procedure is executed which updates the tree with respect to new
camera location (fig. 4). The procedure creates new patches and allocates resources
for terrain regions where additional accuracy is required (), and coarsens the representation
where LOD has become unreasonably high ().
Figure 4: Updating unbalanced patch quadtree.
Constructing a triangulation from the encoded representation is efficiently
handled by the CPU cores while rendering is performed by the graphics unit.
This technique is useful because it significantly reduces the GPU burden by
constructing a simplified adaptive triangulation. The most useful application
of this technique would be a system that has excess CPU computational power,
but is utilizing all of its GPU power.
Adaptive Triangulation
As it was mentioned above, each patch in the quadtree is
assigned its own unique triangulation which is adaptive to local terrain surface
features. This triangulation is computed at the pre-process stage and stored on
the disk. The adaptive triangulation construction exploits the method described
in [1] and [2]. To build the triangulation, all samples of the patch are
assigned to different levels as shown in figure 5. Note that to construct an
adaptive triangulation, a quadtree data structure is also used. To distinguish
this quadtree from the patch quadtree described above, we refer to it as the vertices
quadtree.
Figure 5: Assignment of height map samples to different levels of a
vertices quadtree
To guarantee that a triangulation constructed from a vertices quadtree does not
contain cracks, the quadtree is restricted with the dependency graph shown in
figure 6. Every vertex depends on two other vertices of the same or the next
finest level in the vertices quadtree hierarchy. Border vertices depend only on
one other vertex. This means that if the vertex is selected for triangulation,
then the related ones must be selected too.
Figure 6: Dependency relations on the vertices of the restricted
quadtree.
The described above vertices quadtree is tightly bound to the triangle binary
tree. The root of the triangle binary tree (level 0) is represented by a single
right triangle. Level n of the tree is obtained by bisecting each triangle in
level n-1 along its longest edge (fig. 7).
Figure 7: Triangle binary tree first four levels.
The coarsest possible patch triangulation is represented by two right
triangles, which are the roots of two triangle binary trees. The vertex in the
middle of the triangles’ longest edge is called base. If some vertex is
included into the restricted quadtree, it is called enabled, or disabled
otherwise. Now, if we have the correct set of enabled vertices (with all
dependencies properly kept), a crack-free triangulation can be constructed at
runtime using the following simple recursive procedure which starts from the
root:
1. if( triangle base vertex is enabled )
2. {
3. bisect the triangle
4. process two new triangles
5. }
6. else
7. output current triangle to the list
Algorithm 1: Constructing adaptive triangulation using set of enabled
flags
Figure 8: Enabled vertices and corresponding triangulation.
To determine a set of enabled vertices, the following bottom-up algorithm is
executed during the pre-process:
1. Clear enabled_array[] with false
2. for ( quadtree level l = finest resolution to coarsest resolution )
3. for ( each vertex v in level l )
4. {
5. if ( for all vertices d which v is dependent on: enabled_array[d] == false )
6. {
7. merge two triangles for which v is base vertex
8. calculate the coalesced triangle world space approximation error e
9. if ( e < threshold )
10. enabled_array[v] = false
11. else
12. enabled_array[v] = true
13. }
14. else
15. enabled_array[v] = true
16. }
Algorithm 2: Determining enabled vertices.
Triangle world-space approximation error is the maximum vertical distance of
all vertices covered by the triangle to the triangle plane. Adaptive
triangulation constructed with Algorithm 2 described above guarantees that the
maximum geometric deviation of the simplified triangulation from the original
height map is below the given threshold.
It is now clear that the whole patch triangulation is thoroughly described by
the set of flags indicating whether or not the vertex is enabled. These flags
can be efficiently encoded during the recursive traversal by outputting 1-bit
flags. This is done with the following algorithm, very similar to Algorithm 1:
1. if ( triangle base vertex is enabled )
2. {
3. output 1
4. bisect the triangle
5. process two new triangles
6. }
7. else
8. output 0
Algorithm 3: Encoding enabled flags.
Thus during the pre-processing, Algorithm 2 is first executed to determine the
set of enabled vertices, which is followed by Algorithm 3 encoding them. At
runtime, Algorithm 1 is executed, which uses pre-computed data to construct the
adaptive triangulation.
Implementation Details
The terrain rendering system is built up as a collection of
logically independent components. The principal architecture of the system is
shown in figure 9.
Figure 9: Terrain rendering system architecture principal scheme
The system contains an elevation data source (implemented with CElevationDataSource
class) and an encoded triangulation
data source (implemented as CTriangDataSource
class). The former provides access to the elevation data
through the GetElevData()
method that creates an instance of CPatchElevationData
class and returns pointer to it. The CPatchElevationData
class provides access to the stored
height map data through the GetDataPtr()
method.
The triangulation data source follows the same philosophy: when it is necessary
to create an adaptive triangulation for some patch, it creates an instance of CRQTTriangulation
class with the call to DecodeTriangulation()
. The triangulation indices are
generated by the call to GenerateIndices()
. CRQTTriangulation
class is also responsible for
determining enabled vertices and implements the described above Algorithm 2 by
the CTriangDataSource::CreateAdaptiveTriangulation()
method.
Algorithms 1 and 3 are implemented by the CRQTTriangulation::RecursiveGenerateIndices()
method. The method can operate in two
modes, encoding and decoding (indicated by the m_bIsEncodingMode
flag). In the first mode, the method
encodes the triangulation using enabled flags (pre-computed by Algorithm 2) as
input. Depending on the triangle level and orientation, the method determines
which vertex is its base. It then reads its enabled/disabled status from the
array and outputs a corresponding 1-bit flag into the output bit stream.
In decoding mode, the method reads the flags from the bit stream (which is
loaded from the disk) and sets appropriate enable/disable status in the enabled
flags array. At the same time, the method generates the triangulation.
All patch Microsoft Direct3D* resources are stored in an instance of CTerrainPatch
class. The resource management is
handled by the D3D resource cache. When a new texture is required, the cache
attempts to find an appropriate unused resource. If there are no spare
resources, the cache creates a new one. When resource is no longer needed, it
is not released, but placed into the cache. The cache is thread-safe so a
number of threads can access it simultaneously.
The quadtree construction is implemented by the CBlockBasedAdaptiveModel
class, while Microsoft DirectX*
11-specific methods are separated in CAdaptiveModelDX11Render
.
Asynchronous Task Execution
At the runtime stage, a quadtree-based adaptive
view-dependent terrain model is maintained. For this purpose, required patch
elevation data and adaptive triangulation are extracted from the corresponding
data sources, and terrain patches are created. To hide processing time required
to perform these tasks and eliminate stalls, the level-of-detail processing is
done asynchronously. Since LOD can be either increased or decreased, there are
two types of tasks that can be performed by the system, which are implemented
by CIncreaseLODTask
and CDecreaseLODTask
classes. These classes are derived
from the base class CTaskBase
which exposes Execute()
virtual method. To manage the tasks,
the system exploits TaskMgrTbb
component.
Normal Map Compression
A normal map is used to shade the terrain surface. There is
no need to store the normal map on the disk because it can be calculated in
runtime from the patch height map. This saves disk space but requires moderate
additional computations. Besides, if the terrain is dynamic and can be
deformed, the normal map cannot be statically pre-computed. To improve visual
quality, normal resolution can be higher than that of the height map. For
instance, if the height map has resolution (2^n+1)×(2^n+1) samples, then normal
map could have 4x higher resolution (4∙(2^n+1)×4∙(2^n+1) samples).
To reduce memory storage requirements, normal maps are kept in compressed form.
BC3 compression format is exploited that enables storing the normal map using 1
byte per normal. The compression is done asynchronously; the DXTCompressor
component is exploited for this purpose.
Performance
Figure 10 shows the performance during the recorded fly-over
captured on an Intel microarchitecture codename Sandy Bridge-based machine at
1280x1024 screen resolution.
Figure 10: Performance during the recorded fly-over for screen space
thresholds
With a three-pixel error threshold, the LOD changes are not annoying, while the
average performance is more than 100 fps on the Intel microarchitecture
codename Sandy Bridge-based machine. With a five- pixel threshold, geometry
changes become much more apparent, but the performance increases by a factor of
1.5x.
The time that is required to increase LOD primarily depends on the selected
normal map level-of-detail bias. With 4x normal map up-sample factor, it takes
approximately 44 ms to process one patch on one core. Thus, for a model
consisting of 150 patches, it takes approximately 6.6 seconds to construct the
whole model on one core. Since all patches are processed independently, the
workload could be evenly distributed across available cores, and as a result
the time scales well. Note also that during a typical fly-over, only a number
of patches are updated during one second (usually less than 10). If the camera
position drastically changes, the model will asynchronously be updated during a
number of frames, which will cause no stalls. Note that until the model is
updated, it will be rendered in coarse resolution.
Figure 11: Sample screenshot. Note that here a simple method is used
to colorize differences in elevation but a much better looking terrain can be
obtained by improving texture quality.
References
- P. Lindstrom, D. Koller, W. Ribarsky, L. F. Hodges, N.
Faust, and G. A. Turner. Real-time, continuous level of detail rendering of
height fields. In Proc. SIGGRAPH 96, pages 109-118. ACM SIGGRAPH, 1996.
-
Renato Pajarola. Large scale terrain visualization using the restricted
quadtree triangulation. In Proceedings Visualization 98, pages 19-26 and 515.
IEEE Computer Society Press, 1998.
-
Thatcher Ulrich. Rendering massive terrains using chunked level of detail
control. SIGGRAPH Course Notes (2002). Volume 3, Issue 5.