2018/02/02

The BIN-CSR Sparse Matrix Format

BIN-CSR is a sparse, symmetric matrix format, originally described by Daniel Weber et. al in their 2012 paper on GPU data structures. BIN-CSR is suitable for applications which use massively parallelized matrix operations, like some GPU-based finite element method solvers.

The diagonal elements of the matrix are stored separately, to enable fast preconditioning of the matrix. Preconditioning is a step which allows for faster convergence when performing a system solving method like the conjugate gradient method.

The WebGL demos on this page aim to demonstrate the construction of a BIN-CSR matrix in an interactive way.

The following is a dramatically over-engineered representation of our original matrix. Drag the slider to change the dimensions of the matrix.

The elements of this matrix are packed into padded bins. Each bin contains a set number of rows worth of data. The number of rows contained in a bin is called the bin width. The following is an illustration of bin data packing. The slider can be used to adjust the bin width.

The above visual shows the theoretical grouping of the data. The actual layout of the data in memory is optimized for coallesced memory access on the GPU. In actuality, the data is stored in four separate arrays, visualized here:

The values in the pointer array (gray) are indices into the value and column arrays (black), indicating the starting index for each row. The pointer values are represented by gray connecting lines.

2017/10/15

FEM Jello Cube

The second installment in my progress on a ductile fracture physics engine. So far, it supports neither ductile nor fracturable materials, but elastodynamic deformation is a step in the right direction.

So far, it’s pretty much a jello cube, created mostly following the FEM derivations by Erleben in Physics Based Animation. More technical updates to follow.

2017/10/10

Detecting Islands in Tetrahedral Meshes

In the following video a teapot made of 149210 tetrahedra is generated from a regular surface-bounded mesh. Fifty tetrahedra are removed each frame. After each removal the mesh is traversed in search of “islands” - groups of adjacent tetrahedra which share no faces with other groups - which are indicated by color.

I am beginning work on an FEM based model for deforming and fracturing physics objects. Such a model requires that objects to be composed of assemblies of tetrahedrons (theoretically FEM could work on more complex element shapes than the tetrahedron, but I will make no such attempt :P).

Data Structure

In an effort to keep things real-time and prepare for GPU parallelization, I have required that my tetrahedral mesh implementation keep node and tetrahedron memory in mostly contiguous arrays. This will make GPU data binding more feasible later on.

For mesh traversal, some adjacency information is needed between tetrahedrons. I have seen two methods used for this. One is to link nodes to tetrahedra, but this would require either that each node to contain dynamic data (a list of pointers to the elements which contain them), or that the number of tetrahedra per node is limited.

The other method is to track tetrahedron face adjacencies. Since each tetrahedron has at most four face-adjacencies, tetrahedron and node data can still be packed tightly in memory with a predictable stride.

Algorithms

Removal of a tetrahedron is simple, and can be done in constant time. The removal of tetrahedron i is performed by moving the last tetrahedron (indexed by n-1 where n is the number of tetrahedra) into slot i, and correcting it’s neighboring tetrahedra’ adjacencies. This can be done in constant time.

Traversing the structure for the detection and removal of an island is fairly straight forward. I use a stack rather than recursion, and record the indexes of traversed tetraheda and nodes using two large bitmasks.

When a node is copied into a new mesh, it’s index will have changed, and so the indexes stored in each copied tetrahedron which references that node must also change. A simple array of integers, constructed during node-copy, can be used to map the old indexes to the new ones.

Tetrahedron indexes will also change. The same index-mapping method is used to update tetrahedron neighbors in the new mesh.

Each of the traversal/island removal phases has a complexity of O(N) where N is the number of nodes. If M elements are deleted, then in the worst case N*M*4 islands might be generated, so complete island separation has a complexity of O(N*M).

Summary

The main purpose of this approach is to keep connected tetrahedral mesh data contiguous in memory.

There are faster ways to split tetrahedral meshes (for example, keeping a node pool instead of allowing a single mesh to “own” its nodes’ memory, and adding an inverted node-tetrahedron relationship).

However, the method I described here will result in a memory layout which is close to ideal for doing per-mesh calculations such as those which are optimal in FEM techniques. Furthermore, mesh breakage will not happen at every frame, but stress calculations will.

2017/06/12

Simple Deferred Renderer

A super basic deferred renderer with shadows, for a graphics course I’m auditing.

2017/05/30

Inline Tensor Convolution Using Template Expansion

Last year I faced a problem which required me to compute products (convolutions) between vectors, matrixes, vectors of matrixes (rank-3 tensors).

I wanted to do this in C++ and the best option I could find seemed to be FTensor. Although it probably would have worked for my particular case, FTensor seems to have been written in an older style which failed to make use of variadics, and I wanted something a bit more generic.

In a quest for further flexibility I began to work on my own tensor library, Weyl.

Weyl is header only and contains just one core class, tensor, whose entire source code is very short (at the moment, 367 lines including comments). Thanks to fancy-pants template expansion, large tensor convolutions are (usually!) compiled into a list of simple add and multiply operations.

Here’s the sort of thing I aim to do.

weyl::tensor<float, 3, 3, 5> a({ ... });
weyl::tensor<float, 2, 5, 4> b({ ... });
weyl::tensor<float, 3, 3, 2, 4> c = weyl::sum<2, 1>(a, b);

This ought to be equivalent to the tensor convolution through a Riemann sum over one index.

\[\begin{align*} C_{ijlm} = \sum_{k=1}^5 A_{ijk}B_{lkm} \end{align*}\]

In order to produce this result, we need the template function sum to expand to something equivalent to the following set of embedded loops (assuming that the values in the resulting tensor were initialized to zero).

for (i = 0; i < 3; ++i) // i and j are a's contribution to c's rank indexes
for (j = 0; j < 3; ++j)
for (l = 0; l < 2; ++l) // l and m are b's contribution to c's rank indexes
for (m = 0; m < 4; ++m)
for (k = 0; k < 5; ++k) // k is the index of the dimension to sum over
    c[i][j][l][m] += a[i][j][k] * b[l][k][m];

This is essentially what weyl::tensor produces, and since the bounds of each loop are known, they are unrolled at compile time. You can see the sort of assembly it produces here on Godbolt. For brevity, the full source is not included in this example.

2017/04/16

Rigidbody Physics Engine - Log 7 - Cached Constraint Results

The Jenga tower on the left uses a warm starting technique which begins the MLCP solver with the results from the previous frame. The tower on the right begins from zero each frame.

It’s not shown here, but after a full five minutes of running, the left tower still hasn’t collapsed. The right tower collapses in around 50 seconds.

Warm starting helps in frame-coherent situations like this because the MLCP solver is built to iteratively converge on a solution, but has a finite number of allowed iterations. With lots of stacking, the iterations get truncated before a full solution is reached. If the solver is warm-started, each frame the final solution will be closer to the correct one in a highly coherent scene.

2017/04/09

Rigidbody Physics Engine - Log 6 - Jenga Tower

For friction, I ended up applying simple Newtonian impulses for before starting the constraint solver. This saves some computation inside the PGS iterations (which are the biggest bottleneck), and not much is really lost since friction doesn’t need to be precise to maintain realism. I got this idea from reading the Box2D source code from Erin Catto.

Next step will be to use the cached normal constraint impulse values to scale the friction (the Coulomb model). Once that’s done, it should be far more difficult to pull out the lowest jenga block.

2017/04/05

Rigidbody Physics Engine - Log 5 - Box Stacking

Feels good to finally have some stacked boxes!

The biggest, most obvious missing factor is friction. That is next on the agenda - should come almost for free when I refactor my PGS block solver (yet again :

There are many optimizations needed still in order to get more interesting interactions going. My PGS (constraint resolution) implementation is still fairly slow and currently the tightest bottleneck. Close second place is EPA (contact generation), which is due to get spruced up soon anyway.