Playing with Gravity to Learn CUDA: Simulation Display

We're building a gravity simulation on a GPU using CUDA, and last time we actually got a working simulator running for a 2-body system. That was exciting, but there were some drawbacks, namely the printout-copy-paste-to-spreadsheet-and-graph way of visualizing the simulation. In this post we're going to remedy that issue with a simulation display that shows the position of the bodies graphically while the simulation is running and that also runs on the graphics card alongside the simulation. We'll even make the position buffer multipurpose so that we can calculate the positions directly into this buffer, and then turn around and draw those positions into a window from the same buffer. No copying required, not even automated copying.

Making this display for the gravity simulator turned out to be more difficult than I thought because I haven't programmed in OpenGL since the graphics course I took in college, and I've certainly never done any OpenGL-CUDA interop before. I managed to pull something together by leaning heavily on the N-Body sample project that's part of the nVidia CUDA sample projects. This sample project is also a gravity simulator, but the underlying simulation engine is substantially different than the one we've built so far. Even so, I was able to use the renderer without any modifications. Let's see how it works.

Initialization

When discussing the simulation kernel, we took a bottom-up approach, but now that the kernel already exists and won't change much while we add the display, we'll take a top-down approach for the changes that go into building the simulation display. The first thing we'll do is move main() to a new file called multi_body_gravity.cpp, since it will no longer only set up the simulation engine. This file will contain both OpenGL and CUDA code, so it's going to need a few more headers and some global objects and constants:

#include <helper_gl.h>
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
#include <GL/wglew.h>
#endif
#include <GL/freeglut.h>

#include <cuda_runtime.h>
#include <cuda_gl_interop.h>
#include <helper_cuda.h>

#include "render_particles.h"
#include "kernel.h"

ParticleRenderer* _renderer;
unsigned int _pbo[2];
cudaGraphicsResource* _pGRes[2];

const float _scale = 3.0e11;
const float _point_size = _scale / 50.0;
const int _num_bodies = 2;
const float _dt = 900;

Some of these includes are found directly in the CUDA samples project directory, so I had to add references to those subdirectories in the project properties for the C/C++ and CUDA compiler include directories and linker library directories. The ParticleRenderer class is actually the unmodified renderer from the N-Body sample project, so I copied the render_particles.h and render_particles.cpp files over to my project. The _pbo[2] array is the pair of pixel buffer objects that will hold the positions of each body in the simulation. The _pGRes[2] array is the corresponding CUDA graphics resources that enable the connection between CUDA and OpenGL through those pixel buffer objects. We'll see the initialization code that sets these structures up in a minute. Finally, we have a few global simulation constants. The _scale and _point_size make it easy to set the area of the simulation that the camera will see and display so that the bodies are actually visible in the display window. The _num_bodies and _dt were promoted to global constants from the initialization code of the simulator because they'll be used more extensively throughout the initialization code.

Now let's take a look at main() because it's changed quite a bit:

int main(int argc, char** argv) {
    initGL(&argc, argv);

    const float m[_num_bodies] = { 1.989e30, 5.972e24 };
    const vec3d d0[_num_bodies] = { {0, 0, 0, 1}, {1.496e11, 0, 0, 1} };
    const vec3d v0[_num_bodies] = { {0, -8.9415e-2, 0, 1}, {0, 2.978e4, 0, 1} };

    init(m, v0, d0);

    glutDisplayFunc(display);
    glutReshapeFunc(reshape);
    glutIdleFunc(idle);

    glutMainLoop();

    finalize();

    return 0;
}

The calls to the simulation engine are gone, and a bunch of other initialization calls are here instead. We start by calling initGL() to initialize the OpenGL system, and we pass in the command line arguments because we can pass OpenGL command line configuration values into the OpenGL subsystem this way. Then we initialize the mass, position, and velocity of the bodies like we did before (I will likely make this its own function in the future, which is why it's still sitting in main() for now), but notice that we're using vec3d instead of vec2d structs. I had to switch to using a 3D vector with alpha, since that's the point format that the renderer expects. This struct is now in a new header kernel.h:

typedef struct vec3d {
    float x;
    float y;
    float z;
    float a;
} vec3d;

cudaError_t initCuda(const float* m, const vec3d* v0, unsigned int size);
cudaError_t gravitySim(vec3d** d, const float dt, unsigned int size);
void finalizeCuda();

As you can see, there are a couple of new functions in the kernel, too, but we'll get to those later. For now, back in main() we call init() to initialize the rest of the simulator and renderer. Then, we define the OpenGL display, reshape, and idle functions before calling the main loop that starts drawing frames to the screen. This loop is where display() gets called over and over again. Finally, we call finalize() to close up shop when the program closes. We'll take a look at each of these functions in turn, starting with initGL():

void initGL(int* argc, char** argv) {
    // First initialize OpenGL context, so we can properly set the GL for CUDA.
    // This is necessary in order to achieve optimal performance with OpenGL/CUDA interop.
    glutInit(argc, argv);
    glutInitDisplayMode(GLUT_RGB | GLUT_DEPTH | GLUT_DOUBLE);
    glutInitWindowSize(720, 480);
    glutCreateWindow("CUDA n-body gravity simulator");

    if (!isGLVersionSupported(2, 0) ||
        !areGLExtensionsSupported(
            "GL_ARB_multitexture "
            "GL_ARB_vertex_buffer_object")) {
        fprintf(stderr, "Required OpenGL extensions missing.");
        exit(EXIT_FAILURE);
    } else {
#if   defined(WIN32)
        wglSwapIntervalEXT(0);
#elif defined(LINUX)
        glxSwapIntervalSGI(0);
#endif
    }

    glEnable(GL_DEPTH_TEST);
    glClearColor(0.0, 0.0, 0.0, 1.0);

    checkGLErrors("initGL");
}

The first few API calls here are fairly self-explanatory. Then, there's the if-else statement that is all required for the vertex and pixel shaders that are used in the renderer. (I tried calling wglSwapIntervalEXT() immediately since I know I'm not on a Linux system, but that did not work.) Before returning from the function, we enable the GL_DEPTH_TEST to enable z-clipping, clear the window color to black, and check for errors. Next up in main() is the init() function:

void init(const float* m, const vec3d* v0, const vec3d* d0) {
    // create the position pixel buffer objects for rendering
    // we will actually compute directly from this memory in CUDA too
    glGenBuffers(2, (GLuint*)_pbo);

    unsigned int memSize = sizeof(float) * 4 * _num_bodies;
    for (int i = 0; i < 2; ++i) {
        glBindBuffer(GL_ARRAY_BUFFER, _pbo[i]);
        glBufferData(GL_ARRAY_BUFFER, memSize, d0, GL_DYNAMIC_DRAW);

        int size = 0;
        glGetBufferParameteriv(GL_ARRAY_BUFFER, GL_BUFFER_SIZE, (GLint*)&size);

        if ((unsigned)size != memSize) {
            fprintf(stderr, "WARNING: Pixel Buffer Object allocation failed!n");
        }

        glBindBuffer(GL_ARRAY_BUFFER, 0);
        checkCudaErrors(cudaGraphicsGLRegisterBuffer(&_pGRes[i], _pbo[i], cudaGraphicsMapFlagsNone));
    }

    initCuda(m, v0, _num_bodies);

    _renderer = new ParticleRenderer;
    
    float color[4] = { 1.0f, 0.6f, 0.3f, 1.0f };
    _renderer->setBaseColor(color);

    float* hColor = new float[_num_bodies * 4];
    for (int i = 0; i < sizeof(hColor); i += 4) {
        hColor[i] = color[0];
        hColor[i + 1] = color[1];
        hColor[i + 2] = color[2];
        hColor[i + 3] = color[3];
    }
    _renderer->setColors(hColor, _num_bodies);
    _renderer->setSpriteSize(_point_size);
    _renderer->setPBO(_pbo[1], _num_bodies, false);
}

This part of the initialization starts off by generating the two pixel buffer objects, and loading them with the initial position data using glBufferData(). Next, we need to allocate the CUDA device buffers, but this no longer includes the position buffers, so that initialization code is somewhat reduced from before:

cudaError_t initCuda(const float* m, const vec3d* v0, unsigned int size)
{
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);

    // Allocate GPU buffers for 3 vectors (two input, one output)
    cudaStatus = cudaMalloc((void**)&dev_v, size * sizeof(vec3d));
    cudaStatus = cudaMalloc((void**)&dev_m, size * sizeof(float));
    cudaStatus = cudaMalloc((void**)&dev_v0, size * sizeof(vec3d));

    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_m, m, size * sizeof(float), cudaMemcpyHostToDevice);
    cudaStatus = cudaMemcpy(dev_v0, v0, size * sizeof(vec3d), cudaMemcpyHostToDevice);

    return cudaStatus;
}

Back in init() we then configure the renderer with its base color, colors for each body, the sprite size of the bodies, and the pixel buffer object. Notice that we only set the second of the two PBOs because we're going to take a shortcut and calculate two timesteps with each simulation pass. This trick allows us to make the simulation twice as accurate because we can reduce the timestep, interchange the buffers automatically with two function calls, and always use the same PBO so we can set it in initialization and forget it. Technically, this means we only need one PBO, but this setup works just fine. That's it for initialization, so let's look at how we display the points in the window.

Display

This is where the magic happens. Each time the window is redrawn, OpenGL calls display() to do the drawing to the screen. We need to step the simulation during this time to know where to draw the bodies next, and this is what it looks like:

void display() {
    vec3d* dev_d[2];

    checkCudaErrors(cudaGraphicsMapResources(2, _pGRes, 0));
    size_t bytes;
    checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void**)&(dev_d[0]), &bytes, _pGRes[0]));
    checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void**)&(dev_d[1]), &bytes, _pGRes[1]));

    gravitySim(dev_d, _dt, _num_bodies);

    checkCudaErrors(cudaGraphicsUnmapResources(2, _pGRes, 0));

    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

    glMatrixMode(GL_MODELVIEW);
    glLoadIdentity();
    glTranslatef(0, 0, -_scale);

    _renderer->display(ParticleRenderer::PARTICLE_SPRITES_COLOR);

    glutSwapBuffers();

    glutReportErrors();
}

In the first half of this display() function, we are mapping the graphics resource that we had set up in the initialization to a pair of vec3d pointers. These pointers to buffers are what the dev_d0 and dev_d buffers used to be, and we can pass them into the gravity simulator to use in the kernel as if they are just like any other GPU memory buffer that was allocated with CUDA. On the first call to display(), these buffers will contain the initial positions that were copied in for the bodies, and after that they'll contain the calculated positions after each time step. The rest of the display() function clears the window, sets up the camera view to point at the origin from -_scale distance in the z-direction, tells the renderer to draw the bodies using the pixel buffer objects, and finally, swap the ping-pong screen buffers and report any OpenGL errors.

Because we're still calculating positions and velocities with 2D points, even though they're inside 3D points, all of the bodies will appear on the x-y plane, and the camera will be looking directly at the X-Y plane from far away on the z-axis. We won't go into how the renderer actually draws the bodies since that _renderer->display() function was used as-is, but we will take a deeper look at that gravitySim() function call:

cudaError_t gravitySim(vec3d** dev_d, const float dt, unsigned int size) {
    cudaError_t cudaStatus = cudaSuccess;
    vec3d *d = new vec3d[size];

    cudaStatus = gravityIter(d, dev_v, dev_d[1], dev_m, dev_v0, dev_d[0], dt, size);
    cudaStatus = gravityIter(d, dev_v0, dev_d[0], dev_m, dev_v, dev_d[1], dt, size);
    return cudaStatus;
}

The call to run the gravity simulation has become extremely simple. It's just two calls to gravityIter() to step the simulation twice with dev_d[0] as the input positions and dev_d[1] as the output positions for one call, and then they're swapped for the second call. The dev_v0 and dev_v pointers to the velocity buffers are swapped for each call as well. The gravityIter() function is unmodified, so it still copies the positions from the GPU memory back to host memory, requiring us to pass in a vec3d array to hold the values. This feature is nice for debugging, but once things are working that extra GPU memory transfer can be disabled to speed up the simulation.

That's about it for drawing body positions to the screen, but there are two other functions that we set up with OpenGL that get called periodically. The reshape() function will get called whenever the simulation window gets resized, and it looks like this:

void reshape(int w, int h) {
    glMatrixMode(GL_PROJECTION);
    glLoadIdentity();
    gluPerspective(60.0, (float)w / (float)h, _scale * 2 / 1.0e6, _scale * 2);

    glMatrixMode(GL_MODELVIEW);
    glViewport(0, 0, w, h);
}

All this function does is adjust the camera perspective to match the new window dimensions. The last two arguments in gluPerspective() are for the near and far draw distance, and only the objects within this range will actually be drawn to the window. The _scale constant is used to set the draw area, and this is important because we're working with astronomical distances here. The default draw range isn't enough. When I was first testing this out, I could get bodies drawn to the screen for the small debug system, but nothing would get drawn for the Earth-Sun simulation. It actually took a while to realize that these settings in gluPerspective() were why I wasn't getting anything on the screen for the Earth-Sun simulation. The problem was that everything was getting clipped out because the draw distance was so short! So, remember to set the draw distance correctly if you can't see anything in your view.

The last drawing function is also the easiest:

void idle(void) {
    glutPostRedisplay();
}

The idle() function is called whenever no other processing is happening, and all it does is call glutPostRedisplay() to tell the system that it can start drawing the next frame.

Cleanup

We now have initialization and display complete, so all that's left is the cleanup when the program terminates. The cleanup is pretty simple, and it all happens in finalize():

void finalize() {
    checkCudaErrors(cudaGraphicsUnregisterResource(_pGRes[0]));
    checkCudaErrors(cudaGraphicsUnregisterResource(_pGRes[1]));
    glDeleteBuffers(2, (const GLuint*)_pbo);

    finalizeCuda();
}

First, we unwind the graphics resources and pixel buffer objects that we had created. Then, we call finalizeCuda():

void finalizeCuda() {
    cudaFree(dev_v);
    cudaFree(dev_m);
    cudaFree(dev_v0);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    checkCudaErrors(cudaDeviceReset());
}

This function simply frees the remaining CUDA buffers from the GPU and does a device reset to make sure the debugging tools have accurate data. That completes the modifications to our simulator to enable a graphical display, and I'm pretty happy with the organization. From a high level it consists of a simulation engine using CUDA, a rendering engine using OpenGL, and a controller that sits between them and runs the initialization, interop, and cleanup. It's all nice and tidy.

Running a Simulation

Now that we have a visual display of our simulation, we can take a look at the Earth-Sun orbit in motion:

In order to get this simulation, I lowered the time step to be every half hour. The old value from the last post of every 6 hours resulted in a simulation that went way too fast. The other benefit of the lower time step is increased simulation accuracy, but even with the increased accuracy I saw that the Earth was slowly spiraling towards the top of the screen. To debug this issue, I zoomed in on the Sun and found that it, too, was slowly moving up as the simulation ran. The reason behind this drift was that the initial velocity of the Sun was set to zero, but the Earth had an initial velocity in the positive y-direction. This setup gave the whole system a small amount of momentum in the positive y-direction, causing the Earth and Sun to slowly drift up. To correct for this momentum, I just needed to give the Sun a small velocity in the negative y-direction, and the value needed to be inversely proportional to the Sun's mass and proportional to the Earth's mass and velocity. The exact value of -0.089415 m/s is already shown in the new initial velocity for the Sun in the initialization code section above. With that adjustment, we can zoom in on the Sun by changing the _scale to 2.0e6 and see what the Sun's motion really looks like:

I added a zero-mass reference point at the origin so that it's clear that the Sun returns to its starting point after completing a full circle. This is all very cool, and it now looks like we have a fully functional gravity simulator with a real time display. Of course, I can imagine a number of improvements to the display that would make things even better: dynamic zoom and pan controls, different colors for the different bodies, a grid with distance values, etc. That's all doable stuff, but I want to make more progress on the simulation engine itself, so next time we'll see about scaling up the simulation with more bodies and see how far we can go with this CUDA kernel.

Lucid Mesh

Search This Blog