Playing with Gravity to Learn CUDA: Optimization

The CUDA gravity simulator has reached the point where it has a simulation engine that can display a 1024-body simulation in real-time. (Check it out from the beginning.) In getting to that point, we hit the limit on the number of threads that can be started in a single thread block in CUDA, but of course, that is not the end of the story. We can still increase the number of bodies in the simulation further, and after we've explored how to do that, we'll experiment with the parameters of the simulation to see if we can get anything interesting going that looks like a star cluster. Spoiler alert: we can, and we will.

Pushing Up the Thread Count

In the last post we stopped at 1024 threads not because that was all the GeForce GTX 1050 card I'm using can handle, but because that's a hard limit in CUDA on the number of threads that can be started within a single block, regardless of which GPU you're using. However, we can have more than one block, so we can increase the number of blocks to further increase the number of threads running simultaneously and thus increase the number of individual bodies in the simulation without needing to change the kernel. The number of blocks we want to use is specified as the first parameter within the special <<< blocks, threads >>> syntax when calling a kernel, and so far we've always set blocks equal to 1.

Instead of leaving the number of threads at 1024 and increasing the number of blocks, we can further optimize the efficiency of the simulation by knowing a bit more about the CUDA execution model. In the execution model, when threads are executing on the GPU cores, the threads are organized into groups referred to as "warps." Each warp is a fixed group of 32 threads that execute on hardware resources in lock-step. Since the number of threads in a warp is fixed at 32, the number of cores on a GPU will be a multiple of 32 for everything to work out. For example, my GTX 1050 has 640 cores, which is 32x20. If we execute a number of threads that isn't a multiple of 32, then the last warp will contain some number of empty threads to fill up the warp to 32 threads, so we might as well simulate a multiple of 32 bodies to use the hardware most efficiently.

We can also take advantage of the fact that thread execution is managed at the block level, and organizing the blocks into a small number of warps will give the GPU the most flexibility in assigning resources and hiding memory latency. If the kernel only does a small amount of computation compared to memory accesses (where the kernel needs to access memory for variables passed into the kernel), then a slightly higher number of warps will help keep the cores busy executing more threads while other threads are waiting for memory accesses. Multiple warps can be assigned to the same set of 32 cores and one warp will execute while another warp is waiting for memory. This warp-switching is supported by hardware and is extremely efficient. In our case, there's enough computation done in the kernel that one warp per block is enough to optimize the execution and hide the memory latency. This was not determined solely by code inspection and reasoning about the execution model, but by taking time measurements while the simulator was running, as shown in this code:

cudaError_t gravityIter(vec3d* d, vec3d* dev_v, vec3d* dev_d, const float* dev_m, const vec3d* dev_v0, const vec3d* dev_d0, float dt, unsigned int size) {
    clock_t t = clock();

    // Launch a kernel on the GPU with one thread for each body.
    unsigned int blocks = size / 32;
    gravityKernel <<< blocks, 32 >>> (dev_v, dev_d, dev_m, dev_v0, dev_d0, dt);

    // cudaDeviceSynchronize waits for the kernel to finish
    cudaDeviceSynchronize();

    printf("t_iter = %d\n", clock() - t);

    return cudaStatus;
}

This is the same code that was shown in previous posts that calls the gravityKernel(), but the error checking was removed for clarity. Here, the number of blocks are determined by dividing the simulation size by 32, and the number of blocks and threads are passed to the kernel execution as <<< blocks, 32 >>>. The performance measurement is simply measuring the clock before the kernel starts and after all of the threads have synchronized, and printing out the time change. Lower is better, and there was no benefit to increasing the block size above 32. At the original value of 1024, kernel execution was actually less efficient because the blocks couldn't be managed well across the 640 cores in the GPU, so there is a tradeoff going on between memory latency and thread management for any given kernel.

In order to use this blocks/threads setup, we also needed to make one change to the kernel code because the index into the various data arrays is no longer solely determined by the thread index. Now the block index comes into play as well, so we need to take that into account by calculating the overall index using the always-available blockIdx and blockDim variables as follows:

    int i = blockIdx.x * blockDim.x + threadIdx.x;

With this optimal thread organization determined, we can start scaling up the number of bodies again, and we can use the time measurement to figure out when enough is enough. We want the simulation to still look fairly smooth on the display, so targeting about 30fps would be desirable. That would mean kernel execution should take about 33msec to be optimal, and kernel execution will scale super-linearly with the number of bodies because each body needs to calculate its force on every other body in the system and we'll be well past one body per core. I found that for my GPU, 8 bodies per core, or 5,120 bodies total, still resulted in a smooth simulation and was pretty respectable for simulating a star cluster. Any more than that, and things visibly slowed down. The number of bodies should always be a multiple of the number of cores because if we were left with a group of bodies simulating on only a fraction of the available cores, we'd effectively have wasted the fraction of cores that are left empty for the last part of that simulation cycle. Remember, all threads in a warp execute in lock-step, all threads are executing the exact same code so they should finish at nearly the same time, and all threads are synchronized at the end of the kernel execution. That means all cores should finish thread execution at nearly the same time, so choosing the number of bodies as a multiple of the number of cores makes perfect sense.

Experimenting with a Star Cluster

Now that we have a reasonable number of bodies for a star cluster simulation, let's see what that looks like. We had left the simulation configuration at the scale of the solar system, but now that we're trying to simulate a star cluster, we need to expand that out substantially. Let's try a space extending about 100 light-years across. We need that in meters, and a light-year is 9.46e15 meters, so we'll use 9.46e17 for our scale. We can't continue to simulate in half-hour increments at this scale, or we'll be waiting forever for anything to happen. Increments of 1,000 years seems more appropriate. That gives us the following constants for our simulation:

const float _scale = 9.46e17;
const float _point_size = _scale / 50.0;
const int _num_bodies = 8 * 640;
const float _dt = 3600*24*365.25*1000;

Then we need to set up the stars in the system. Realistically, the distribution of stars in a cluster (or anywhere in the universe) is biased towards red dwarf or M-class stars, which make up 75% of all stars, but are only 0.08-0.45 solar masses. Blue giant or O-class stars make up only a tiny fraction of all stars at 0.00003%, but are greater than 16 solar masses. Instead of trying to reproduce this distribution, we'll take the easy way out and model a uniform distribution of 0.5-1.5 solar masses for our star cluster. It should be sufficient for this exploration, and it's simple to understand and code. We can set up this star configuration in initBodies() as follows:

void initBodies(float* m, vec3d* v, vec3d* d) {
    // set origin
    m[0] = 0.0;
    d[0] = { 0, 0, 0, 1 };
    v[0] = { 0, 0, 0, 1 };
    
    float nominal_mass = 2.0e30;
    for (int i = 1; i < _num_bodies; i++) {
        m[i] = (rand() / (float)RAND_MAX) * nominal_mass + nominal_mass / 2;

        d[i].x = (rand() / (float)RAND_MAX - 0.5) * _scale * 2;
        d[i].y = (rand() / (float)RAND_MAX - 0.5) * _scale * 2;
        d[i].z = 0.0;
        d[i].a = 1.0;

        v[i].x = 0.0;
        v[i].y = 0.0;
        v[i].z = 0.0;
        v[i].a = 1.0;
    }
}

A solar mass is about 2.0e30 kg, and the mass calculation gives a uniform distribution of masses around 0.5-1.5 solar masses. The position calculation similarly gives a uniform distribution around +/-_scale in both the x- and y-dimensions. Let's see what this simulation configuration looks like:

It's not terribly interesting because nothing starts out in motion. It takes a very long time for gravity to have enough of an effect to really get things in motion, except for the few stars that were randomly placed very close together to begin with. Those stars are quickly attracted to each other, accelerating rapidly. Then, because there's no way to handle collisions, the stars will pass each other and go flying off in opposite directions. Over time more of the system will be set in motion, but it seems like it will mostly be ejections from the system and the result is not going to be realistic at all. Thinking about the initial conditions, at this point in star formation the system is definitely going to be in motion. We can start our refinements by trying to initialize each star with a random motion. Something like a maximum of 5 km/sec in either direction and dimension sounds reasonable:

void initBodies(float* m, vec3d* v, vec3d* d) {
    // set origin
    m[0] = 0.0;
    d[0] = { 0, 0, 0, 1 };
    v[0] = { 0, 0, 0, 1 };
    
    float nominal_mass = 2.0e30;
    for (int i = 1; i < _num_bodies; i++) {
        m[i] = (rand() / (float)RAND_MAX) * nominal_mass + nominal_mass / 2;

        d[i].x = (rand() / (float)RAND_MAX - 0.5) * _scale * 2;
        d[i].y = (rand() / (float)RAND_MAX - 0.5) * _scale * 2;
        d[i].z = 0.0;
        d[i].a = 1.0;

        v[i].x = (rand() / (float)RAND_MAX - 0.5) * 10000.0;
        v[i].y = (rand() / (float)RAND_MAX - 0.5) * 10000.0;
        v[i].z = 0.0;
        v[i].a = 1.0;
    }
}

A simulation of this configuration looks like this:

While this simulation looks more interesting—there's some motion to the system and ejections happen only very rarely—it's also not realistic. It looks more like a slow-motion particle simulation of a gas than a gravity simulation of a star cluster. It also doesn't have a discernable pattern to it or look like it will ever reach a state like the star cluster pictured at the top of this post. We need to force our simulation in the direction we want it to go with more intent.

Setting up a (More) Realistic Star Cluster

If we stop to think about what a real star cluster looks like, it does not have a uniform distribution of stars. The stars are concentrated more in the center of the cluster and the number of stars falls off the further from the center of the cluster they are. We can very easily model this kind of distribution by changing our coordinate system from the standard rectangular coordinates to polar coordinates. In polar coordinates, instead of having an x and y value, we have a radius r that represents the distance from the origin, and an angle θ that represents the angle from the x-axis. If we choose a random r and θ for each body with a uniform distribution, we will naturally end up with a distribution of points that's more concentrated in the center because about the same number of points should exist on concentric circles around the center. Then, we simply convert back to rectangular coordinates using the following equations:

x = r * cos(θ)

y = r * sin(θ)

We can implement the polar coordinate generation of stars in initBodies() like so:

#define M_PI 3.14159265358979323846
void initBodies(float* m, vec3d* v, vec3d* d) {
    // set origin
    m[0] = 0.0;
    d[0] = { 0, 0, 0, 1 };
    v[0] = { 0, 0, 0, 1 };
    
    float nominal_mass = 2.0e30;
    for (int i = 1; i < _num_bodies; i++) {
        m[i] = (rand() / (float)RAND_MAX) * nominal_mass + nominal_mass / 2;

        double r = (rand() / (double)RAND_MAX) * _scale;
        double theta = (rand() / (double)RAND_MAX) * 2 * M_PI;

        d[i].x = cos(theta) * r;
        d[i].y = sin(theta) * r;
        d[i].z = 0.0;
        d[i].a = 1.0;

        v[i].x = 0.0;
        v[i].y = 0.0;
        v[i].z = 0.0;
        v[i].a = 1.0;
    }
}

Notice that r varies between 0 and _scale for the radius, and θ varies between 0 and 2π for the angle, which fills the entire viewable area and then some with stars. Now, let's take another look at the simulation:

While the initialization now creates a star cluster that looks much more like the real thing, since nothing is moving again, we end up with similar behavior to the first simulation where the stars that start out very close together gravitationally interact first, and they tend to get violently ejected from the system. The difference here is that because the stars in the center are packed much closer together, these interactions happen much more frequently in the center. Most of the outer stars remain fairly stationary for a long time, though, and overall the simulation is only mildly more interesting than the last one. Let's keep going by giving every star a random initial velocity, too:

#define M_PI 3.14159265358979323846
void initBodies(float* m, vec3d* v, vec3d* d) {
    // set origin
    m[0] = 0.0;
    d[0] = { 0, 0, 0, 1 };
    v[0] = { 0, 0, 0, 1 };
    
    float nominal_mass = 2.0e30;
    for (int i = 1; i < _num_bodies; i++) {
        m[i] = (rand() / (float)RAND_MAX) * nominal_mass + nominal_mass / 2;

        double r = (rand() / (double)RAND_MAX) * _scale;
        double theta = (rand() / (double)RAND_MAX) * 2 * M_PI;

        d[i].x = cos(theta) * r;
        d[i].y = sin(theta) * r;
        d[i].z = 0.0;
        d[i].a = 1.0;

        v[i].x = (rand() / (float)RAND_MAX - 0.5) * 10000.0;
        v[i].y = (rand() / (float)RAND_MAX - 0.5) * 10000.0;
        v[i].z = 0.0;
        v[i].a = 1.0;
    }
}

Now we get a simulation that looks a lot more like the second one, but with a higher concentration of stars in the center. It looks like the whole system will eventually disassociate along with some random violent ejections taking place. This may be a somewhat realistic simulation of an open star cluster, which will eventually disassociate over time like this. However, open star clusters normally only have hundreds of stars, not thousands. A cluster of thousands of stars is most likely a globular cluster that exists for billions of years and doesn't readily disassociate.

We can try another modification to give the stars an initial common motion, and see what happens. Here we'll want to give them a roughly circular motion that's relative to how much gravity they would experience based on how far they are from the center of the cluster. Stars closer to the center should logically be orbiting faster than those stars on the outskirts of the cluster. We can create this kind of motion by setting each star's initial velocity with a magnitude (r) proportional to the square root of the total mass and inverse square root of the star's distance from the center, and an angle that's perpendicular to the star's position angle from the x-axis. Polar coordinates makes this calculation very simple:

#define M_PI 3.14159265358979323846
void initBodies(float* m, vec3d* v, vec3d* d) {
    // set origin
    m[0] = 0.0;
    d[0] = { 0, 0, 0, 1 };
    v[0] = { 0, 0, 0, 1 };
    
    float nominal_mass = 2.0e30;
    double total_mass = m[0];
    for (int i = 1; i < _num_bodies; i++) {
        m[i] = (rand() / (float)RAND_MAX) * nominal_mass + nominal_mass / 2;
        total_mass += m[i];
    }

    for (int i = 1; i < _num_bodies; i++) {
        double r = (rand() / (double)RAND_MAX) * _scale;
        if (r == 0.0) r = 1.0e15;
        double theta = (rand() / (double)RAND_MAX) * 2 * M_PI;

        d[i].x = cos(theta) * r;
        d[i].y = sin(theta) * r;
        d[i].z = 0.0;
        d[i].a = 1.0;

        r = sqrtf(G * total_mass / r) * (rand() / (double)RAND_MAX + 3.0) / 4.0;
        theta += M_PI / 2 + (rand() / (double)RAND_MAX - 0.5) * M_PI / 2;

        v[i].x = cos(theta) * r;
        v[i].y = sin(theta) * r;
        v[i].z = 0.0;
        v[i].a = 1.0;
    }
}

We needed to break up the for loop so that all masses are assigned first, and we can accumulate a total mass of the system for later use in the calculation of the magnitude of the velocity. We need to take a precaution now for the initial radius of each star from the origin because it cannot be zero, otherwise we'll divide by zero and the star cluster will implode! (Or the simulation just crashes.) I also added a random component to the velocity magnitude and angle: ±25% for the magnitude and ±π/4 for the angle. Our simulation now has a rotational component:

This is starting to look very cool! It still looks like the cluster will disassociate over time, starting from the center, but it has much more dynamic behavior. It's starting to look like a reasonable simulation of a star cluster. What we probably need to hold it together is a huge central mass to act as an anchor for the other stars' orbits. We can do that by simply adding a million solar mass object at the origin (this would obviously be a black hole):

    m[0] = 2.0e36;

For this simulation, I also removed the random components to the initial velocities, just to see what it would look like:

This is another interesting simulation that seems to model what a very small galaxy would look like, or possibly a large star cluster. I'm not sure, since I'm not an expert in the subject, but it's fascinating to watch, nonetheless. At times it even looks like spiral arms are forming in rotational waves of stars, although they aren't maintained. It would probably require many more stars to create stable spiral arms. We also still see ejections in this simulation, which would be considered violent relaxations of the system, as one body is ejected and the remaining body in the interaction probably becomes more tightly coupled to the rest of the system. It also looks like many of these ejections happen because of interactions with the central mass that cause a star to whip around the center, accelerate rapidly, and fly out of the system. Let's take one final look at another simulation with the random component added back into the initial velocities:

Wow! Now this is an impressive simulation. It looks like there are multiple layers of orbital motion from the added randomness of the initial velocities, and the system has a pleasingly complex structure to it. There are violent relaxations going on, stars that seem to be falling into the center, and other stars migrating to the outskirts of the cluster. Over time it seems to be fairly stable as well, and the simulation can run for a very long time without losing its general form. Overall, I'm very happy with the results.

Like all programming projects, there are plenty of things that could be improved upon in this simulator. We could, of course, implement a model of collisions that would combine stars that came too close together. We could more accurately model acceleration and velocity, as some of those ejections seem to be happening at much higher speeds than they should. Adding in a Lorentz transformation for the velocity calculation would make that more accurate. And, we could strive to improve the simulation engine to be able to simulate a much larger number of bodies. Adding in calculations that keep an ordering of bodies and preferentially calculate forces only for bodies that are closer to each other or extremely massive could potentially allow the simulator to handle orders of magnitude more bodies at the expense of making the kernel much more complex and somewhat less accurate.

All of these options would be interesting paths to explore, but for now, this CUDA gravity simulator is looking pretty good and produces some fascinating simulations. It's worth taking a step back and thinking about the amazingly complex calculations that are going on here. Each and every body in the simulation is interacting with every other body, the motion is only set with the initial conditions and then is fully determined by the single force of gravity, and this is all calculated and displayed in real time by a graphics card with 640 execution cores. It's incredible! I'm quite pleased with where the simulator ended up, and I hope you learned some things about CUDA (and gravity) along the way.

Lucid Mesh

Search This Blog