diff --git a/README.md b/README.md
index 98dd9a8..f5e13fa 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,76 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Daniel McCann
+* Tested on: Windows 10, i7-5700HQ CPU @ 2.70GHz, GeForce GTX 970M, 16 GB RAM
 
-### (TODO: Your README)
+### Overview
+
+![](./images/broid.gif)
+
+Above: 100,000 boids simulated on a GTX 970M. Gif is 7FPS, real is 33.
+
+As the first project for GPU programming and architecture, this simulation serves as an introduction to CUDA and memory coherence.
+
+Each 'boid' (particle) in this simulation reacts to neighbors with three rules: it is attracted toward them, repelled within a certain distance, and matches their velocity. Each rule has adjustable radii and weights.
+
+There are three stages to this implementation. The first, naive, has every particle check every other particle. 
+The second, called 'scattered,' implements a uniform grid. Each boid belongs to a grid cell in 3D space, and only needs to check neighbors within 8 cells instead of the entire space. The 8 cells are determined by the octant of the cell enclosing the boid; this extra computation is marginal and saves 19 cells of boids (2x2x2 instead of 3x3x3).
+The third, coherent, is a small modification to scattered that sorts boid positions and velocities to be coherent in memory, with the ordering determined by the position of its inclosing cells. Thus all of the boids within a cell are coherent in memory, and cells' memory is coherent down the x-axis
+
+Each stage brings a massive improvement on the last, and the coherent approach has trivial implementation differences from the scattered approach.
+
+### Performance analysis
+
+## Algorithm Differences
+
+To compare the solutions without a uniform grid, with a grid, and with memory-coherent boid data, I used the FPS measurement with the visualizer turned off.
+FPS measurements fluctuate often in the grid since the runtime is dependent on the worst case of population within grid cells. So with low numbers of particles, the measurements for the scattered and coherent algorithms are ballpark estimates.
+Once the particle count reaches 100,000 it stays within a 2 fps range.
+
+For these tests, each rule was as follows:
+
+|Rule # | Radius | Strength |
+| --- | --- | --- |
+| 1 | 8 | 0.01 |
+| 2 | 4 | 0.1 |
+| 3 | 5 | 0.1 |
+
+![](./images/FPSMethod.png)
+
+![](./images/FPSMethodText.PNG)
+
+A few things to notice: the minimum number of boids/particles in these measurements is 1250.
+At this stage, the naive method is roughly the same as the coherent grid, and better than the scattered grid. With very low numbers of particles, it should be the best method.
+
+Memory coherence was a massive benefit to FPS. My measurements show an 18% increase at the minimum, and this is for the least precise measurements. At 100,000 particles, it gave a 50% increase in FPS, from 22 to 33. This is further aided by picking grid cells in a memory-coherent order, making four pairs of cells with completely coherent reads for particles.
+
+## Cell Size
+
+I tested FPS with 100,000 particles with a memory-coherent uniform grid.
+
+| Radius | FPS |
+| --- | --- |
+| 16.0 | 6 - 8 |
+| 12.0 | 13 - 14 |
+| 8.0 | 32 - 33 |
+| 4.0 | 122 - 123 |
+
+As the radius increases, the particles are divided into fewer cells, thus each particle must check more neighbors. This, of course, converges to a worse version of the naive algorithm.
+
+## Block Size
+
+Using the same rules outlined in Algorithm Differences, 100,000 particles and coherence:
+
+| Size | FPS |
+| --- | --- |
+| 16 | 17 - 18 |
+| 32 | 30 - 32 |
+| 64 | 31 - 32 |
+| 128 | 32 - 33 |
+| 256 | 30 - 31 |
+| 512 | 30 - 31 |
+
+Varying block sizes have only marginal differences for this algorithm at this number of particles. 128 is ideal (barely) and 16 is a significant downgrade from 32. This makes sense because each block will not have a full working warp.
+256 and 512 are marginally different. My GPU has a max block dimension of (1024, 1024, 64).
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
diff --git a/images/FPSMethod.png b/images/FPSMethod.png
new file mode 100644
index 0000000..cfa1f9b
Binary files /dev/null and b/images/FPSMethod.png differ
diff --git a/images/FPSMethodText.PNG b/images/FPSMethodText.PNG
new file mode 100644
index 0000000..218121e
Binary files /dev/null and b/images/FPSMethodText.PNG differ
diff --git a/images/broid.gif b/images/broid.gif
new file mode 100644
index 0000000..a1c5ba1
Binary files /dev/null and b/images/broid.gif differ
diff --git a/src/kernel.cu b/src/kernel.cu
index aaf0fbf..375d737 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -37,12 +37,12 @@ void checkCUDAError(const char *msg, int line = -1) {
 *****************/
 
 /*! Block size used for CUDA kernel launch. */
-#define blockSize 128
+#define blockSize 512
 
 // LOOK-1.2 Parameters for the boids algorithm.
 // These worked well in our reference implementation.
-#define rule1Distance 5.0f
-#define rule2Distance 3.0f
+#define rule1Distance 8.0f
+#define rule2Distance 4.0f
 #define rule3Distance 5.0f
 
 #define rule1Scale 0.01f
@@ -85,6 +85,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_pos_coherent;
+glm::vec3 *dev_vel_coherent;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -169,6 +171,22 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount* sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+  
+  cudaMalloc((void**)&dev_pos_coherent, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos_coherent failed!");
+  cudaMalloc((void**)&dev_vel_coherent, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_vel_coherent failed!");
+
+
+
   cudaThreadSynchronize();
 }
 
@@ -210,8 +228,8 @@ __global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float
 void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
   dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
 
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+  kernCopyPositionsToVBO <<<fullBlocksPerGrid, blockSize >>>(numObjects, dev_pos, vbodptr_positions, scene_scale);
+  kernCopyVelocitiesToVBO <<<fullBlocksPerGrid, blockSize >>>(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
 
   checkCUDAErrorWithLine("copyBoidsToVBO failed!");
 
@@ -230,10 +248,46 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+    // Rule 1, Cohesion: boids fly towards their local perceived center of mass, which excludes themselves
+    // Rule 2, Avoidance: boids try to stay a distance d away from each other
+    // Rule 3, Matching: boids try to match the speed of surrounding boids
+	// sum each contribution separately
+	glm::vec3 cohesCenter = glm::vec3(0);
+	glm::vec3 avoidVel = glm::vec3(0);
+	glm::vec3 matchVel = glm::vec3(0);
+	// different number of neighbors depending on rule settings
+	float numCohes = 0.0f;
+	float numAvoid = 0.0f;
+	float numMatch = 0.0f;
+
+	glm::vec3 selfPos = pos[iSelf];
+
+	for (int i = 0; i < N; i++) {
+		if (i == iSelf) continue;
+		glm::vec3 otherPos = pos[i];
+		float dist = glm::distance(otherPos, selfPos);
+
+		if (dist < rule1Distance) {
+			numCohes++;
+			cohesCenter += otherPos;
+		}
+
+		if (dist < rule2Distance) {
+			numAvoid++;
+			avoidVel -= otherPos - selfPos;
+		}
+
+		if (dist < rule3Distance) {
+			numMatch++;
+			matchVel += vel[i];
+		}
+	}
+
+	cohesCenter = numCohes > 0.0f ? (cohesCenter / numCohes) : selfPos;
+	avoidVel = numAvoid > 0.0f ? (avoidVel / numAvoid) : glm::vec3(0);
+	matchVel = numMatch > 0.0f ? (matchVel / numMatch) : glm::vec3(0);
+
+	return vel[iSelf] + (cohesCenter - selfPos) * rule1Scale + avoidVel * rule2Scale + matchVel * rule3Scale;
 }
 
 /**
@@ -242,9 +296,16 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+	int idx = threadIdx.x + blockDim.x * blockIdx.x;
+
+	glm::vec3 vel = computeVelocityChange(N, idx, pos, vel1);
+
+	float speed = glm::length(vel);
+	if (speed > maxSpeed) {
+		vel = glm::normalize(vel) * maxSpeed;
+	}
+
+	vel2[idx] = vel;
 }
 
 /**
@@ -278,6 +339,7 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
 //          for(x)
 //            for(y)
 //             for(z)? Or some other order?
+// z y x
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
   return x + y * gridResolution + z * gridResolution * gridResolution;
 }
@@ -289,10 +351,27 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+	int idx = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (idx >= N) {
+		return;
+	}
+
+	int dataIdx = idx;// indices[idx];
+	indices[idx] = idx;
+	glm::vec3 curPos = pos[dataIdx];
+	curPos -= gridMin;
+	curPos *= inverseCellWidth;
+
+	curPos = floor(curPos); // now in grid-index-space
+
+	int selfGridX = int(curPos.x);
+	int selfGridY = int(curPos.x);
+	int selfGridZ = int(curPos.z);
+
+	int gridIdx = gridIndex3Dto1D(selfGridX, selfGridY, selfGridZ, gridResolution);
+	gridIndices[idx] = gridIdx;
 }
 
-// LOOK-2.1 Consider how this could be useful for indicating that a cell
-//          does not enclose any boids
 __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if (index < N) {
@@ -306,6 +385,25 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+	int idx = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (idx > N - 1) {
+		return;
+	}
+	else if (idx == N - 1) {
+		int c = particleGridIndices[idx];
+		gridCellEndIndices[c] = idx;
+	}
+	else {
+		int c1 = particleGridIndices[idx];
+		int c2 = particleGridIndices[idx + 1];
+		if (c1 != c2) {
+			gridCellEndIndices[c1] = idx;
+			gridCellStartIndices[c2] = idx + 1;
+		}
+		else if (idx == 0) {
+			gridCellStartIndices[c1] = 0;
+		}
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -314,14 +412,91 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   int *gridCellStartIndices, int *gridCellEndIndices,
   int *particleArrayIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int idx = threadIdx.x + blockDim.x * blockIdx.x;
+	idx = particleArrayIndices[idx];
+
+	glm::vec3 cohesCenter = glm::vec3(0);
+	glm::vec3 avoidVel = glm::vec3(0);
+	glm::vec3 matchVel = glm::vec3(0);
+	// different number of neighbors depending on rule settings
+	float numCohes = 0.0f;
+	float numAvoid = 0.0f;
+	float numMatch = 0.0f;
+
+	glm::vec3 selfPos = pos[idx];
+
+	glm::vec3 normPos = selfPos - gridMin;
+	normPos *= inverseCellWidth;
+	normPos = floor(normPos); // now in grid-index-space
+
+	int selfGridX = int(normPos.x);
+	int selfGridY = int(normPos.x);
+	int selfGridZ = int(normPos.z);
+
+	normPos += glm::vec3(0.5f);
+	normPos *= cellWidth; // world space center of the grid, shifted to positive
+
+	glm::vec3 shiftPos = selfPos - gridMin;
+	int octx = ((shiftPos.x - normPos.x) > 0 ? 1 : 0);
+	int octy = ((shiftPos.y - normPos.y) > 0 ? 1 : 0);
+	int octz = ((shiftPos.z - normPos.z) > 0 ? 1 : 0);
+	
+	// outer 3 loops: for each cell in each axis
+	for (int cz = octz - 1 + selfGridZ; cz <= selfGridZ + octz; cz++) {
+		if (cz < 0 || cz > gridResolution) continue;
+
+		for (int cy = octy - 1 + selfGridY; cy <= selfGridY + octy; cy++) {
+			if (cy < 0 || cy > gridResolution) continue;
+
+			for (int cx = octx - 1 + selfGridX; cx <= selfGridX + octx; cx++) {
+				if (cx < 0 || cx > gridResolution) continue;
+				
+				int currGridIdx = gridIndex3Dto1D(cx, cy, cz, gridResolution);
+				int gridStart = gridCellStartIndices[currGridIdx];
+				if (gridStart < 0) continue; // -1 indicates nothing in this cell
+				int gridEnd = gridCellEndIndices[currGridIdx];
+
+				// iterate through all boids in this cell
+				for (int gridCurr = gridStart; gridCurr <= gridEnd; gridCurr++) {
+					int currBoidIdx = particleArrayIndices[gridCurr];
+					if (currBoidIdx == idx) continue; // same boid
+
+					glm::vec3 otherPos = pos[currBoidIdx];
+					float dist = glm::distance(otherPos, selfPos);
+
+					if (dist < rule1Distance) {
+						numCohes++;
+						cohesCenter += otherPos;
+					}
+
+					if (dist < rule2Distance) {
+						numAvoid++;
+						avoidVel -= otherPos - selfPos;
+					}
+
+					if (dist < rule3Distance) {
+						numMatch++;
+						matchVel += vel1[currBoidIdx];
+					}
+
+				}
+
+			}
+		}
+	}
+
+	cohesCenter = numCohes > 0.0f ? (cohesCenter / numCohes) : selfPos;
+	avoidVel = numAvoid > 0.0f ? (avoidVel / numAvoid) : glm::vec3(0);
+	matchVel = numMatch > 0.0f ? (matchVel / numMatch) : glm::vec3(0);
+
+	glm::vec3 finalVel = vel1[idx] + (cohesCenter - selfPos) * rule1Scale + avoidVel * rule2Scale + matchVel * rule3Scale;
+
+	float speed = glm::length(finalVel);
+	if (speed > maxSpeed) {
+		finalVel = glm::normalize(finalVel) * maxSpeed;
+	}
+
+	vel2[idx] = finalVel;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,26 +504,114 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int idx = threadIdx.x + blockDim.x * blockIdx.x;
+
+	glm::vec3 cohesCenter = glm::vec3(0);
+	glm::vec3 avoidVel = glm::vec3(0);
+	glm::vec3 matchVel = glm::vec3(0);
+	// different number of neighbors depending on rule settings
+	float numCohes = 0.0f;
+	float numAvoid = 0.0f;
+	float numMatch = 0.0f;
+
+	glm::vec3 selfPos = pos[idx];
+
+	glm::vec3 normPos = selfPos - gridMin;
+	normPos *= inverseCellWidth;
+	normPos = floor(normPos); // now in grid-index-space
+
+	int selfGridX = int(normPos.x);
+	int selfGridY = int(normPos.x);
+	int selfGridZ = int(normPos.z);
+
+	normPos += glm::vec3(0.5f);
+	normPos *= cellWidth; // world space center of the grid, shifted to positive
+
+	glm::vec3 shiftPos = selfPos - gridMin;
+	int octx = ((shiftPos.x - normPos.x) > 0 ? 1 : 0);
+	int octy = ((shiftPos.y - normPos.y) > 0 ? 1 : 0);
+	int octz = ((shiftPos.z - normPos.z) > 0 ? 1 : 0);
+
+	// outer 3 loops: for each cell in each axis
+	for (int cz = octz - 1 + selfGridZ; cz <= selfGridZ + octz; cz++) {
+		if (cz < 0 || cz > gridResolution) continue;
+
+		for (int cy = octy - 1 + selfGridY; cy <= selfGridY + octy; cy++) {
+			if (cy < 0 || cy > gridResolution) continue;
+
+			for (int cx = octx - 1 + selfGridX; cx <= selfGridX + octx; cx++) {
+				if (cx < 0 || cx > gridResolution) continue;
+
+				int currGridIdx = gridIndex3Dto1D(cx, cy, cz, gridResolution);
+				int gridStart = gridCellStartIndices[currGridIdx];
+				if (gridStart < 0) continue; // -1 indicates nothing in this cell
+				int gridEnd = gridCellEndIndices[currGridIdx];
+
+				// iterate through all boids in this cell
+				for (int gridCurr = gridStart; gridCurr <= gridEnd; gridCurr++) {
+					int currBoidIdx = gridCurr;
+					if (currBoidIdx == idx) continue; // same boid
+
+					glm::vec3 otherPos = pos[currBoidIdx];
+					float dist = glm::distance(otherPos, selfPos);
+
+					if (dist < rule1Distance) {
+						numCohes++;
+						cohesCenter += otherPos;
+					}
+
+					if (dist < rule2Distance) {
+						numAvoid++;
+						avoidVel -= otherPos - selfPos;
+					}
+
+					if (dist < rule3Distance) {
+						numMatch++;
+						matchVel += vel1[currBoidIdx];
+					}
+
+				}
+
+			}
+		}
+	}
+
+	cohesCenter = numCohes > 0.0f ? (cohesCenter / numCohes) : selfPos;
+	avoidVel = numAvoid > 0.0f ? (avoidVel / numAvoid) : glm::vec3(0);
+	matchVel = numMatch > 0.0f ? (matchVel / numMatch) : glm::vec3(0);
+
+	glm::vec3 finalVel = vel1[idx] + (cohesCenter - selfPos) * rule1Scale + avoidVel * rule2Scale + matchVel * rule3Scale;
+
+	float speed = glm::length(finalVel);
+	if (speed > maxSpeed) {
+		finalVel = glm::normalize(finalVel) * maxSpeed;
+	}
+
+	vel2[idx] = finalVel;
+}
+
+__global__ void kernSortBoidData(int N, int *sortedIndices, 
+	glm::vec3 *pos, glm::vec3 *vel, 
+	glm::vec3 *pos1, glm::vec3 *vel1) {
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) return;
+
+	int boidIdx = sortedIndices[idx];
+	pos1[idx] = pos[boidIdx];
+	vel1[idx] = vel[boidIdx];
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernUpdateVelocityBruteForce <<<fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos <<<fullBlocksPerGrid, blockSize >> >(numObjects, dt, dev_pos, dev_vel2);
+	// ping-pong the velocity buffers
+	glm::vec3 *temp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = temp;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +627,38 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+
+	dim3 cellCountSize((gridCellCount + blockSize - 1) / blockSize);
+	dim3 boidCountSize((numObjects + blockSize - 1) / blockSize);
+
+	// reset grid structure pointers
+	kernResetIntBuffer <<< cellCountSize, blockSize >>>(gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer <<< cellCountSize, blockSize >>>(gridCellCount, dev_gridCellEndIndices, -1);
+	
+	// compute grid indices based on current boid positions
+	kernComputeIndices <<< boidCountSize, blockSize >>>(numObjects, gridSideCount,
+		gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	// sort the boids based on grid indices
+	thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);
+	thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+	thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+
+	// initialize grid to boid pointers
+	kernIdentifyCellStartEnd <<< boidCountSize, blockSize >>>(numObjects, dev_particleGridIndices, 
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	// run the simulation
+	kernUpdateVelNeighborSearchScattered <<< boidCountSize, blockSize >>> (numObjects, gridSideCount,
+		gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices,
+		dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+
+	kernUpdatePos <<< boidCountSize, blockSize >>>(numObjects, dt, dev_pos, dev_vel2);
+
+	glm::vec3 *temp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = temp;
+
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +677,47 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+	dim3 cellCountSize((gridCellCount + blockSize - 1) / blockSize);
+	dim3 boidCountSize((numObjects + blockSize - 1) / blockSize);
+
+	// reset grid structure pointers
+	kernResetIntBuffer << < cellCountSize, blockSize >> >(gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << < cellCountSize, blockSize >> >(gridCellCount, dev_gridCellEndIndices, -1);
+
+	// compute grid indices based on current boid positions
+	kernComputeIndices << < boidCountSize, blockSize >> >(numObjects, gridSideCount,
+		gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	// sort the boids based on grid indices
+	thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);
+	thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+	thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+
+	// rearrange boids to be memory coherent, swap buffers
+	kernSortBoidData << < boidCountSize, blockSize >> > (numObjects, dev_particleArrayIndices, 
+		dev_pos, dev_vel1, dev_pos_coherent, dev_vel_coherent);
+	glm::vec3 *temp = dev_pos;
+	dev_pos = dev_pos_coherent;
+	dev_pos_coherent = temp;
+	temp = dev_vel1;
+	dev_vel1 = dev_vel_coherent;
+	dev_vel_coherent = temp;
+
+	// initialize grid to boid pointers
+	kernIdentifyCellStartEnd << < boidCountSize, blockSize >> >(numObjects, dev_particleGridIndices,
+		dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	// run the simulation
+	kernUpdateVelNeighborSearchCoherent << < boidCountSize, blockSize >> > (numObjects, gridSideCount,
+		gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices,
+		dev_pos, dev_vel1, dev_vel2);
+
+	kernUpdatePos << < boidCountSize, blockSize >> >(numObjects, dt, dev_pos, dev_vel2);
+
+	temp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = temp;
 }
 
 void Boids::endSimulation() {
@@ -390,6 +726,13 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+
+  cudaFree(dev_pos_coherent);
+  cudaFree(dev_vel_coherent);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index a29471d..4d96dd8 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 100000;
 const float DT = 0.2f;
 
 /**