diff --git a/README.md b/README.md
index 98dd9a8..e14d588 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,67 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* RISHABH SHAH
+* Tested on: Windows 10, i7-6700HQ @ 2.6GHz 16GB, GTX 960M 4096MB (Laptop)
 
-### (TODO: Your README)
+### Boids Flocking Simulation on GPU
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](./images/coherent_5000.gif)
+
+*simulation result of 5000 boids using coherent grid*
+
+In this simulation, boids (particles) move in space interacting with nearby boids based on three rules: Cohesion, Separation, Alignment. At every timestep, these rules define the new velocity of every boid.
+
+The simulation was done using three methods
+- Naive - Here, we check every boid with every other boid, regardless of the distance.
+- Scattered Grid - This one is a bit smarter. Here, we divide the space into a uniform grid, and only check the boids in the nearby grid cells around the boid.
+- Coherent Grid - Because of the way the GPU works, the scattered grid implementation doesn't work as efficiently as it should. On a GPU, accessing neighboring values in arrays is much faster than accessing far of values. So, we rearrange the position and velocity arrays from the scattered grid implementation and make them coherent. This gives considerably faster performance for high boid counts.
+
+Skipping cells based on neighborhood distance: Instead of just looping over the boids from the nearby cells, I determine if a part of the neighborhood lies in the cells. If it does, only then, the boids of the cell are considered.
+
+#### Performance Analysis
+*NOTE: All the timings were calculated as an average over 15 seconds.*
+- Number of Boids:
+    -   As expected, the frame rate drops with the increase in number of boids.
+    -   When the number of boids is low (about 1000), naive is the fatest. This must be because of the overhead of the additional calculation that are needed for grid based implementations. The number of boids is so less, it is better to just compare every boid with every other boid, tham to do the additional calculations.
+    -   As the number of boids increases, the benefits of using grid based implementations are seen.
+
+        *Analysis with visualization off*
+
+        ![](./images/boidsvsfps.png)
+
+        ![](./images/chart_boids.png)
+
+        *Analysis with visualization on*
+
+        ![](./images/viz.png)
+
+        ![](./images/chart_viz.png)
+
+-   Block Size:
+    -   Change in frame rate can be seen as block size increases from 16 to 32. Probably because half of the threads in the warps were idle with block size of 16. A little performance increase can be seen upto 64 or 128.
+    -   After 128, there is not much difference.
+
+        *Analysis with visualization off*
+
+        ![](./images/blocksizevsfps.png)
+
+        ![](./images/chart_blocksize.png)
+
+-   Cell Width:
+    -   Grid cell width directly affects performance, as increasing it, increases the number of grid cells to be processed, but also decreases the number of boids in each cell.
+    -   The performance for half the original cell width was quite better for all the methods.
+    -   As can be seen in the data below, the performance actually increased when doubling the boid count from 5000 to 10000. Same thing happens with original cell width. I have no clue why.
+
+        *Analysis with visualization off*
+
+        ![](./images/halfcellwidth.png)
+
+        ![](./images/chart_halfcellwidth.png)
+
+-   Comparison:
+    -   This is a comparison of the performance changes with visualization on, visualization off, and with smaller cell size (+ visualization off)
+
+    ![](./images/chart_comparison.png)
+
+#### src/CMakeLists.txt changes: OPTIONS -arch=sm_50
diff --git a/images/blocksizevsfps.png b/images/blocksizevsfps.png
new file mode 100644
index 0000000..f3b59a1
Binary files /dev/null and b/images/blocksizevsfps.png differ
diff --git a/images/boidsvsfps.png b/images/boidsvsfps.png
new file mode 100644
index 0000000..03f071c
Binary files /dev/null and b/images/boidsvsfps.png differ
diff --git a/images/boidsvsfps_viz.png b/images/boidsvsfps_viz.png
new file mode 100644
index 0000000..52f728f
Binary files /dev/null and b/images/boidsvsfps_viz.png differ
diff --git a/images/chart_blocksize.png b/images/chart_blocksize.png
new file mode 100644
index 0000000..183567d
Binary files /dev/null and b/images/chart_blocksize.png differ
diff --git a/images/chart_boids.png b/images/chart_boids.png
new file mode 100644
index 0000000..853413a
Binary files /dev/null and b/images/chart_boids.png differ
diff --git a/images/chart_comparison.png b/images/chart_comparison.png
new file mode 100644
index 0000000..9c89829
Binary files /dev/null and b/images/chart_comparison.png differ
diff --git a/images/chart_halfcellwidth.png b/images/chart_halfcellwidth.png
new file mode 100644
index 0000000..11732c9
Binary files /dev/null and b/images/chart_halfcellwidth.png differ
diff --git a/images/chart_viz.png b/images/chart_viz.png
new file mode 100644
index 0000000..6f3a5c6
Binary files /dev/null and b/images/chart_viz.png differ
diff --git a/images/coherent_5000.gif b/images/coherent_5000.gif
new file mode 100644
index 0000000..ed45ba2
Binary files /dev/null and b/images/coherent_5000.gif differ
diff --git a/images/coherent_50000.gif b/images/coherent_50000.gif
new file mode 100644
index 0000000..f49315d
Binary files /dev/null and b/images/coherent_50000.gif differ
diff --git a/images/halfcellwidth.png b/images/halfcellwidth.png
new file mode 100644
index 0000000..d338960
Binary files /dev/null and b/images/halfcellwidth.png differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index fdd636d..dff0113 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50
     )
diff --git a/src/kernel.cu b/src/kernel.cu
index 1a5123d..0788fbf 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -31,7 +31,6 @@ void checkCUDAError(const char *msg, int line = -1) {
   }
 }
 
-
 /*****************
 * Configuration *
 *****************/
@@ -76,15 +75,18 @@ glm::vec3 *dev_vel2;
 // For efficient sorting and the uniform grid. These should always be parallel.
 int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
 int *dev_particleGridIndices; // What grid cell is this particle in?
+
 // needed for use with thrust
-thrust::device_ptr<int> dev_thrust_particleArrayIndices;
-thrust::device_ptr<int> dev_thrust_particleGridIndices;
+//thrust::device_ptr<int> dev_thrust_particleArrayIndices;
+//thrust::device_ptr<int> dev_thrust_particleGridIndices;
 
 int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
 int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_posCoh;
+glm::vec3 *dev_velCoh;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -169,10 +171,27 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_posCoh, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_posCoh failed!");
+
+  cudaMalloc((void**)&dev_velCoh, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_velCoh failed!");
+
   cudaDeviceSynchronize();
 }
 
-
 /******************
 * copyBoidsToVBO *
 ******************/
@@ -229,11 +248,50 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * Compute the new velocity on the body with index `iSelf` due to the `N` boids
 * in the `pos` and `vel` arrays.
 */
+
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  
+  glm::vec3 newVel(0.0f, 0.0f, 0.0f);
+  glm::vec3 perceivedCenter(0.0f, 0.0f, 0.0f);
+  glm::vec3 sep(0.0f, 0.0f, 0.0f);
+  glm::vec3 ali(0.0f, 0.0f, 0.0f);
+  int nCountCoh = 0;
+  int nCountAli = 0;
+  
+  for (int i = 0; i < N; i++) {
+    if (i == iSelf) {
+      continue;
+    }
+    float distance = glm::distance(pos[iSelf], pos[i]);
+    // 1: cohesion
+    if (distance < rule1Distance) {
+      perceivedCenter += pos[i];
+      nCountCoh++;
+    }
+    // 2: seperation
+    if (distance < rule2Distance) {
+      sep -= (pos[i] - pos[iSelf]);
+      //nCountSep++;
+    }
+    // 3: alignment
+    if (distance < rule3Distance) {
+      ali += vel[i];
+      nCountAli++;
+    }
+  }
+  if (nCountCoh > 0) {
+    perceivedCenter /= nCountCoh;
+    newVel += (perceivedCenter - pos[iSelf]) * rule1Scale;
+  }
+  if (nCountAli > 0) {
+    //ali /= nCountAli;
+    newVel += ali * rule3Scale;
+  }
+  newVel += sep * rule2Scale;
+  return newVel;
 }
 
 /**
@@ -245,6 +303,15 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) return;
+
+  vel2[index] = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+
+  int speed = glm::length(vel2[index]);
+  if (speed > maxSpeed) {
+    vel2[index] *= (maxSpeed / speed);
+  }
 }
 
 /**
@@ -254,9 +321,8 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
 __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
   // Update position by velocity
   int index = threadIdx.x + (blockIdx.x * blockDim.x);
-  if (index >= N) {
-    return;
-  }
+  if (index >= N) return;
+
   glm::vec3 thisPos = pos[index];
   thisPos += vel[index] * dt;
 
@@ -269,6 +335,15 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
   thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
   thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
 
+  // Bounce the boids off the walls - converge to corner with time
+  /*vel[index].x = thisPos.x < -scene_scale ? -vel[index].x : vel[index].x;
+  vel[index].y = thisPos.y < -scene_scale ? -vel[index].y : vel[index].y;
+  vel[index].z = thisPos.z < -scene_scale ? -vel[index].z : vel[index].z;
+
+  vel[index].x = thisPos.x > scene_scale ? -vel[index].x : vel[index].x;
+  vel[index].y = thisPos.y > scene_scale ? -vel[index].y : vel[index].y;
+  vel[index].z = thisPos.z > scene_scale ? -vel[index].z : vel[index].z;*/
+
   pos[index] = thisPos;
 }
 
@@ -285,10 +360,16 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+  // TODO-2.1
+  // - Label each boid with the index of its grid cell.
+  // - Set up a parallel array of integer indices as pointers to the actual
+  //   boid data in pos and vel1/vel2
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) return;
+
+  glm::ivec3 p = (pos[index] - gridMin) * inverseCellWidth;
+  gridIndices[index] = gridIndex3Dto1D(p.x, p.y, p.z, gridResolution);
+  indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +387,23 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) return;
+
+  int indThis = particleGridIndices[index];
+  if (index == 0) {
+    gridCellStartIndices[indThis] = 0;
+  }
+  else {
+    int indPrev = particleGridIndices[index - 1];
+    if (indThis != indPrev) {
+      gridCellStartIndices[indThis] = index;
+      gridCellEndIndices[indPrev] = index -1;
+    }
+  }
+  if (index == N-1) {
+    gridCellEndIndices[indThis] = N-1;
+  }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +420,104 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) return;
+
+  float nDist = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+
+  glm::vec3 newVel(0.0f, 0.0f, 0.0f);
+  glm::vec3 perceivedCenter(0.0f, 0.0f, 0.0f);
+  glm::vec3 sep(0.0f, 0.0f, 0.0f);
+  glm::vec3 ali(0.0f, 0.0f, 0.0f);
+  int nCountCoh = 0;
+  int nCountAli = 0;
+
+  glm::vec3 origPos = pos[index];
+  glm::ivec3 origCell = (origPos - gridMin) * inverseCellWidth;
+  int origCellIdx = gridIndex3Dto1D(origCell.x, origCell.y, origCell.z, gridResolution);
+
+  for (int x = -1; x < 2; x++) {
+    for (int y = -1; y < 2; y++) {
+      for (int z = -1; z < 2; z++) {
+        
+        // OPTIMIZATION:
+        // Process the cell only if a part of the boid's neighborhood lies inside it.
+        // Thus, processing a minimum of 1 and a maximum of 27 cells.
+        int newCellIdx;
+        if (x == y == z == 0) {
+          newCellIdx = origCellIdx;
+        }
+        else {
+          glm::vec3 newPos = (origPos + glm::vec3(x, y, z) * nDist);
+          glm::ivec3 newCell = (newPos - gridMin) * inverseCellWidth;
+          if (newCell.x < 0 || newCell.x >= gridResolution ||
+            newCell.y < 0 || newCell.y >= gridResolution ||
+            newCell.z < 0 || newCell.z >= gridResolution) {
+            continue;
+          }
+          newCellIdx = gridIndex3Dto1D(newCell.x, newCell.y, newCell.z, gridResolution);
+
+          // This checks if the point at the extreme end of the neighbourhood is in another cell:
+          if (newCellIdx == origCellIdx) {
+            continue;
+          }
+        }
+
+        if (gridCellStartIndices[newCellIdx] != -1) {
+          for (int i = gridCellStartIndices[newCellIdx]; i <= gridCellEndIndices[newCellIdx]; i++) {
+            int currentIdx = particleArrayIndices[i];
+            if (currentIdx == index) {
+              continue;
+            }
+            glm::vec3 posCurrent = pos[currentIdx];
+            float distance = glm::distance(origPos, posCurrent);
+            // 1: cohesion
+            if (distance < rule1Distance) {
+              perceivedCenter += posCurrent;
+              nCountCoh++;
+            }
+            // 2: seperation
+            if (distance < rule2Distance) {
+              sep -= (posCurrent - origPos);
+            }
+            // 3: alignment
+            if (distance < rule3Distance) {
+              ali += vel1[currentIdx];
+              nCountAli++;
+            }
+          }
+        }
+
+      }
+    }
+  }
+
+  if (nCountCoh > 0) {
+    perceivedCenter /= nCountCoh;
+    newVel += (perceivedCenter - origPos) * rule1Scale;
+  }
+  if (nCountAli > 0) {
+    newVel += ali * rule3Scale;
+  }
+  newVel += sep * rule2Scale;
+  vel2[index] = vel1[index] + newVel;
+
+  int speed = glm::length(vel2[index]);
+  if (speed > maxSpeed) {
+    vel2[index] *= (maxSpeed / speed);
+  }
+}
+
+__global__ void kernComputePosVelCoherent(
+  int N, int *particleArrayIndices, glm::vec3 *pos, glm::vec3 *vel,
+  glm::vec3 *posCoh, glm::vec3 *velCoh) {
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) return;
+
+  int cohIdx = particleArrayIndices[index];
+  posCoh[index] = pos[cohIdx];
+  velCoh[index] = vel[cohIdx];
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,6 +537,94 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) return;
+
+  float nDist = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+
+  glm::vec3 newVel(0.0f, 0.0f, 0.0f);
+  glm::vec3 perceivedCenter(0.0f, 0.0f, 0.0f);
+  glm::vec3 sep(0.0f, 0.0f, 0.0f);
+  glm::vec3 ali(0.0f, 0.0f, 0.0f);
+  int nCountCoh = 0;
+  int nCountAli = 0;
+
+  glm::vec3 origPos = pos[index];
+  glm::ivec3 origCell = (origPos - gridMin) * inverseCellWidth;
+  int origCellIdx = gridIndex3Dto1D(origCell.x, origCell.y, origCell.z, gridResolution);
+
+  bool process;
+
+  for (int z = -1; z < 2; z++) {
+    for (int y = -1; y < 2; y++) {
+      for (int x = -1; x < 2; x++) {
+
+        // OPTIMIZATION:
+        // Process the cell only if a part of the boid's neighborhood lies inside it.
+        // Thus, processing a minimum of 1 and a maximum of 27 cells.
+        int newCellIdx;
+        if (x == y == z == 0) {
+          newCellIdx = origCellIdx;
+        }
+        else {
+          glm::vec3 newPos = (origPos + glm::vec3(x, y, z) * nDist);
+          glm::ivec3 newCell = (newPos - gridMin) * inverseCellWidth;
+          if (newCell.x < 0 || newCell.x >= gridResolution ||
+            newCell.y < 0 || newCell.y >= gridResolution ||
+            newCell.z < 0 || newCell.z >= gridResolution) {
+            continue;
+          }
+          newCellIdx = gridIndex3Dto1D(newCell.x, newCell.y, newCell.z, gridResolution);
+
+          // This checks if the point at the extreme end of the neighbourhood is in another cell:
+          if (newCellIdx == origCellIdx) {
+            continue;
+          }
+        }
+
+        if (gridCellStartIndices[newCellIdx] != -1) {
+          for (int currentIdx = gridCellStartIndices[newCellIdx]; currentIdx <= gridCellEndIndices[newCellIdx]; currentIdx++) {
+            if (currentIdx == index) {
+              continue;
+            }
+            glm::vec3 posCurrent = pos[currentIdx];
+            float distance = glm::distance(origPos, posCurrent);
+            // 1: cohesion
+            if (distance < rule1Distance) {
+              perceivedCenter += posCurrent;
+              nCountCoh++;
+            }
+            // 2: seperation
+            if (distance < rule2Distance) {
+              sep -= (posCurrent - origPos);
+            }
+            // 3: alignment
+            if (distance < rule3Distance) {
+              ali += vel1[currentIdx];
+              nCountAli++;
+            }
+          }
+        }
+
+      }
+    }
+  }
+
+  if (nCountCoh > 0) {
+    perceivedCenter /= nCountCoh;
+    newVel += (perceivedCenter - origPos) * rule1Scale;
+  }
+  if (nCountAli > 0) {
+    newVel += ali * rule3Scale;
+  }
+  newVel += sep * rule2Scale;
+  vel2[index] = vel1[index] + newVel;
+
+  int speed = glm::length(vel2[index]);
+  if (speed > maxSpeed) {
+    vel2[index] *= (maxSpeed / speed);
+  }
 }
 
 /**
@@ -349,6 +633,12 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+  kernUpdateVelocityBruteForce <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelocityBruteForce failed!");
+  kernUpdatePos <<<fullBlocksPerGrid, blockSize>>> (numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+  std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +654,50 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+
+  dim3 fullBlocksPerGrid_cells((gridCellCount + blockSize - 1) / blockSize);
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+  //reset start and end buffers to -1..
+  kernResetIntBuffer <<<fullBlocksPerGrid_cells, blockSize>>> (gridCellCount, dev_gridCellStartIndices, -1);
+  checkCUDAErrorWithLine("kernResetIntBuffer for start indices failed!");
+
+  kernResetIntBuffer <<<fullBlocksPerGrid_cells, blockSize>>> (gridCellCount, dev_gridCellEndIndices, -1);
+  checkCUDAErrorWithLine("kernResetIntBuffer for end indices failed!");
+  
+  // compute array and grid indices..
+  kernComputeIndices <<<fullBlocksPerGrid, blockSize>>>
+      (numObjects, gridSideCount,
+      gridMinimum, gridInverseCellWidth,
+      dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+  // sort
+  thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);
+  thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+
+  thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+  checkCUDAErrorWithLine("thrust::sort_by_key failed!");
+
+  // compute start and end indices..
+  kernIdentifyCellStartEnd <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+  // compute new velocity using grid..
+  kernUpdateVelNeighborSearchScattered <<<fullBlocksPerGrid, blockSize>>> 
+      (numObjects, gridSideCount, gridMinimum,
+      gridInverseCellWidth, gridCellWidth, 
+      dev_gridCellStartIndices, dev_gridCellEndIndices, 
+      dev_particleArrayIndices,
+      dev_pos, dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+  
+  // update pos..
+  kernUpdatePos <<<fullBlocksPerGrid, blockSize>>> (numObjects, dt, dev_pos, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+  // ping pong vel buffers..
+  std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,14 +716,69 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+  dim3 fullBlocksPerGrid_cells((gridCellCount + blockSize - 1) / blockSize);
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+  //reset start and end buffers to -1..
+  kernResetIntBuffer <<<fullBlocksPerGrid_cells, blockSize>>> (gridCellCount, dev_gridCellStartIndices, -1);
+  checkCUDAErrorWithLine("kernResetIntBuffer for start indices failed!");
+
+  kernResetIntBuffer <<<fullBlocksPerGrid_cells, blockSize>>> (gridCellCount, dev_gridCellEndIndices, -1);
+  checkCUDAErrorWithLine("kernResetIntBuffer for end indices failed!");
+
+  // compute array and grid indices..
+  kernComputeIndices <<<fullBlocksPerGrid, blockSize>>>
+      (numObjects, gridSideCount,
+      gridMinimum, gridInverseCellWidth,
+      dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+  checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+  // sort
+  thrust::device_ptr<int> dev_thrust_keys(dev_particleGridIndices);
+  thrust::device_ptr<int> dev_thrust_values(dev_particleArrayIndices);
+
+  thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + numObjects, dev_thrust_values);
+  checkCUDAErrorWithLine("thrust::sort_by_key failed!");
+
+  // compute start and end indices..
+  kernIdentifyCellStartEnd <<<fullBlocksPerGrid, blockSize>>> 
+      (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+  // compute rearranged pos and vel buffers..
+  kernComputePosVelCoherent <<<fullBlocksPerGrid, blockSize>>> 
+      (numObjects, dev_particleArrayIndices, dev_pos, dev_vel1, dev_posCoh, dev_velCoh);
+  checkCUDAErrorWithLine("kernComputePosVelCoherent failed!");
+
+  // compute new velocity using grid..
+  kernUpdateVelNeighborSearchScattered <<<fullBlocksPerGrid, blockSize>>>
+      (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+      dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices,
+      dev_posCoh, dev_velCoh, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
+  // update pos..
+  kernUpdatePos <<<fullBlocksPerGrid, blockSize>>> (numObjects, dt, dev_posCoh, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+  // ping pong buffers..
+  std::swap(dev_pos, dev_posCoh);
+  std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::endSimulation() {
   cudaFree(dev_vel1);
   cudaFree(dev_vel2);
   cudaFree(dev_pos);
-
+  
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+
+  cudaFree(dev_posCoh);
+  cudaFree(dev_velCoh);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index fc1e870..cfbf89a 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,8 +14,8 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
@@ -219,6 +219,8 @@ void initShaders(GLuint * program) {
     double fps = 0;
     double timebase = 0;
     int frame = 0;
+    double avgfps = 0;
+    int ctr = 0;
 
     Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
@@ -235,6 +237,9 @@ void initShaders(GLuint * program) {
         frame = 0;
       }
 
+      avgfps += fps;
+      ctr++;
+
       runCUDA();
 
       std::ostringstream ss;
@@ -261,6 +266,7 @@ void initShaders(GLuint * program) {
     }
     glfwDestroyWindow(window);
     glfwTerminate();
+    std::cout << "Average FPS: " << avgfps/double(ctr) << std::endl;
   }