The test cases say all values should be 11, but the problem statements say:
Puzzle 7 "Implement the same kernel in 2D. You have fewer threads per block than the size of a in both directions."
(I'm assuming this means the same as Puzzle 6, which is to add 10 to each position.)
Puzzle 8 "Implement a kernel that adds 10 to each position of a and stores it in out. You have fewer threads per block than the size of a."
BTW I'm having fun and learning, thanks for this :)