-
Notifications
You must be signed in to change notification settings - Fork 8
Description
I think it'll be useful to give users a way to check what GPUs are doing in their Spark cluster without turning on logging, it would be a good place to start by simply adding a new source that will record metrics we want (e.g. number of GPU calls, how much data was transferred, the duration of a kernel, etc).
I've added code for a diff diff that builds and runs OK, the metrics are printed to the console, I haven't done any work to get this showing in our Spark UI and this is old so based on Spark 1.6.x with LBFGS.
Does anybody know if there's a better way of doing it? So instead of adding this for all of our GPU algorithms, we'd always write metrics whenever we use our GPU enabled algorithms from Spark.
Perhaps we override cudaMalloc and have a runOnGPU function that we always call that will record the metrics and then call the correct method?
I have a diff I'll be happy to share, it includes a GPUSource class, a modified SparkContext and SparkEnv, and I modified the LBFGS code to record our metrics.