Skip to content

feat: expose scan stats to spark sql metrics#556

Merged
yanghua merged 3 commits into
lance-format:mainfrom
fangbo:add-scan-metrics
Jun 23, 2026
Merged

feat: expose scan stats to spark sql metrics#556
yanghua merged 3 commits into
lance-format:mainfrom
fangbo:add-scan-metrics

Conversation

@fangbo

@fangbo fangbo commented May 25, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions github-actions Bot added the enhancement New feature or request label May 25, 2026
@fangbo fangbo force-pushed the add-scan-metrics branch from 8f5367a to 696d50a Compare May 25, 2026 02:21
@fangbo

fangbo commented May 28, 2026

Copy link
Copy Markdown
Collaborator Author

@yanghua Could you please take a look at it ?

@fangbo fangbo requested a review from yanghua June 2, 2026 03:29
@yanghua

yanghua commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Got it! Will review it later.

@yanghua yanghua left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

}
}
if (fragmentReader != null) {
metricsTracker.addScanStats(fragmentReader.getScanStats());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases, this logic may cause duplicated accumulation. Because the next() method may be re-called in some scenarios. So, it would be better to accumulate it when fragmentReader#close(), or do something like this:

private boolean currentReaderStatsAdded = false;

while (...) {
    if (fragmentReader != null) {
        if (!currentReaderStatsAdded) {
            metricsTracker.addScanStats(fragmentReader.getScanStats());
            currentReaderStatsAdded = true;
        }
        fragmentReader.close();
    }
    fragmentReader = LanceFragmentColumnarBatchScanner.create(...);
    currentReaderStatsAdded = false;
    ...
}
if (fragmentReader != null && !currentReaderStatsAdded) {
    metricsTracker.addScanStats(fragmentReader.getScanStats());
    currentReaderStatsAdded = true;
}
return false;

}

@Override
public void close() throws IOException {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we also need to accumulate the metrics here to avoid the missing? We can combine with the currentReaderStatsAdded (in the front comment) flag to avoid duplicate accumulation.


// Test new metrics
assertTrue(
metrics.getOrDefault(LanceCustomMetrics.NUM_IOPS, 0L) >= 0, "numIops should be >= 0");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May these assertions not work? Because the default value is 0, right?

@@ -31,7 +35,7 @@ public abstract class BaseLanceCustomMetricsTest {
@Test
void testAllMetricsReturnsSixMetrics() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Six is not correct now, right?

Comment thread docs/src/performance.md Outdated
| `numFragmentsScanned` | counter | Lance fragments actually opened by this task. Compare against the table fragment count to verify pruning is working. |
| `numBatchesLoaded` | counter | Arrow batches returned from the JNI scanner. |
| `numRowsScanned` | counter | Rows read from storage before filter evaluation. Pair with Spark's built-in `numOutputRows` to compute filter selectivity (`numOutputRows / numRowsScanned`). |
| `numIops` | counter | Number of I/O operations performed during scanning. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of I/O operations performed by the scanner.

Comment thread docs/src/performance.md Outdated
| `numRowsScanned` | counter | Rows read from storage before filter evaluation. Pair with Spark's built-in `numOutputRows` to compute filter selectivity (`numOutputRows / numRowsScanned`). |
| `numIops` | counter | Number of I/O operations performed during scanning. |
| `numRequests` | counter | Number of requests made to the storage layer. |
| `numBytesRead` | counter | Total bytes read from storage. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total bytes read from storage by the scanner.

@fangbo fangbo force-pushed the add-scan-metrics branch from b3c7a1f to 99742a5 Compare June 22, 2026 08:25
@fangbo

fangbo commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

@yanghua I have fixed the CR comments. Could you please look at it again? Thanks a lot.

@yanghua yanghua left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @fangbo thanks

@yanghua yanghua merged commit 804dfc4 into lance-format:main Jun 23, 2026
17 checks passed
@fangbo fangbo deleted the add-scan-metrics branch June 23, 2026 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants