docs: add RFC for Query Results Cache#60
Conversation
c73508e to
be988ea
Compare
be988ea to
bf5ec2e
Compare
|
|
||
| 1. Compute canonical plan hash. | ||
| 2. Check `TempStorage.exists()` on the metadata handle for the key. | ||
| 3. On exists: read and deserialize `QueryResultsCacheEntry` from metadata file. Validate: schema match (column names + types vs current `OutputNode`), expiration check, whole-result `resultHash` check. |
There was a problem hiding this comment.
Does output column order participate in cache validation? The RFC mentions validating column names + types, but it may help to explicitly mention column ordering as well.
|
|
||
| - A running byte counter tracks total result size. If it exceeds `max-result-size`, the tee-write is abandoned and partial files are cleaned up. | ||
| - On successful `SELECT` completion, the `QueryResultsCacheWriter` collects the accumulated `TempStorageHandle` references, builds a `QueryResultsCacheEntry`, and stores metadata in the cache provider. | ||
| - If the query fails or is cancelled, partial cache writes are discarded. |
There was a problem hiding this comment.
How are orphaned/partial cache objects cleaned up? Or cache write starts only after query is completely FINISHED?
For example, if page uploads succeed but metadata.json write fails, or the coordinator crashes during cache population. Is cleanup purely TTL-based?
| 3. On exists: read and deserialize `QueryResultsCacheEntry` from metadata file. Validate: schema match (column names + types vs current `OutputNode`), expiration check, whole-result `resultHash` check. | ||
| 4. On hit: `serveCachedResult()` — read `SerializedPage` objects from `TempStorage`, verify per-page CRC, enqueue into `OutputBuffer`, transition to finishing. Client sees no difference from normal execution. | ||
| 5. On miss: register query ID + cache key for population on completion, proceed with normal scheduling. | ||
|
|
There was a problem hiding this comment.
How is resource usage accounted for on cache hits? Since cached queries bypass tasks/operators/exchanges:
- Are they still counted against Resource Groups/concurrency limits?
- How are coordinator CPU/network/memory costs tracked?
There was a problem hiding this comment.
On similar lines, what are the operator level metrics that will show up for such a query ?
To me a cached-query-result is akin to doing a rewrite for this query to a MV or a (materialized) CTEProducerNode, so I would expect to see the plan + metrics related to reading data from such a proxy-source-node. Maybe we can call it a CachedQueryResults plan node ?
|
|
||
| CRCs are computed over the **encrypted** page bytes, not the plaintext. Integrity and confidentiality are therefore independent: corruption is detectable without decrypting the page first, and the CRC does not leak plaintext structure. | ||
|
|
||
| Any integrity check failure (per-page or whole-result) is treated identically to a cache miss. The query proceeds to normal execution and the corrupted entry is overwritten on completion via the normal write path. Failures are emitted as a metric (`cache.integrity_failure_count`) for visibility. |
There was a problem hiding this comment.
Do we plan to expose cache-hit observability/metrics?
For example:
- servedFromResultCache
- EventListener visibility
| 1. Compute canonical plan hash. | ||
| 2. Check `TempStorage.exists()` on the metadata handle for the key. | ||
| 3. On exists: read and deserialize `QueryResultsCacheEntry` from metadata file. Validate: schema match (column names + types vs current `OutputNode`), expiration check, whole-result `resultHash` check. | ||
| 4. On hit: `serveCachedResult()` — read `SerializedPage` objects from `TempStorage`, verify per-page CRC, enqueue into `OutputBuffer`, transition to finishing. Client sees no difference from normal execution. |
There was a problem hiding this comment.
What query states will cache-hit queries go through? Since execution is bypassed, will queries still transition through RUNNING as expected by the UI/client protocol?
|
|
||
| ## Future Work | ||
|
|
||
| - **Predicate stitching**: Partial cache reuse for partition-decomposable queries when only some partitions change. Requires per-partition stats. |
There was a problem hiding this comment.
+1. I think we could extend the idea to do sub-plan matching, similar to how we're exploring MV matching
|
For Purging the cache, wondering if we are thinking any approach or just delete the cache directory upon restart. Also wonderign what happens if the local tempstroage is out of capacity as lots of query results will be stored there. |
Abstract
This RFC proposes a Query Results Cache for Presto that caches completed
SELECTquery results so semantically equivalent future queries skip execution entirely. The cache intercepts atSqlQueryExecution.start()after optimization, before scheduling. On a cache hit, pages are read fromTempStorageand fed directly to the client via the existingOutputBuffer → ExchangeClient → HTTPresponse pipeline, bypassing scheduling and execution entirely.