fix: use all changelog timestamps to estimate package age#140
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the calculate_stability function in src/utils.rs to determine the oldest timestamp from changelog_times (falling back to buildtime) and adjusts the calculation of span_days depending on whether the oldest timestamp falls within the lookback period. The reviewer suggested a performance optimization to avoid unnecessary heap allocations by using iterator adapters to count relevant changes instead of allocating a vector.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Hmm, I'm not sure about this. Or at least it's not obvious to me. The reason we purposely ignore changelog items older than a year is because those old entries are no longer indicative of how frequent the package changes if we have newer information on hand. If e.g. a package wasn't touched for two years and then there are two recent changes in the last month, that should weigh differently than those three events being evenly spread across the lookback period. I could imagine special-casing the scenario you describe where there's only n=1 recent event in the lookback period to still bias towards stability. |
|
One thing we could do which I think would probably be more appropriate is to switch the lambda calculation from being an average to actually weighing the events based on their age e.g. via exponential decay. But as always with all this, it needs to be data-driven so that we actually measure a noticeable improvement in packing performance and not just shooting in the dark. |
|
Should I split this off into two separate PRs? I see your point about disregarding older changelog entries when calculating the lookback period, but I think the change to bin changelog entries by day should be a more unambiguous improvement. In any case, I'll run some tests to see what difference these changes make. |
Yes, please open a separate PR! |
|
Okay, I opened a separate PR for the changelog binning: #143 Marking this PR as draft for now since it needs further testing. |
|
So I just tested out this change (not including the changelog binning that I moved to separate PR) with the same dataset I looked at before, using the secureblue The difference in update sizes compared to the results with the current version of chunkah wasn't all that big, but it did decrease average update sizes by about 2% for both daily updates and every-3-days updates. Hard to tell whether this is statistically significant or just noise, but it at least suggests that this is probably a positive-or-neutral change. |
In the stability calculation, `span_days` should cover the full lookback period if the oldest timestamp is earlier than the beginning of the lookback period. For example, if a package was only updated two years ago and again one month ago, the package is over a year old, so we're looking at one update in the past *year*, not just one update in the past *month* (as we would for a package with no changelog entries earlier than a month ago). The previous calculation underestimated stability of packages that had no updates for a long time, followed by a recent update.
|
Okay, after the changelog binning commit was merged, I redid the above test to compare the effect of this change on top of changelog binning, and this time update sizes increased by 1-2%... so probably it's just noise. 😔 I'm not sure whether it would be better to use some sort of exponential decay weighting to have more recent data be more influential with smooth drop-off as the data ages, rather than the current approach of uniformly weighting over a range with a sharp cutoff. It might give more accurate estimates sooner for packages that genuinely have a change in how actively updated it is, but on the other hand, weighting recent changelog entries too heavily could make the stability score itself less stable over time, increasing the likelihood of packages jumping between stability tiers. |
Nice, thanks for testing.
Yeah, that's a valid point. I'm open to changing the approach here more radically (even e.g. swap out Poisson for something else), assuming it yields better results. Two concerns there are (1) complexity and (2) overfitting to whatever distribution we're benchmarking against, e.g. Fedora (so ideally we cross-check against other distributions/ecosystems). Were you dissatisfied with some of the packing results BTW or just interested in optimizing things? |
|
I think the current packing results are satisfactory in the sense that they're comparable to (for daily updates) or moderately better than (for less frequent updates) the results from build-chunked-oci. However, intuitively I feel like there's probably a fair bit of room for improvement. One idea that I've been thinking about is that, when I look at package data, there seem to be several natural groupings of packages that usually update together; for example, on Kinoite, there's the If there was some way to either automatically detect such groupings (some sort of correlation/clustering analysis on the changelog timestamps, perhaps?) or allow them to be specified via configuration, I suspect this could result in substantial improvements in layer reuse. However, any method of automatically detecting groupings would need to either be quite stable (to minimize how often the groupings change over time) or would probably need to be accompanied by something along the lines of #39 to prevent drift in the layer plan by reading a previous manifest. |
In the stability calculation,
span_daysshould cover the full lookback period if the oldest timestamp is earlier than the beginning of the lookback period. For example, if a package was only updated two years ago and again one month ago, the package is over a year old, so we're looking at one update in the past year, not just one update in the past month (as we would for a package with no changelog entries earlier than a month ago).The previous calculation underestimated stability of packages that had no updates for a long time, followed by a recent update.
Also optimize stability computation by avoiding unnecessary allocations.