Performance Optimizations and Scalability Fixes#465
Performance Optimizations and Scalability Fixes#465RohanExploit wants to merge 2 commits intomainfrom
Conversation
- Refactor `ThreadSafeCache` in `backend/cache.py` to use `OrderedDict` for O(1) LRU eviction. - Add throttling (5s) to `AdaptiveWeights` in `backend/adaptive_weights.py` to reduce file system I/O. - Limit spatial queries in `backend/routers/issues.py` to 100 results to improve scalability in dense areas. - Add database index for `integrity_hash` in `backend/models.py` and `backend/init_db.py`. - Verified changes with unit tests.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
✅ Deploy Preview for fixmybharat canceled.
|
🙏 Thank you for your contribution, @RohanExploit!PR Details:
Quality Checklist:
Review Process:
Note: The maintainers will monitor code quality and ensure the overall project flow isn't broken. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds throttling to adaptive weights reload checks, refactors the cache to an OrderedDict LRU with embedded timestamps and a SimpleCache wrapper, creates an index on Issue.integrity_hash, and applies Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
1 issue found across 5 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/routers/issues.py">
<violation number="1" location="backend/routers/issues.py:117">
P2: Limiting the unordered bounding-box query to 100 candidates can exclude closer issues, so the deduplication check may miss true nearby duplicates in dense clusters.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| Issue.longitude >= min_lon, | ||
| Issue.longitude <= max_lon | ||
| ).all() | ||
| ).limit(100).all() |
There was a problem hiding this comment.
P2: Limiting the unordered bounding-box query to 100 candidates can exclude closer issues, so the deduplication check may miss true nearby duplicates in dense clusters.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/routers/issues.py, line 117:
<comment>Limiting the unordered bounding-box query to 100 candidates can exclude closer issues, so the deduplication check may miss true nearby duplicates in dense clusters.</comment>
<file context>
@@ -114,7 +114,7 @@ async def create_issue(
Issue.longitude >= min_lon,
Issue.longitude <= max_lon
- ).all()
+ ).limit(100).all()
)
</file context>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/routers/issues.py (1)
101-118:⚠️ Potential issue | 🟡 MinorUnbounded
LIMIT 100withoutORDER BYmay exclude the closest issues in dense clusters.The bounding-box query now caps at 100 rows, but without an
ORDER BY, the database returns rows in an arbitrary order. In a dense area with >100 open issues within the bounding box, the 100 rows selected may not include the actual closest ones—defeating the purpose of the subsequentfind_nearby_issuesdistance ranking.Consider adding a lightweight proximity sort so the LIMIT retains the most relevant candidates:
Proposed fix
).filter( Issue.status == "open", Issue.latitude >= min_lat, Issue.latitude <= max_lat, Issue.longitude >= min_lon, Issue.longitude <= max_lon - ).limit(100).all() + ).order_by( + func.abs(Issue.latitude - latitude) + func.abs(Issue.longitude - longitude) + ).limit(100).all()The same concern applies to the identical query on Line 325 in
get_nearby_issues(where the user-specified radius can be up to 500m, making this more likely to matter).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/routers/issues.py` around lines 101 - 118, The bounding-box DB query uses LIMIT 100 without ORDER BY so arbitrary rows may be returned; modify the query in the block that builds the bounding-box filter (the lambda passed to run_in_threadpool that queries Issue.id,...Issue.status) to add a lightweight proximity ORDER BY before .limit(100) using a squared-distance expression computed from the requested latitude/longitude (e.g., (Issue.latitude - target_lat)**2 + (Issue.longitude - target_lon)**2) so the 100 candidates are the nearest ones; apply the same change to the identical query used in get_nearby_issues so both bounding-box queries consistently order by proximity before limiting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/cache.py`:
- Around line 41-63: The logger.debug call inside the Cache.set method uses an
f-string with no placeholders; change it to a regular string literal to remove
the unnecessary f-string. Locate the set(self, data: Any, key: str = "default")
method, and in the block that evicts LRU items after calling _cleanup_expired()
(where you call self._cache.popitem(last=False)), replace logger.debug(f"Evicted
LRU cache entry due to size limit") with a non-f-string logger.debug("Evicted
LRU cache entry due to size limit") so the log is correct and static analysis
warnings are resolved.
---
Outside diff comments:
In `@backend/routers/issues.py`:
- Around line 101-118: The bounding-box DB query uses LIMIT 100 without ORDER BY
so arbitrary rows may be returned; modify the query in the block that builds the
bounding-box filter (the lambda passed to run_in_threadpool that queries
Issue.id,...Issue.status) to add a lightweight proximity ORDER BY before
.limit(100) using a squared-distance expression computed from the requested
latitude/longitude (e.g., (Issue.latitude - target_lat)**2 + (Issue.longitude -
target_lon)**2) so the 100 candidates are the nearest ones; apply the same
change to the identical query used in get_nearby_issues so both bounding-box
queries consistently order by proximity before limiting.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
backend/adaptive_weights.pybackend/cache.pybackend/init_db.pybackend/models.pybackend/routers/issues.py
| def set(self, data: Any, key: str = "default") -> None: | ||
| """ | ||
| Thread-safe set operation with memory management. | ||
| Thread-safe set operation. | ||
| Evicts least recently used item if cache is full. | ||
| """ | ||
| with self._lock: | ||
| current_time = time.time() | ||
|
|
||
| # Clean up expired entries before adding new one | ||
| self._cleanup_expired() | ||
|
|
||
| # If cache is full, evict least recently used entry | ||
| if len(self._data) >= self._max_size and key not in self._data: | ||
| self._evict_lru() | ||
| # If updating existing key, move to end | ||
| if key in self._cache: | ||
| self._cache.move_to_end(key) | ||
|
|
||
| # Set new data atomically | ||
| self._data[key] = data | ||
| self._timestamps[key] = current_time | ||
| self._access_count[key] = 1 | ||
| self._cache[key] = (data, current_time) | ||
|
|
||
| logger.debug(f"Cache set: key={key}, size={len(self._data)}") | ||
|
|
||
| # Check size and evict if needed | ||
| # We first try to remove expired items if we are over limit | ||
| if len(self._cache) > self._max_size: | ||
| self._cleanup_expired() | ||
|
|
||
| # If still over limit, evict LRU (first item) | ||
| while len(self._cache) > self._max_size: | ||
| self._cache.popitem(last=False) | ||
| logger.debug(f"Evicted LRU cache entry due to size limit") |
There was a problem hiding this comment.
f-string without placeholders on Line 63.
Per the static analysis hint, the f-string on Line 63 has no interpolation — it should be a plain string.
Proposed fix
- logger.debug(f"Evicted LRU cache entry due to size limit")
+ logger.debug("Evicted LRU cache entry due to size limit")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def set(self, data: Any, key: str = "default") -> None: | |
| """ | |
| Thread-safe set operation with memory management. | |
| Thread-safe set operation. | |
| Evicts least recently used item if cache is full. | |
| """ | |
| with self._lock: | |
| current_time = time.time() | |
| # Clean up expired entries before adding new one | |
| self._cleanup_expired() | |
| # If cache is full, evict least recently used entry | |
| if len(self._data) >= self._max_size and key not in self._data: | |
| self._evict_lru() | |
| # If updating existing key, move to end | |
| if key in self._cache: | |
| self._cache.move_to_end(key) | |
| # Set new data atomically | |
| self._data[key] = data | |
| self._timestamps[key] = current_time | |
| self._access_count[key] = 1 | |
| self._cache[key] = (data, current_time) | |
| logger.debug(f"Cache set: key={key}, size={len(self._data)}") | |
| # Check size and evict if needed | |
| # We first try to remove expired items if we are over limit | |
| if len(self._cache) > self._max_size: | |
| self._cleanup_expired() | |
| # If still over limit, evict LRU (first item) | |
| while len(self._cache) > self._max_size: | |
| self._cache.popitem(last=False) | |
| logger.debug(f"Evicted LRU cache entry due to size limit") | |
| def set(self, data: Any, key: str = "default") -> None: | |
| """ | |
| Thread-safe set operation. | |
| Evicts least recently used item if cache is full. | |
| """ | |
| with self._lock: | |
| current_time = time.time() | |
| # If updating existing key, move to end | |
| if key in self._cache: | |
| self._cache.move_to_end(key) | |
| self._cache[key] = (data, current_time) | |
| # Check size and evict if needed | |
| # We first try to remove expired items if we are over limit | |
| if len(self._cache) > self._max_size: | |
| self._cleanup_expired() | |
| # If still over limit, evict LRU (first item) | |
| while len(self._cache) > self._max_size: | |
| self._cache.popitem(last=False) | |
| logger.debug("Evicted LRU cache entry due to size limit") |
🧰 Tools
🪛 Ruff (0.15.2)
[error] 63-63: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/cache.py` around lines 41 - 63, The logger.debug call inside the
Cache.set method uses an f-string with no placeholders; change it to a regular
string literal to remove the unnecessary f-string. Locate the set(self, data:
Any, key: str = "default") method, and in the block that evicts LRU items after
calling _cleanup_expired() (where you call self._cache.popitem(last=False)),
replace logger.debug(f"Evicted LRU cache entry due to size limit") with a
non-f-string logger.debug("Evicted LRU cache entry due to size limit") so the
log is correct and static analysis warnings are resolved.
There was a problem hiding this comment.
Pull request overview
This PR aims to optimize application performance and scalability through cache refactoring, file I/O throttling, query limits, and database indexing. The changes focus on reducing overhead in high-traffic scenarios and handling dense spatial clusters more gracefully.
Changes:
- Replaced ThreadSafeCache implementation from multi-dictionary approach to OrderedDict-based LRU cache
- Added 5-second throttling to AdaptiveWeights file reload checks to reduce I/O
- Limited spatial bounding box queries to 100 results to prevent performance degradation in dense clusters
- Added database index on integrity_hash column
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/cache.py | Refactored ThreadSafeCache to use OrderedDict for O(1) LRU operations, simplified internal tracking |
| backend/adaptive_weights.py | Added throttling mechanism to prevent excessive file system checks on high-traffic endpoints |
| backend/routers/issues.py | Added .limit(100) to spatial bounding box queries to cap candidates in dense clusters |
| backend/models.py | Added index=True to integrity_hash Column definition |
| backend/init_db.py | Added migration to create index on integrity_hash column |
Comments suppressed due to low confidence (2)
backend/cache.py:33
- The expiration check uses
>(greater than) instead of>=(greater than or equal to). This means an item that has been cached for exactly TTL seconds is still considered valid, but should be considered expired. For consistent behavior with the TTL semantics, use>=to ensure items are expired when they reach or exceed the TTL.
if time.time() - timestamp > self._ttl:
backend/adaptive_weights.py:21
- The initial load in new (line 21) doesn't update _last_check_time, which means the first call to any getter method (e.g., get_severity_keywords) will immediately trigger another file system check via _check_reload(), bypassing the throttle. This causes two consecutive file reads on startup. To fix this, set
self._last_check_time = time.time()at the end of _load_weights() or after line 21 in new.
def __new__(cls):
if cls._instance is None:
cls._instance = super(AdaptiveWeights, cls).__new__(cls)
cls._instance._load_weights()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # If updating existing key, move to end | ||
| if key in self._cache: | ||
| self._cache.move_to_end(key) | ||
|
|
||
| # Set new data atomically | ||
| self._data[key] = data | ||
| self._timestamps[key] = current_time | ||
| self._access_count[key] = 1 | ||
| self._cache[key] = (data, current_time) |
There was a problem hiding this comment.
The move_to_end operation on line 51 is ineffective because the subsequent assignment on line 53 doesn't preserve the position where the key was moved to. In OrderedDict, when you assign to an existing key, it maintains its original position, not where you moved it to.
To correctly implement LRU behavior, you should either:
- Move the key to end AFTER setting the new value, or
- Delete the key first, then set it (which will add it at the end)
Recommended fix: Remove lines 49-51 and add self._cache.move_to_end(key) after line 53 to ensure the updated key is marked as most recently used.
| @@ -114,7 +114,7 @@ async def create_issue( | |||
| Issue.latitude <= max_lat, | |||
| Issue.longitude >= min_lon, | |||
| Issue.longitude <= max_lon | |||
There was a problem hiding this comment.
The query applies limit(100) without an ORDER BY clause, making the result set undefined when more than 100 issues exist in the bounding box. In dense clusters, this could miss the closest issues if they're not in the arbitrary 100 returned by the database. Consider ordering by distance approximation (e.g., ORDER BY ABS(latitude - target_lat) + ABS(longitude - target_lon)) or by created_at/id before applying the limit to ensure deterministic and more reasonable results.
| Issue.longitude <= max_lon | |
| Issue.longitude <= max_lon | |
| ).order_by( | |
| func.abs(Issue.latitude - latitude) + func.abs(Issue.longitude - longitude) |
| @@ -322,7 +322,7 @@ def get_nearby_issues( | |||
| Issue.latitude <= max_lat, | |||
| Issue.longitude >= min_lon, | |||
| Issue.longitude <= max_lon | |||
There was a problem hiding this comment.
The query applies limit(100) without an ORDER BY clause, making the result set undefined when more than 100 issues exist in the bounding box. In dense clusters, this could miss the closest issues if they're not in the arbitrary 100 returned by the database. Consider ordering by distance approximation (e.g., ORDER BY ABS(latitude - target_lat) + ABS(longitude - target_lon)) or by created_at/id before applying the limit to ensure deterministic and more reasonable results.
| Issue.longitude <= max_lon | |
| Issue.longitude <= max_lon | |
| ).order_by( | |
| func.abs(Issue.latitude - latitude) + func.abs(Issue.longitude - longitude), | |
| Issue.created_at.desc() |
| location = Column(String, nullable=True) | ||
| action_plan = Column(JSONEncodedDict, nullable=True) | ||
| integrity_hash = Column(String, nullable=True) # Blockchain integrity seal | ||
| integrity_hash = Column(String, nullable=True, index=True) # Blockchain integrity seal |
There was a problem hiding this comment.
The index on integrity_hash appears unnecessary. All queries that access integrity_hash either:
- Select it in the column projection (doesn't benefit from an index)
- Order by Issue.id.desc() (uses the id index, not integrity_hash)
- Filter by Issue.id (uses the id index, not integrity_hash)
No queries filter or order by integrity_hash itself, so this index adds overhead to INSERT/UPDATE operations without providing any performance benefit. Consider removing this index unless there are planned queries that will filter or order by integrity_hash.
| integrity_hash = Column(String, nullable=True, index=True) # Blockchain integrity seal | |
| integrity_hash = Column(String, nullable=True) # Blockchain integrity seal |
| if not index_exists("issues", "ix_issues_integrity_hash"): | ||
| conn.execute(text("CREATE INDEX IF NOT EXISTS ix_issues_integrity_hash ON issues (integrity_hash)")) | ||
|
|
There was a problem hiding this comment.
This migration adds an index on integrity_hash, but analysis of the codebase shows no queries that filter or order by integrity_hash. All queries use Issue.id for filtering and ordering (lines 174, 632 in routers/issues.py). This index adds INSERT/UPDATE overhead without providing query performance benefits. Consider removing this migration unless there are planned future queries that will leverage this index.
| if not index_exists("issues", "ix_issues_integrity_hash"): | |
| conn.execute(text("CREATE INDEX IF NOT EXISTS ix_issues_integrity_hash ON issues (integrity_hash)")) |
| class ThreadSafeCache: | ||
| """ | ||
| Thread-safe cache implementation with TTL and memory management. | ||
| Fixes race conditions and implements proper cache expiration. | ||
| Thread-safe cache implementation with TTL and LRU eviction. | ||
| Uses collections.OrderedDict for O(1) LRU operations. | ||
| """ | ||
|
|
||
| def __init__(self, ttl: int = 300, max_size: int = 100): | ||
| self._data = {} | ||
| self._timestamps = {} | ||
| self._ttl = ttl # Time to live in seconds | ||
| self._max_size = max_size # Maximum number of cache entries | ||
| self._lock = threading.RLock() # Reentrant lock for thread safety | ||
| self._access_count = {} # Track access frequency for LRU eviction | ||
| self._cache = collections.OrderedDict() # Maps key -> (value, timestamp) | ||
| self._ttl = ttl | ||
| self._max_size = max_size | ||
| self._lock = threading.RLock() | ||
|
|
||
| def get(self, key: str = "default") -> Optional[Any]: | ||
| """ | ||
| Thread-safe get operation with automatic cleanup. | ||
| Thread-safe get operation. | ||
| Removes item if expired. Updates LRU position if valid. | ||
| """ | ||
| with self._lock: | ||
| current_time = time.time() | ||
| if key not in self._cache: | ||
| return None | ||
|
|
||
| value, timestamp = self._cache[key] | ||
|
|
||
| # Check if key exists and is not expired | ||
| if key in self._data and key in self._timestamps: | ||
| if current_time - self._timestamps[key] < self._ttl: | ||
| # Update access count for LRU | ||
| self._access_count[key] = self._access_count.get(key, 0) + 1 | ||
| return self._data[key] | ||
| else: | ||
| # Expired entry - remove it | ||
| self._remove_key(key) | ||
| # Check expiration | ||
| if time.time() - timestamp > self._ttl: | ||
| del self._cache[key] | ||
| return None | ||
|
|
||
| return None | ||
| # Move to end (most recently used) | ||
| self._cache.move_to_end(key) | ||
| return value | ||
|
|
||
| def set(self, data: Any, key: str = "default") -> None: | ||
| """ | ||
| Thread-safe set operation with memory management. | ||
| Thread-safe set operation. | ||
| Evicts least recently used item if cache is full. | ||
| """ | ||
| with self._lock: | ||
| current_time = time.time() | ||
|
|
||
| # Clean up expired entries before adding new one | ||
| self._cleanup_expired() | ||
|
|
||
| # If cache is full, evict least recently used entry | ||
| if len(self._data) >= self._max_size and key not in self._data: | ||
| self._evict_lru() | ||
| # If updating existing key, move to end | ||
| if key in self._cache: | ||
| self._cache.move_to_end(key) | ||
|
|
||
| # Set new data atomically | ||
| self._data[key] = data | ||
| self._timestamps[key] = current_time | ||
| self._access_count[key] = 1 | ||
| self._cache[key] = (data, current_time) | ||
|
|
||
| logger.debug(f"Cache set: key={key}, size={len(self._data)}") | ||
|
|
||
| # Check size and evict if needed | ||
| # We first try to remove expired items if we are over limit | ||
| if len(self._cache) > self._max_size: | ||
| self._cleanup_expired() | ||
|
|
||
| # If still over limit, evict LRU (first item) | ||
| while len(self._cache) > self._max_size: | ||
| self._cache.popitem(last=False) | ||
| logger.debug(f"Evicted LRU cache entry due to size limit") | ||
|
|
There was a problem hiding this comment.
The significant refactoring of ThreadSafeCache from a multi-dictionary approach to an OrderedDict-based LRU implementation lacks direct unit tests. The existing test (test_cache_update.py) only mocks cache methods and doesn't validate the new LRU eviction logic, TTL expiration, or the OrderedDict-based implementation. Consider adding tests that verify:
- LRU eviction occurs correctly when max_size is exceeded
- TTL expiration removes stale entries
- get() operations update the LRU order
- Thread safety under concurrent access
- Edge cases like setting when at capacity
| Issue.longitude >= min_lon, | ||
| Issue.longitude <= max_lon | ||
| ).all() | ||
| ).limit(100).all() |
There was a problem hiding this comment.
The PR description claims "Verifying all changes with targeted unit tests," but the spatial query limit of 100 lacks test coverage. There are no tests verifying behavior when more than 100 issues exist in a bounding box, which is the exact scenario this limit is meant to handle. Consider adding tests that:
- Create >100 issues in a dense cluster
- Verify the limit is applied
- Verify that deduplication still works reasonably well
- Document any limitations of the 100-issue cap
- Update `render.yaml` to set `PYTHONPATH` to `.` instead of `backend`. - This ensures that `uvicorn` can correctly import `backend.main` when running from the repository root, resolving the `ModuleNotFoundError`.
Optimized application performance and scalability by:
ThreadSafeCacheimplementation with anOrderedDict-based true LRU cache.AdaptiveWeightsfile reloads to prevent excessive I/O on high-traffic endpoints.integrity_hashfor faster lookups and verification.PR created automatically by Jules for task 17956532901284540244 started by @RohanExploit
Summary by cubic
Improved backend performance and scalability with a true LRU cache, throttled weight reloads, capped spatial queries, and an index on integrity_hash. Fixed Render deployment by setting PYTHONPATH to '.' to resolve import errors.
Written for commit 32e9eab. Summary will update on new commits.
Summary by CodeRabbit