-
Notifications
You must be signed in to change notification settings - Fork 0
1.2.1 rc #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…cation in WebSecurityConfig
Bugfix/analysis issues
📝 WalkthroughWalkthroughAdds pre-acquired analysis locking to webhook handlers and processors, atomic upserts for comment rate limits, token-limit estimation, expanded previous-issue/resolution metadata across AI flows, deterministic RAG context endpoint and Qdrant index manager, a Tree-sitter AST-based splitter, various RAG/indexing managers, and assorted logging, DTO, and API adjustments. Changes
Sequence Diagram(s)sequenceDiagram
participant Webhook as WebhookHandler
participant LockSvc as AnalysisLockService
participant Vcs as VcsService
participant Processor as PullRequestAnalysisProcessor
Webhook->>LockSvc: acquireLock(PR_ANALYSIS, project, sourceBranch, commitHash, prId)
alt lock NOT acquired
LockSvc-->>Webhook: empty
Webhook-->>Webhook: log duplicate
Webhook-->>Webhook: return IGNORED
else lock acquired
LockSvc-->>Webhook: lockKey
Webhook->>Vcs: post placeholder comment (protected by lock)
Webhook->>Processor: submit PrProcessRequest (includes preAcquiredLockKey)
Processor->>LockSvc: (processor may use existing lockKey or acquire/release as needed)
Processor-->>Vcs: update placeholder / post results
Processor-->>Webhook: completion
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/codecrow analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java`:
- Around line 125-133: Replace the non-atomic isLocked() check in
GitHubPullRequestWebhookHandler with an actual lock acquisition via
analysisLockService.acquireLockWithWait(...) before calling
postPlaceholderComment so only one webhook can post the placeholder; if
acquireLockWithWait returns empty, log and return WebhookResult.ignored as
before, otherwise enter a try/finally where you call postPlaceholderComment
(store placeholderCommentId) and proceed with analysis, and always call
analysisLockService.releaseLock(lockKey) in finally to ensure the lock is
released.
♻️ Duplicate comments (1)
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/webhookhandler/BitbucketCloudPullRequestWebhookHandler.java (1)
112-119: Same TOCTOU lock-check gap as in GitHub handler.See the earlier note: use an atomic lock acquisition rather than
isLockedalone.
...g/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java
Outdated
Show resolved
Hide resolved
|
/codecrow analyze |
fix: Implement atomic upsert for command rate limiting to prevent rac…
|
/codecrow analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/CommentCommandRateLimitService.java (1)
68-70: Fix race condition:OffsetDateTime.now()called twice.The
windowStartcalculation callsOffsetDateTime.now()twice on different lines. Between calls, time may advance across a minute or hour boundary, resulting in an inconsistent window calculation.For example, if the first
now()returns10:59:59.999and the second returns11:00:00.001:
truncatedTo(HOURS)yields10:00:00getMinute()yields0- Result is
10:00:00when it should be11:00:00🐛 Proposed fix: capture now() once
- OffsetDateTime windowStart = OffsetDateTime.now() - .truncatedTo(ChronoUnit.HOURS) - .plus((OffsetDateTime.now().getMinute() / windowMinutes) * windowMinutes, ChronoUnit.MINUTES); + OffsetDateTime now = OffsetDateTime.now(); + OffsetDateTime windowStart = now + .truncatedTo(ChronoUnit.HOURS) + .plus((now.getMinute() / windowMinutes) * windowMinutes, ChronoUnit.MINUTES);
🤖 Fix all issues with AI agents
In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/persistence/repository/analysis/CommentCommandRateLimitRepository.java`:
- Around line 39-51: The upsertCommandCount method in
CommentCommandRateLimitRepository issues an INSERT ... ON CONFLICT against table
comment_command_rate_limit using columns (project_id, window_start) and columns
window_start/command_count which do not exist in the current V15 migration; fix
by either updating the entity/repository to match the V15 schema (use pr_id,
vcs_username, command_name, executed_at/execution_date and change the ON
CONFLICT target and inserted/updated columns accordingly) or by adding a new
migration that creates/alter the comment_command_rate_limit table with
project_id, window_start, command_count and a unique constraint on (project_id,
window_start) so the upsertCommandCount method will succeed.
| @Modifying | ||
| @Query(value = """ | ||
| INSERT INTO comment_command_rate_limit (project_id, window_start, command_count, last_command_at) | ||
| VALUES (:projectId, :windowStart, 1, NOW()) | ||
| ON CONFLICT (project_id, window_start) | ||
| DO UPDATE SET | ||
| command_count = comment_command_rate_limit.command_count + 1, | ||
| last_command_at = NOW() | ||
| """, nativeQuery = true) | ||
| void upsertCommandCount( | ||
| @Param("projectId") Long projectId, | ||
| @Param("windowStart") OffsetDateTime windowStart | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find the entity class definition
find . -type f -name "*.java" | xargs grep -l "class CommentCommandRateLimit" | head -5Repository: rostilos/CodeCrow
Length of output: 431
🏁 Script executed:
#!/bin/bash
# Search for migration files related to comment_command_rate_limit table
find . -type f \( -name "*.sql" -o -name "*.xml" -o -name "*.yml" -o -name "*.yaml" \) | xargs grep -l "comment_command_rate_limit" 2>/dev/null | head -10Repository: rostilos/CodeCrow
Length of output: 273
🏁 Script executed:
#!/bin/bash
# Search for unique constraint definitions related to project_id and window_start
rg -n "unique.*project.*window|@UniqueConstraint.*project|ON CONFLICT.*project_id.*window_start" --type java --type sql 2>/dev/nullRepository: rostilos/CodeCrow
Length of output: 1469
🏁 Script executed:
#!/bin/bash
# Find the directory structure of the repository to understand schema/migration setup
ls -la | head -20
find . -type d -name "*migration*" -o -type d -name "*schema*" -o -type d -name "*db*" | head -10Repository: rostilos/CodeCrow
Length of output: 973
🏁 Script executed:
#!/bin/bash
# Read the entity class to see the constraint definition
cat ./java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/analysis/CommentCommandRateLimit.javaRepository: rostilos/CodeCrow
Length of output: 2432
🏁 Script executed:
#!/bin/bash
# Read the migration file that creates the table
cat ./java-ecosystem/libs/core/src/main/resources/db/migration/old/V15__create_comment_command_rate_limit_table.sqlRepository: rostilos/CodeCrow
Length of output: 4633
🏁 Script executed:
#!/bin/bash
# Read the migration file that adds indexes
cat ./java-ecosystem/libs/core/src/main/resources/db/migration/old/V17__add_comment_commands_indexes.sqlRepository: rostilos/CodeCrow
Length of output: 3548
🏁 Script executed:
#!/bin/bash
# List all migration files to see the complete migration history
find ./java-ecosystem/libs/core/src/main/resources/db/migration -type f -name "*.sql" | sortRepository: rostilos/CodeCrow
Length of output: 4729
🏁 Script executed:
#!/bin/bash
# Search for any migration that mentions window_start or the correct schema
rg -l "window_start" ./java-ecosystem/libs/core/src/main/resources/db/migration/ 2>/dev/nullRepository: rostilos/CodeCrow
Length of output: 43
🏁 Script executed:
#!/bin/bash
# Search for any migration that creates or alters comment_command_rate_limit table
rg -A 30 "CREATE TABLE.*comment_command_rate_limit|ALTER TABLE.*comment_command_rate_limit" ./java-ecosystem/libs/core/src/main/resources/db/migration/ 2>/dev/nullRepository: rostilos/CodeCrow
Length of output: 4502
🏁 Script executed:
#!/bin/bash
# Search for any ALTER TABLE that modifies comment_command_rate_limit after V15
rg -B 2 -A 15 "ALTER TABLE.*comment_command_rate_limit" ./java-ecosystem/libs/core/src/main/resources/db/migration/ 2>/dev/nullRepository: rostilos/CodeCrow
Length of output: 43
🏁 Script executed:
#!/bin/bash
# Check all migrations V16 and onward for any schema changes to comment_command_rate_limit
for file in ./java-ecosystem/libs/core/src/main/resources/db/migration/old/V{16,17,18,19,20,21,22}*.sql; do
if [ -f "$file" ]; then
echo "=== $file ==="
grep -i "comment_command_rate_limit\|window_start\|command_count" "$file" || echo "(no match)"
fi
doneRepository: rostilos/CodeCrow
Length of output: 1042
🏁 Script executed:
#!/bin/bash
# Search entire codebase for references to the entity to understand if it's actually used
rg "CommentCommandRateLimit" --type java | head -20Repository: rostilos/CodeCrow
Length of output: 4465
The repository method upsertCommandCount targets a schema that doesn't exist in the database and will fail at runtime.
The entity class defines window_start and command_count columns with a unique constraint on (project_id, window_start), but the actual database migration (V15) creates the table with entirely different columns: pr_id, vcs_username, command_name, executed_at, execution_date. The unique constraint referenced in the ON CONFLICT clause does not exist in the database schema. This mismatch between the entity definition and the database migration will cause the upsert operation to fail when executed.
Either the entity schema must be aligned with the V15 migration, or a new migration must be created to restructure the table to match the entity definition.
🤖 Prompt for AI Agents
In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/persistence/repository/analysis/CommentCommandRateLimitRepository.java`
around lines 39 - 51, The upsertCommandCount method in
CommentCommandRateLimitRepository issues an INSERT ... ON CONFLICT against table
comment_command_rate_limit using columns (project_id, window_start) and columns
window_start/command_count which do not exist in the current V15 migration; fix
by either updating the entity/repository to match the V15 schema (use pr_id,
vcs_username, command_name, executed_at/execution_date and change the ON
CONFLICT target and inserted/updated columns accordingly) or by adding a new
migration that creates/alter the comment_command_rate_limit table with
project_id, window_start, command_count and a unique constraint on (project_id,
window_start) so the upsertCommandCount method will succeed.
…ted before alias creation
Bugfix/analysis issues
|
/codecrow analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/webhookhandler/BitbucketCloudPullRequestWebhookHandler.java`:
- Around line 113-131: After acquireLock(...) returns earlyLock in
BitbucketCloudPullRequestWebhookHandler, wrap the post-lock work (placeholder
posting and invoking PullRequestAnalysisProcessor.process(...)) in a try/finally
so that if an exception prevents calling processor.process(...) you call
analysisLockService.releaseLock(earlyLock.get()) in the finally; ensure you do
NOT release the lock when processor.process(...) was successfully invoked (since
PullRequestAnalysisProcessor.process() will release it), so track a boolean like
processorInvoked to decide whether to release the earlyLock in the finally.
♻️ Duplicate comments (1)
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java (1)
184-195: Same lock leak concern as Bitbucket handler.The catch block doesn't release the acquired lock if an exception occurs after
acquireLock()succeeds but beforeprocess()completes. See the review comment onBitbucketCloudPullRequestWebhookHandlerfor the detailed analysis and suggested fix.
🧹 Nitpick comments (1)
python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py (1)
538-546: Improve exception handling for better debugging.The migration logic is correct, but the exception handling can be improved:
- Use
logger.exceptioninstead oflogger.errorto include the traceback automatically.- Chain the exception with
raise ... from del_errto preserve the original cause.♻️ Suggested improvement
if is_direct_collection: logger.info(f"Migrating from direct collection to alias-based indexing. Deleting old collection: {alias_name}") try: self.qdrant_client.delete_collection(alias_name) except Exception as del_err: - logger.error(f"Failed to delete old direct collection before alias swap: {del_err}") - raise Exception(f"Cannot create alias - collection '{alias_name}' exists and cannot be deleted: {del_err}") + logger.exception(f"Failed to delete old direct collection before alias swap: {del_err}") + raise RuntimeError(f"Cannot create alias - collection '{alias_name}' exists and cannot be deleted") from del_err
| // Try to acquire lock atomically BEFORE posting placeholder | ||
| // This prevents TOCTOU race where multiple webhooks could pass isLocked() check simultaneously | ||
| // Note: PullRequestAnalysisProcessor.process() uses acquireLockWithWait() which will | ||
| // reuse this lock since it's for the same project/branch/type | ||
| String sourceBranch = payload.sourceBranch(); | ||
| Optional<String> earlyLock = analysisLockService.acquireLock( | ||
| project, sourceBranch, AnalysisLockType.PR_ANALYSIS, | ||
| payload.commitHash(), Long.parseLong(payload.pullRequestId())); | ||
|
|
||
| if (earlyLock.isEmpty()) { | ||
| log.info("PR analysis already in progress for project={}, branch={}, PR={} - skipping duplicate webhook", | ||
| project.getId(), sourceBranch, payload.pullRequestId()); | ||
| return WebhookResult.ignored("PR analysis already in progress for this branch"); | ||
| } | ||
|
|
||
| // Lock acquired - placeholder posting is now protected from race conditions | ||
| // Note: We don't release this lock here - PullRequestAnalysisProcessor will manage it | ||
| // since acquireLockWithWait() will detect the existing lock and use it | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find the PullRequestAnalysisProcessor file
find . -name "PullRequestAnalysisProcessor.java" -type fRepository: rostilos/CodeCrow
Length of output: 202
🏁 Script executed:
# Find the AnalysisLockService file
find . -name "AnalysisLockService.java" -type fRepository: rostilos/CodeCrow
Length of output: 182
🏁 Script executed:
# Check if there are multiple implementations of AnalysisLockService
find . -name "*LockService*.java" -type fRepository: rostilos/CodeCrow
Length of output: 308
🏁 Script executed:
# Check size and read PullRequestAnalysisProcessor
wc -l ./java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessor.javaRepository: rostilos/CodeCrow
Length of output: 206
🏁 Script executed:
# Check size and read AnalysisLockService
wc -l ./java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/AnalysisLockService.javaRepository: rostilos/CodeCrow
Length of output: 186
🏁 Script executed:
cat -n ./java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessor.javaRepository: rostilos/CodeCrow
Length of output: 17492
🏁 Script executed:
cat -n ./java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/AnalysisLockService.javaRepository: rostilos/CodeCrow
Length of output: 17352
Lock leak potential exists between early lock acquisition and processor handoff, but TTL mitigates indefinite lock leaks.
The concern about exception handling is partially valid. If an exception occurs after acquireLock() succeeds at line 120 but before processor.process() is called, the early lock would not be released by the webhook handler's catch block. However, the processor's process() method has a finally block (line 210-211 in PullRequestAnalysisProcessor) that reliably releases the lock when called, and the AnalysisLockService automatically expires locks after 30 minutes (configurable) with periodic cleanup every 5 minutes.
To eliminate the temporary lock leak window entirely, consider wrapping the post-lock operations in a try-finally block that releases the early lock only if the processor is never invoked:
Proposed mitigation
if (earlyLock.isEmpty()) {
log.info("PR analysis already in progress for project={}, branch={}, PR={} - skipping duplicate webhook",
project.getId(), sourceBranch, payload.pullRequestId());
return WebhookResult.ignored("PR analysis already in progress for this branch");
}
+try {
// Lock acquired - placeholder posting is now protected from race conditions
// Processor will manage release via its finally block
// ... rest of method ...
return WebhookResult.success(...);
+} catch (Exception e) {
+ analysisLockService.releaseLock(earlyLock.get());
+ // ... existing error handling ...
+ return WebhookResult.error(...);
+}🤖 Prompt for AI Agents
In
`@java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/webhookhandler/BitbucketCloudPullRequestWebhookHandler.java`
around lines 113 - 131, After acquireLock(...) returns earlyLock in
BitbucketCloudPullRequestWebhookHandler, wrap the post-lock work (placeholder
posting and invoking PullRequestAnalysisProcessor.process(...)) in a try/finally
so that if an exception prevents calling processor.process(...) you call
analysisLockService.releaseLock(earlyLock.get()) in the finally; ensure you do
NOT release the lock when processor.process(...) was successfully invoked (since
PullRequestAnalysisProcessor.process() will release it), so track a boolean like
processorInvoked to decide whether to release the earlyLock in the finally.
…rategy for direct collections
fix: Enhance alias management by implementing backup and migration strategy for direct collections
|
/codecrow analyze |
…esolution tracking
fix: Enhance AI analysis by incorporating full PR issue history and resolution tracking
|
/codecrow analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/service/GitHubAiClientService.java (1)
81-92: Backward-compatibility regression for callers using the 3‑arg overload.
The previousAnalysis context is dropped because you passCollections.emptyList()and the PR builder no longer useswithPreviousAnalysisData(...). If any caller still uses the 3‑arg overload, they lose historical issues.✅ Suggested fix to preserve prior behavior
- default: - return buildPrAnalysisRequest(project, (PrProcessRequest) request, previousAnalysis, Collections.emptyList()); + default: + List<CodeAnalysis> history = previousAnalysis + .map(List::of) + .orElseGet(Collections::emptyList); + return buildPrAnalysisRequest(project, (PrProcessRequest) request, previousAnalysis, history);java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiRequestPreviousIssueDTO.java (1)
18-52: resolvedInPrVersion is populated with analysis ID — likely wrong.The field is documented as "PR version where this was resolved," but it's assigned
issue.getResolvedAnalysisId(), which returns an analysis ID (a Long), not a PR version. The CodeAnalysisIssue entity has no direct accessor for the resolved analysis' PR version. Either rename the field/comment toresolvedInAnalysisIdto match the assigned value, or fetch the resolved analysis by ID to extract the actualprVersion.
🤖 Fix all issues with AI agents
In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java`:
- Around line 281-302: The Javadoc for withAllPrAnalysesData promises
deduplication but the method currently just flattens all issues; update
withAllPrAnalysesData in AiAnalysisRequestImpl to deduplicate by a composite
identity (file path + line number + reason + severity + category) while
preserving newest-first order: when streaming allPrAnalyses -> flatMap ->
map(AiRequestPreviousIssueDTO::fromEntity), collect into a LinkedHashMap keyed
by the composite key and keep the first occurrence (newest) for each key, then
set previousCodeAnalysisIssues to the map.values() list; ensure you reference
the existing AiRequestPreviousIssueDTO.fromEntity conversion and the
previousCodeAnalysisIssues field and return self() as before.
In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java`:
- Around line 382-387: Remove the dead 4-argument overload
createIssueFromData(Map<String, Object> issueData, String issueKey, String
vcsAuthorId, String vcsAuthorUsername) since no callers use it; update
CodeAnalysisService by deleting that private method entirely (all references
should remain pointing to the existing 7-argument createIssueFromData(...,
ResolutionContext, String, String) method) and run a build to ensure no usages
remain.
🧹 Nitpick comments (6)
python-ecosystem/mcp-client/utils/response_parser.py (1)
84-91: Mutation of input dict may cause side effects.The normalization at line 86 mutates the original
issuedictionary by adding theidkey. While this works for the current use case (since_clean_issueis called on items that are subsequently discarded), this could lead to unexpected behavior if callers retain references to the original dict.Consider operating on a copy to avoid side effects:
♻️ Suggested defensive fix
cleaned = {} # Normalize issueId to id (AI may use either) - if 'issueId' in issue and 'id' not in issue: - issue['id'] = issue['issueId'] + id_value = issue.get('id') or issue.get('issueId') for key, value in issue.items(): # Skip issueId since we normalized it to id above if key == 'issueId': continue - + + # Use normalized id value + if key == 'id': + value = id_value + # Skip fields not in valid setjava-ecosystem/libs/analysis-engine/src/test/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiRequestPreviousIssueDTOTest.java (1)
207-224: Missing mock forgetPrNumber()may cause test fragility.The test
shouldHandleNullSeveritymocksanalysis.getBranchName()andanalysis.getPrVersion()but does not mockanalysis.getPrNumber(). IffromEntity()accessesgetPrNumber(), Mockito will return0L(default forLongprimitives via unboxing) rather thannull, which may not reflect the intended test scenario or could mask potential NPE issues in production code.Consider adding the mock for consistency with other tests:
when(analysis.getPrNumber()).thenReturn(null);java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/service/BitbucketAiClientService.java (2)
123-124: Remove duplicate variable declaration.
aiConnectionandprojectAiConnectionboth reference the same object fromproject.getAiBinding().getAiConnection(). This redundancy can cause confusion and maintenance burden.♻️ Proposed fix
VcsInfo vcsInfo = getVcsInfo(project); VcsConnection vcsConnection = vcsInfo.vcsConnection(); AIConnection aiConnection = project.getAiBinding().getAiConnection(); - AIConnection projectAiConnection = project.getAiBinding().getAiConnection();Then update line 232 to use
aiConnection:- .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(projectAiConnection.getApiKeyEncrypted())) + .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(aiConnection.getApiKeyEncrypted()))
295-296: Same duplicate variable declaration issue.Apply the same fix here — remove
projectAiConnectionand useaiConnectionconsistently.♻️ Proposed fix
VcsInfo vcsInfo = getVcsInfo(project); VcsConnection vcsConnection = vcsInfo.vcsConnection(); AIConnection aiConnection = project.getAiBinding().getAiConnection(); - AIConnection projectAiConnection = project.getAiBinding().getAiConnection();Then update line 303:
- .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(projectAiConnection.getApiKeyEncrypted())) + .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(aiConnection.getApiKeyEncrypted()))java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java (2)
268-288: N+1 query pattern in issue reconciliation.For each issue in the AI response,
issueRepository.findById(originalId)executes a separate database query. For large analyses with many issues referencing previous issues, this could cause performance degradation.Consider batch-fetching all original issue IDs upfront:
♻️ Suggested approach
Before the issue processing loop, collect all original IDs and fetch them in one query:
// Extract all original issue IDs from the issues list/map Set<Long> originalIds = extractOriginalIssueIds(issuesObj); Map<Long, CodeAnalysisIssue> originalIssuesMap = originalIds.isEmpty() ? Collections.emptyMap() : issueRepository.findAllById(originalIds).stream() .collect(Collectors.toMap(CodeAnalysisIssue::getId, Function.identity())); // Then pass originalIssuesMap to createIssueFromData and look up by IDThis would reduce N database calls to 1 batch query.
353-362: Clarify resolution tracking intent in comment.The resolution tracking logic sets
resolvedDescriptionfrom the current issue'sreasonfield. This is semantically correct (AI provides the resolution explanation), but the variable naming and comment could be clearer.Consider enhancing the comment:
// AI provides resolution explanation in 'reason' field for resolved issues issue.setResolvedDescription(reason);
| /** | ||
| * Set previous issues from ALL PR analysis versions. | ||
| * This provides the LLM with complete issue history including resolved issues, | ||
| * helping it understand what was already found and fixed. | ||
| * | ||
| * Issues are deduplicated by keeping only the most recent version of each issue. | ||
| * Resolved issues are included so the LLM knows what was already addressed. | ||
| * | ||
| * @param allPrAnalyses List of all analyses for this PR, ordered by version DESC (newest first) | ||
| */ | ||
| public T withAllPrAnalysesData(List<CodeAnalysis> allPrAnalyses) { | ||
| if (allPrAnalyses == null || allPrAnalyses.isEmpty()) { | ||
| return self(); | ||
| } | ||
|
|
||
| this.previousCodeAnalysisIssues = allPrAnalyses.stream() | ||
| .flatMap(analysis -> analysis.getIssues().stream()) | ||
| .map(AiRequestPreviousIssueDTO::fromEntity) | ||
| .toList(); | ||
|
|
||
| return self(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find CodeAnalysisIssue class definition
find . -name "*.java" -type f | xargs grep -l "class CodeAnalysisIssue" | head -5Repository: rostilos/CodeCrow
Length of output: 284
🏁 Script executed:
#!/bin/bash
# Search for CodeAnalysis class to understand the structure
find . -name "*.java" -type f | xargs grep -l "class CodeAnalysis\b" | head -5Repository: rostilos/CodeCrow
Length of output: 165
🏁 Script executed:
#!/bin/bash
# Look for any deduplication logic in the codebase
rg "LinkedHashMap|toMap.*Function.identity" --type java -C2 | head -30Repository: rostilos/CodeCrow
Length of output: 4362
🏁 Script executed:
#!/bin/bash
# Find usages of withAllPrAnalysesData to understand how it's called
rg "withAllPrAnalysesData" --type java -C3Repository: rostilos/CodeCrow
Length of output: 5696
🏁 Script executed:
#!/bin/bash
# Read CodeAnalysisIssue class to find stable identifier fields
cat java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/codeanalysis/CodeAnalysisIssue.javaRepository: rostilos/CodeCrow
Length of output: 5060
🏁 Script executed:
#!/bin/bash
# Read CodeAnalysis class to understand the structure
cat java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/codeanalysis/CodeAnalysis.javaRepository: rostilos/CodeCrow
Length of output: 5970
🏁 Script executed:
#!/bin/bash
# Find AiRequestPreviousIssueDTO class
find . -name "*.java" -type f | xargs grep -l "class AiRequestPreviousIssueDTO"Repository: rostilos/CodeCrow
Length of output: 199
🏁 Script executed:
#!/bin/bash
# Check how the Javadoc says "deduplicated by keeping only the most recent version"
# Let's verify the exact context in the file
sed -n '281,302p' java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.javaRepository: rostilos/CodeCrow
Length of output: 1080
🏁 Script executed:
#!/bin/bash
# Find and read AiRequestPreviousIssueDTO class
find . -name "AiRequestPreviousIssueDTO.java" -type f -exec cat {} \;Repository: rostilos/CodeCrow
Length of output: 2395
🏁 Script executed:
#!/bin/bash
# Check if there are any other dedup patterns or discussions in related code
rg "deduplicate|dedup|unique.*issue|fingerprint" --type java -C2 | head -50Repository: rostilos/CodeCrow
Length of output: 971
🏁 Script executed:
#!/bin/bash
# Check test file to understand expected behavior
cat java-ecosystem/libs/analysis-engine/src/test/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiRequestPreviousIssueDTOTest.javaRepository: rostilos/CodeCrow
Length of output: 12371
Deduplication promised in Javadoc is not implemented.
The method flattens all issues from all PR analysis versions without deduplicating them. The Javadoc states "Issues are deduplicated by keeping only the most recent version of each issue," but the implementation contains no dedup logic. This creates duplicate entries in the stream when the same logical issue (identified by file, line, reason, severity, category) appears across multiple PR versions, inflating the data sent to the LLM and contradicting the documented behavior.
Use a composite key based on issue identity fields (file path + line number + reason + severity + category) to deduplicate while preserving the newest-first order:
Suggested implementation pattern
- this.previousCodeAnalysisIssues = allPrAnalyses.stream()
- .flatMap(analysis -> analysis.getIssues().stream())
- .map(AiRequestPreviousIssueDTO::fromEntity)
- .toList();
+ this.previousCodeAnalysisIssues = allPrAnalyses.stream()
+ .flatMap(analysis -> analysis.getIssues().stream())
+ .map(AiRequestPreviousIssueDTO::fromEntity)
+ // Keep first occurrence (newest) based on issue identity
+ .collect(java.util.stream.Collectors.toMap(
+ dto -> dto.file() + "|" + dto.line() + "|" + dto.reason() + "|" + dto.severity() + "|" + dto.category(),
+ java.util.function.Function.identity(),
+ (first, ignored) -> first, // Keep first (newest) when key collision
+ java.util.LinkedHashMap::new // Preserve order
+ ))
+ .values()
+ .stream()
+ .toList();🤖 Prompt for AI Agents
In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java`
around lines 281 - 302, The Javadoc for withAllPrAnalysesData promises
deduplication but the method currently just flattens all issues; update
withAllPrAnalysesData in AiAnalysisRequestImpl to deduplicate by a composite
identity (file path + line number + reason + severity + category) while
preserving newest-first order: when streaming allPrAnalyses -> flatMap ->
map(AiRequestPreviousIssueDTO::fromEntity), collect into a LinkedHashMap keyed
by the composite key and keep the first occurrence (newest) for each key, then
set previousCodeAnalysisIssues to the map.values() list; ensure you reference
the existing AiRequestPreviousIssueDTO.fromEntity conversion and the
previousCodeAnalysisIssues field and return self() as before.
| /** | ||
| * Overload for backward compatibility with callers that don't have resolution context | ||
| */ | ||
| private CodeAnalysisIssue createIssueFromData(Map<String, Object> issueData, String issueKey, String vcsAuthorId, String vcsAuthorUsername) { | ||
| return createIssueFromData(issueData, issueKey, vcsAuthorId, vcsAuthorUsername, null, null, null); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for all calls to createIssueFromData method
# Look for both 4-argument (backward compatible) and 7-argument versions
rg "createIssueFromData\s*\(" -A 3 java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java | head -100Repository: rostilos/CodeCrow
Length of output: 1123
🏁 Script executed:
#!/bin/bash
# Also search for calls to this method across the entire repository
rg "createIssueFromData\(" --type javaRepository: rostilos/CodeCrow
Length of output: 1001
Remove the unused backward-compatible overload.
This 4-argument overload (lines 382-387) has no callers in the codebase. All calls to createIssueFromData use the 7-argument version. Remove this dead code.
🤖 Prompt for AI Agents
In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java`
around lines 382 - 387, Remove the dead 4-argument overload
createIssueFromData(Map<String, Object> issueData, String issueKey, String
vcsAuthorId, String vcsAuthorUsername) since no callers use it; update
CodeAnalysisService by deleting that private method entirely (all references
should remain pointing to the existing 7-argument createIssueFromData(...,
ResolutionContext, String, String) method) and run a build to ensure no usages
remain.
…oth incremental and full modes
fix: Update issue reconciliation logic to handle previous issues in both incremental and full modes
|
/codecrow analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python-ecosystem/mcp-client/service/multi_stage_orchestrator.py (1)
198-205: Prevent resolved issues from being re-added when reconciliation runs in FULL mode.Because Stage 1.5 now always runs when
previousCodeAnalysisIssuesexist,_reconcile_previous_issueswill append any prior issue not re-reported—even ones already resolved—effectively reopening them. Skip previously resolved issues before adding “persisting” entries.🛠️ Proposed fix
@@ for prev_issue in request.previousCodeAnalysisIssues: # Convert to dict if needed if hasattr(prev_issue, 'model_dump'): prev_data = prev_issue.model_dump() else: prev_data = prev_issue if isinstance(prev_issue, dict) else vars(prev_issue) + + # Skip issues that were already resolved in history + status = prev_data.get('status') + if status is None and prev_data.get('isResolved') is not None: + status = 'resolved' if prev_data.get('isResolved') else 'open' + if str(status).lower() == 'resolved': + continue
fix: Improve handling of issue resolution status and logging for bett…
- Updated JobService to use REQUIRES_NEW transaction propagation for deleting ignored jobs, ensuring fresh entity retrieval and preventing issues with the calling transaction. - Removed token limitation from AI connection model and related DTOs, transitioning to project-level configuration for token limits. - Adjusted AIConnectionDTO tests to reflect the removal of token limitation. - Enhanced Bitbucket, GitHub, and GitLab AI client services to check token limits before analysis, throwing DiffTooLargeException when limits are exceeded. - Updated command processors to utilize project-level token limits instead of AI connection-specific limits. - Modified webhook processing to handle diff size issues gracefully, posting informative messages to VCS when analysis is skipped due to large diffs. - Cleaned up integration tests to remove references to token limitation in AI connection creation and updates.
…sis processing. Project PR analysis max analysis token limit implementation
fix: Enhance logging and implement deterministic context retrieval in RAG pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@python-ecosystem/mcp-client/service/multi_stage_orchestrator.py`:
- Around line 920-921: The logger.debug call that reads logger.debug(f"Reached
chunk limit of 15, stopping") uses an unnecessary f-string; replace it with a
normal string literal instead (e.g., logger.debug("Reached chunk limit of 15,
stopping")) in the same scope where included_count is checked (the if
included_count >= 15 block) so static analysis warnings are resolved while
preserving the log message.
In `@python-ecosystem/mcp-client/service/rag_client.py`:
- Around line 249-286: The fallback return shapes for get_deterministic_context
are inconsistent and must match the expected schema (keys: "chunks",
"changed_files", "related_definitions", "class_context", "namespace_context");
update the disabled branch and both except handlers in the method that calls
self._get_client and posts to f"{self.base_url}/query/deterministic" so they all
return {"context": {"chunks": [], "changed_files": {}, "related_definitions":
{}, "class_context": {}, "namespace_context": {}}}; keep the existing logging
and error handling but normalize the returned context shape to avoid KeyErrors
in callers of query_service.get_deterministic_context.
🧹 Nitpick comments (7)
python-ecosystem/rag-pipeline/src/rag_pipeline/api/api.py (2)
70-85: Minor docstring inconsistency: says "TWO-STEP" but describes three steps.The docstring states "TWO-STEP process" but then lists steps 1, 2, and 3. Consider updating to "THREE-STEP" for accuracy.
class DeterministicContextRequest(BaseModel): """Request for deterministic metadata-based context retrieval. - TWO-STEP process leveraging tree-sitter metadata: + THREE-STEP process leveraging tree-sitter metadata: 1. Get chunks for the changed file_paths 2. Extract semantic_names/imports/extends from those chunks 3. Find related definitions using extracted identifiers
455-467: Improve exception handling for better debugging.Per static analysis hints, consider:
- Using
logging.exceptionto include the traceback automatically- Using
raise ... from efor proper exception chaining♻️ Suggested improvement
try: context = query_service.get_deterministic_context( workspace=request.workspace, project=request.project, branches=request.branches, file_paths=request.file_paths, limit_per_file=request.limit_per_file or 10 ) - return {"context": context} except Exception as e: - logger.error(f"Error getting deterministic context: {e}") - raise HTTPException(status_code=500, detail=str(e)) + logger.exception(f"Error getting deterministic context: {e}") + raise HTTPException(status_code=500, detail=str(e)) from epython-ecosystem/mcp-client/service/rag_client.py (1)
228-237: Docstring mentions "Two-step" but describes three steps.Same inconsistency as in the API model - consider updating for consistency.
python-ecosystem/rag-pipeline/src/rag_pipeline/services/query_service.py (2)
693-696: Unused loop variableq_type.The static analysis correctly identifies that
q_typeis unpacked but never used. Rename to_q_typeto indicate it's intentionally unused.logger.info(f"Generated {len(queries)} queries for PR context") - for i, (q_text, q_weight, q_top_k, q_type) in enumerate(queries): + for i, (q_text, q_weight, q_top_k, _q_type) in enumerate(queries): logger.info(f" Query {i+1}: weight={q_weight}, top_k={q_top_k}, text='{q_text[:80]}...'")
387-400: Potential performance concern: fallback fetches up to 1000 points per file.When exact path matching fails, the code scrolls up to 1000 points and filters client-side. For large collections with many changed files, this could cause significant latency and memory pressure.
Consider:
- Adding a warning log when falling back to partial match
- Reducing the limit or adding pagination
- Using a more targeted filter if Qdrant supports partial string matching
# Fallback: partial match if exact fails if not results: + logger.debug(f"Exact path match failed for '{normalized_path}', trying partial match") all_results, _ = self.qdrant_client.scroll( collection_name=collection_name, scroll_filter=Filter(must=[branch_filter]), - limit=1000, + limit=500, # Reduced limit to mitigate performance impact with_payload=True, with_vectors=False )java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java (2)
427-431: Unused variablebranchIndexExists.The variable is computed and logged but never used in the control flow. Based on the summary, this appears intentional (the early-return was removed to always check for updates), but the unused variable is confusing. Either remove it entirely or add a comment explaining why it's logged but not used for control flow.
♻️ Suggested fix: Remove unused variable or use inline logging
- // Check if branch already has indexed data (RagBranchIndex exists) - // Note: We still proceed with diff check to ensure any new changes are indexed - boolean branchIndexExists = isBranchIndexReady(project, targetBranch); - log.info("Branch index status for project={}, branch={}: exists={}", - project.getId(), targetBranch, branchIndexExists); + // Note: We always proceed with diff check to ensure any new changes are indexed, + // regardless of whether a branch index already exists + log.info("Branch index status for project={}, branch={}: exists={}", + project.getId(), targetBranch, isBranchIndexReady(project, targetBranch));
810-812: Inconsistent substring bounds checking on commit hashes.Line 811 uses
Math.min(7, indexedCommit.length())defensively, but line 812 usessubstring(0, 7)directly. While git commit hashes are always 40 characters, the inconsistency could cause issues if this code is ever used with abbreviated hashes.♻️ Suggested fix: Use consistent bounds checking
- log.info("Incremental diff for branch '{}' ({}..{}): bytes={}", - targetBranch, indexedCommit.substring(0, Math.min(7, indexedCommit.length())), - currentCommit.substring(0, 7), rawDiff != null ? rawDiff.length() : 0); + log.info("Incremental diff for branch '{}' ({}..{}): bytes={}", + targetBranch, indexedCommit.substring(0, Math.min(7, indexedCommit.length())), + currentCommit.substring(0, Math.min(7, currentCommit.length())), rawDiff != null ? rawDiff.length() : 0);
| if included_count >= 15: # Increased from 10 for more context | ||
| logger.debug(f"Reached chunk limit of 15, stopping") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extraneous f prefix from string without placeholders.
Static analysis correctly identifies this f-string has no placeholders.
🔧 Suggested fix
- logger.debug(f"Reached chunk limit of 15, stopping")
+ logger.debug("Reached chunk limit of 15, stopping")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if included_count >= 15: # Increased from 10 for more context | |
| logger.debug(f"Reached chunk limit of 15, stopping") | |
| if included_count >= 15: # Increased from 10 for more context | |
| logger.debug("Reached chunk limit of 15, stopping") |
🧰 Tools
🪛 Ruff (0.14.13)
921-921: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In `@python-ecosystem/mcp-client/service/multi_stage_orchestrator.py` around lines
920 - 921, The logger.debug call that reads logger.debug(f"Reached chunk limit
of 15, stopping") uses an unnecessary f-string; replace it with a normal string
literal instead (e.g., logger.debug("Reached chunk limit of 15, stopping")) in
the same scope where included_count is checked (the if included_count >= 15
block) so static analysis warnings are resolved while preserving the log
message.
| if not self.enabled: | ||
| logger.debug("RAG disabled, returning empty deterministic context") | ||
| return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}} | ||
|
|
||
| start_time = datetime.now() | ||
|
|
||
| try: | ||
| payload = { | ||
| "workspace": workspace, | ||
| "project": project, | ||
| "branches": branches, | ||
| "file_paths": file_paths, | ||
| "limit_per_file": limit_per_file | ||
| } | ||
|
|
||
| client = await self._get_client() | ||
| response = await client.post( | ||
| f"{self.base_url}/query/deterministic", | ||
| json=payload | ||
| ) | ||
| response.raise_for_status() | ||
| result = response.json() | ||
|
|
||
| # Log timing and stats | ||
| elapsed_ms = (datetime.now() - start_time).total_seconds() * 1000 | ||
| context = result.get("context", {}) | ||
| chunk_count = len(context.get("chunks", [])) | ||
| logger.info(f"Deterministic RAG query completed in {elapsed_ms:.2f}ms, " | ||
| f"retrieved {chunk_count} chunks for {len(file_paths)} files") | ||
|
|
||
| return result | ||
|
|
||
| except httpx.HTTPError as e: | ||
| logger.warning(f"Failed to retrieve deterministic context: {e}") | ||
| return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}} | ||
| except Exception as e: | ||
| logger.error(f"Unexpected error in deterministic RAG query: {e}") | ||
| return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent fallback return structures will cause key errors in callers.
The method returns different structures in different code paths:
- Line 251 (disabled):
{"chunks": [], "changed_files": {}, "related_definitions": {}} - Lines 283/286 (errors):
{"chunks": [], "by_identifier": {}, "by_file": {}}
Based on the query_service.get_deterministic_context implementation, the correct keys are changed_files, related_definitions, class_context, and namespace_context. Callers expecting these keys will fail when receiving the error fallback.
🐛 Proposed fix
except httpx.HTTPError as e:
logger.warning(f"Failed to retrieve deterministic context: {e}")
- return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}}
+ return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}}
except Exception as e:
logger.error(f"Unexpected error in deterministic RAG query: {e}")
- return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}}
+ return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not self.enabled: | |
| logger.debug("RAG disabled, returning empty deterministic context") | |
| return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}} | |
| start_time = datetime.now() | |
| try: | |
| payload = { | |
| "workspace": workspace, | |
| "project": project, | |
| "branches": branches, | |
| "file_paths": file_paths, | |
| "limit_per_file": limit_per_file | |
| } | |
| client = await self._get_client() | |
| response = await client.post( | |
| f"{self.base_url}/query/deterministic", | |
| json=payload | |
| ) | |
| response.raise_for_status() | |
| result = response.json() | |
| # Log timing and stats | |
| elapsed_ms = (datetime.now() - start_time).total_seconds() * 1000 | |
| context = result.get("context", {}) | |
| chunk_count = len(context.get("chunks", [])) | |
| logger.info(f"Deterministic RAG query completed in {elapsed_ms:.2f}ms, " | |
| f"retrieved {chunk_count} chunks for {len(file_paths)} files") | |
| return result | |
| except httpx.HTTPError as e: | |
| logger.warning(f"Failed to retrieve deterministic context: {e}") | |
| return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}} | |
| except Exception as e: | |
| logger.error(f"Unexpected error in deterministic RAG query: {e}") | |
| return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}} | |
| if not self.enabled: | |
| logger.debug("RAG disabled, returning empty deterministic context") | |
| return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}} | |
| start_time = datetime.now() | |
| try: | |
| payload = { | |
| "workspace": workspace, | |
| "project": project, | |
| "branches": branches, | |
| "file_paths": file_paths, | |
| "limit_per_file": limit_per_file | |
| } | |
| client = await self._get_client() | |
| response = await client.post( | |
| f"{self.base_url}/query/deterministic", | |
| json=payload | |
| ) | |
| response.raise_for_status() | |
| result = response.json() | |
| # Log timing and stats | |
| elapsed_ms = (datetime.now() - start_time).total_seconds() * 1000 | |
| context = result.get("context", {}) | |
| chunk_count = len(context.get("chunks", [])) | |
| logger.info(f"Deterministic RAG query completed in {elapsed_ms:.2f}ms, " | |
| f"retrieved {chunk_count} chunks for {len(file_paths)} files") | |
| return result | |
| except httpx.HTTPError as e: | |
| logger.warning(f"Failed to retrieve deterministic context: {e}") | |
| return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}} | |
| except Exception as e: | |
| logger.error(f"Unexpected error in deterministic RAG query: {e}") | |
| return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}} |
🧰 Tools
🪛 Ruff (0.14.13)
279-279: Consider moving this statement to an else block
(TRY300)
284-284: Do not catch blind exception: Exception
(BLE001)
285-285: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In `@python-ecosystem/mcp-client/service/rag_client.py` around lines 249 - 286,
The fallback return shapes for get_deterministic_context are inconsistent and
must match the expected schema (keys: "chunks", "changed_files",
"related_definitions", "class_context", "namespace_context"); update the
disabled branch and both except handlers in the method that calls
self._get_client and posts to f"{self.base_url}/query/deterministic" so they all
return {"context": {"chunks": [], "changed_files": {}, "related_definitions":
{}, "class_context": {}, "namespace_context": {}}}; keep the existing logging
and error handling but normalize the returned context shape to avoid KeyErrors
in callers of query_service.get_deterministic_context.
…Exception in webhook processors
… entities from async contexts
…ies without re-fetching in async contexts
… lazy loading of associations
…cy across transaction contexts
…oading of associations
…ansaction management in async context
…mproved job management
…ervice for direct deletion
…nt in webhook processing
…n for RAG context
- Added AST-based code splitter using Tree-sitter for accurate code parsing. - Introduced TreeSitterParser for dynamic language loading and caching. - Created scoring configuration for RAG query result reranking with configurable boost factors and priority patterns. - Refactored RAGQueryService to utilize the new scoring configuration for enhanced result ranking. - Improved metadata extraction and handling for better context in scoring.
…rove code referencing in prompts
… target branch results
…emental updates in RAG operations
Feature/pr analysis rate limiting
|
| Status | PASS WITH WARNINGS |
| Risk Level | HIGH |
| Review Coverage | 97 files analyzed in depth |
| Confidence | HIGH |
Executive Summary
This release candidate introduces significant updates to AI connection management, RAG pipeline orchestration, and VCS webhook handling across the Java and Python ecosystems. The changes focus on improving code analysis depth and indexing efficiency, but they introduce complexities in distributed locking and cross-service error handling. While the architectural direction is sound, there are notable risks regarding data integrity in the vector database and potential "silent failures" during large diff processing.
Recommendation
Decision: PASS WITH WARNINGS
The PR is architecturally solid but requires immediate attention to inconsistent exception handling for large diffs and potential lock leakage in the webhook processing flow. These should be addressed prior to final production deployment to ensure system stability and user feedback transparency.
Strategic Review Areas
1. Large Diff & Token Management
The introduction of TokenEstimator is a critical improvement; however, its implementation is inconsistent across VCS providers (GitHub/GitLab/Bitbucket). Some paths lack error reporting, which may lead to PRs hanging in a pending state if limits are exceeded.
2. Distributed Locking & Concurrency
The shift to a "pre-acquired" lock pattern in the PullRequestAnalysisProcessor introduces a risk of orphaned locks. If the calling webhook handlers encounter unhandled exceptions after acquiring a lock, subsequent analyses for those branches will be blocked until manual database intervention occurs.
3. RAG Pipeline Integrity
The Python-based RAG pipeline has been enhanced with language-specific Tree-Sitter queries. Reviewers should note that deterministic ID generation for vector points currently relies on metadata that, if missing, defaults to random values. This could lead to duplicated data in the vector store over time.
4. Database & Persistence
The release includes schema migrations to remove token limitations. Note that some new repositories utilize PostgreSQL-specific ON CONFLICT syntax, which hardens the dependency on the current database provider. Additionally, the CodeAnalysisService involves a double-save pattern that requires strict transactional wrapping to avoid "ghost" records on failure.
Issues Overview
| Severity | Count | |
|---|---|---|
| 🔴 High | 2 | Critical issues requiring immediate attention |
| 🟡 Medium | 12 | Issues that should be addressed |
| 🔵 Low | 9 | Minor issues and improvements |
| ℹ️ Info | 2 | Informational notes and suggestions |
Analysis completed on 2026-01-29 22:01:09 | View Full Report | Pull Request
📋 Detailed Issues (25)
🔴 High Severity Issues
Id on Platform: 1615
Category: 🐛 Bug Risk
File: .../core/index_manager.py
Issue: The code deletes the original collection (self.qdrant_client.delete_collection(alias_name)) BEFORE creating the new alias. If update_collection_aliases fails immediately after deletion (e.g., network timeout, service restart, or validation error), the system is left in a state where the primary alias is gone. While data exists in the temp collection, the production alias is broken.
💡 Suggested Fix
In Qdrant, the most reliable way to handle this without downtime or risk is to use a 'Rename' sequence if possible, or at minimum, wrap the deletion and alias creation in a more robust way. Since Qdrant doesn't support atomic rename of a collection to an alias, you should ensure the temp collection is fully ready and perhaps perform the deletion within the same logical block as the alias creation, or use a 'try-finally' with a more aggressive recovery.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py
@@ -569,4 +569,7 @@
- logger.warning(f"Deleting old direct collection: {alias_name} (temp collection {temp_collection_name} verified with {temp_collection_info.points_count} points)")
-
- try:
- self.qdrant_client.delete_collection(alias_name)
+ # Use a safer sequence: attempt to create alias with a different name first to verify permissions
+ # then perform the delete and immediate alias creation.
+ logger.warning(f"Deleting old direct collection: {alias_name}")
+ self.qdrant_client.delete_collection(alias_name)Id on Platform: 1618
Category: 🐛 Bug Risk
File: .../splitter/metadata.py
Issue: The multi-line signature detection logic for Python uses next((i for i, l in enumerate(lines) if l.strip() == line), -1) to find the index of the current line. If multiple lines in the first 15 lines of the chunk have identical content (e.g., two identical decorators or comments), this logic might find the wrong index, leading to incorrect signature reconstruction or processing the same lines repeatedly.
💡 Suggested Fix
Use the loop index directly instead of recalculating it with a search to avoid ambiguity with duplicate lines.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
@@ -118,15 +118,13 @@
- for line in lines[:15]:
+ for idx, line in enumerate(lines[:15]):
line = line.strip()
if language == 'python':
if line.startswith(('def ', 'async def ', 'class ')):
sig = line
if line.startswith('class ') and ':' in line:
return line.split(':')[0] + ':'
if ')' not in sig and ':' not in sig:
- idx = next((i for i, l in enumerate(lines) if l.strip() == line), -1)
- if idx >= 0:
- for next_line in lines[idx+1:idx+5]:
- sig += ' ' + next_line.strip()
- if ')' in next_line:
- break
+ for next_line in lines[idx+1:min(idx+5, len(lines))]:
+ sig += ' ' + next_line.strip()
+ if ')' in next_line:
+ break🟡 Medium Severity Issues
Id on Platform: 1602
Category: ✨ Best Practices
File: .../service/CodeAnalysisService.java:114
Issue: Saving the entity twice (at line 114 and line 178) within the same transactional flow to obtain an ID is a common pattern, but it can lead to unnecessary database roundtrips. More importantly, if the subsequent processing fails, a 'ghost' analysis record with no issues might remain in the database depending on transaction management outside this method.
💡 Suggested Fix
Ensure this method is wrapped in a transaction so that if processing issues fails after the first save, the entire operation rolls back. Since this is likely inside a service, consider if the ID can be used without the intermediate save or if a flush is sufficient.
No suggested fix providedId on Platform: 1603
Category: 🐛 Bug Risk
File: .../service/ProjectService.java:378
Issue: The method findByIdWithConnections is used to fetch the project. If this method is implemented as a Join Fetch in the repository, it may return null or an empty Optional if the 'aiBinding' (or relevant connection) is missing, depending on whether an INNER JOIN or LEFT JOIN is used. If it's an INNER JOIN, you won't be able to bind a connection to a project that doesn't have one yet.
💡 Suggested Fix
Ensure the repository method uses a LEFT JOIN FETCH for the connection binding to allow fetching projects that do not yet have an AI connection bound.
--- a/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/persistence/repository/project/ProjectRepository.java
+++ b/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/persistence/repository/project/ProjectRepository.java
@@ -1,1 +1,3 @@
-@Query("SELECT p FROM Project p JOIN FETCH p.aiBinding WHERE p.id = :id")
+@Query("SELECT p FROM Project p LEFT JOIN FETCH p.aiBinding WHERE p.id = :id")
Optional<Project> findByIdWithConnections(@Param("id") Long id);Id on Platform: 1605
Category: 🐛 Bug Risk
File: .../index_manager/indexer.py:84
Issue: The 'random' module is imported inside the 'else' block of the size estimation method. This can lead to repeated import overhead in a loop or failure if the environment restricts local imports.
💡 Suggested Fix
Move the random import to the top of the file.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
@@ -8,2 +8,3 @@
import logging
+import random
import timeId on Platform: 1606
Category: ✨ Best Practices
File: .../analysis/PullRequestAnalysisProcessor.java:231
Issue: The logic only releases the lock if '!isPreAcquired'. While this prevents the processor from releasing a lock it didn't create, it assumes the caller will always handle cleanup. If the caller (e.g., a webhook handler) expects the processor to take ownership of the lifecycle once passed, the lock will leak. Typically, pre-acquired locks used in a 'try-finally' block should have a clearly defined owner to prevent orphaned locks in the database.
💡 Suggested Fix
Ensure there is a clear contract on lock ownership. If the processor should not release it, add a comment explaining where the release happens, or ensure the 'isPreAcquired' flag is correctly managed by the caller's transaction scope. Given the existing code, ensure the caller has a finally block to release the lock since this processor now skips it.
--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessor.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessor.java
@@ -228,6 +228,7 @@
return Map.of("status", "error", "message", e.getMessage());
} finally {
+ // If pre-acquired, the caller (e.g. WebhookHandler) is responsible for release
if (!isPreAcquired) {
analysisLockService.releaseLock(lockKey);
}Id on Platform: 1607
Category: 🛡️ Error Handling
File: .../index_manager/branch_manager.py:136
Issue: In preserve_other_branch_points, if an exception occurs during the scroll loop after some points have already been collected, the function catches the exception and returns an incomplete list. This can lead to silent data loss during a reindex because copy_points_to_collection will only write the partially collected points to the new collection.
💡 Suggested Fix
Reraise the exception or return None/raise a custom error to signal that the preservation failed, rather than returning a partial list of points.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py
@@ -134,2 +134,2 @@
except Exception as e:
- logger.warning(f"Could not read existing points: {e}")
- return []
+ logger.error(f"Critical failure during point preservation: {e}")
+ raiseId on Platform: 1609
Category: ✨ Best Practices
File: .../processor/WebhookAsyncProcessor.java
Issue: While the self-injection pattern (using @lazy) works to solve the proxy bypass, it is generally considered a smell or a workaround in Spring. A cleaner architectural approach is to move the @transactional logic into a separate Service (e.g., WebhookProcessingService). This avoids circular dependencies and adheres better to the Single Responsibility Principle.
💡 Suggested Fix
Extract the contents of 'processWebhookInTransaction' and its associated private helper methods into a new service class, then inject that service into WebhookAsyncProcessor.
No suggested fix providedId on Platform: 1614
Category: ⚡ Performance
File: .../ai/AiAnalysisRequestImpl.java
Issue: The method implementation does not perform deduplication as claimed in its Javadoc ('Issues are deduplicated by keeping only the most recent version of each issue'). If multiple analyses for a PR are passed, the same issues (which often persist across PR versions) will be added multiple times to the list, potentially exceeding token limits or confusing the AI model.
💡 Suggested Fix
Implement deduplication by grouping issues by a unique key (like filePath, line, and reason) or simply by issue ID if IDs are stable across versions, keeping only the most recent version.
--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
@@ -293,3 +293,4 @@
- this.previousCodeAnalysisIssues = allPrAnalyses.stream()
- .flatMap(analysis -> analysis.getIssues().stream())
- .map(AiRequestPreviousIssueDTO::fromEntity)
- .toList();
+ this.previousCodeAnalysisIssues = allPrAnalyses.stream() // assumes list is already sorted newest-first
+ .flatMap(analysis -> analysis.getIssues().stream())
+ .map(AiRequestPreviousIssueDTO::fromEntity)
+ .collect(java.util.stream.Collectors.toMap(AiRequestPreviousIssueDTO::id, i -> i, (existing, replacement) -> existing))
+ .values().stream().toList();Id on Platform: 1616
Category: 🐛 Bug Risk
File: .../splitter/query_runner.py
Issue: The logic to identify the pattern name for custom queries checks for the absence of a dot: 'elif "." not in capture_name: pattern_name = capture_name'. However, if a custom query only uses dotted captures (e.g., @function.definition) without a base capture (e.g., @function), the 'pattern_name' will remain None and the match will be skipped at line 262.
💡 Suggested Fix
Ensure that if a dotted capture is found and pattern_name is still None, the base part of the capture is used as the pattern name.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
@@ -257,2 +257,5 @@
elif "." not in capture_name:
pattern_name = capture_name
+ elif pattern_name is None:
+ pattern_name = capture_name.split(".")[0]
+ main_node = nodeId on Platform: 1617
Category: 🐛 Bug Risk
File: .../splitter/tree_parser.py
Issue: The code assumes that the language loading function in the external module is always named 'language' or the value provided in the mapping. However, for some tree-sitter modules (like tree-sitter-typescript), the module might contain multiple languages (e.g., 'language_typescript' and 'language_tsx'). While the mapping in languages.py handles this, the generic getattr check might fail silently or return None if the function name provided in the mapping is incorrect or the module structure changes, which is common in tree-sitter transitions.
💡 Suggested Fix
Add more robust validation and error logging for the language function retrieval to ensure it is actually callable before execution.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.py
@@ -73,7 +73,7 @@
lang_module = importlib.import_module(module_name)
lang_func = getattr(lang_module, func_name, None)
- if not lang_func:
+ if not lang_func or not callable(lang_func):
logger.debug(f"Module {module_name} has no callable {func_name} function")
return NoneId on Platform: 1621
Category: 🐛 Bug Risk
File: .../utils/utils.py
Issue: The heuristic for detecting archive roots ('-' in first_part and len(first_part) > 20) might incorrectly flag and strip legitimate top-level project directories that follow naming conventions like 'my-very-long-project-name'. This would cause the first directory of the actual project structure to be discarded.
💡 Suggested Fix
Tighten the archive detection logic. Instead of a general hyphen check, consider checking if the part matches a hexadecimal pattern (common for commit hashes) or allow the user to provide a known root prefix. Alternatively, use a more specific regex for known VCS archive patterns (e.g., Bitbucket's {project}-{repo}-{commit}).
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/utils/utils.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/utils/utils.py
@@ -111,5 +111,5 @@
- looks_like_archive = (
- '-' in first_part and len(first_part) > 20 or # owner-repo-commit
- len(first_part) >= 40 or # Just commit hash
- (first_part.count('-') >= 2 and any(c.isdigit() for c in first_part)) # Has digits and multiple hyphens
- )
+ # Heuristic for archive roots like 'owner-repo-7a8b9c0'
+ import re
+ is_hex_suffix = bool(re.search(r'-[0-9a-f]{7,40}$', first_part))
+ is_hash = bool(re.fullmatch(r'[0-9a-f]{40}', first_part))
+ looks_like_archive = is_hash or is_hex_suffixId on Platform: 1624
Category: 🛡️ Error Handling
File: .../service/GitHubAiClientService.java
Issue: When a DiffTooLargeException is thrown, there is no visible mechanism in the diff to ensure the user receives feedback on the PR (e.g., a comment explaining why the analysis failed). If not caught and handled by posting a comment, the PR might stay in a 'pending' state indefinitely.
💡 Suggested Fix
Ensure that the calling code or a global exception handler catches DiffTooLargeException and uses VcsReportingService to post an informative comment to the PR.
No suggested fix providedId on Platform: 1625
Category: ✨ Best Practices
File: .../service/GitLabAiClientService.java:315
Issue: Token estimation and limit enforcement were added to 'buildMrAnalysisRequest' but are missing from 'buildBranchAnalysisRequest'. Large branch diffs may still attempt to be sent to the AI, potentially causing timeouts or API errors depending on the provider's limits.
💡 Suggested Fix
Implement the same TokenEstimator check in buildBranchAnalysisRequest before building the AiAnalysisRequest.
--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/service/GitLabAiClientService.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/service/GitLabAiClientService.java
@@ -312,6 +312,12 @@
- .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(aiConnection.getApiKeyEncrypted()))
- .withUseLocalMcp(true)
- .withPreviousAnalysisData(previousAnalysis)
- .withMaxAllowedTokens(project.getEffectiveConfig().maxAnalysisTokenLimit())
+ int maxTokenLimit = project.getEffectiveConfig().maxAnalysisTokenLimit();
+ TokenEstimator.TokenEstimationResult tokenEstimate = TokenEstimator.estimateAndCheck(rawDiff, maxTokenLimit);
+ if (tokenEstimate.exceedsLimit()) {
+ throw new DiffTooLargeException(tokenEstimate.estimatedTokens(), maxTokenLimit, project.getId(), "branch:" + request.getTargetBranchName());
+ }
+
+ return AiAnalysisRequest.builder()
+ .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(aiConnection.getApiKeyEncrypted()))
+ .withUseLocalMcp(true)
+ .withPreviousAnalysisData(previousAnalysis)
+ .withMaxAllowedTokens(maxTokenLimit)🔵 Low Severity Issues
Id on Platform: 1604
Category: ✨ Best Practices
File: .../service/multi_stage_orchestrator.py:458
Issue: The method logger.info is called but the 'logger' variable is not defined or imported within the visible diff context, while other parts of the file use 'self.logger' or 'logging'. Based on line 458, 'logger' is used directly.
💡 Suggested Fix
Ensure 'logger' is correctly imported or use 'logging.getLogger(name)' at the module level.
--- a/python-ecosystem/mcp-client/service/multi_stage_orchestrator.py
+++ b/python-ecosystem/mcp-client/service/multi_stage_orchestrator.py
@@ -24,2 +24,3 @@
import asyncio
+import logging
+logger = logging.getLogger(__name__)
+Id on Platform: 1608
Category: ✨ Best Practices
File: .../index_manager/point_operations.py
Issue: If 'path' is missing from metadata, it defaults to a random UUID. This breaks the 'deterministic ID' guarantee promised by generate_point_id, as every run will generate different IDs for the same chunk if the path isn't explicitly provided, leading to duplicate points in the vector database.
💡 Suggested Fix
Require path to be present in chunk metadata or use a more stable fallback than a random UUID to ensure idempotency.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py
@@ -52,5 +52,6 @@
for chunk in chunks:
- path = chunk.metadata.get("path", str(uuid.uuid4()))
+ path = chunk.metadata.get("path")
+ if not path:
+ logger.warning("Chunk missing path metadata; using 'unknown'")
+ path = "unknown"
if path not in chunks_by_file:Id on Platform: 1610
Category: 🛡️ Error Handling
File: .../webhookhandler/GitHubPullRequestWebhookHandler.java:133
Issue: The use of Long.parseLong(payload.pullRequestId()) occurs multiple times. If pullRequestId is not a valid numeric string, this will throw an unhandled NumberFormatException, bypassing the specialized catch blocks and potentially leaving a lock dangling if it occurs after acquisition (though here it's inside the try block).
💡 Suggested Fix
Parse the PR ID once into a variable before the try block or ensure the payload is validated before this method is called.
--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java
@@ -125,12 +125,13 @@
String placeholderCommentId = null;
try {
+ long prNum = Long.parseLong(payload.pullRequestId());
// Try to acquire lock atomically BEFORE posting placeholder
// This prevents race condition where multiple webhooks could post duplicate placeholders
String sourceBranch = payload.sourceBranch();
Optional<String> earlyLock = analysisLockService.acquireLock(
project, sourceBranch, AnalysisLockType.PR_ANALYSIS,
- payload.commitHash(), Long.parseLong(payload.pullRequestId()));
+ payload.commitHash(), prNum);
if (earlyLock.isEmpty()) {
log.info("PR analysis already in progress for project={}, branch={}, PR={} - skipping duplicate webhook",
@@ -139,7 +140,7 @@
// Lock acquired - placeholder posting is now protected from race conditions
// Post placeholder comment immediately to show analysis has started
- placeholderCommentId = postPlaceholderComment(project, Long.parseLong(payload.pullRequestId()));
+ placeholderCommentId = postPlaceholderComment(project, prNum);Id on Platform: 1611
Category: ✨ Best Practices
File: .../gitlab/GitLabClient.java:519
Issue: Reading the entire response body into a string using response.body().string() can lead to OutOfMemoryErrors if the diff is extremely large. While diffs are usually manageable, very large PRs or binary changes could cause issues.
💡 Suggested Fix
Use response.body().byteStream() with the ObjectMapper to parse the response incrementally if possible, or check Content-Length before reading.
--- a/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/gitlab/GitLabClient.java
+++ b/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/gitlab/GitLabClient.java
@@ -516,2 +516,2 @@
- JsonNode root = objectMapper.readTree(response.body().string());
+ JsonNode root = objectMapper.readTree(response.body().byteStream());Id on Platform: 1613
Category: ⚡ Performance
File: .../service/WebhookDeduplicationService.java:72
Issue: The 'cleanupOldEntries' method is called on every single webhook check and performs a full scan of the map. As the number of unique project/commit combinations increases, this O(N) operation inside the critical path of 'isDuplicateCommitAnalysis' will cause latency spikes.
💡 Suggested Fix
Perform cleanup periodically (e.g., via @scheduled or every N requests) or use a cache with native expiration like Caffeine or Guava.
--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java
@@ -34,4 +34,5 @@
private final Map<String, Instant> recentCommitAnalyses = new ConcurrentHashMap<>();
+ private final java.util.concurrent.atomic.AtomicInteger requestCounter = new java.util.concurrent.atomic.AtomicInteger(0);
@@ -69,3 +70,5 @@
- // Cleanup old entries
- cleanupOldEntries(now);
+ // Cleanup old entries every 50 requests to avoid O(N) scan on every call
+ if (requestCounter.incrementAndGet() % 50 == 0) {
+ cleanupOldEntries(now);
+ }Id on Platform: 1619
Category: ✨ Best Practices
File: .../utils/response_parser.py:130
Issue: The import logging statement is placed inside a method call which is inefficient as it executes every time a resolved issue is processed.
💡 Suggested Fix
Move the import to the top of the file to follow PEP 8 and improve performance.
--- a/python-ecosystem/mcp-client/utils/response_parser.py
+++ b/python-ecosystem/mcp-client/utils/response_parser.py
@@ -1,3 +1,4 @@
+import logging
from typing import Any, Dict, List, Optional
@@ -126,5 +127,4 @@
if value:
issue_id = issue.get('id') or issue.get('issueId', 'unknown')
- import logging
logging.getLogger(__name__).info(f"Issue {issue_id} marked as isResolved=True (original: {original_value})")Id on Platform: 1620
Category: ✨ Best Practices
File: .../models/scoring_config.py:100
Issue: The 'get_priority' method returns ('MEDIUM', 1.0) as a fallback if no patterns match. This is inconsistent with the explicit HIGH/MEDIUM/LOW categories and might lead to confusing results if a file is 'MEDIUM' because it matched a pattern vs 'MEDIUM' because it was a fallback.
💡 Suggested Fix
Introduce a 'DEFAULT' priority level or ensure the fallback boost factor is distinct to allow for clearer debugging and fine-tuning.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py
@@ -98,1 +98,1 @@
- return ('MEDIUM', 1.0)
+ return ('DEFAULT', 1.0)Id on Platform: 1622
Category: ✨ Best Practices
File: .../analysis/CommentCommandRateLimitRepository.java
Issue: The native query uses PostgreSQL-specific syntax ('ON CONFLICT'). While efficient, it reduces the portability of the persistence layer if the application ever moves away from PostgreSQL. Given the project currently uses PostgreSQL specific features in the entity indexes (as seen in the CommentCommandRateLimit.java context), this is acceptable but should be noted.
💡 Suggested Fix
No immediate change required if PostgreSQL is the hard requirement, otherwise consider using a standard JPA approach with locking, though that would be less performant than this atomic upsert.
No suggested fix providedId on Platform: 1623
Category: ✨ Best Practices
File: .../service/BitbucketAiClientService.java:126
Issue: The variable 'aiConnection' is assigned twice from the same source: 'project.getAiBinding().getAiConnection()'. While harmless, it is redundant.
💡 Suggested Fix
Remove the redundant variable assignment for projectAiConnection.
--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/service/BitbucketAiClientService.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/service/BitbucketAiClientService.java
@@ -127,5 +127,4 @@
VcsConnection vcsConnection = vcsInfo.vcsConnection();
AIConnection aiConnection = project.getAiBinding().getAiConnection();
- AIConnection projectAiConnection = project.getAiBinding().getAiConnection();
// CRITICAL: Log the AI connection being used for debuggingℹ️ Informational Notes
Id on Platform: 1612
Category: ✨ Best Practices
File: .../util/TokenEstimator.java:35
Issue: The fallback estimation uses a fixed ratio of 4 characters per token. While standard for English, this can be significantly inaccurate for other languages or code-heavy content.
💡 Suggested Fix
Consider using a slightly more conservative fallback or documenting the language assumption, though for a fallback this is generally acceptable.
No suggested fix providedId on Platform: 1626
Category: ✨ Best Practices
File: .../tests/test_rag.py:63
Issue: The chunk_size has been increased 10x (from 800 to 8000). While the comment suggests this is to fit semantic units, such a large jump significantly exceeds the default 'SentenceSplitter' and 'RecursiveCharacterTextSplitter' configurations seen in 'src/rag_pipeline/core/chunking.py' and 'src/rag_pipeline/core/semantic_splitter.py' which still reference 800-1000. This might lead to context window issues or inefficient retrieval if the underlying models have smaller limits.
💡 Suggested Fix
Verify if 8000 is intended or a typo (perhaps 800 was intended as the base), and ensure the corresponding implementation classes in 'chunking.py' have their defaults updated to match this test expectation to avoid test failures in other environments.
No suggested fix providedFiles Affected
- .../service/GitHubAiClientService.java: 1 issue
- .../splitter/metadata.py: 1 issue
- .../analysis/PullRequestAnalysisProcessor.java: 1 issue
- .../core/index_manager.py: 1 issue
- .../splitter/tree_parser.py: 1 issue
- .../service/WebhookDeduplicationService.java: 1 issue
- .../gitlab/GitLabClient.java: 1 issue
- .../webhookhandler/GitHubPullRequestWebhookHandler.java: 1 issue
- .../analysis/CommentCommandRateLimitRepository.java: 1 issue
- .../service/ProjectService.java: 1 issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/webhookhandler/GitLabMergeRequestWebhookHandler.java (1)
128-160: Pass the pre-acquired lock key downstream to prevent lock leaks.
acquireLock()returns a key that is currently ignored. Unlike the GitHub and Bitbucket handlers which setrequest.preAcquiredLockKey = earlyLock.get(), the GitLab handler does not propagate the lock. This causesPullRequestAnalysisProcessorto callacquireLockWithWait()again instead of reusing the webhook's lock, leaving the original lock unreleased.🔧 Proposed fix
PrProcessRequest request = new PrProcessRequest(); +request.preAcquiredLockKey = earlyLock.get(); request.projectId = project.getId(); request.pullRequestId = Long.parseLong(payload.pullRequestId()); request.sourceBranchName = payload.sourceBranch();java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/processor/WebhookAsyncProcessor.java (1)
432-511: Preserve per-command markers when posting error/info.
UsingCODECROW_COMMAND_MARKERoverwrites summarize/review markers, so later runs won’t clean up prior error/info comments.✅ Suggested fix
VcsReportingService reportingService = vcsServiceFactory.getReportingService(provider); + String commandType = payload.getCodecrowCommand() != null + ? payload.getCodecrowCommand().type().name().toLowerCase() + : "default"; + String marker = CommentPlaceholders.getMarkerForCommandType(commandType); + // Sanitize error message for user display - hide sensitive technical details String sanitizedMessage = sanitizeErrorForVcs(errorMessage); @@ reportingService.updateComment( project, Long.parseLong(payload.pullRequestId()), placeholderCommentId, content, - CommentPlaceholders.CODECROW_COMMAND_MARKER + marker ); @@ reportingService.postComment( project, Long.parseLong(payload.pullRequestId()), content, - CommentPlaceholders.CODECROW_COMMAND_MARKER + marker ); @@ VcsReportingService reportingService = vcsServiceFactory.getReportingService(provider); + String commandType = payload.getCodecrowCommand() != null + ? payload.getCodecrowCommand().type().name().toLowerCase() + : "default"; + String marker = CommentPlaceholders.getMarkerForCommandType(commandType); + // If we have a placeholder comment, update it with the info if (placeholderCommentId != null) { reportingService.updateComment( project, Long.parseLong(payload.pullRequestId()), placeholderCommentId, infoMessage, - CommentPlaceholders.CODECROW_COMMAND_MARKER + marker ); @@ reportingService.postComment( project, Long.parseLong(payload.pullRequestId()), infoMessage, - CommentPlaceholders.CODECROW_COMMAND_MARKER + marker );
🤖 Fix all issues with AI agents
In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/exception/DiffTooLargeException.java`:
- Around line 14-18: The DiffTooLargeException constructor currently uses
String.format with %d for projectId and pullRequestId which will NPE on null
Longs due to auto-unboxing; update the format string in the
DiffTooLargeException(int estimatedTokens, int maxAllowedTokens, Long projectId,
Long pullRequestId) constructor to use %s for the nullable Long parameters and
pass them unchanged so nulls are rendered safely (leave the integer token
placeholders as %d).
In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/util/TokenEstimator.java`:
- Around line 46-48: The exceedsLimit method should apply the same defensive
check as estimateAndCheck: ensure maxTokens > 0 before comparing; update
exceedsLimit(String text, int maxTokens) to return false when maxTokens <= 0 (or
use the same boolean expression pattern used in estimateAndCheck) and then
compare estimateTokens(text) > maxTokens, referencing the exceedsLimit and
estimateAndCheck methods and the estimateTokens call to keep behavior consistent
and avoid the incorrect true result for non-empty text when maxTokens <= 0.
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py`:
- Around line 102-145: The preserve_other_branch_points function currently
swallows exceptions and returns an empty list (via the try/except around
self.client.scroll) which can cause silent data loss during reindex; also
accumulating all points in preserved_points is memory-unsafe for large
collections. Change the function to fail fast by removing the broad except or
re-raising after logging so errors surface (keep logger.warning but raise), and
refactor to stream results instead of collecting them in RAM—e.g., convert
preserve_other_branch_points into a generator or accept a callback/writer to
send each page (results from self.client.scroll) directly to the target
collection as you iterate, using self.client.scroll and next_offset to page
through until completion.
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/collection_manager.py`:
- Around line 53-65: The version suffix in create_versioned_collection uses
int(time.time()) which has only second resolution and can collide; change the
suffix to a higher-resolution or random identifier (e.g., use time.time_ns() or
uuid.uuid4()) when building versioned_name so concurrent calls produce unique
names; update the versioned_name construction and any related logging that
references versioned_name in
collection_manager.CollectionManager.create_versioned_collection to use the new
suffix.
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py`:
- Around line 50-56: The code in the grouping loop uses uuid.uuid4() as a
fallback for missing chunk.metadata["path"], making the fallback
non-deterministic and breaking upsert semantics; update the logic in the loop
that builds chunks_by_file (the for chunk in chunks block that uses
chunk.metadata.get("path", ...)) to use a deterministic fallback (e.g., a
constant like "unknown" or a stable computed key) or explicitly raise an error
when "path" is missing so that TextNode objects without a stable path cannot
produce random IDs and duplicate points on re-indexing.
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/python.scm`:
- Around line 1-28: The type_alias_statement query is using an incorrect field
syntax; update the (type_alias_statement) query so it matches the grammar
sequence ("type" (type ...) "=" (type)) and capture the alias identifier by
nesting (identifier) `@name` inside the first (type) node instead of using a
non-existent name: field; also add an alternative to handle generic aliases by
allowing the first (type) to be either (identifier) `@name` or (generic_type
(identifier) `@name`), and keep the whole node annotated as
`@definition.type_alias`.
In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py`:
- Line 263: source_bytes is computed but never used, causing a Ruff F841; either
remove this unused local and the other unused captures around lines 340-342, or
wire them into the ASTChunk so they become meaningful fields. To fix: if you
need the data, add parameters/attributes to the ASTChunk constructor/definition
(class ASTChunk) and pass source_bytes (and the other captures) into the
ASTChunk creation sites; otherwise delete the unused local assignments (e.g.,
remove source_bytes = text.encode('utf-8') and the analogous unused variables at
340-342). Ensure any tests or downstream code expect the new ASTChunk signature
if you choose to propagate the values.
🟡 Minor comments (16)
java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java-812-814 (1)
812-814: PotentialStringIndexOutOfBoundsExceptionif commit hash is shorter than 7 characters.While commit hashes are typically 40 characters (SHA-1), malformed or truncated data could cause
substring(0, 7)to throw an exception. Consider using a safe substring utility.🛡️ Proposed fix for safe substring
- log.info("Incremental diff for branch '{}' ({} -> {}): bytes={}", - targetBranch, indexedCommit.substring(0, 7), currentCommit.substring(0, 7), - rawDiff != null ? rawDiff.length() : 0); + log.info("Incremental diff for branch '{}' ({} -> {}): bytes={}", + targetBranch, + indexedCommit.substring(0, Math.min(7, indexedCommit.length())), + currentCommit.substring(0, Math.min(7, currentCommit.length())), + rawDiff != null ? rawDiff.length() : 0);Note: The same pattern exists at lines 752-753 in
ensureMainIndexUpToDate, which was not changed in this PR but has the same potential issue.python-ecosystem/rag-pipeline/src/rag_pipeline/services/query_service.py-388-394 (1)
388-394: Add validation for emptybrancheslist.If
branchesis an empty list,target_branchwill beNone(line 389), but the branch filter at line 393 will pass an empty list toMatchAny(any=[]). This could cause unexpected behavior or Qdrant errors depending on how the client handles empty match conditions.🛡️ Proposed fix to add early validation
logger.info(f"Deterministic context: files={file_paths[:5]}, branches={branches}") + if not branches: + logger.warning("No branches provided for deterministic context") + return {"chunks": [], "changed_files": {}, "related_definitions": {}, + "class_context": {}, "namespace_context": {}, + "_metadata": {"error": "no_branches_provided"}} + def _apply_branch_priority(points: list, target: str, existing_target_paths: set) -> list:python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/c_sharp.scm-7-11 (1)
7-11: Namespace queries may miss qualified, aliased, and generic names.In tree-sitter-c-sharp, namespace names can be
identifier,qualified_name,alias_qualified_name, orgeneric_name. The current patternname: (identifier)will not match namespaces with dotted names (e.g.,namespace Foo.Bar.Baz), global aliases (e.g.,namespace global::Foo), or generic names, causing those to have no@namecapture.Proposed fix to handle all name types
; Namespace declarations (namespace_declaration - name: (identifier) `@name`) `@definition.namespace` + name: (_) `@name`) `@definition.namespace` (file_scoped_namespace_declaration - name: (identifier) `@name`) `@definition.namespace` + name: (_) `@name`) `@definition.namespace`java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/utils/CommentPlaceholders.java-46-65 (1)
46-65: Guard against nullcommandTypeto avoid NPEs.
These helpers calltoLowerCase()directly; a null input would crash placeholder handling.✅ Suggested fix
public static String getPlaceholderMessage(String commandType) { - return switch (commandType.toLowerCase()) { + String normalized = commandType == null ? "" : commandType.toLowerCase(); + return switch (normalized) { case "analyze" -> PLACEHOLDER_ANALYZE; case "summarize" -> PLACEHOLDER_SUMMARIZE; case "review" -> PLACEHOLDER_REVIEW; case "ask" -> PLACEHOLDER_ASK; default -> PLACEHOLDER_DEFAULT; }; } @@ public static String getMarkerForCommandType(String commandType) { - return switch (commandType.toLowerCase()) { + String normalized = commandType == null ? "" : commandType.toLowerCase(); + return switch (normalized) { case "summarize" -> CODECROW_SUMMARY_MARKER; case "review" -> CODECROW_REVIEW_MARKER; default -> CODECROW_COMMAND_MARKER; }; }java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/JobService.java-226-229 (1)
226-229: Guard against null job IDs before re-fetch.When
jobRepository.findById()is called with a null ID, it throwsIllegalArgumentExceptionrather than returningOptional.empty(). While all job creation methods save immediately before returning and call sites appear to only pass persisted jobs, adding a null-ID guard would be defensive and prevent potential issues if a transient job ever slips through.🛈 Suggested guard (apply to startJob, completeJob, cancelJob, skipJob, and failJob)
- job = jobRepository.findById(job.getId()).orElse(job); + Long jobId = job.getId(); + if (jobId != null) { + job = jobRepository.findById(jobId).orElse(job); + }Also applies to: lines 248-251, 292-295, 306-309, and failJob method
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java-47-75 (1)
47-75: Race condition between duplicate check and recording.There's a TOCTOU (time-of-check to time-of-use) race between
recentCommitAnalyses.get(key)(Line 55) andrecentCommitAnalyses.put(key, now)(Line 69). Two concurrent webhook events for the same commit could both pass the duplicate check if they interleave between get and put.Use
putIfAbsentorcomputefor atomicity:🔒 Proposed fix using compute for atomic check-and-record
public boolean isDuplicateCommitAnalysis(Long projectId, String commitHash, String eventType) { if (commitHash == null || commitHash.isBlank()) { return false; } String key = projectId + ":" + commitHash; Instant now = Instant.now(); - Instant lastAnalysis = recentCommitAnalyses.get(key); - - if (lastAnalysis != null) { - long secondsSinceLastAnalysis = now.getEpochSecond() - lastAnalysis.getEpochSecond(); - - if (secondsSinceLastAnalysis < DEDUP_WINDOW_SECONDS) { - log.info("Skipping duplicate commit analysis: project={}, commit={}, event={}, " + - "lastAnalysis={}s ago (within {}s window)", - projectId, commitHash, eventType, secondsSinceLastAnalysis, DEDUP_WINDOW_SECONDS); - return true; - } - } - - // Record this analysis - recentCommitAnalyses.put(key, now); + // Atomically check and record to prevent race conditions + Instant[] previousValue = new Instant[1]; + recentCommitAnalyses.compute(key, (k, lastAnalysis) -> { + if (lastAnalysis != null) { + long secondsSinceLastAnalysis = now.getEpochSecond() - lastAnalysis.getEpochSecond(); + if (secondsSinceLastAnalysis < DEDUP_WINDOW_SECONDS) { + previousValue[0] = lastAnalysis; + return lastAnalysis; // Keep existing, signal duplicate + } + } + return now; // Record new analysis + }); + + if (previousValue[0] != null) { + long secondsSinceLastAnalysis = now.getEpochSecond() - previousValue[0].getEpochSecond(); + log.info("Skipping duplicate commit analysis: project={}, commit={}, event={}, " + + "lastAnalysis={}s ago (within {}s window)", + projectId, commitHash, eventType, secondsSinceLastAnalysis, DEDUP_WINDOW_SECONDS); + return true; + } // Cleanup old entries cleanupOldEntries(now); return false; }python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/javascript.scm-1-42 (1)
1-42: Remove TypeScript-specific patterns fromjavascript.scm.While language routing properly separates JavaScript and TypeScript query files, the
javascript.scmfile should not include TypeScript-specific patterns (interface_declaration,type_alias_declaration) on lines 35-42. These constructs don't exist in the JavaScript grammar and are already correctly defined in the separatetypescript.scmfile. This creates unnecessary duplication and obscures the intent of each query file.The good coverage of standard JavaScript patterns and proper language routing via
languages.pyare solid, but clean up this file to focus only on JavaScript constructs.python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py-103-117 (1)
103-117: Consider adding Kotlin module mapping.
LANGUAGE_TO_TREESITTERmapsLanguage.KOTLINto'kotlin', butTREESITTER_MODULESdoesn't include a'kotlin'entry. This would cause Kotlin files to fail tree-sitter parsing despite being mapped.python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py-83-99 (1)
83-99: Addstrict=Trueto zip() to catch embedding count mismatches.If
get_text_embedding_batchreturns fewer embeddings than expected (e.g., due to a partial failure), the zip will silently drop chunks. Usingstrict=Trueensures the lengths match and raisesValueErroron mismatch, preventing data loss.🔧 Proposed fix
# Build points with embeddings points = [] - for (point_id, chunk), embedding in zip(chunk_data, embeddings): + for (point_id, chunk), embedding in zip(chunk_data, embeddings, strict=True): points.append(PointStruct(python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py-117-126 (1)
117-126: Rename ambiguous variableltoline.The variable name
lis flagged as ambiguous (easily confused with1orI). Rename it for clarity:🔧 Proposed fix
if ')' not in sig and ':' not in sig: - idx = next((i for i, l in enumerate(lines) if l.strip() == line), -1) + idx = next((i for i, ln in enumerate(lines) if ln.strip() == line), -1) if idx >= 0:python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py-545-604 (1)
545-604: Exception handling improvements needed in migration path.The static analysis correctly flags several exception handling issues:
- Line 554: f-string has no placeholders - remove the
fprefix- Lines 572-574: Use
logging.exceptioninstead oflogger.errorto capture stack trace, and chain exceptions withraise ... from err- Lines 598-601: Chain the exception with
raise ... from retry_err- Line 604: Use bare
raiseinstead ofraise alias_errto preserve the original traceback🔧 Proposed fixes for exception handling
- if is_direct_collection and "already exists" in str(alias_err).lower(): - logger.info(f"Alias creation failed due to existing collection. Migrating from direct collection to alias-based indexing...") + if is_direct_collection and "already exists" in str(alias_err).lower(): + logger.info("Alias creation failed due to existing collection. Migrating from direct collection to alias-based indexing...")try: self.qdrant_client.delete_collection(alias_name) except Exception as del_err: - logger.error(f"Failed to delete old direct collection: {del_err}") - raise Exception(f"Cannot create alias - collection '{alias_name}' exists and cannot be deleted: {del_err}") + logger.exception(f"Failed to delete old direct collection: {del_err}") + raise RuntimeError(f"Cannot create alias - collection '{alias_name}' exists and cannot be deleted") from del_errexcept Exception as retry_err: # Critical failure - we've deleted the old collection but can't create alias # The temp collection still exists with all data, log clear instructions logger.critical( f"CRITICAL: Alias creation failed after deleting old collection! " f"Data is safe in '{temp_collection_name}'. " f"Manual fix required: create alias '{alias_name}' -> '{temp_collection_name}'. " f"Error: {retry_err}" ) - raise Exception( + raise RuntimeError( f"Alias creation failed after migration. Data preserved in '{temp_collection_name}'. " f"Run: qdrant alias create {alias_name} -> {temp_collection_name}" - ) + ) from retry_err else: # Some other alias error, re-raise - raise alias_err + raisepython-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py-75-117 (1)
75-117: Inconsistency between AST_SUPPORTED_LANGUAGES and TREESITTER_MODULES.
AST_SUPPORTED_LANGUAGESincludes languages that don't have corresponding entries inTREESITTER_MODULES:
SCALA,LUA,PERL,SWIFT,HASKELL,COBOLThis means
is_ast_supported(path)could returnTruefor files where tree-sitter parsing will actually fail (sinceget_languageinTreeSitterParserwon't find a module mapping).Either add the missing module mappings to
TREESITTER_MODULES, or remove unsupported languages fromAST_SUPPORTED_LANGUAGESto maintain consistency.🔧 Option 1: Remove unsupported languages from AST_SUPPORTED_LANGUAGES
# Languages that support full AST parsing via tree-sitter AST_SUPPORTED_LANGUAGES: Set[Language] = { - Language.PYTHON, Language.JAVA, Language.KOTLIN, Language.JS, Language.TS, - Language.GO, Language.RUST, Language.C, Language.CPP, Language.CSHARP, - Language.PHP, Language.RUBY, Language.SCALA, Language.LUA, Language.PERL, - Language.SWIFT, Language.HASKELL, Language.COBOL + Language.PYTHON, Language.JAVA, Language.JS, Language.TS, + Language.GO, Language.RUST, Language.C, Language.CPP, Language.CSHARP, + Language.PHP, Language.RUBY, # Only languages with TREESITTER_MODULES entries }python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/typescript.scm-23-26 (1)
23-26: Arrow function declarations will match both function and variable definition patterns, causing duplicates in the index.The pattern at lines 23-26 specifically captures arrow functions assigned to variables as
@definition.function. However, line 49 captures alllexical_declarationnodes as@definition.variable, which will also match the same arrow function declarations.This causes the same AST node to be extracted twice with different semantic types. Consider making the variable declaration pattern more specific to exclude arrow functions, or refactor to prevent duplicate extraction.
This issue also exists in the JavaScript file at lines 18-21 and 32-34.
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py-874-900 (1)
874-900: Fix fragment counts when tiny sub-chunks are skipped.
total_subis computed before the min-size filter, so headers/metadata can claim more fragments than are emitted. Prefilter first, then computetotal_sub.Proposed diff
- sub_chunks = splitter.split_text(chunk.content) - - nodes = [] - parent_id = generate_deterministic_id(path, chunk.content, 0) - total_sub = len([s for s in sub_chunks if s and s.strip()]) + raw_chunks = splitter.split_text(chunk.content) + sub_chunks = [s for s in raw_chunks if s and s.strip()] + if len(sub_chunks) > 1: + sub_chunks = [s for s in sub_chunks if len(s.strip()) >= self.min_chunk_size] + nodes = [] + parent_id = generate_deterministic_id(path, chunk.content, 0) + total_sub = len(sub_chunks) @@ - for i, sub_chunk in enumerate(sub_chunks): - if not sub_chunk or not sub_chunk.strip(): - continue - if len(sub_chunk.strip()) < self.min_chunk_size and total_sub > 1: - continue + for sub_chunk in sub_chunks:python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py-133-156 (1)
133-156: Remove the unusedmetadatadict.
metadatais assigned but never used; this will trigger lint errors and adds noise.🧹 Minimal fix
- metadata = { - "namespace": namespace, - "workspace": workspace, - "project": project, - "branch": branch, - "commit": commit, - "document_count": document_count, - "chunk_count": chunk_count, - "last_updated": datetime.now(timezone.utc).isoformat(), - } - logger.info(f"Indexed {namespace}: {document_count} docs, {chunk_count} chunks")python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py-319-399 (1)
319-399: Return branch stats for branch-scoped operations.
update_filesanddelete_filesare branch-specific, but they currently return project stats (branch="*"). This can mislead callers expecting branch-level results.✅ Suggested fix
- return self.stats_manager.get_project_stats(workspace, project, collection_name) + return self.stats_manager.get_branch_stats(workspace, project, branch, collection_name) ... - return self.stats_manager.get_project_stats(workspace, project, collection_name) + return self.stats_manager.get_branch_stats(workspace, project, branch, collection_name) ... - return self.stats_manager.get_project_stats(workspace, project, collection_name) + return self.stats_manager.get_branch_stats(workspace, project, branch, collection_name)
🧹 Nitpick comments (24)
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/rust.scm (2)
22-23: Consider capturing type/trait info fromimpl_itemfor richer metadata.The current pattern captures the entire
implblock but doesn't extract the type being implemented or the trait (for trait implementations). This metadata could be useful for code-splitting context.💡 Optional enhancement for richer impl metadata
; Implementation blocks -(impl_item) `@definition.impl` +; Inherent impl +(impl_item + type: (_) `@impl_type`) `@definition.impl` + +; Trait impl +(impl_item + trait: (type_identifier) `@trait_name` + type: (_) `@impl_type`) `@definition.impl.trait`
41-46: Consider addingunion_itemandexternblock patterns for completeness.The file covers most Rust constructs, but a few less common ones are missing. These are optional additions depending on your codebase requirements.
💡 Optional patterns for additional Rust constructs
; Macro definitions (macro_definition name: (identifier) `@name`) `@definition.macro` +; Union definitions +(union_item + name: (type_identifier) `@name`) `@definition.union` + +; Extern blocks (FFI) +(foreign_mod_item) `@definition.extern` + ; Attributes (attribute_item) `@attribute`java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/processor/command/ReviewCommandProcessor.java (1)
157-160: Uselog.debuginstead oflog.infofor diagnostic logging.The comment indicates this logging is for debugging purposes, but
log.infowill emit in production under normal log levels. Consider usinglog.debugto avoid log noise and reduce exposure of internal identifiers (connection IDs, model names) in production logs.📝 Suggested fix
- // CRITICAL: Log the AI connection being used for debugging - log.info("Building review command request for project={}, AI model={}, provider={}, aiConnectionId={}", - project.getId(), aiConnection.getAiModel(), aiConnection.getProviderKey(), aiConnection.getId()); + log.debug("Building review command request for project={}, AI model={}, provider={}, aiConnectionId={}", + project.getId(), aiConnection.getAiModel(), aiConnection.getProviderKey(), aiConnection.getId());python-ecosystem/mcp-client/service/multi_stage_orchestrator.py (1)
402-438: Minor: Redundant empty string check.Line 425 (
not clean_line == "") is redundant since line 418 already requiresclean_lineto be truthy, which excludes empty strings.♻️ Suggested simplification
if (clean_line and len(clean_line) > 10 and # Minimum meaningful length not clean_line.startswith("//") and # Skip comments not clean_line.startswith("#") and not clean_line.startswith("*") and not clean_line == "{" and - not clean_line == "}" and - not clean_line == ""): + not clean_line == "}"): current_snippet_lines.append(clean_line)python-ecosystem/rag-pipeline/src/rag_pipeline/services/query_service.py (4)
420-432: Consider adding a configurable limit for the fallback query.The fallback query at line 424 fetches up to 1000 points to search for partial path matches. For large collections, this could impact performance. Consider making this limit configurable or applying early filtering criteria.
500-501: Consider logging exception type for better debugging.The broad
Exceptioncatches are reasonable for a resilient context retrieval service, but logging the exception type alongside the message would help with debugging specific failure modes (e.g., network errors vs. validation errors).💡 Example improvement
except Exception as e: - logger.warning(f"Error querying file '{file_path}': {e}") + logger.warning(f"Error querying file '{file_path}': {type(e).__name__}: {e}")Also applies to: 556-557, 606-607, 656-657
754-755: Rename unused loop variableq_typeto_q_type.The loop variable
q_typeis unpacked but not used within the loop body. Per Python convention, prefix unused variables with an underscore.♻️ Proposed fix
- for i, (q_text, q_weight, q_top_k, q_type) in enumerate(queries): + for i, (q_text, q_weight, q_top_k, _q_type) in enumerate(queries): logger.info(f" Query {i+1}: weight={q_weight}, top_k={q_top_k}, text='{q_text[:80]}...'")
849-849: Remove unusedimport re.The
remodule is imported but not used anywhere in this method. This appears to be a leftover from refactoring.♻️ Proposed fix
from collections import defaultdict import os - import repython-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/c_sharp.scm (1)
50-53: Field and event declarations lack@namecapture.Unlike other declarations,
field_declarationandevent_declarationdon't capture the@name. This is likely intentional since fields can declare multiple variables (e.g.,int a, b, c;), but the inconsistency may affect downstream processing that expects a@namecapture on all definitions.If individual field/event names are needed, consider capturing the nested variable declarators:
Optional: capture individual field variable names
; Field declarations -(field_declaration) `@definition.field` +(field_declaration + (variable_declaration + (variable_declarator + (identifier) `@name`))) `@definition.field` ; Event declarations -(event_declaration) `@definition.event` +(event_declaration + (variable_declaration + (variable_declarator + (identifier) `@name`))) `@definition.event`java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/utils/CommentPlaceholders.java (1)
3-41: Make the utility class non-instantiable.
This class only exposes static members, so preventing instantiation is a small safety win.♻️ Proposed refactor
-public class CommentPlaceholders { +public final class CommentPlaceholders { + private CommentPlaceholders() { + // utility class + }java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/processor/PrProcessRequest.java (1)
39-45: Normalize blankpreAcquiredLockKeyto avoid accidental lock bypass.
If downstream checks only for non-null, an empty string could skip lock acquisition without a valid key.♻️ Suggested guard
- public String getPreAcquiredLockKey() { return preAcquiredLockKey; } + public String getPreAcquiredLockKey() { + return (preAcquiredLockKey == null || preAcquiredLockKey.isBlank()) ? null : preAcquiredLockKey; + }Also applies to: 75-75
java-ecosystem/libs/core/src/test/java/org/rostilos/codecrow/core/dto/project/ProjectDTOTest.java (1)
53-78: Consider adding assertion formaxAnalysisTokenLimit.The test adds the new field value (
200000) to the constructor but doesn't assert ondto.maxAnalysisTokenLimit(). For completeness, add an assertion to verify the value.💡 Suggested assertion
assertThat(dto.webhooksConfigured()).isTrue(); assertThat(dto.qualityGateId()).isEqualTo(50L); + assertThat(dto.maxAnalysisTokenLimit()).isEqualTo(200000); }java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java (1)
80-85: Cleanup runs synchronously on every call - consider async or periodic cleanup.The
cleanupOldEntriesiterates through all map entries on every non-duplicate webhook. Under high traffic with many unique commits, this could add latency. For a more scalable approach, consider:
- Using a scheduled task for periodic cleanup
- Only cleaning up every N calls
- Using a size-bounded cache like Caffeine with TTL
java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/controller/ProjectController.java (1)
593-598: Consider adding validation formaxAnalysisTokenLimit.The
maxAnalysisTokenLimitfield accepts anyIntegervalue, including negative numbers or zero, which would be invalid for a token limit. Consider adding validation constraints.💡 Suggested validation
public record UpdateAnalysisSettingsRequest( Boolean prAnalysisEnabled, Boolean branchAnalysisEnabled, String installationMethod, + `@jakarta.validation.constraints.Positive` Integer maxAnalysisTokenLimit ) {}Alternatively, validate in the service layer if project-level defaults should be applied when null/invalid.
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/service/GitLabAiClientService.java (1)
112-268: Consider extracting shared logic to reduce duplication across VCS providers.
GitLabAiClientServiceandGitHubAiClientServiceshare substantial duplicate code:
- Token estimation and limit checking
- Incremental analysis mode determination
- Delta diff fetching and ratio calculations
- AiAnalysisRequest builder construction
Consider extracting common logic to a shared base class or utility to improve maintainability.
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/service/BitbucketAiClientService.java (2)
125-126: Remove duplicate variable assignment.
aiConnectionandprojectAiConnectionare assigned the same value fromproject.getAiBinding().getAiConnection(). Only one variable is needed.♻️ Proposed fix
- AIConnection aiConnection = project.getAiBinding().getAiConnection(); - AIConnection projectAiConnection = project.getAiBinding().getAiConnection(); + AIConnection aiConnection = project.getAiBinding().getAiConnection();Then replace references to
projectAiConnection(line 255) withaiConnection.
127-130: Consider reducing log verbosity for production.Logging the
aiConnectionIdatINFOlevel on every request may be excessive in high-traffic scenarios. Consider usingDEBUGlevel or making this conditional.♻️ Proposed change
- log.info("Building PR analysis request for project={}, AI model={}, provider={}, aiConnectionId={}", + log.debug("Building PR analysis request for project={}, AI model={}, provider={}, aiConnectionId={}", project.getId(), aiConnection.getAiModel(), aiConnection.getProviderKey(), aiConnection.getId());python-ecosystem/rag-pipeline/src/rag_pipeline/utils/utils.py (1)
107-111: Add parentheses to clarify operator precedence.The
and/orchaining on line 108 has ambiguous precedence. While Python evaluatesandbeforeor, explicit parentheses improve readability and prevent subtle bugs.♻️ Proposed fix
looks_like_archive = ( - '-' in first_part and len(first_part) > 20 or # owner-repo-commit + ('-' in first_part and len(first_part) > 20) or # owner-repo-commit len(first_part) >= 40 or # Just commit hash (first_part.count('-') >= 2 and any(c.isdigit() for c in first_part)) # Has digits and multiple hyphens )python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.py (1)
120-129: Thread safety consideration for global singleton.The
get_parser()function isn't thread-safe. If multiple threads call it concurrently during initialization, multiple instances could be created. This is likely fine sinceTreeSitterParserinstances are effectively stateless except for caching, but worth noting.🔒 Thread-safe singleton (optional)
+import threading + # Global singleton instance _parser_instance: Optional[TreeSitterParser] = None +_parser_lock = threading.Lock() def get_parser() -> TreeSitterParser: """Get the global TreeSitterParser instance.""" global _parser_instance if _parser_instance is None: - _parser_instance = TreeSitterParser() + with _parser_lock: + if _parser_instance is None: + _parser_instance = TreeSitterParser() return _parser_instancepython-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py (1)
123-125: Uselogger.exceptionto capture stack trace on upsert failure.
logger.errorloses the stack trace.logger.exceptionautomatically includes it, which is valuable for debugging batch failures.🔧 Proposed fix
except Exception as e: - logger.error(f"Failed to upsert batch starting at {i}: {e}") + logger.exception(f"Failed to upsert batch starting at {i}: {e}") failed += len(batch)python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py (1)
167-211: Consider adding a minimum score floor.The boosted score can become very low when multiple penalties compound (e.g.,
0.8 * 0.95 * 0.7 = 0.532). While there's amax_score_cap, there's no floor. If a near-zero score could cause issues downstream, consider adding amin_score_floor:# Cap the score score = min(score, self.max_score_cap) + # Optional: enforce minimum floor + score = max(score, 0.01) # or use a configurable floor return (score, priority)python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py (1)
54-71: AddClassVarannotation to mutable class attribute.
COMMENT_PREFIXis a mutable dict at class level. While it's only read (not modified), annotating it withClassVarmakes the intent clear and satisfies the static analyzer:🔧 Proposed fix
+from typing import Dict, Any, List, Optional, ClassVar ... class MetadataExtractor: ... # Comment prefixes by language - COMMENT_PREFIX: Dict[str, str] = { + COMMENT_PREFIX: ClassVar[Dict[str, str]] = { 'python': '#',python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py (1)
219-219: Silence the unusedpattern_idloop variable.This avoids lint noise without changing behavior.
Proposed diff
- for pattern_id, captures_dict in raw_matches: + for _pattern_id, captures_dict in raw_matches:python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py (1)
102-131: Remove or use the unusedalias_checkerparameter.
The argument is never referenced; either drop it to simplify the API or use it to skip alias names if that was the intent.♻️ Minimal cleanup
- def list_all_indices(self, alias_checker) -> List[IndexStats]: + def list_all_indices(self) -> List[IndexStats]:
| public DiffTooLargeException(int estimatedTokens, int maxAllowedTokens, Long projectId, Long pullRequestId) { | ||
| super(String.format( | ||
| "PR diff exceeds token limit: estimated %d tokens, max allowed %d tokens (project=%d, PR=%d)", | ||
| estimatedTokens, maxAllowedTokens, projectId, pullRequestId | ||
| )); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cat -n "java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/exception/DiffTooLargeException.java"Repository: rostilos/CodeCrow
Length of output: 1889
🏁 Script executed:
rg -n "new DiffTooLargeException" -C 3Repository: rostilos/CodeCrow
Length of output: 4371
🏁 Script executed:
# Test String.format behavior with null values in Java
python3 << 'EOF'
# String.format behavior in Java:
# %d expects int, so if you pass null to %d, it will cause NPE
# %s with null is safe - it converts null to "null" string
print("Java String.format behavior:")
print("- %d with null: NPE (because %d expects primitive int)")
print("- %s with null: works fine, outputs 'null'")
print("\nHowever, Long is an Object that can be null.")
print("If you pass Long value to %d, Java auto-unboxes it.")
print("If the Long is null, auto-unboxing throws NPE before format is called.")
EOFRepository: rostilos/CodeCrow
Length of output: 356
Replace %d with %s for nullable Long IDs to avoid NPE during auto-unboxing.
When projectId or pullRequestId are null, String.format("%d", null) throws NullPointerException during auto-unboxing, masking the intended "diff exceeds token limit" error. Using %s safely handles null values.
Safer formatting
- super(String.format(
- "PR diff exceeds token limit: estimated %d tokens, max allowed %d tokens (project=%d, PR=%d)",
- estimatedTokens, maxAllowedTokens, projectId, pullRequestId
- ));
+ super(String.format(
+ "Diff exceeds token limit: estimated %d tokens, max allowed %d tokens (project=%s, PR=%s)",
+ estimatedTokens, maxAllowedTokens, projectId, pullRequestId
+ ));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| public DiffTooLargeException(int estimatedTokens, int maxAllowedTokens, Long projectId, Long pullRequestId) { | |
| super(String.format( | |
| "PR diff exceeds token limit: estimated %d tokens, max allowed %d tokens (project=%d, PR=%d)", | |
| estimatedTokens, maxAllowedTokens, projectId, pullRequestId | |
| )); | |
| public DiffTooLargeException(int estimatedTokens, int maxAllowedTokens, Long projectId, Long pullRequestId) { | |
| super(String.format( | |
| "Diff exceeds token limit: estimated %d tokens, max allowed %d tokens (project=%s, PR=%s)", | |
| estimatedTokens, maxAllowedTokens, projectId, pullRequestId | |
| )); | |
| } |
🤖 Prompt for AI Agents
In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/exception/DiffTooLargeException.java`
around lines 14 - 18, The DiffTooLargeException constructor currently uses
String.format with %d for projectId and pullRequestId which will NPE on null
Longs due to auto-unboxing; update the format string in the
DiffTooLargeException(int estimatedTokens, int maxAllowedTokens, Long projectId,
Long pullRequestId) constructor to use %s for the nullable Long parameters and
pass them unchanged so nulls are rendered safely (leave the integer token
placeholders as %d).
| public static boolean exceedsLimit(String text, int maxTokens) { | ||
| return estimateTokens(text) > maxTokens; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -n "maxAnalysisTokenLimit" -C 2Repository: rostilos/CodeCrow
Length of output: 28269
🏁 Script executed:
cat -n java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/util/TokenEstimator.javaRepository: rostilos/CodeCrow
Length of output: 3548
Inconsistent handling of maxTokens <= 0 between exceedsLimit and estimateAndCheck.
The exceedsLimit method lacks the defensive check present in estimateAndCheck. While estimateAndCheck guards against division by zero (line 75), it doesn't apply the same guard to the exceedsLimit boolean logic (line 79). If maxTokens <= 0, both methods will incorrectly mark any non-empty text as exceeding the limit. Add a guard to exceedsLimit for consistency, or apply it to the exceedsLimit field in estimateAndCheck:
Suggested fix
public static boolean exceedsLimit(String text, int maxTokens) {
- return estimateTokens(text) > maxTokens;
+ return maxTokens > 0 && estimateTokens(text) > maxTokens;
}
@@
public static TokenEstimationResult estimateAndCheck(String text, int maxTokens) {
int estimated = estimateTokens(text);
double utilization = maxTokens > 0 ? (estimated * 100.0 / maxTokens) : 0;
return new TokenEstimationResult(
estimated,
maxTokens,
- estimated > maxTokens,
+ maxTokens > 0 && estimated > maxTokens,
utilization
);
}🤖 Prompt for AI Agents
In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/util/TokenEstimator.java`
around lines 46 - 48, The exceedsLimit method should apply the same defensive
check as estimateAndCheck: ensure maxTokens > 0 before comparing; update
exceedsLimit(String text, int maxTokens) to return false when maxTokens <= 0 (or
use the same boolean expression pattern used in estimateAndCheck) and then
compare estimateTokens(text) > maxTokens, referencing the exceedsLimit and
estimateAndCheck methods and the estimateTokens call to keep behavior consistent
and avoid the incorrect true result for non-empty text when maxTokens <= 0.
| def preserve_other_branch_points( | ||
| self, | ||
| collection_name: str, | ||
| exclude_branch: str | ||
| ) -> List[PointStruct]: | ||
| """Preserve points from branches other than the one being reindexed. | ||
| Used during full reindex to keep data from other branches. | ||
| """ | ||
| logger.info(f"Preserving points from branches other than '{exclude_branch}'...") | ||
|
|
||
| preserved_points = [] | ||
| offset = None | ||
|
|
||
| try: | ||
| while True: | ||
| results = self.client.scroll( | ||
| collection_name=collection_name, | ||
| limit=100, | ||
| offset=offset, | ||
| scroll_filter=Filter( | ||
| must_not=[ | ||
| FieldCondition( | ||
| key="branch", | ||
| match=MatchValue(value=exclude_branch) | ||
| ) | ||
| ] | ||
| ), | ||
| with_payload=True, | ||
| with_vectors=True | ||
| ) | ||
| points, next_offset = results | ||
| preserved_points.extend(points) | ||
|
|
||
| if next_offset is None or len(points) < 100: | ||
| break | ||
| offset = next_offset | ||
|
|
||
| logger.info(f"Found {len(preserved_points)} points from other branches to preserve") | ||
| return preserved_points | ||
| except Exception as e: | ||
| logger.warning(f"Could not read existing points: {e}") | ||
| return [] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fail fast when preserving other branches to avoid data loss.
If reading existing points fails, returning an empty list lets the reindex proceed and can drop other branches during the swap. Also, loading all preserved points into memory scales poorly on large collections—consider streaming copy to the target collection instead of accumulating everything in RAM.
🔧 Suggested fix to avoid silent data loss
- except Exception as e:
- logger.warning(f"Could not read existing points: {e}")
- return []
+ except Exception as e:
+ logger.warning(f"Could not read existing points: {e}")
+ raise🧰 Tools
🪛 Ruff (0.14.14)
[warning] 141-141: Consider moving this statement to an else block
(TRY300)
[warning] 142-142: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py`
around lines 102 - 145, The preserve_other_branch_points function currently
swallows exceptions and returns an empty list (via the try/except around
self.client.scroll) which can cause silent data loss during reindex; also
accumulating all points in preserved_points is memory-unsafe for large
collections. Change the function to fail fast by removing the broad except or
re-raising after logging so errors surface (keep logger.warning but raise), and
refactor to stream results instead of collecting them in RAM—e.g., convert
preserve_other_branch_points into a generator or accept a callback/writer to
send each page (results from self.client.scroll) directly to the target
collection as you iterate, using self.client.scroll and next_offset to page
through until completion.
| def create_versioned_collection(self, base_name: str) -> str: | ||
| """Create a new versioned collection for atomic swap indexing.""" | ||
| versioned_name = f"{base_name}_v{int(time.time())}" | ||
| logger.info(f"Creating versioned collection: {versioned_name}") | ||
|
|
||
| self.client.create_collection( | ||
| collection_name=versioned_name, | ||
| vectors_config=VectorParams( | ||
| size=self.embedding_dim, | ||
| distance=Distance.COSINE | ||
| ) | ||
| ) | ||
| return versioned_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Versioned collection names can collide within the same second.
int(time.time()) has second resolution; concurrent indexing can generate identical names and fail collection creation. Prefer a higher‑resolution or random suffix.
Proposed diff (nanosecond timestamp)
- versioned_name = f"{base_name}_v{int(time.time())}"
+ versioned_name = f"{base_name}_v{time.time_ns()}"🤖 Prompt for AI Agents
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/collection_manager.py`
around lines 53 - 65, The version suffix in create_versioned_collection uses
int(time.time()) which has only second resolution and can collide; change the
suffix to a higher-resolution or random identifier (e.g., use time.time_ns() or
uuid.uuid4()) when building versioned_name so concurrent calls produce unique
names; update the versioned_name construction and any related logging that
references versioned_name in
collection_manager.CollectionManager.create_versioned_collection to use the new
suffix.
| # Group chunks by file path | ||
| chunks_by_file: Dict[str, List[TextNode]] = {} | ||
| for chunk in chunks: | ||
| path = chunk.metadata.get("path", str(uuid.uuid4())) | ||
| if path not in chunks_by_file: | ||
| chunks_by_file[path] = [] | ||
| chunks_by_file[path].append(chunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-deterministic fallback path breaks upsert semantics.
When a chunk's metadata lacks a path, using uuid.uuid4() as a fallback generates a random value on each call. This defeats the purpose of deterministic IDs, causing duplicate points instead of replacements on re-indexing.
Consider using a consistent fallback like "unknown" or raising an error:
🐛 Proposed fix
# Group chunks by file path
chunks_by_file: Dict[str, List[TextNode]] = {}
for chunk in chunks:
- path = chunk.metadata.get("path", str(uuid.uuid4()))
+ path = chunk.metadata.get("path", "unknown")
if path not in chunks_by_file:
chunks_by_file[path] = []
chunks_by_file[path].append(chunk)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Group chunks by file path | |
| chunks_by_file: Dict[str, List[TextNode]] = {} | |
| for chunk in chunks: | |
| path = chunk.metadata.get("path", str(uuid.uuid4())) | |
| if path not in chunks_by_file: | |
| chunks_by_file[path] = [] | |
| chunks_by_file[path].append(chunk) | |
| # Group chunks by file path | |
| chunks_by_file: Dict[str, List[TextNode]] = {} | |
| for chunk in chunks: | |
| path = chunk.metadata.get("path", "unknown") | |
| if path not in chunks_by_file: | |
| chunks_by_file[path] = [] | |
| chunks_by_file[path].append(chunk) |
🤖 Prompt for AI Agents
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py`
around lines 50 - 56, The code in the grouping loop uses uuid.uuid4() as a
fallback for missing chunk.metadata["path"], making the fallback
non-deterministic and breaking upsert semantics; update the logic in the loop
that builds chunks_by_file (the for chunk in chunks block that uses
chunk.metadata.get("path", ...)) to use a deterministic fallback (e.g., a
constant like "unknown" or a stable computed key) or explicitly raise an error
when "path" is missing so that TextNode objects without a stable path cannot
produce random IDs and duplicate points on re-indexing.
| ; Python tree-sitter queries for AST-based code splitting | ||
|
|
||
| ; Import statements | ||
| (import_statement) @import | ||
|
|
||
| (import_from_statement) @import | ||
|
|
||
| ; Class definitions | ||
| (class_definition | ||
| name: (identifier) @name) @definition.class | ||
|
|
||
| ; Function definitions | ||
| (function_definition | ||
| name: (identifier) @name) @definition.function | ||
|
|
||
| ; Decorated definitions (class or function with decorators) | ||
| (decorated_definition) @definition.decorated | ||
|
|
||
| ; Decorators | ||
| (decorator) @decorator | ||
|
|
||
| ; Assignment statements (module-level constants) | ||
| (assignment | ||
| left: (identifier) @name) @definition.assignment | ||
|
|
||
| ; Type alias (Python 3.12+) | ||
| (type_alias_statement | ||
| name: (type) @name) @definition.type_alias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's explore the repository structure to understand context
find . -type f -name "*.scm" | head -20Repository: rostilos/CodeCrow
Length of output: 714
🏁 Script executed:
# Search for type_alias references in the codebase
rg -i "type_alias" --type-list | head -5Repository: rostilos/CodeCrow
Length of output: 450
🏁 Script executed:
# Look for any documentation or tests related to tree-sitter queries
find . -type f \( -name "*.md" -o -name "*test*" -o -name "*example*" \) | grep -E "(test|example|doc)" | head -20Repository: rostilos/CodeCrow
Length of output: 944
🏁 Script executed:
# Check if there are other Python query files or configuration
find . -type f -name "*.scm" -o -name "*tree-sitter*" -o -name "*grammar*"Repository: rostilos/CodeCrow
Length of output: 714
🏁 Script executed:
# Search for type_alias_statement usage anywhere in the repo
rg "type_alias_statement" -A 3 -B 1Repository: rostilos/CodeCrow
Length of output: 395
🏁 Script executed:
# Search for any documentation about AST chunking or tree-sitter usage
cat ./python-ecosystem/rag-pipeline/docs/AST_CHUNKING.mdRepository: rostilos/CodeCrow
Length of output: 5100
🏁 Script executed:
# Check if there are any test files that validate these queries
find . -type f -name "*.py" | xargs rg -l "type_alias" | head -5Repository: rostilos/CodeCrow
Length of output: 43
🏁 Script executed:
# Look at similar patterns in other language query files for reference
cat ./python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/typescript.scmRepository: rostilos/CodeCrow
Length of output: 1380
🏁 Script executed:
# Search for tree-sitter configuration or dependencies
find . -type f \( -name "pyproject.toml" -o -name "requirements*.txt" -o -name "setup.py" \) | xargs grep -l "tree-sitter" 2>/dev/nullRepository: rostilos/CodeCrow
Length of output: 109
🏁 Script executed:
# Search for any usage or import of the python query file
rg "python.scm" -A 3 -B 1Repository: rostilos/CodeCrow
Length of output: 43
🏁 Script executed:
# Check the requirements.txt to see tree-sitter versions
cat ./python-ecosystem/rag-pipeline/requirements.txt | grep -i treeRepository: rostilos/CodeCrow
Length of output: 476
🏁 Script executed:
# Search for any tests that use or validate these queries
find . -path "*/test*" -name "*.py" | xargs rg -l "query\|parser\|splitter" 2>/dev/null | head -5Repository: rostilos/CodeCrow
Length of output: 43
🏁 Script executed:
# Look at how the queries are used in the actual splitter code
find . -type f -name "*.py" -path "*/splitter/*" | head -10Repository: rostilos/CodeCrow
Length of output: 518
🏁 Script executed:
# Check the actual splitter implementation to understand how queries are used
find . -type f -name "*splitter*.py" | xargs ls -lh 2>/dev/null | head -10Repository: rostilos/CodeCrow
Length of output: 183
🏁 Script executed:
# Check the query_runner to understand how queries are executed
cat ./python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.pyRepository: rostilos/CodeCrow
Length of output: 14628
🏁 Script executed:
# Check tree_parser to see how it loads and uses queries
cat ./python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.pyRepository: rostilos/CodeCrow
Length of output: 4243
🏁 Script executed:
# Search for any test files that specifically test Python queries
find . -type f -name "*.py" -path "*test*" | xargs rg -l "python.scm\|type_alias" 2>/dev/nullRepository: rostilos/CodeCrow
Length of output: 45
🏁 Script executed:
# Check if there's any documentation or examples about type alias handling
rg "type.*alias\|type_alias_statement" --type-add 'scm:*.scm' -iRepository: rostilos/CodeCrow
Length of output: 43
🏁 Script executed:
# Look for tree-sitter-python documentation or examples in the repo
find . -type f -name "*.md" -o -name "*.txt" | xargs rg -i "python.*3.12\|type.*alias" 2>/dev/null | head -20Repository: rostilos/CodeCrow
Length of output: 43
🏁 Script executed:
# Search for any Python 3.12+ type alias examples or tests
rg "type\s+\w+\s*=" --type py -B 2 -A 2 | head -30Repository: rostilos/CodeCrow
Length of output: 43
🏁 Script executed:
# Check if there are any comments or documentation about the type_alias_statement
rg -i "type.*alias.*grammar\|type_alias.*node\|type_alias.*structure" --type md --type txtRepository: rostilos/CodeCrow
Length of output: 43
🏁 Script executed:
# Let me search for tree-sitter grammar info onlineRepository: rostilos/CodeCrow
Length of output: 43
🌐 Web query:
tree-sitter-python type_alias_statement grammar node structure
💡 Result:
In tree-sitter-python, a PEP 695 alias like:
type Alias[T] = dict[str, T]is parsed as a type_alias_statement node whose important named children are exposed via fields (Tree-sitter “field names” become addressable children in the produced syntax tree). [1]
As of tree-sitter-python 0.23.6, type_alias_statement has these key fields: [2]
left: the alias “header” (the alias name, and—if present—its type parameters, e.g.Alias[T])right: the aliased type expression on the RHS (e.g.dict[str, T])
So structurally (schematically) it looks like:
(type_alias_statement
left: <alias-name-or-generic-alias>
right: <type-expression>)
where the literal tokens type and = are also present as (typically anonymous) children in the node’s full child list.
Sources:
[1] https://tree-sitter.github.io/tree-sitter/creating-parsers/2-the-grammar-dsl.html
[2] https://www.rpmfind.info/linux/RPM/opensuse/16.0/x86_64/tree-sitter-python-0.23.6-bp160.1.12.x86_64.html
🏁 Script executed:
# Search for tree-sitter-python grammar documentation in the repo
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "*.py" \) | xargs grep -l "tree-sitter" 2>/dev/null | head -5Repository: rostilos/CodeCrow
Length of output: 345
🌐 Web query:
tree-sitter-python type_alias_statement left field identifier example query
💡 Result:
In tree-sitter-python, type_alias_statement does not define a left: field — it’s just:
type_alias_statement: $ => prec.dynamic(1, seq(
'type',
$.type,
'=',
$.type,
))[1]
So you match the first type child (the alias “name” side) structurally in your query (not via left:). Field syntax in queries only works when the grammar assigned a field name. [2]
Example source
type Point = tuple[int, int]
type Box[T] = tuple[T, T]Query: capture the alias identifier (simple type Name = ...)
(type_alias_statement
"type"
(type (identifier) `@type_alias.name`)
"="
(type) `@type_alias.value`)(Works when the LHS type is just an identifier.)
Query: capture the alias identifier for generics too (type Name[T] = ...)
In this grammar, a generic LHS is parsed as generic_type: $.identifier $.type_parameter. [1]
(type_alias_statement
"type"
(type
(generic_type
(identifier) `@type_alias.name`
(type_parameter) `@type_alias.type_params`))
"="
(type) `@type_alias.value`)Sources: [1] [2]
🏁 Script executed:
# Check if there are any other Python query files or examples we can reference
find . -type f -name "*.py" -path "*splitter*" -o -path "*query*" | xargs grep -l "scm\|query" 2>/dev/null | head -5Repository: rostilos/CodeCrow
Length of output: 364
Fix type_alias_statement query—incorrect field syntax and missing identifier nesting.
The type_alias_statement query on lines 27–28 uses incorrect syntax. Tree-sitter-python's type_alias_statement does not expose a name: field; the grammar structures it as a sequence: 'type', $.type, '=', $.type. To capture the alias name, nest the identifier within the first (type) node:
Suggested fix
; Type alias (Python 3.12+)
(type_alias_statement
"type"
(type (identifier) `@name`)
"="
(type)) `@definition.type_alias`For generic aliases (e.g., type Point[T] = ...), also handle generic_type:
(type_alias_statement
"type"
(type
[(identifier) `@name`
(generic_type (identifier) `@name`)])
"="
(type)) `@definition.type_alias`🤖 Prompt for AI Agents
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/python.scm`
around lines 1 - 28, The type_alias_statement query is using an incorrect field
syntax; update the (type_alias_statement) query so it matches the grammar
sequence ("type" (type ...) "=" (type)) and capture the alias identifier by
nesting (identifier) `@name` inside the first (type) node instead of using a
non-existent name: field; also add an alternative to handle generic aliases by
allowing the first (type) to be either (identifier) `@name` or (generic_type
(identifier) `@name`), and keep the whole node annotated as
`@definition.type_alias`.
| if not matches: | ||
| return [] | ||
|
|
||
| source_bytes = text.encode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove or wire unused captures to avoid F841.
These locals are never used and will trip Ruff. Either propagate them into ASTChunk (if intended) or drop them.
Proposed diff (remove unused locals)
- source_bytes = text.encode('utf-8')
@@
- visibility = match.get(f'{match.pattern_name}.visibility')
- return_type = match.get(f'{match.pattern_name}.return_type')
- params = match.get(f'{match.pattern_name}.params')
modifiers = []Also applies to: 340-342
🧰 Tools
🪛 Ruff (0.14.14)
[error] 263-263: Local variable source_bytes is assigned to but never used
Remove assignment to unused variable source_bytes
(F841)
🤖 Prompt for AI Agents
In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py` at
line 263, source_bytes is computed but never used, causing a Ruff F841; either
remove this unused local and the other unused captures around lines 340-342, or
wire them into the ASTChunk so they become meaningful fields. To fix: if you
need the data, add parameters/attributes to the ASTChunk constructor/definition
(class ASTChunk) and pass source_bytes (and the other captures) into the
ASTChunk creation sites; otherwise delete the unused local assignments (e.g.,
remove source_bytes = text.encode('utf-8') and the analogous unused variables at
340-342). Ensure any tests or downstream code expect the new ASTChunk signature
if you choose to propagate the values.
This pull request introduces enhanced tracking and reporting of code analysis issues across all pull request (PR) analysis versions, providing the AI with a complete issue history—including resolution tracking—to improve context for code review. It also adds an atomic upsert method for command rate limiting and updates related tests.
AI Analysis Issue History & Resolution Tracking
AiRequestPreviousIssueDTOto track PR version, resolution description, resolving commit, and resolved PR version, and updated thefromEntitymethod and tests accordingly. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]withAllPrAnalysesDatainAiAnalysisRequestImplto provide the LLM with deduplicated, full PR issue history, including resolved issues.PullRequestAnalysisProcessorto gather all PR analyses and pass them to the AI request builder for richer context. [1] [2]VcsAiClientServiceinterface with a new overload ofbuildAiAnalysisRequestthat accepts the full list of PR analyses for backward compatibility and richer AI input. [1] [2]Infrastructure
upsertCommandCount) toCommentCommandRateLimitRepositoryto increment or create command rate limit records without race conditions using PostgreSQL'sON CONFLICT DO UPDATE.Other
frontendsubmodule to a new commit.This pull request introduces important improvements to both the Java and Python ecosystems, focusing on preventing duplicate pull request analyses, enhancing web security configuration, and fixing byte/character slicing issues in Python AST processing. The changes also include a submodule update in the frontend.Java Ecosystem
Webhook Handler Improvements:
AnalysisLockServicecheck in bothGitHubPullRequestWebhookHandlerandBitbucketCloudPullRequestWebhookHandlerto prevent duplicate analyses for the same branch by skipping webhook processing if an analysis is already in progress. This helps avoid redundant placeholder comments and race conditions. [1] [2] [3] [4] [5] [6]Web Security Configuration:
WebSecurityConfig.javato permit all requests for async and error dispatcher types, allow unauthenticated access to the/errorendpoint, and clarify the handling of CORS preflight requests. This improves compatibility with SSE/streaming and error handling. [1] [2]Python Ecosystem
AST Chunk Extraction Fixes:
_extract_ast_chunks_with_contextby ensuring that all tree-sitter node slicing is done at the byte level and then decoded. This prevents Unicode-related bugs and ensures accurate extraction of code chunks and line numbers. [1] [2] [3]Formatting Improvements:
_format_rag_contextto use a variable for metadata text, improving readability and maintainability.Frontend
Submodule Update:
frontendsubmodule to a new commit, pulling in the latest changes from the frontend repository.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.