Skip to content

Conversation

@rostilos
Copy link
Owner

@rostilos rostilos commented Jan 23, 2026

This pull request introduces enhanced tracking and reporting of code analysis issues across all pull request (PR) analysis versions, providing the AI with a complete issue history—including resolution tracking—to improve context for code review. It also adds an atomic upsert method for command rate limiting and updates related tests.

AI Analysis Issue History & Resolution Tracking

  • Added new fields to AiRequestPreviousIssueDTO to track PR version, resolution description, resolving commit, and resolved PR version, and updated the fromEntity method and tests accordingly. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
  • Implemented withAllPrAnalysesData in AiAnalysisRequestImpl to provide the LLM with deduplicated, full PR issue history, including resolved issues.
  • Updated PullRequestAnalysisProcessor to gather all PR analyses and pass them to the AI request builder for richer context. [1] [2]
  • Extended the VcsAiClientService interface with a new overload of buildAiAnalysisRequest that accepts the full list of PR analyses for backward compatibility and richer AI input. [1] [2]

Infrastructure

  • Added an atomic upsert method (upsertCommandCount) to CommentCommandRateLimitRepository to increment or create command rate limit records without race conditions using PostgreSQL's ON CONFLICT DO UPDATE.

Other

  • Updated the frontend submodule to a new commit.This pull request introduces important improvements to both the Java and Python ecosystems, focusing on preventing duplicate pull request analyses, enhancing web security configuration, and fixing byte/character slicing issues in Python AST processing. The changes also include a submodule update in the frontend.

Java Ecosystem

Webhook Handler Improvements:

  • Added an AnalysisLockService check in both GitHubPullRequestWebhookHandler and BitbucketCloudPullRequestWebhookHandler to prevent duplicate analyses for the same branch by skipping webhook processing if an analysis is already in progress. This helps avoid redundant placeholder comments and race conditions. [1] [2] [3] [4] [5] [6]

Web Security Configuration:

  • Updated WebSecurityConfig.java to permit all requests for async and error dispatcher types, allow unauthenticated access to the /error endpoint, and clarify the handling of CORS preflight requests. This improves compatibility with SSE/streaming and error handling. [1] [2]

Python Ecosystem

AST Chunk Extraction Fixes:

  • Fixed byte vs. character slicing issues in _extract_ast_chunks_with_context by ensuring that all tree-sitter node slicing is done at the byte level and then decoded. This prevents Unicode-related bugs and ensures accurate extraction of code chunks and line numbers. [1] [2] [3]

Formatting Improvements:

  • Refactored _format_rag_context to use a variable for metadata text, improving readability and maintainability.

Frontend

Submodule Update:

  • Updated the frontend submodule to a new commit, pulling in the latest changes from the frontend repository.

Summary by CodeRabbit

  • New Features
    • Pre-emptive locking to avoid duplicate PR analyses; deterministic RAG context API and branch-diff retrieval; per-project analysis token limit and token-estimation guard with clear error on oversized diffs.
  • Enhancements
    • RAG indexing & splitter overhaul for richer AST-based chunking, larger default chunk sizes, deterministic indexing and improved index management.
  • Bug Fixes
    • Atomic upsert for comment command rate-limiting and safer alias-swap/migration flow.
  • Data / API
    • Added resolution-tracking fields to issue payloads and included full PR history in AI analysis requests; removed connection-level tokenLimitation setting.
  • Security
    • Async/error dispatches allowed to bypass auth; error page made public.
  • Chores
    • Frontend submodule pointer updated (no behavior change).

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 23, 2026

📝 Walkthrough

Walkthrough

Adds pre-acquired analysis locking to webhook handlers and processors, atomic upserts for comment rate limits, token-limit estimation, expanded previous-issue/resolution metadata across AI flows, deterministic RAG context endpoint and Qdrant index manager, a Tree-sitter AST-based splitter, various RAG/indexing managers, and assorted logging, DTO, and API adjustments.

Changes

Cohort / File(s) Summary
Frontend Submodule
frontend
Subproject pointer updated (commit changed). No code changes.
Web Security
java-ecosystem/libs/security/.../WebSecurityConfig.java
Allow DispatcherType.ASYNC and DispatcherType.ERROR through security; permit /error publicly.
Webhook Locking & Handlers
java-ecosystem/services/pipeline-agent/src/main/java/.../github/.../GitHubPullRequestWebhookHandler.java, .../bitbucket/.../BitbucketCloudPullRequestWebhookHandler.java, .../gitlab/.../GitLabMergeRequestWebhookHandler.java, .../bitbucket/.../BitbucketCloudBranchWebhookHandler.java
Inject AnalysisLockService; pre-acquire PR analysis locks before posting placeholders; skip/ignore duplicate events when lock not obtained; update constructors and propagate preAcquiredLockKey.
Async Processor, Dedup, Placeholders
.../pipeline-agent/.../generic/processor/WebhookAsyncProcessor.java, .../generic/service/WebhookDeduplicationService.java, .../generic/utils/CommentPlaceholders.java
Add transactional proxy method for webhook processing, handle AnalysisLocked/DiffTooLarge exceptions, centralize placeholders/messages, add webhook deduplication service and caller checks.
Rate Limiting (DB + Service)
java-ecosystem/libs/core/.../analysis/CommentCommandRateLimitRepository.java, .../pipeline-agent/.../generic/service/CommentCommandRateLimitService.java
Add native upsertCommandCount using ON CONFLICT to atomically increment command_count; replace read-modify-write with repository upsert.
Analysis Locking in Processor & Services
java-ecosystem/libs/analysis-engine/.../processor/analysis/PullRequestAnalysisProcessor.java, java-ecosystem/libs/analysis-engine/.../dto/request/processor/PrProcessRequest.java
Support preAcquiredLockKey in PR requests and processing; allow bypass of lock acquisition when provided.
AI DTOs, Builders & Client Wiring
java-ecosystem/libs/analysis-engine/.../AiRequestPreviousIssueDTO.java, .../AiAnalysisRequestImpl.java, .../service/vcs/VcsAiClientService.java, services/pipeline-agent/.../*AiClientService.java
Add resolution-tracking fields to previous-issue DTO, add builder withAllPrAnalysesData, add overloaded buildAiAnalysisRequest to accept allPrAnalyses and propagate through Bitbucket/GitHub/GitLab services; wire project-level maxAnalysisTokenLimit.
CodeAnalysis Persistence & API
java-ecosystem/libs/core/.../service/CodeAnalysisService.java, .../persistence/repository/codeanalysis/CodeAnalysisRepository.java
Persist analyses earlier, reconcile issues with persisted originals, add getAllPrAnalyses(Long, Long) repository/service method to retrieve all versions for a PR.
Diff Token Estimation & Exceptions
java-ecosystem/libs/analysis-engine/.../util/TokenEstimator.java, java-ecosystem/libs/analysis-engine/.../exception/DiffTooLargeException.java, java-ecosystem/libs/analysis-engine/pom.xml, module-info.java
Add TokenEstimator (jtokkit dependency), DiffTooLargeException, estimate tokens for diffs and throw when exceeding project-configured max tokens; add jtokkit to POM and module requires.
Project Config / DTOs / AIConnection
java-ecosystem/libs/core/.../model/project/config/ProjectConfig.java, .../model/project/Project.java, .../dto/project/ProjectDTO.java, .../dto/ai/AIConnectionDTO.java, migrations, tests`
Add project-level maxAnalysisTokenLimit (default 200000); Project.getEffectiveConfig helper; remove tokenLimitation from AIConnection and DTOs; add DB migration to drop column; update tests and controller DTOs to remove tokenLimitation.
RAG Query & Deterministic Context API
python-ecosystem/rag-pipeline/src/rag_pipeline/api/api.py, python-ecosystem/mcp-client/service/rag_client.py, python-ecosystem/rag-pipeline/src/rag_pipeline/services/query_service.py
Add /query/deterministic endpoint and DeterministicContextRequest; RagClient.get_deterministic_context; RAGQueryService.get_deterministic_context returning metadata-driven deterministic context (semantic names, related definitions, class/namespace context) and branch-prioritized deduplication.
RAG Index Manager & Qdrant Integration
python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/... (manager.py, indexer.py, collection_manager.py, branch_manager.py, point_operations.py, stats_manager.py, init.py)
New RAG index manager subsystem for Qdrant: collection/alias management, branch operations, repository indexing (atomic alias swap + migration path), point upsert/embed flow, stats manager, file-level update/delete operations.
AST-based Splitter & Tree-sitter Integration
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/... (splitter.py, tree_parser.py, query_runner.py, languages.py, metadata.py, queries/*.scm), rag_pipeline/core/__init__.py
Add ASTCodeSplitter with ASTChunk and deterministic IDs; tree-sitter parser, query runner, language mappings, metadata extractor, and per-language .scm queries for Java/TS/Python/Go/Rust/C#/PHP/JS. Expose new splitter exports.
Removed / Replaced Splitters
python-ecosystem/rag-pipeline/src/rag_pipeline/core/ast_splitter.py, .../semantic_splitter.py, .../core/chunking.py
Removed legacy AST-based ast_splitter module and the semantic_splitter and code-aware/function-aware splitters; replaced by new splitter package.
Splitter/Indexing Config & Utilities
python-ecosystem/rag-pipeline/src/rag_pipeline/models/config.py, .../models/scoring_config.py, .../utils/utils.py, core/loader.py
Increase default chunk sizes; add scoring config for reranking; introduce clean_archive_path and apply to loader metadata; add path normalization.
Multi-Stage Orchestrator & MCP client changes
python-ecosystem/mcp-client/service/multi_stage_orchestrator.py, python-ecosystem/mcp-client/model/models.py, python-ecosystem/mcp-client/utils/response_parser.py, .../prompts/prompt_constants.py
Always run reconciliation when previousCodeAnalysisIssues exist; add _extract_diff_snippets and _fetch_batch_rag_context; accept and propagate previous-issue metadata (prVersion, resolved fields); response parser accepts issueId and normalizes to id; update prompt templates to instruct preserving original ids and boolean isResolved.
RAG Client deterministic integration & tests
python-ecosystem/mcp-client/service/rag_client.py, tests`
Add RagClient.get_deterministic_context with payload/response handling and logging; tests updated where applicable.
VCS: GitLab diff endpoint
java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/gitlab/GitLabClient.java
Add public getBranchDiff(...) method to call GitLab compare API and return unified diff-like string.
RAG Engine & Operations Logging
java-ecosystem/libs/rag-engine/src/main/java/.../RagOperationsServiceImpl.java, IncrementalRagUpdateService.java, VcsRagIndexingService.java
Add extensive logging, pre-indexing canStartIndexing check, acquire RAG_INDEXING lock before starting job, release lock in finally; streamline job creation after lock.
Project Controller & Service
java-ecosystem/services/web-server/.../ProjectController.java, project/service/ProjectService.java
Expose UpdateAnalysisSettingsRequest.maxAnalysisTokenLimit and persist it via projectService.updateAnalysisSettings(project, maxAnalysisTokenLimit).
Tests & DTO updates
various tests under java-ecosystem/tests/integration-tests/ and libs/*/src/test
Update tests to reflect removed tokenLimitation, ProjectDTO new field, AiRequestPreviousIssueDTO additions, and RAGConfig default chunk sizes.
Misc small changes
java-ecosystem/services/pipeline-agent/src/main/java/.../config/AsyncConfig.java, ProviderWebhookController.java, Ask/Review/SummarizeCommandProcessor.*, AbstractWebhookHandler.java
Add rejection handling policy for webhookExecutor; additional logging around webhook dispatch; source of token limit switched to project.getEffectiveConfig().maxAnalysisTokenLimit(); small import cleanup.

Sequence Diagram(s)

sequenceDiagram
  participant Webhook as WebhookHandler
  participant LockSvc as AnalysisLockService
  participant Vcs as VcsService
  participant Processor as PullRequestAnalysisProcessor

  Webhook->>LockSvc: acquireLock(PR_ANALYSIS, project, sourceBranch, commitHash, prId)
  alt lock NOT acquired
    LockSvc-->>Webhook: empty
    Webhook-->>Webhook: log duplicate
    Webhook-->>Webhook: return IGNORED
  else lock acquired
    LockSvc-->>Webhook: lockKey
    Webhook->>Vcs: post placeholder comment (protected by lock)
    Webhook->>Processor: submit PrProcessRequest (includes preAcquiredLockKey)
    Processor->>LockSvc: (processor may use existing lockKey or acquire/release as needed)
    Processor-->>Vcs: update placeholder / post results
    Processor-->>Webhook: completion
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • 0.1.2 rc #20 — Strong relation: overlaps placeholder/comment posting, webhook handlers, and diff handling changes.
  • 1.2.0 rc #65 — Strong relation: modifies RAG/delta-indexing and index management surfaces used by the new index manager.
  • 1.2.1 rc #85 — Strong relation: addresses AST-based splitting and splitter/query changes similar to the new Tree-sitter splitter work.

Poem

🐰 I hopped through locks and counted hops in a row,
Comments stayed atomic, no races to show,
I parsed trees with whiskers, found functions and names,
Built deterministic chunks and tidy index frames,
Now place-holders rest easy while indices grow.

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.72% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title '1.2.1 rc' is vague and generic. It lacks specificity about the core changes in the PR, using only a version label without describing the primary feature or improvement. Consider using a more descriptive title that captures the main objective, such as 'Add PR issue history tracking and resolution detection' or 'Enhance AI analysis with issue history and atomic rate limiting'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rostilos
Copy link
Owner Author

/codecrow analyze

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java`:
- Around line 125-133: Replace the non-atomic isLocked() check in
GitHubPullRequestWebhookHandler with an actual lock acquisition via
analysisLockService.acquireLockWithWait(...) before calling
postPlaceholderComment so only one webhook can post the placeholder; if
acquireLockWithWait returns empty, log and return WebhookResult.ignored as
before, otherwise enter a try/finally where you call postPlaceholderComment
(store placeholderCommentId) and proceed with analysis, and always call
analysisLockService.releaseLock(lockKey) in finally to ensure the lock is
released.
♻️ Duplicate comments (1)
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/webhookhandler/BitbucketCloudPullRequestWebhookHandler.java (1)

112-119: Same TOCTOU lock-check gap as in GitHub handler.

See the earlier note: use an atomic lock acquisition rather than isLocked alone.

Repository owner deleted a comment from codecrow-ai bot Jan 23, 2026
@rostilos
Copy link
Owner Author

/codecrow analyze

fix: Implement atomic upsert for command rate limiting to prevent rac…
@rostilos
Copy link
Owner Author

/codecrow analyze

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/CommentCommandRateLimitService.java (1)

68-70: Fix race condition: OffsetDateTime.now() called twice.

The windowStart calculation calls OffsetDateTime.now() twice on different lines. Between calls, time may advance across a minute or hour boundary, resulting in an inconsistent window calculation.

For example, if the first now() returns 10:59:59.999 and the second returns 11:00:00.001:

  • truncatedTo(HOURS) yields 10:00:00
  • getMinute() yields 0
  • Result is 10:00:00 when it should be 11:00:00
🐛 Proposed fix: capture now() once
-        OffsetDateTime windowStart = OffsetDateTime.now()
-            .truncatedTo(ChronoUnit.HOURS)
-            .plus((OffsetDateTime.now().getMinute() / windowMinutes) * windowMinutes, ChronoUnit.MINUTES);
+        OffsetDateTime now = OffsetDateTime.now();
+        OffsetDateTime windowStart = now
+            .truncatedTo(ChronoUnit.HOURS)
+            .plus((now.getMinute() / windowMinutes) * windowMinutes, ChronoUnit.MINUTES);
🤖 Fix all issues with AI agents
In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/persistence/repository/analysis/CommentCommandRateLimitRepository.java`:
- Around line 39-51: The upsertCommandCount method in
CommentCommandRateLimitRepository issues an INSERT ... ON CONFLICT against table
comment_command_rate_limit using columns (project_id, window_start) and columns
window_start/command_count which do not exist in the current V15 migration; fix
by either updating the entity/repository to match the V15 schema (use pr_id,
vcs_username, command_name, executed_at/execution_date and change the ON
CONFLICT target and inserted/updated columns accordingly) or by adding a new
migration that creates/alter the comment_command_rate_limit table with
project_id, window_start, command_count and a unique constraint on (project_id,
window_start) so the upsertCommandCount method will succeed.

Comment on lines +39 to +51
@Modifying
@Query(value = """
INSERT INTO comment_command_rate_limit (project_id, window_start, command_count, last_command_at)
VALUES (:projectId, :windowStart, 1, NOW())
ON CONFLICT (project_id, window_start)
DO UPDATE SET
command_count = comment_command_rate_limit.command_count + 1,
last_command_at = NOW()
""", nativeQuery = true)
void upsertCommandCount(
@Param("projectId") Long projectId,
@Param("windowStart") OffsetDateTime windowStart
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find the entity class definition
find . -type f -name "*.java" | xargs grep -l "class CommentCommandRateLimit" | head -5

Repository: rostilos/CodeCrow

Length of output: 431


🏁 Script executed:

#!/bin/bash
# Search for migration files related to comment_command_rate_limit table
find . -type f \( -name "*.sql" -o -name "*.xml" -o -name "*.yml" -o -name "*.yaml" \) | xargs grep -l "comment_command_rate_limit" 2>/dev/null | head -10

Repository: rostilos/CodeCrow

Length of output: 273


🏁 Script executed:

#!/bin/bash
# Search for unique constraint definitions related to project_id and window_start
rg -n "unique.*project.*window|@UniqueConstraint.*project|ON CONFLICT.*project_id.*window_start" --type java --type sql 2>/dev/null

Repository: rostilos/CodeCrow

Length of output: 1469


🏁 Script executed:

#!/bin/bash
# Find the directory structure of the repository to understand schema/migration setup
ls -la | head -20
find . -type d -name "*migration*" -o -type d -name "*schema*" -o -type d -name "*db*" | head -10

Repository: rostilos/CodeCrow

Length of output: 973


🏁 Script executed:

#!/bin/bash
# Read the entity class to see the constraint definition
cat ./java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/analysis/CommentCommandRateLimit.java

Repository: rostilos/CodeCrow

Length of output: 2432


🏁 Script executed:

#!/bin/bash
# Read the migration file that creates the table
cat ./java-ecosystem/libs/core/src/main/resources/db/migration/old/V15__create_comment_command_rate_limit_table.sql

Repository: rostilos/CodeCrow

Length of output: 4633


🏁 Script executed:

#!/bin/bash
# Read the migration file that adds indexes
cat ./java-ecosystem/libs/core/src/main/resources/db/migration/old/V17__add_comment_commands_indexes.sql

Repository: rostilos/CodeCrow

Length of output: 3548


🏁 Script executed:

#!/bin/bash
# List all migration files to see the complete migration history
find ./java-ecosystem/libs/core/src/main/resources/db/migration -type f -name "*.sql" | sort

Repository: rostilos/CodeCrow

Length of output: 4729


🏁 Script executed:

#!/bin/bash
# Search for any migration that mentions window_start or the correct schema
rg -l "window_start" ./java-ecosystem/libs/core/src/main/resources/db/migration/ 2>/dev/null

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

#!/bin/bash
# Search for any migration that creates or alters comment_command_rate_limit table
rg -A 30 "CREATE TABLE.*comment_command_rate_limit|ALTER TABLE.*comment_command_rate_limit" ./java-ecosystem/libs/core/src/main/resources/db/migration/ 2>/dev/null

Repository: rostilos/CodeCrow

Length of output: 4502


🏁 Script executed:

#!/bin/bash
# Search for any ALTER TABLE that modifies comment_command_rate_limit after V15
rg -B 2 -A 15 "ALTER TABLE.*comment_command_rate_limit" ./java-ecosystem/libs/core/src/main/resources/db/migration/ 2>/dev/null

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

#!/bin/bash
# Check all migrations V16 and onward for any schema changes to comment_command_rate_limit
for file in ./java-ecosystem/libs/core/src/main/resources/db/migration/old/V{16,17,18,19,20,21,22}*.sql; do
  if [ -f "$file" ]; then
    echo "=== $file ==="
    grep -i "comment_command_rate_limit\|window_start\|command_count" "$file" || echo "(no match)"
  fi
done

Repository: rostilos/CodeCrow

Length of output: 1042


🏁 Script executed:

#!/bin/bash
# Search entire codebase for references to the entity to understand if it's actually used
rg "CommentCommandRateLimit" --type java | head -20

Repository: rostilos/CodeCrow

Length of output: 4465


The repository method upsertCommandCount targets a schema that doesn't exist in the database and will fail at runtime.

The entity class defines window_start and command_count columns with a unique constraint on (project_id, window_start), but the actual database migration (V15) creates the table with entirely different columns: pr_id, vcs_username, command_name, executed_at, execution_date. The unique constraint referenced in the ON CONFLICT clause does not exist in the database schema. This mismatch between the entity definition and the database migration will cause the upsert operation to fail when executed.

Either the entity schema must be aligned with the V15 migration, or a new migration must be created to restructure the table to match the entity definition.

🤖 Prompt for AI Agents
In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/persistence/repository/analysis/CommentCommandRateLimitRepository.java`
around lines 39 - 51, The upsertCommandCount method in
CommentCommandRateLimitRepository issues an INSERT ... ON CONFLICT against table
comment_command_rate_limit using columns (project_id, window_start) and columns
window_start/command_count which do not exist in the current V15 migration; fix
by either updating the entity/repository to match the V15 schema (use pr_id,
vcs_username, command_name, executed_at/execution_date and change the ON
CONFLICT target and inserted/updated columns accordingly) or by adding a new
migration that creates/alter the comment_command_rate_limit table with
project_id, window_start, command_count and a unique constraint on (project_id,
window_start) so the upsertCommandCount method will succeed.

@rostilos
Copy link
Owner Author

/codecrow analyze

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/webhookhandler/BitbucketCloudPullRequestWebhookHandler.java`:
- Around line 113-131: After acquireLock(...) returns earlyLock in
BitbucketCloudPullRequestWebhookHandler, wrap the post-lock work (placeholder
posting and invoking PullRequestAnalysisProcessor.process(...)) in a try/finally
so that if an exception prevents calling processor.process(...) you call
analysisLockService.releaseLock(earlyLock.get()) in the finally; ensure you do
NOT release the lock when processor.process(...) was successfully invoked (since
PullRequestAnalysisProcessor.process() will release it), so track a boolean like
processorInvoked to decide whether to release the earlyLock in the finally.
♻️ Duplicate comments (1)
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java (1)

184-195: Same lock leak concern as Bitbucket handler.

The catch block doesn't release the acquired lock if an exception occurs after acquireLock() succeeds but before process() completes. See the review comment on BitbucketCloudPullRequestWebhookHandler for the detailed analysis and suggested fix.

🧹 Nitpick comments (1)
python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py (1)

538-546: Improve exception handling for better debugging.

The migration logic is correct, but the exception handling can be improved:

  1. Use logger.exception instead of logger.error to include the traceback automatically.
  2. Chain the exception with raise ... from del_err to preserve the original cause.
♻️ Suggested improvement
             if is_direct_collection:
                 logger.info(f"Migrating from direct collection to alias-based indexing. Deleting old collection: {alias_name}")
                 try:
                     self.qdrant_client.delete_collection(alias_name)
                 except Exception as del_err:
-                    logger.error(f"Failed to delete old direct collection before alias swap: {del_err}")
-                    raise Exception(f"Cannot create alias - collection '{alias_name}' exists and cannot be deleted: {del_err}")
+                    logger.exception(f"Failed to delete old direct collection before alias swap: {del_err}")
+                    raise RuntimeError(f"Cannot create alias - collection '{alias_name}' exists and cannot be deleted") from del_err

Comment on lines 113 to 131
// Try to acquire lock atomically BEFORE posting placeholder
// This prevents TOCTOU race where multiple webhooks could pass isLocked() check simultaneously
// Note: PullRequestAnalysisProcessor.process() uses acquireLockWithWait() which will
// reuse this lock since it's for the same project/branch/type
String sourceBranch = payload.sourceBranch();
Optional<String> earlyLock = analysisLockService.acquireLock(
project, sourceBranch, AnalysisLockType.PR_ANALYSIS,
payload.commitHash(), Long.parseLong(payload.pullRequestId()));

if (earlyLock.isEmpty()) {
log.info("PR analysis already in progress for project={}, branch={}, PR={} - skipping duplicate webhook",
project.getId(), sourceBranch, payload.pullRequestId());
return WebhookResult.ignored("PR analysis already in progress for this branch");
}

// Lock acquired - placeholder posting is now protected from race conditions
// Note: We don't release this lock here - PullRequestAnalysisProcessor will manage it
// since acquireLockWithWait() will detect the existing lock and use it

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find the PullRequestAnalysisProcessor file
find . -name "PullRequestAnalysisProcessor.java" -type f

Repository: rostilos/CodeCrow

Length of output: 202


🏁 Script executed:

# Find the AnalysisLockService file
find . -name "AnalysisLockService.java" -type f

Repository: rostilos/CodeCrow

Length of output: 182


🏁 Script executed:

# Check if there are multiple implementations of AnalysisLockService
find . -name "*LockService*.java" -type f

Repository: rostilos/CodeCrow

Length of output: 308


🏁 Script executed:

# Check size and read PullRequestAnalysisProcessor
wc -l ./java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessor.java

Repository: rostilos/CodeCrow

Length of output: 206


🏁 Script executed:

# Check size and read AnalysisLockService
wc -l ./java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/AnalysisLockService.java

Repository: rostilos/CodeCrow

Length of output: 186


🏁 Script executed:

cat -n ./java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessor.java

Repository: rostilos/CodeCrow

Length of output: 17492


🏁 Script executed:

cat -n ./java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/AnalysisLockService.java

Repository: rostilos/CodeCrow

Length of output: 17352


Lock leak potential exists between early lock acquisition and processor handoff, but TTL mitigates indefinite lock leaks.

The concern about exception handling is partially valid. If an exception occurs after acquireLock() succeeds at line 120 but before processor.process() is called, the early lock would not be released by the webhook handler's catch block. However, the processor's process() method has a finally block (line 210-211 in PullRequestAnalysisProcessor) that reliably releases the lock when called, and the AnalysisLockService automatically expires locks after 30 minutes (configurable) with periodic cleanup every 5 minutes.

To eliminate the temporary lock leak window entirely, consider wrapping the post-lock operations in a try-finally block that releases the early lock only if the processor is never invoked:

Proposed mitigation
if (earlyLock.isEmpty()) {
    log.info("PR analysis already in progress for project={}, branch={}, PR={} - skipping duplicate webhook", 
            project.getId(), sourceBranch, payload.pullRequestId());
    return WebhookResult.ignored("PR analysis already in progress for this branch");
}

+try {
    // Lock acquired - placeholder posting is now protected from race conditions
    // Processor will manage release via its finally block
    // ... rest of method ...
    return WebhookResult.success(...);
+} catch (Exception e) {
+    analysisLockService.releaseLock(earlyLock.get());
+    // ... existing error handling ...
+    return WebhookResult.error(...);
+}
🤖 Prompt for AI Agents
In
`@java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/webhookhandler/BitbucketCloudPullRequestWebhookHandler.java`
around lines 113 - 131, After acquireLock(...) returns earlyLock in
BitbucketCloudPullRequestWebhookHandler, wrap the post-lock work (placeholder
posting and invoking PullRequestAnalysisProcessor.process(...)) in a try/finally
so that if an exception prevents calling processor.process(...) you call
analysisLockService.releaseLock(earlyLock.get()) in the finally; ensure you do
NOT release the lock when processor.process(...) was successfully invoked (since
PullRequestAnalysisProcessor.process() will release it), so track a boolean like
processorInvoked to decide whether to release the earlyLock in the finally.

rostilos and others added 2 commits January 23, 2026 14:35
fix: Enhance alias management by implementing backup and migration strategy for direct collections
@rostilos
Copy link
Owner Author

/codecrow analyze

rostilos and others added 2 commits January 23, 2026 14:59
fix: Enhance AI analysis by incorporating full PR issue history and resolution tracking
@rostilos
Copy link
Owner Author

/codecrow analyze

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/service/GitHubAiClientService.java (1)

81-92: Backward-compatibility regression for callers using the 3‑arg overload.
The previousAnalysis context is dropped because you pass Collections.emptyList() and the PR builder no longer uses withPreviousAnalysisData(...). If any caller still uses the 3‑arg overload, they lose historical issues.

✅ Suggested fix to preserve prior behavior
-            default:
-                return buildPrAnalysisRequest(project, (PrProcessRequest) request, previousAnalysis, Collections.emptyList());
+            default:
+                List<CodeAnalysis> history = previousAnalysis
+                        .map(List::of)
+                        .orElseGet(Collections::emptyList);
+                return buildPrAnalysisRequest(project, (PrProcessRequest) request, previousAnalysis, history);
java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiRequestPreviousIssueDTO.java (1)

18-52: resolvedInPrVersion is populated with analysis ID — likely wrong.

The field is documented as "PR version where this was resolved," but it's assigned issue.getResolvedAnalysisId(), which returns an analysis ID (a Long), not a PR version. The CodeAnalysisIssue entity has no direct accessor for the resolved analysis' PR version. Either rename the field/comment to resolvedInAnalysisId to match the assigned value, or fetch the resolved analysis by ID to extract the actual prVersion.

🤖 Fix all issues with AI agents
In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java`:
- Around line 281-302: The Javadoc for withAllPrAnalysesData promises
deduplication but the method currently just flattens all issues; update
withAllPrAnalysesData in AiAnalysisRequestImpl to deduplicate by a composite
identity (file path + line number + reason + severity + category) while
preserving newest-first order: when streaming allPrAnalyses -> flatMap ->
map(AiRequestPreviousIssueDTO::fromEntity), collect into a LinkedHashMap keyed
by the composite key and keep the first occurrence (newest) for each key, then
set previousCodeAnalysisIssues to the map.values() list; ensure you reference
the existing AiRequestPreviousIssueDTO.fromEntity conversion and the
previousCodeAnalysisIssues field and return self() as before.

In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java`:
- Around line 382-387: Remove the dead 4-argument overload
createIssueFromData(Map<String, Object> issueData, String issueKey, String
vcsAuthorId, String vcsAuthorUsername) since no callers use it; update
CodeAnalysisService by deleting that private method entirely (all references
should remain pointing to the existing 7-argument createIssueFromData(...,
ResolutionContext, String, String) method) and run a build to ensure no usages
remain.
🧹 Nitpick comments (6)
python-ecosystem/mcp-client/utils/response_parser.py (1)

84-91: Mutation of input dict may cause side effects.

The normalization at line 86 mutates the original issue dictionary by adding the id key. While this works for the current use case (since _clean_issue is called on items that are subsequently discarded), this could lead to unexpected behavior if callers retain references to the original dict.

Consider operating on a copy to avoid side effects:

♻️ Suggested defensive fix
         cleaned = {}
         
         # Normalize issueId to id (AI may use either)
-        if 'issueId' in issue and 'id' not in issue:
-            issue['id'] = issue['issueId']
+        id_value = issue.get('id') or issue.get('issueId')
         
         for key, value in issue.items():
             # Skip issueId since we normalized it to id above
             if key == 'issueId':
                 continue
-                
+            
+            # Use normalized id value
+            if key == 'id':
+                value = id_value
+            
             # Skip fields not in valid set
java-ecosystem/libs/analysis-engine/src/test/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiRequestPreviousIssueDTOTest.java (1)

207-224: Missing mock for getPrNumber() may cause test fragility.

The test shouldHandleNullSeverity mocks analysis.getBranchName() and analysis.getPrVersion() but does not mock analysis.getPrNumber(). If fromEntity() accesses getPrNumber(), Mockito will return 0L (default for Long primitives via unboxing) rather than null, which may not reflect the intended test scenario or could mask potential NPE issues in production code.

Consider adding the mock for consistency with other tests:

when(analysis.getPrNumber()).thenReturn(null);
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/service/BitbucketAiClientService.java (2)

123-124: Remove duplicate variable declaration.

aiConnection and projectAiConnection both reference the same object from project.getAiBinding().getAiConnection(). This redundancy can cause confusion and maintenance burden.

♻️ Proposed fix
         VcsInfo vcsInfo = getVcsInfo(project);
         VcsConnection vcsConnection = vcsInfo.vcsConnection();
         AIConnection aiConnection = project.getAiBinding().getAiConnection();
-        AIConnection projectAiConnection = project.getAiBinding().getAiConnection();

Then update line 232 to use aiConnection:

-                .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(projectAiConnection.getApiKeyEncrypted()))
+                .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(aiConnection.getApiKeyEncrypted()))

295-296: Same duplicate variable declaration issue.

Apply the same fix here — remove projectAiConnection and use aiConnection consistently.

♻️ Proposed fix
         VcsInfo vcsInfo = getVcsInfo(project);
         VcsConnection vcsConnection = vcsInfo.vcsConnection();
         AIConnection aiConnection = project.getAiBinding().getAiConnection();
-        AIConnection projectAiConnection = project.getAiBinding().getAiConnection();

Then update line 303:

-                .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(projectAiConnection.getApiKeyEncrypted()))
+                .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(aiConnection.getApiKeyEncrypted()))
java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java (2)

268-288: N+1 query pattern in issue reconciliation.

For each issue in the AI response, issueRepository.findById(originalId) executes a separate database query. For large analyses with many issues referencing previous issues, this could cause performance degradation.

Consider batch-fetching all original issue IDs upfront:

♻️ Suggested approach

Before the issue processing loop, collect all original IDs and fetch them in one query:

// Extract all original issue IDs from the issues list/map
Set<Long> originalIds = extractOriginalIssueIds(issuesObj);
Map<Long, CodeAnalysisIssue> originalIssuesMap = originalIds.isEmpty() 
    ? Collections.emptyMap()
    : issueRepository.findAllById(originalIds).stream()
        .collect(Collectors.toMap(CodeAnalysisIssue::getId, Function.identity()));

// Then pass originalIssuesMap to createIssueFromData and look up by ID

This would reduce N database calls to 1 batch query.


353-362: Clarify resolution tracking intent in comment.

The resolution tracking logic sets resolvedDescription from the current issue's reason field. This is semantically correct (AI provides the resolution explanation), but the variable naming and comment could be clearer.

Consider enhancing the comment:

// AI provides resolution explanation in 'reason' field for resolved issues
issue.setResolvedDescription(reason);

Comment on lines +281 to +302
/**
* Set previous issues from ALL PR analysis versions.
* This provides the LLM with complete issue history including resolved issues,
* helping it understand what was already found and fixed.
*
* Issues are deduplicated by keeping only the most recent version of each issue.
* Resolved issues are included so the LLM knows what was already addressed.
*
* @param allPrAnalyses List of all analyses for this PR, ordered by version DESC (newest first)
*/
public T withAllPrAnalysesData(List<CodeAnalysis> allPrAnalyses) {
if (allPrAnalyses == null || allPrAnalyses.isEmpty()) {
return self();
}

this.previousCodeAnalysisIssues = allPrAnalyses.stream()
.flatMap(analysis -> analysis.getIssues().stream())
.map(AiRequestPreviousIssueDTO::fromEntity)
.toList();

return self();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find CodeAnalysisIssue class definition
find . -name "*.java" -type f | xargs grep -l "class CodeAnalysisIssue" | head -5

Repository: rostilos/CodeCrow

Length of output: 284


🏁 Script executed:

#!/bin/bash
# Search for CodeAnalysis class to understand the structure
find . -name "*.java" -type f | xargs grep -l "class CodeAnalysis\b" | head -5

Repository: rostilos/CodeCrow

Length of output: 165


🏁 Script executed:

#!/bin/bash
# Look for any deduplication logic in the codebase
rg "LinkedHashMap|toMap.*Function.identity" --type java -C2 | head -30

Repository: rostilos/CodeCrow

Length of output: 4362


🏁 Script executed:

#!/bin/bash
# Find usages of withAllPrAnalysesData to understand how it's called
rg "withAllPrAnalysesData" --type java -C3

Repository: rostilos/CodeCrow

Length of output: 5696


🏁 Script executed:

#!/bin/bash
# Read CodeAnalysisIssue class to find stable identifier fields
cat java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/codeanalysis/CodeAnalysisIssue.java

Repository: rostilos/CodeCrow

Length of output: 5060


🏁 Script executed:

#!/bin/bash
# Read CodeAnalysis class to understand the structure
cat java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/codeanalysis/CodeAnalysis.java

Repository: rostilos/CodeCrow

Length of output: 5970


🏁 Script executed:

#!/bin/bash
# Find AiRequestPreviousIssueDTO class
find . -name "*.java" -type f | xargs grep -l "class AiRequestPreviousIssueDTO"

Repository: rostilos/CodeCrow

Length of output: 199


🏁 Script executed:

#!/bin/bash
# Check how the Javadoc says "deduplicated by keeping only the most recent version"
# Let's verify the exact context in the file
sed -n '281,302p' java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java

Repository: rostilos/CodeCrow

Length of output: 1080


🏁 Script executed:

#!/bin/bash
# Find and read AiRequestPreviousIssueDTO class
find . -name "AiRequestPreviousIssueDTO.java" -type f -exec cat {} \;

Repository: rostilos/CodeCrow

Length of output: 2395


🏁 Script executed:

#!/bin/bash
# Check if there are any other dedup patterns or discussions in related code
rg "deduplicate|dedup|unique.*issue|fingerprint" --type java -C2 | head -50

Repository: rostilos/CodeCrow

Length of output: 971


🏁 Script executed:

#!/bin/bash
# Check test file to understand expected behavior
cat java-ecosystem/libs/analysis-engine/src/test/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiRequestPreviousIssueDTOTest.java

Repository: rostilos/CodeCrow

Length of output: 12371


Deduplication promised in Javadoc is not implemented.

The method flattens all issues from all PR analysis versions without deduplicating them. The Javadoc states "Issues are deduplicated by keeping only the most recent version of each issue," but the implementation contains no dedup logic. This creates duplicate entries in the stream when the same logical issue (identified by file, line, reason, severity, category) appears across multiple PR versions, inflating the data sent to the LLM and contradicting the documented behavior.

Use a composite key based on issue identity fields (file path + line number + reason + severity + category) to deduplicate while preserving the newest-first order:

Suggested implementation pattern
-            this.previousCodeAnalysisIssues = allPrAnalyses.stream()
-                    .flatMap(analysis -> analysis.getIssues().stream())
-                    .map(AiRequestPreviousIssueDTO::fromEntity)
-                    .toList();
+            this.previousCodeAnalysisIssues = allPrAnalyses.stream()
+                    .flatMap(analysis -> analysis.getIssues().stream())
+                    .map(AiRequestPreviousIssueDTO::fromEntity)
+                    // Keep first occurrence (newest) based on issue identity
+                    .collect(java.util.stream.Collectors.toMap(
+                            dto -> dto.file() + "|" + dto.line() + "|" + dto.reason() + "|" + dto.severity() + "|" + dto.category(),
+                            java.util.function.Function.identity(),
+                            (first, ignored) -> first,  // Keep first (newest) when key collision
+                            java.util.LinkedHashMap::new  // Preserve order
+                    ))
+                    .values()
+                    .stream()
+                    .toList();
🤖 Prompt for AI Agents
In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java`
around lines 281 - 302, The Javadoc for withAllPrAnalysesData promises
deduplication but the method currently just flattens all issues; update
withAllPrAnalysesData in AiAnalysisRequestImpl to deduplicate by a composite
identity (file path + line number + reason + severity + category) while
preserving newest-first order: when streaming allPrAnalyses -> flatMap ->
map(AiRequestPreviousIssueDTO::fromEntity), collect into a LinkedHashMap keyed
by the composite key and keep the first occurrence (newest) for each key, then
set previousCodeAnalysisIssues to the map.values() list; ensure you reference
the existing AiRequestPreviousIssueDTO.fromEntity conversion and the
previousCodeAnalysisIssues field and return self() as before.

Comment on lines +382 to +387
/**
* Overload for backward compatibility with callers that don't have resolution context
*/
private CodeAnalysisIssue createIssueFromData(Map<String, Object> issueData, String issueKey, String vcsAuthorId, String vcsAuthorUsername) {
return createIssueFromData(issueData, issueKey, vcsAuthorId, vcsAuthorUsername, null, null, null);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for all calls to createIssueFromData method
# Look for both 4-argument (backward compatible) and 7-argument versions
rg "createIssueFromData\s*\(" -A 3 java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java | head -100

Repository: rostilos/CodeCrow

Length of output: 1123


🏁 Script executed:

#!/bin/bash
# Also search for calls to this method across the entire repository
rg "createIssueFromData\(" --type java

Repository: rostilos/CodeCrow

Length of output: 1001


Remove the unused backward-compatible overload.

This 4-argument overload (lines 382-387) has no callers in the codebase. All calls to createIssueFromData use the 7-argument version. Remove this dead code.

🤖 Prompt for AI Agents
In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java`
around lines 382 - 387, Remove the dead 4-argument overload
createIssueFromData(Map<String, Object> issueData, String issueKey, String
vcsAuthorId, String vcsAuthorUsername) since no callers use it; update
CodeAnalysisService by deleting that private method entirely (all references
should remain pointing to the existing 7-argument createIssueFromData(...,
ResolutionContext, String, String) method) and run a build to ensure no usages
remain.

rostilos and others added 2 commits January 23, 2026 15:27
fix: Update issue reconciliation logic to handle previous issues in both incremental and full modes
@rostilos
Copy link
Owner Author

/codecrow analyze

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python-ecosystem/mcp-client/service/multi_stage_orchestrator.py (1)

198-205: Prevent resolved issues from being re-added when reconciliation runs in FULL mode.

Because Stage 1.5 now always runs when previousCodeAnalysisIssues exist, _reconcile_previous_issues will append any prior issue not re-reported—even ones already resolved—effectively reopening them. Skip previously resolved issues before adding “persisting” entries.

🛠️ Proposed fix
@@
         for prev_issue in request.previousCodeAnalysisIssues:
             # Convert to dict if needed
             if hasattr(prev_issue, 'model_dump'):
                 prev_data = prev_issue.model_dump()
             else:
                 prev_data = prev_issue if isinstance(prev_issue, dict) else vars(prev_issue)
+
+            # Skip issues that were already resolved in history
+            status = prev_data.get('status')
+            if status is None and prev_data.get('isResolved') is not None:
+                status = 'resolved' if prev_data.get('isResolved') else 'open'
+            if str(status).lower() == 'resolved':
+                continue

rostilos and others added 2 commits January 23, 2026 17:59
fix: Improve handling of issue resolution status and logging for bett…
rostilos and others added 4 commits January 23, 2026 19:33
- Updated JobService to use REQUIRES_NEW transaction propagation for deleting ignored jobs, ensuring fresh entity retrieval and preventing issues with the calling transaction.
- Removed token limitation from AI connection model and related DTOs, transitioning to project-level configuration for token limits.
- Adjusted AIConnectionDTO tests to reflect the removal of token limitation.
- Enhanced Bitbucket, GitHub, and GitLab AI client services to check token limits before analysis, throwing DiffTooLargeException when limits are exceeded.
- Updated command processors to utilize project-level token limits instead of AI connection-specific limits.
- Modified webhook processing to handle diff size issues gracefully, posting informative messages to VCS when analysis is skipped due to large diffs.
- Cleaned up integration tests to remove references to token limitation in AI connection creation and updates.
…sis processing. Project PR analysis max analysis token limit implementation
fix: Enhance logging and implement deterministic context retrieval in RAG pipeline
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@python-ecosystem/mcp-client/service/multi_stage_orchestrator.py`:
- Around line 920-921: The logger.debug call that reads logger.debug(f"Reached
chunk limit of 15, stopping") uses an unnecessary f-string; replace it with a
normal string literal instead (e.g., logger.debug("Reached chunk limit of 15,
stopping")) in the same scope where included_count is checked (the if
included_count >= 15 block) so static analysis warnings are resolved while
preserving the log message.

In `@python-ecosystem/mcp-client/service/rag_client.py`:
- Around line 249-286: The fallback return shapes for get_deterministic_context
are inconsistent and must match the expected schema (keys: "chunks",
"changed_files", "related_definitions", "class_context", "namespace_context");
update the disabled branch and both except handlers in the method that calls
self._get_client and posts to f"{self.base_url}/query/deterministic" so they all
return {"context": {"chunks": [], "changed_files": {}, "related_definitions":
{}, "class_context": {}, "namespace_context": {}}}; keep the existing logging
and error handling but normalize the returned context shape to avoid KeyErrors
in callers of query_service.get_deterministic_context.
🧹 Nitpick comments (7)
python-ecosystem/rag-pipeline/src/rag_pipeline/api/api.py (2)

70-85: Minor docstring inconsistency: says "TWO-STEP" but describes three steps.

The docstring states "TWO-STEP process" but then lists steps 1, 2, and 3. Consider updating to "THREE-STEP" for accuracy.

 class DeterministicContextRequest(BaseModel):
     """Request for deterministic metadata-based context retrieval.
     
-    TWO-STEP process leveraging tree-sitter metadata:
+    THREE-STEP process leveraging tree-sitter metadata:
     1. Get chunks for the changed file_paths
     2. Extract semantic_names/imports/extends from those chunks
     3. Find related definitions using extracted identifiers

455-467: Improve exception handling for better debugging.

Per static analysis hints, consider:

  1. Using logging.exception to include the traceback automatically
  2. Using raise ... from e for proper exception chaining
♻️ Suggested improvement
     try:
         context = query_service.get_deterministic_context(
             workspace=request.workspace,
             project=request.project,
             branches=request.branches,
             file_paths=request.file_paths,
             limit_per_file=request.limit_per_file or 10
         )
-        
         return {"context": context}
     except Exception as e:
-        logger.error(f"Error getting deterministic context: {e}")
-        raise HTTPException(status_code=500, detail=str(e))
+        logger.exception(f"Error getting deterministic context: {e}")
+        raise HTTPException(status_code=500, detail=str(e)) from e
python-ecosystem/mcp-client/service/rag_client.py (1)

228-237: Docstring mentions "Two-step" but describes three steps.

Same inconsistency as in the API model - consider updating for consistency.

python-ecosystem/rag-pipeline/src/rag_pipeline/services/query_service.py (2)

693-696: Unused loop variable q_type.

The static analysis correctly identifies that q_type is unpacked but never used. Rename to _q_type to indicate it's intentionally unused.

         logger.info(f"Generated {len(queries)} queries for PR context")
-        for i, (q_text, q_weight, q_top_k, q_type) in enumerate(queries):
+        for i, (q_text, q_weight, q_top_k, _q_type) in enumerate(queries):
             logger.info(f"  Query {i+1}: weight={q_weight}, top_k={q_top_k}, text='{q_text[:80]}...'")

387-400: Potential performance concern: fallback fetches up to 1000 points per file.

When exact path matching fails, the code scrolls up to 1000 points and filters client-side. For large collections with many changed files, this could cause significant latency and memory pressure.

Consider:

  1. Adding a warning log when falling back to partial match
  2. Reducing the limit or adding pagination
  3. Using a more targeted filter if Qdrant supports partial string matching
                 # Fallback: partial match if exact fails
                 if not results:
+                    logger.debug(f"Exact path match failed for '{normalized_path}', trying partial match")
                     all_results, _ = self.qdrant_client.scroll(
                         collection_name=collection_name,
                         scroll_filter=Filter(must=[branch_filter]),
-                        limit=1000,
+                        limit=500,  # Reduced limit to mitigate performance impact
                         with_payload=True,
                         with_vectors=False
                     )
java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java (2)

427-431: Unused variable branchIndexExists.

The variable is computed and logged but never used in the control flow. Based on the summary, this appears intentional (the early-return was removed to always check for updates), but the unused variable is confusing. Either remove it entirely or add a comment explaining why it's logged but not used for control flow.

♻️ Suggested fix: Remove unused variable or use inline logging
-        // Check if branch already has indexed data (RagBranchIndex exists)
-        // Note: We still proceed with diff check to ensure any new changes are indexed
-        boolean branchIndexExists = isBranchIndexReady(project, targetBranch);
-        log.info("Branch index status for project={}, branch={}: exists={}", 
-                project.getId(), targetBranch, branchIndexExists);
+        // Note: We always proceed with diff check to ensure any new changes are indexed,
+        // regardless of whether a branch index already exists
+        log.info("Branch index status for project={}, branch={}: exists={}", 
+                project.getId(), targetBranch, isBranchIndexReady(project, targetBranch));

810-812: Inconsistent substring bounds checking on commit hashes.

Line 811 uses Math.min(7, indexedCommit.length()) defensively, but line 812 uses substring(0, 7) directly. While git commit hashes are always 40 characters, the inconsistency could cause issues if this code is ever used with abbreviated hashes.

♻️ Suggested fix: Use consistent bounds checking
-        log.info("Incremental diff for branch '{}' ({}..{}): bytes={}", 
-                targetBranch, indexedCommit.substring(0, Math.min(7, indexedCommit.length())), 
-                currentCommit.substring(0, 7), rawDiff != null ? rawDiff.length() : 0);
+        log.info("Incremental diff for branch '{}' ({}..{}): bytes={}", 
+                targetBranch, indexedCommit.substring(0, Math.min(7, indexedCommit.length())), 
+                currentCommit.substring(0, Math.min(7, currentCommit.length())), rawDiff != null ? rawDiff.length() : 0);

Comment on lines 920 to 921
if included_count >= 15: # Increased from 10 for more context
logger.debug(f"Reached chunk limit of 15, stopping")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix from string without placeholders.

Static analysis correctly identifies this f-string has no placeholders.

🔧 Suggested fix
-                logger.debug(f"Reached chunk limit of 15, stopping")
+                logger.debug("Reached chunk limit of 15, stopping")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if included_count >= 15: # Increased from 10 for more context
logger.debug(f"Reached chunk limit of 15, stopping")
if included_count >= 15: # Increased from 10 for more context
logger.debug("Reached chunk limit of 15, stopping")
🧰 Tools
🪛 Ruff (0.14.13)

921-921: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In `@python-ecosystem/mcp-client/service/multi_stage_orchestrator.py` around lines
920 - 921, The logger.debug call that reads logger.debug(f"Reached chunk limit
of 15, stopping") uses an unnecessary f-string; replace it with a normal string
literal instead (e.g., logger.debug("Reached chunk limit of 15, stopping")) in
the same scope where included_count is checked (the if included_count >= 15
block) so static analysis warnings are resolved while preserving the log
message.

Comment on lines +249 to +286
if not self.enabled:
logger.debug("RAG disabled, returning empty deterministic context")
return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}}

start_time = datetime.now()

try:
payload = {
"workspace": workspace,
"project": project,
"branches": branches,
"file_paths": file_paths,
"limit_per_file": limit_per_file
}

client = await self._get_client()
response = await client.post(
f"{self.base_url}/query/deterministic",
json=payload
)
response.raise_for_status()
result = response.json()

# Log timing and stats
elapsed_ms = (datetime.now() - start_time).total_seconds() * 1000
context = result.get("context", {})
chunk_count = len(context.get("chunks", []))
logger.info(f"Deterministic RAG query completed in {elapsed_ms:.2f}ms, "
f"retrieved {chunk_count} chunks for {len(file_paths)} files")

return result

except httpx.HTTPError as e:
logger.warning(f"Failed to retrieve deterministic context: {e}")
return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}}
except Exception as e:
logger.error(f"Unexpected error in deterministic RAG query: {e}")
return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Inconsistent fallback return structures will cause key errors in callers.

The method returns different structures in different code paths:

  • Line 251 (disabled): {"chunks": [], "changed_files": {}, "related_definitions": {}}
  • Lines 283/286 (errors): {"chunks": [], "by_identifier": {}, "by_file": {}}

Based on the query_service.get_deterministic_context implementation, the correct keys are changed_files, related_definitions, class_context, and namespace_context. Callers expecting these keys will fail when receiving the error fallback.

🐛 Proposed fix
         except httpx.HTTPError as e:
             logger.warning(f"Failed to retrieve deterministic context: {e}")
-            return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}}
+            return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}}
         except Exception as e:
             logger.error(f"Unexpected error in deterministic RAG query: {e}")
-            return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}}
+            return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not self.enabled:
logger.debug("RAG disabled, returning empty deterministic context")
return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}}
start_time = datetime.now()
try:
payload = {
"workspace": workspace,
"project": project,
"branches": branches,
"file_paths": file_paths,
"limit_per_file": limit_per_file
}
client = await self._get_client()
response = await client.post(
f"{self.base_url}/query/deterministic",
json=payload
)
response.raise_for_status()
result = response.json()
# Log timing and stats
elapsed_ms = (datetime.now() - start_time).total_seconds() * 1000
context = result.get("context", {})
chunk_count = len(context.get("chunks", []))
logger.info(f"Deterministic RAG query completed in {elapsed_ms:.2f}ms, "
f"retrieved {chunk_count} chunks for {len(file_paths)} files")
return result
except httpx.HTTPError as e:
logger.warning(f"Failed to retrieve deterministic context: {e}")
return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}}
except Exception as e:
logger.error(f"Unexpected error in deterministic RAG query: {e}")
return {"context": {"chunks": [], "by_identifier": {}, "by_file": {}}}
if not self.enabled:
logger.debug("RAG disabled, returning empty deterministic context")
return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}}
start_time = datetime.now()
try:
payload = {
"workspace": workspace,
"project": project,
"branches": branches,
"file_paths": file_paths,
"limit_per_file": limit_per_file
}
client = await self._get_client()
response = await client.post(
f"{self.base_url}/query/deterministic",
json=payload
)
response.raise_for_status()
result = response.json()
# Log timing and stats
elapsed_ms = (datetime.now() - start_time).total_seconds() * 1000
context = result.get("context", {})
chunk_count = len(context.get("chunks", []))
logger.info(f"Deterministic RAG query completed in {elapsed_ms:.2f}ms, "
f"retrieved {chunk_count} chunks for {len(file_paths)} files")
return result
except httpx.HTTPError as e:
logger.warning(f"Failed to retrieve deterministic context: {e}")
return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}}
except Exception as e:
logger.error(f"Unexpected error in deterministic RAG query: {e}")
return {"context": {"chunks": [], "changed_files": {}, "related_definitions": {}}}
🧰 Tools
🪛 Ruff (0.14.13)

279-279: Consider moving this statement to an else block

(TRY300)


284-284: Do not catch blind exception: Exception

(BLE001)


285-285: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
In `@python-ecosystem/mcp-client/service/rag_client.py` around lines 249 - 286,
The fallback return shapes for get_deterministic_context are inconsistent and
must match the expected schema (keys: "chunks", "changed_files",
"related_definitions", "class_context", "namespace_context"); update the
disabled branch and both except handlers in the method that calls
self._get_client and posts to f"{self.base_url}/query/deterministic" so they all
return {"context": {"chunks": [], "changed_files": {}, "related_definitions":
{}, "class_context": {}, "namespace_context": {}}}; keep the existing logging
and error handling but normalize the returned context shape to avoid KeyErrors
in callers of query_service.get_deterministic_context.

rostilos and others added 22 commits January 26, 2026 20:59
- Added AST-based code splitter using Tree-sitter for accurate code parsing.
- Introduced TreeSitterParser for dynamic language loading and caching.
- Created scoring configuration for RAG query result reranking with configurable boost factors and priority patterns.
- Refactored RAGQueryService to utilize the new scoring configuration for enhanced result ranking.
- Improved metadata extraction and handling for better context in scoring.
@codecrow-local codecrow-local bot deleted a comment from codecrow-ai bot Jan 29, 2026
@codecrow-local
Copy link

codecrow-local bot commented Jan 29, 2026

⚠️ Code Analysis Results

Summary

Pull Request Review: 1.2.1 rc

Status PASS WITH WARNINGS
Risk Level HIGH
Review Coverage 97 files analyzed in depth
Confidence HIGH

Executive Summary

This release candidate introduces significant updates to AI connection management, RAG pipeline orchestration, and VCS webhook handling across the Java and Python ecosystems. The changes focus on improving code analysis depth and indexing efficiency, but they introduce complexities in distributed locking and cross-service error handling. While the architectural direction is sound, there are notable risks regarding data integrity in the vector database and potential "silent failures" during large diff processing.

Recommendation

Decision: PASS WITH WARNINGS

The PR is architecturally solid but requires immediate attention to inconsistent exception handling for large diffs and potential lock leakage in the webhook processing flow. These should be addressed prior to final production deployment to ensure system stability and user feedback transparency.


Strategic Review Areas

1. Large Diff & Token Management

The introduction of TokenEstimator is a critical improvement; however, its implementation is inconsistent across VCS providers (GitHub/GitLab/Bitbucket). Some paths lack error reporting, which may lead to PRs hanging in a pending state if limits are exceeded.

2. Distributed Locking & Concurrency

The shift to a "pre-acquired" lock pattern in the PullRequestAnalysisProcessor introduces a risk of orphaned locks. If the calling webhook handlers encounter unhandled exceptions after acquiring a lock, subsequent analyses for those branches will be blocked until manual database intervention occurs.

3. RAG Pipeline Integrity

The Python-based RAG pipeline has been enhanced with language-specific Tree-Sitter queries. Reviewers should note that deterministic ID generation for vector points currently relies on metadata that, if missing, defaults to random values. This could lead to duplicated data in the vector store over time.

4. Database & Persistence

The release includes schema migrations to remove token limitations. Note that some new repositories utilize PostgreSQL-specific ON CONFLICT syntax, which hardens the dependency on the current database provider. Additionally, the CodeAnalysisService involves a double-save pattern that requires strict transactional wrapping to avoid "ghost" records on failure.

Issues Overview

Severity Count
🔴 High 2 Critical issues requiring immediate attention
🟡 Medium 12 Issues that should be addressed
🔵 Low 9 Minor issues and improvements
ℹ️ Info 2 Informational notes and suggestions

Analysis completed on 2026-01-29 22:01:09 | View Full Report | Pull Request


📋 Detailed Issues (25)

🔴 High Severity Issues

Id on Platform: 1615

Category: 🐛 Bug Risk

File: .../core/index_manager.py

Issue: The code deletes the original collection (self.qdrant_client.delete_collection(alias_name)) BEFORE creating the new alias. If update_collection_aliases fails immediately after deletion (e.g., network timeout, service restart, or validation error), the system is left in a state where the primary alias is gone. While data exists in the temp collection, the production alias is broken.

💡 Suggested Fix

In Qdrant, the most reliable way to handle this without downtime or risk is to use a 'Rename' sequence if possible, or at minimum, wrap the deletion and alias creation in a more robust way. Since Qdrant doesn't support atomic rename of a collection to an alias, you should ensure the temp collection is fully ready and perhaps perform the deletion within the same logical block as the alias creation, or use a 'try-finally' with a more aggressive recovery.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py
@@ -569,4 +569,7 @@
-                    logger.warning(f"Deleting old direct collection: {alias_name} (temp collection {temp_collection_name} verified with {temp_collection_info.points_count} points)")
-                    
-                    try:
-                        self.qdrant_client.delete_collection(alias_name)
+                    # Use a safer sequence: attempt to create alias with a different name first to verify permissions
+                    # then perform the delete and immediate alias creation.
+                    logger.warning(f"Deleting old direct collection: {alias_name}")
+                    self.qdrant_client.delete_collection(alias_name)

View Issue Details


Id on Platform: 1618

Category: 🐛 Bug Risk

File: .../splitter/metadata.py

Issue: The multi-line signature detection logic for Python uses next((i for i, l in enumerate(lines) if l.strip() == line), -1) to find the index of the current line. If multiple lines in the first 15 lines of the chunk have identical content (e.g., two identical decorators or comments), this logic might find the wrong index, leading to incorrect signature reconstruction or processing the same lines repeatedly.

💡 Suggested Fix

Use the loop index directly instead of recalculating it with a search to avoid ambiguity with duplicate lines.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
@@ -118,15 +118,13 @@
-        for line in lines[:15]:
+        for idx, line in enumerate(lines[:15]):
             line = line.strip()
             
             if language == 'python':
                 if line.startswith(('def ', 'async def ', 'class ')):
                     sig = line
                     if line.startswith('class ') and ':' in line:
                         return line.split(':')[0] + ':'
                     if ')' not in sig and ':' not in sig:
-                        idx = next((i for i, l in enumerate(lines) if l.strip() == line), -1)
-                        if idx >= 0:
-                            for next_line in lines[idx+1:idx+5]:
-                                sig += ' ' + next_line.strip()
-                                if ')' in next_line:
-                                    break
+                        for next_line in lines[idx+1:min(idx+5, len(lines))]:
+                            sig += ' ' + next_line.strip()
+                            if ')' in next_line:
+                                break

View Issue Details


🟡 Medium Severity Issues

Id on Platform: 1602

Category: ✨ Best Practices

File: .../service/CodeAnalysisService.java:114

Issue: Saving the entity twice (at line 114 and line 178) within the same transactional flow to obtain an ID is a common pattern, but it can lead to unnecessary database roundtrips. More importantly, if the subsequent processing fails, a 'ghost' analysis record with no issues might remain in the database depending on transaction management outside this method.

💡 Suggested Fix

Ensure this method is wrapped in a transaction so that if processing issues fails after the first save, the entire operation rolls back. Since this is likely inside a service, consider if the ID can be used without the intermediate save or if a flush is sufficient.

No suggested fix provided

View Issue Details


Id on Platform: 1603

Category: 🐛 Bug Risk

File: .../service/ProjectService.java:378

Issue: The method findByIdWithConnections is used to fetch the project. If this method is implemented as a Join Fetch in the repository, it may return null or an empty Optional if the 'aiBinding' (or relevant connection) is missing, depending on whether an INNER JOIN or LEFT JOIN is used. If it's an INNER JOIN, you won't be able to bind a connection to a project that doesn't have one yet.

💡 Suggested Fix

Ensure the repository method uses a LEFT JOIN FETCH for the connection binding to allow fetching projects that do not yet have an AI connection bound.

--- a/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/persistence/repository/project/ProjectRepository.java
+++ b/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/persistence/repository/project/ProjectRepository.java
@@ -1,1 +1,3 @@
-@Query("SELECT p FROM Project p JOIN FETCH p.aiBinding WHERE p.id = :id")
+@Query("SELECT p FROM Project p LEFT JOIN FETCH p.aiBinding WHERE p.id = :id")
 Optional<Project> findByIdWithConnections(@Param("id") Long id);

View Issue Details


Id on Platform: 1605

Category: 🐛 Bug Risk

File: .../index_manager/indexer.py:84

Issue: The 'random' module is imported inside the 'else' block of the size estimation method. This can lead to repeated import overhead in a loop or failure if the environment restricts local imports.

💡 Suggested Fix

Move the random import to the top of the file.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
@@ -8,2 +8,3 @@
 import logging
+import random
 import time

View Issue Details


Id on Platform: 1606

Category: ✨ Best Practices

File: .../analysis/PullRequestAnalysisProcessor.java:231

Issue: The logic only releases the lock if '!isPreAcquired'. While this prevents the processor from releasing a lock it didn't create, it assumes the caller will always handle cleanup. If the caller (e.g., a webhook handler) expects the processor to take ownership of the lifecycle once passed, the lock will leak. Typically, pre-acquired locks used in a 'try-finally' block should have a clearly defined owner to prevent orphaned locks in the database.

💡 Suggested Fix

Ensure there is a clear contract on lock ownership. If the processor should not release it, add a comment explaining where the release happens, or ensure the 'isPreAcquired' flag is correctly managed by the caller's transaction scope. Given the existing code, ensure the caller has a finally block to release the lock since this processor now skips it.

--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessor.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessor.java
@@ -228,6 +228,7 @@
             
             return Map.of("status", "error", "message", e.getMessage());
         } finally {
+            // If pre-acquired, the caller (e.g. WebhookHandler) is responsible for release
             if (!isPreAcquired) {
                 analysisLockService.releaseLock(lockKey);
             }

View Issue Details


Id on Platform: 1607

Category: 🛡️ Error Handling

File: .../index_manager/branch_manager.py:136

Issue: In preserve_other_branch_points, if an exception occurs during the scroll loop after some points have already been collected, the function catches the exception and returns an incomplete list. This can lead to silent data loss during a reindex because copy_points_to_collection will only write the partially collected points to the new collection.

💡 Suggested Fix

Reraise the exception or return None/raise a custom error to signal that the preservation failed, rather than returning a partial list of points.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py
@@ -134,2 +134,2 @@
         except Exception as e:
-            logger.warning(f"Could not read existing points: {e}")
-            return []
+            logger.error(f"Critical failure during point preservation: {e}")
+            raise

View Issue Details


Id on Platform: 1609

Category: ✨ Best Practices

File: .../processor/WebhookAsyncProcessor.java

Issue: While the self-injection pattern (using @lazy) works to solve the proxy bypass, it is generally considered a smell or a workaround in Spring. A cleaner architectural approach is to move the @transactional logic into a separate Service (e.g., WebhookProcessingService). This avoids circular dependencies and adheres better to the Single Responsibility Principle.

💡 Suggested Fix

Extract the contents of 'processWebhookInTransaction' and its associated private helper methods into a new service class, then inject that service into WebhookAsyncProcessor.

No suggested fix provided

View Issue Details


Id on Platform: 1614

Category: ⚡ Performance

File: .../ai/AiAnalysisRequestImpl.java

Issue: The method implementation does not perform deduplication as claimed in its Javadoc ('Issues are deduplicated by keeping only the most recent version of each issue'). If multiple analyses for a PR are passed, the same issues (which often persist across PR versions) will be added multiple times to the list, potentially exceeding token limits or confusing the AI model.

💡 Suggested Fix

Implement deduplication by grouping issues by a unique key (like filePath, line, and reason) or simply by issue ID if IDs are stable across versions, keeping only the most recent version.

--- a/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
+++ b/java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/ai/AiAnalysisRequestImpl.java
@@ -293,3 +293,4 @@
-            this.previousCodeAnalysisIssues = allPrAnalyses.stream()
-                    .flatMap(analysis -> analysis.getIssues().stream())
-                    .map(AiRequestPreviousIssueDTO::fromEntity)
-                    .toList();
+            this.previousCodeAnalysisIssues = allPrAnalyses.stream() // assumes list is already sorted newest-first
+                    .flatMap(analysis -> analysis.getIssues().stream())
+                    .map(AiRequestPreviousIssueDTO::fromEntity)
+                    .collect(java.util.stream.Collectors.toMap(AiRequestPreviousIssueDTO::id, i -> i, (existing, replacement) -> existing))
+                    .values().stream().toList();

View Issue Details


Id on Platform: 1616

Category: 🐛 Bug Risk

File: .../splitter/query_runner.py

Issue: The logic to identify the pattern name for custom queries checks for the absence of a dot: 'elif "." not in capture_name: pattern_name = capture_name'. However, if a custom query only uses dotted captures (e.g., @function.definition) without a base capture (e.g., @function), the 'pattern_name' will remain None and the match will be skipped at line 262.

💡 Suggested Fix

Ensure that if a dotted capture is found and pattern_name is still None, the base part of the capture is used as the pattern name.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
@@ -257,2 +257,5 @@
                 elif "." not in capture_name:
                     pattern_name = capture_name
+                elif pattern_name is None:
+                    pattern_name = capture_name.split(".")[0]
+                    main_node = node

View Issue Details


Id on Platform: 1617

Category: 🐛 Bug Risk

File: .../splitter/tree_parser.py

Issue: The code assumes that the language loading function in the external module is always named 'language' or the value provided in the mapping. However, for some tree-sitter modules (like tree-sitter-typescript), the module might contain multiple languages (e.g., 'language_typescript' and 'language_tsx'). While the mapping in languages.py handles this, the generic getattr check might fail silently or return None if the function name provided in the mapping is incorrect or the module structure changes, which is common in tree-sitter transitions.

💡 Suggested Fix

Add more robust validation and error logging for the language function retrieval to ensure it is actually callable before execution.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.py
@@ -73,7 +73,7 @@
             lang_module = importlib.import_module(module_name)
             
             lang_func = getattr(lang_module, func_name, None)
-            if not lang_func:
+            if not lang_func or not callable(lang_func):
                 logger.debug(f"Module {module_name} has no callable {func_name} function")
                 return None

View Issue Details


Id on Platform: 1621

Category: 🐛 Bug Risk

File: .../utils/utils.py

Issue: The heuristic for detecting archive roots ('-' in first_part and len(first_part) > 20) might incorrectly flag and strip legitimate top-level project directories that follow naming conventions like 'my-very-long-project-name'. This would cause the first directory of the actual project structure to be discarded.

💡 Suggested Fix

Tighten the archive detection logic. Instead of a general hyphen check, consider checking if the part matches a hexadecimal pattern (common for commit hashes) or allow the user to provide a known root prefix. Alternatively, use a more specific regex for known VCS archive patterns (e.g., Bitbucket's {project}-{repo}-{commit}).

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/utils/utils.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/utils/utils.py
@@ -111,5 +111,5 @@
-    looks_like_archive = (
-        '-' in first_part and len(first_part) > 20 or  # owner-repo-commit
-        len(first_part) >= 40 or  # Just commit hash
-        (first_part.count('-') >= 2 and any(c.isdigit() for c in first_part))  # Has digits and multiple hyphens
-    )
+    # Heuristic for archive roots like 'owner-repo-7a8b9c0'
+    import re
+    is_hex_suffix = bool(re.search(r'-[0-9a-f]{7,40}$', first_part))
+    is_hash = bool(re.fullmatch(r'[0-9a-f]{40}', first_part))
+    looks_like_archive = is_hash or is_hex_suffix

View Issue Details


Id on Platform: 1624

Category: 🛡️ Error Handling

File: .../service/GitHubAiClientService.java

Issue: When a DiffTooLargeException is thrown, there is no visible mechanism in the diff to ensure the user receives feedback on the PR (e.g., a comment explaining why the analysis failed). If not caught and handled by posting a comment, the PR might stay in a 'pending' state indefinitely.

💡 Suggested Fix

Ensure that the calling code or a global exception handler catches DiffTooLargeException and uses VcsReportingService to post an informative comment to the PR.

No suggested fix provided

View Issue Details


Id on Platform: 1625

Category: ✨ Best Practices

File: .../service/GitLabAiClientService.java:315

Issue: Token estimation and limit enforcement were added to 'buildMrAnalysisRequest' but are missing from 'buildBranchAnalysisRequest'. Large branch diffs may still attempt to be sent to the AI, potentially causing timeouts or API errors depending on the provider's limits.

💡 Suggested Fix

Implement the same TokenEstimator check in buildBranchAnalysisRequest before building the AiAnalysisRequest.

--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/service/GitLabAiClientService.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/service/GitLabAiClientService.java
@@ -312,6 +312,12 @@
-                .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(aiConnection.getApiKeyEncrypted()))
-                .withUseLocalMcp(true)
-                .withPreviousAnalysisData(previousAnalysis)
-                .withMaxAllowedTokens(project.getEffectiveConfig().maxAnalysisTokenLimit())
+            int maxTokenLimit = project.getEffectiveConfig().maxAnalysisTokenLimit();
+            TokenEstimator.TokenEstimationResult tokenEstimate = TokenEstimator.estimateAndCheck(rawDiff, maxTokenLimit);
+            if (tokenEstimate.exceedsLimit()) {
+                throw new DiffTooLargeException(tokenEstimate.estimatedTokens(), maxTokenLimit, project.getId(), "branch:" + request.getTargetBranchName());
+            }
+
+            return AiAnalysisRequest.builder()
+                .withProjectAiConnectionTokenDecrypted(tokenEncryptionService.decrypt(aiConnection.getApiKeyEncrypted()))
+                .withUseLocalMcp(true)
+                .withPreviousAnalysisData(previousAnalysis)
+                .withMaxAllowedTokens(maxTokenLimit)

View Issue Details


🔵 Low Severity Issues

Id on Platform: 1604

Category: ✨ Best Practices

File: .../service/multi_stage_orchestrator.py:458

Issue: The method logger.info is called but the 'logger' variable is not defined or imported within the visible diff context, while other parts of the file use 'self.logger' or 'logging'. Based on line 458, 'logger' is used directly.

💡 Suggested Fix

Ensure 'logger' is correctly imported or use 'logging.getLogger(name)' at the module level.

--- a/python-ecosystem/mcp-client/service/multi_stage_orchestrator.py
+++ b/python-ecosystem/mcp-client/service/multi_stage_orchestrator.py
@@ -24,2 +24,3 @@
 import asyncio
+import logging
 
+logger = logging.getLogger(__name__)
+

View Issue Details


Id on Platform: 1608

Category: ✨ Best Practices

File: .../index_manager/point_operations.py

Issue: If 'path' is missing from metadata, it defaults to a random UUID. This breaks the 'deterministic ID' guarantee promised by generate_point_id, as every run will generate different IDs for the same chunk if the path isn't explicitly provided, leading to duplicate points in the vector database.

💡 Suggested Fix

Require path to be present in chunk metadata or use a more stable fallback than a random UUID to ensure idempotency.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py
@@ -52,5 +52,6 @@
         for chunk in chunks:
-            path = chunk.metadata.get("path", str(uuid.uuid4()))
+            path = chunk.metadata.get("path")
+            if not path:
+                logger.warning("Chunk missing path metadata; using 'unknown'")
+                path = "unknown"
             if path not in chunks_by_file:

View Issue Details


Id on Platform: 1610

Category: 🛡️ Error Handling

File: .../webhookhandler/GitHubPullRequestWebhookHandler.java:133

Issue: The use of Long.parseLong(payload.pullRequestId()) occurs multiple times. If pullRequestId is not a valid numeric string, this will throw an unhandled NumberFormatException, bypassing the specialized catch blocks and potentially leaving a lock dangling if it occurs after acquisition (though here it's inside the try block).

💡 Suggested Fix

Parse the PR ID once into a variable before the try block or ensure the payload is validated before this method is called.

--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/webhookhandler/GitHubPullRequestWebhookHandler.java
@@ -125,12 +125,13 @@
         String placeholderCommentId = null;
         
         try {
+            long prNum = Long.parseLong(payload.pullRequestId());
             // Try to acquire lock atomically BEFORE posting placeholder
             // This prevents race condition where multiple webhooks could post duplicate placeholders
             String sourceBranch = payload.sourceBranch();
             Optional<String> earlyLock = analysisLockService.acquireLock(
                     project, sourceBranch, AnalysisLockType.PR_ANALYSIS,
-                    payload.commitHash(), Long.parseLong(payload.pullRequestId()));
+                    payload.commitHash(), prNum);
             
             if (earlyLock.isEmpty()) {
                 log.info("PR analysis already in progress for project={}, branch={}, PR={} - skipping duplicate webhook", 
@@ -139,7 +140,7 @@
             // Lock acquired - placeholder posting is now protected from race conditions
             
             // Post placeholder comment immediately to show analysis has started
-            placeholderCommentId = postPlaceholderComment(project, Long.parseLong(payload.pullRequestId()));
+            placeholderCommentId = postPlaceholderComment(project, prNum);

View Issue Details


Id on Platform: 1611

Category: ✨ Best Practices

File: .../gitlab/GitLabClient.java:519

Issue: Reading the entire response body into a string using response.body().string() can lead to OutOfMemoryErrors if the diff is extremely large. While diffs are usually manageable, very large PRs or binary changes could cause issues.

💡 Suggested Fix

Use response.body().byteStream() with the ObjectMapper to parse the response incrementally if possible, or check Content-Length before reading.

--- a/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/gitlab/GitLabClient.java
+++ b/java-ecosystem/libs/vcs-client/src/main/java/org/rostilos/codecrow/vcsclient/gitlab/GitLabClient.java
@@ -516,2 +516,2 @@
-            JsonNode root = objectMapper.readTree(response.body().string());
+            JsonNode root = objectMapper.readTree(response.body().byteStream());

View Issue Details


Id on Platform: 1613

Category: ⚡ Performance

File: .../service/WebhookDeduplicationService.java:72

Issue: The 'cleanupOldEntries' method is called on every single webhook check and performs a full scan of the map. As the number of unique project/commit combinations increases, this O(N) operation inside the critical path of 'isDuplicateCommitAnalysis' will cause latency spikes.

💡 Suggested Fix

Perform cleanup periodically (e.g., via @scheduled or every N requests) or use a cache with native expiration like Caffeine or Guava.

--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java
@@ -34,4 +34,5 @@
     private final Map<String, Instant> recentCommitAnalyses = new ConcurrentHashMap<>();
+    private final java.util.concurrent.atomic.AtomicInteger requestCounter = new java.util.concurrent.atomic.AtomicInteger(0);
 
@@ -69,3 +70,5 @@
-        // Cleanup old entries
-        cleanupOldEntries(now);
+        // Cleanup old entries every 50 requests to avoid O(N) scan on every call
+        if (requestCounter.incrementAndGet() % 50 == 0) {
+            cleanupOldEntries(now);
+        }

View Issue Details


Id on Platform: 1619

Category: ✨ Best Practices

File: .../utils/response_parser.py:130

Issue: The import logging statement is placed inside a method call which is inefficient as it executes every time a resolved issue is processed.

💡 Suggested Fix

Move the import to the top of the file to follow PEP 8 and improve performance.

--- a/python-ecosystem/mcp-client/utils/response_parser.py
+++ b/python-ecosystem/mcp-client/utils/response_parser.py
@@ -1,3 +1,4 @@
+import logging
 from typing import Any, Dict, List, Optional
 
@@ -126,5 +127,4 @@
                 if value:
                     issue_id = issue.get('id') or issue.get('issueId', 'unknown')
-                    import logging
                     logging.getLogger(__name__).info(f"Issue {issue_id} marked as isResolved=True (original: {original_value})")

View Issue Details


Id on Platform: 1620

Category: ✨ Best Practices

File: .../models/scoring_config.py:100

Issue: The 'get_priority' method returns ('MEDIUM', 1.0) as a fallback if no patterns match. This is inconsistent with the explicit HIGH/MEDIUM/LOW categories and might lead to confusing results if a file is 'MEDIUM' because it matched a pattern vs 'MEDIUM' because it was a fallback.

💡 Suggested Fix

Introduce a 'DEFAULT' priority level or ensure the fallback boost factor is distinct to allow for clearer debugging and fine-tuning.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py
@@ -98,1 +98,1 @@
-            return ('MEDIUM', 1.0)
+            return ('DEFAULT', 1.0)

View Issue Details


Id on Platform: 1622

Category: ✨ Best Practices

File: .../analysis/CommentCommandRateLimitRepository.java

Issue: The native query uses PostgreSQL-specific syntax ('ON CONFLICT'). While efficient, it reduces the portability of the persistence layer if the application ever moves away from PostgreSQL. Given the project currently uses PostgreSQL specific features in the entity indexes (as seen in the CommentCommandRateLimit.java context), this is acceptable but should be noted.

💡 Suggested Fix

No immediate change required if PostgreSQL is the hard requirement, otherwise consider using a standard JPA approach with locking, though that would be less performant than this atomic upsert.

No suggested fix provided

View Issue Details


Id on Platform: 1623

Category: ✨ Best Practices

File: .../service/BitbucketAiClientService.java:126

Issue: The variable 'aiConnection' is assigned twice from the same source: 'project.getAiBinding().getAiConnection()'. While harmless, it is redundant.

💡 Suggested Fix

Remove the redundant variable assignment for projectAiConnection.

--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/service/BitbucketAiClientService.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/service/BitbucketAiClientService.java
@@ -127,5 +127,4 @@
         VcsConnection vcsConnection = vcsInfo.vcsConnection();
         AIConnection aiConnection = project.getAiBinding().getAiConnection();
-        AIConnection projectAiConnection = project.getAiBinding().getAiConnection();
         
         // CRITICAL: Log the AI connection being used for debugging

View Issue Details


ℹ️ Informational Notes

Id on Platform: 1612

Category: ✨ Best Practices

File: .../util/TokenEstimator.java:35

Issue: The fallback estimation uses a fixed ratio of 4 characters per token. While standard for English, this can be significantly inaccurate for other languages or code-heavy content.

💡 Suggested Fix

Consider using a slightly more conservative fallback or documenting the language assumption, though for a fallback this is generally acceptable.

No suggested fix provided

View Issue Details


Id on Platform: 1626

Category: ✨ Best Practices

File: .../tests/test_rag.py:63

Issue: The chunk_size has been increased 10x (from 800 to 8000). While the comment suggests this is to fit semantic units, such a large jump significantly exceeds the default 'SentenceSplitter' and 'RecursiveCharacterTextSplitter' configurations seen in 'src/rag_pipeline/core/chunking.py' and 'src/rag_pipeline/core/semantic_splitter.py' which still reference 800-1000. This might lead to context window issues or inefficient retrieval if the underlying models have smaller limits.

💡 Suggested Fix

Verify if 8000 is intended or a typo (perhaps 800 was intended as the base), and ensure the corresponding implementation classes in 'chunking.py' have their defaults updated to match this test expectation to avoid test failures in other environments.

No suggested fix provided

View Issue Details


Files Affected

  • .../service/GitHubAiClientService.java: 1 issue
  • .../splitter/metadata.py: 1 issue
  • .../analysis/PullRequestAnalysisProcessor.java: 1 issue
  • .../core/index_manager.py: 1 issue
  • .../splitter/tree_parser.py: 1 issue
  • .../service/WebhookDeduplicationService.java: 1 issue
  • .../gitlab/GitLabClient.java: 1 issue
  • .../webhookhandler/GitHubPullRequestWebhookHandler.java: 1 issue
  • .../analysis/CommentCommandRateLimitRepository.java: 1 issue
  • .../service/ProjectService.java: 1 issue

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/webhookhandler/GitLabMergeRequestWebhookHandler.java (1)

128-160: Pass the pre-acquired lock key downstream to prevent lock leaks.

acquireLock() returns a key that is currently ignored. Unlike the GitHub and Bitbucket handlers which set request.preAcquiredLockKey = earlyLock.get(), the GitLab handler does not propagate the lock. This causes PullRequestAnalysisProcessor to call acquireLockWithWait() again instead of reusing the webhook's lock, leaving the original lock unreleased.

🔧 Proposed fix
 PrProcessRequest request = new PrProcessRequest();
+request.preAcquiredLockKey = earlyLock.get();
 request.projectId = project.getId();
 request.pullRequestId = Long.parseLong(payload.pullRequestId());
 request.sourceBranchName = payload.sourceBranch();
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/processor/WebhookAsyncProcessor.java (1)

432-511: Preserve per-command markers when posting error/info.
Using CODECROW_COMMAND_MARKER overwrites summarize/review markers, so later runs won’t clean up prior error/info comments.

✅ Suggested fix
         VcsReportingService reportingService = vcsServiceFactory.getReportingService(provider);
 
+        String commandType = payload.getCodecrowCommand() != null
+            ? payload.getCodecrowCommand().type().name().toLowerCase()
+            : "default";
+        String marker = CommentPlaceholders.getMarkerForCommandType(commandType);
+
         // Sanitize error message for user display - hide sensitive technical details
         String sanitizedMessage = sanitizeErrorForVcs(errorMessage);
@@
                 reportingService.updateComment(
                     project,
                     Long.parseLong(payload.pullRequestId()),
                     placeholderCommentId,
                     content,
-                    CommentPlaceholders.CODECROW_COMMAND_MARKER
+                    marker
                 );
@@
                 reportingService.postComment(
                     project, 
                     Long.parseLong(payload.pullRequestId()), 
                     content,
-                    CommentPlaceholders.CODECROW_COMMAND_MARKER
+                    marker
                 );
@@
         VcsReportingService reportingService = vcsServiceFactory.getReportingService(provider);
 
+        String commandType = payload.getCodecrowCommand() != null
+            ? payload.getCodecrowCommand().type().name().toLowerCase()
+            : "default";
+        String marker = CommentPlaceholders.getMarkerForCommandType(commandType);
+
         // If we have a placeholder comment, update it with the info
         if (placeholderCommentId != null) {
             reportingService.updateComment(
                 project,
                 Long.parseLong(payload.pullRequestId()),
                 placeholderCommentId,
                 infoMessage,
-                CommentPlaceholders.CODECROW_COMMAND_MARKER
+                marker
             );
@@
             reportingService.postComment(
                 project, 
                 Long.parseLong(payload.pullRequestId()), 
                 infoMessage,
-                CommentPlaceholders.CODECROW_COMMAND_MARKER
+                marker
             );
🤖 Fix all issues with AI agents
In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/exception/DiffTooLargeException.java`:
- Around line 14-18: The DiffTooLargeException constructor currently uses
String.format with %d for projectId and pullRequestId which will NPE on null
Longs due to auto-unboxing; update the format string in the
DiffTooLargeException(int estimatedTokens, int maxAllowedTokens, Long projectId,
Long pullRequestId) constructor to use %s for the nullable Long parameters and
pass them unchanged so nulls are rendered safely (leave the integer token
placeholders as %d).

In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/util/TokenEstimator.java`:
- Around line 46-48: The exceedsLimit method should apply the same defensive
check as estimateAndCheck: ensure maxTokens > 0 before comparing; update
exceedsLimit(String text, int maxTokens) to return false when maxTokens <= 0 (or
use the same boolean expression pattern used in estimateAndCheck) and then
compare estimateTokens(text) > maxTokens, referencing the exceedsLimit and
estimateAndCheck methods and the estimateTokens call to keep behavior consistent
and avoid the incorrect true result for non-empty text when maxTokens <= 0.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py`:
- Around line 102-145: The preserve_other_branch_points function currently
swallows exceptions and returns an empty list (via the try/except around
self.client.scroll) which can cause silent data loss during reindex; also
accumulating all points in preserved_points is memory-unsafe for large
collections. Change the function to fail fast by removing the broad except or
re-raising after logging so errors surface (keep logger.warning but raise), and
refactor to stream results instead of collecting them in RAM—e.g., convert
preserve_other_branch_points into a generator or accept a callback/writer to
send each page (results from self.client.scroll) directly to the target
collection as you iterate, using self.client.scroll and next_offset to page
through until completion.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/collection_manager.py`:
- Around line 53-65: The version suffix in create_versioned_collection uses
int(time.time()) which has only second resolution and can collide; change the
suffix to a higher-resolution or random identifier (e.g., use time.time_ns() or
uuid.uuid4()) when building versioned_name so concurrent calls produce unique
names; update the versioned_name construction and any related logging that
references versioned_name in
collection_manager.CollectionManager.create_versioned_collection to use the new
suffix.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py`:
- Around line 50-56: The code in the grouping loop uses uuid.uuid4() as a
fallback for missing chunk.metadata["path"], making the fallback
non-deterministic and breaking upsert semantics; update the logic in the loop
that builds chunks_by_file (the for chunk in chunks block that uses
chunk.metadata.get("path", ...)) to use a deterministic fallback (e.g., a
constant like "unknown" or a stable computed key) or explicitly raise an error
when "path" is missing so that TextNode objects without a stable path cannot
produce random IDs and duplicate points on re-indexing.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/python.scm`:
- Around line 1-28: The type_alias_statement query is using an incorrect field
syntax; update the (type_alias_statement) query so it matches the grammar
sequence ("type" (type ...) "=" (type)) and capture the alias identifier by
nesting (identifier) `@name` inside the first (type) node instead of using a
non-existent name: field; also add an alternative to handle generic aliases by
allowing the first (type) to be either (identifier) `@name` or (generic_type
(identifier) `@name`), and keep the whole node annotated as
`@definition.type_alias`.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py`:
- Line 263: source_bytes is computed but never used, causing a Ruff F841; either
remove this unused local and the other unused captures around lines 340-342, or
wire them into the ASTChunk so they become meaningful fields. To fix: if you
need the data, add parameters/attributes to the ASTChunk constructor/definition
(class ASTChunk) and pass source_bytes (and the other captures) into the
ASTChunk creation sites; otherwise delete the unused local assignments (e.g.,
remove source_bytes = text.encode('utf-8') and the analogous unused variables at
340-342). Ensure any tests or downstream code expect the new ASTChunk signature
if you choose to propagate the values.
🟡 Minor comments (16)
java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java-812-814 (1)

812-814: Potential StringIndexOutOfBoundsException if commit hash is shorter than 7 characters.

While commit hashes are typically 40 characters (SHA-1), malformed or truncated data could cause substring(0, 7) to throw an exception. Consider using a safe substring utility.

🛡️ Proposed fix for safe substring
-        log.info("Incremental diff for branch '{}' ({} -> {}): bytes={}", 
-                targetBranch, indexedCommit.substring(0, 7), currentCommit.substring(0, 7), 
-                rawDiff != null ? rawDiff.length() : 0);
+        log.info("Incremental diff for branch '{}' ({} -> {}): bytes={}", 
+                targetBranch, 
+                indexedCommit.substring(0, Math.min(7, indexedCommit.length())), 
+                currentCommit.substring(0, Math.min(7, currentCommit.length())), 
+                rawDiff != null ? rawDiff.length() : 0);

Note: The same pattern exists at lines 752-753 in ensureMainIndexUpToDate, which was not changed in this PR but has the same potential issue.

python-ecosystem/rag-pipeline/src/rag_pipeline/services/query_service.py-388-394 (1)

388-394: Add validation for empty branches list.

If branches is an empty list, target_branch will be None (line 389), but the branch filter at line 393 will pass an empty list to MatchAny(any=[]). This could cause unexpected behavior or Qdrant errors depending on how the client handles empty match conditions.

🛡️ Proposed fix to add early validation
         logger.info(f"Deterministic context: files={file_paths[:5]}, branches={branches}")
         
+        if not branches:
+            logger.warning("No branches provided for deterministic context")
+            return {"chunks": [], "changed_files": {}, "related_definitions": {}, 
+                    "class_context": {}, "namespace_context": {},
+                    "_metadata": {"error": "no_branches_provided"}}
+        
         def _apply_branch_priority(points: list, target: str, existing_target_paths: set) -> list:
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/c_sharp.scm-7-11 (1)

7-11: Namespace queries may miss qualified, aliased, and generic names.

In tree-sitter-c-sharp, namespace names can be identifier, qualified_name, alias_qualified_name, or generic_name. The current pattern name: (identifier) will not match namespaces with dotted names (e.g., namespace Foo.Bar.Baz), global aliases (e.g., namespace global::Foo), or generic names, causing those to have no @name capture.

Proposed fix to handle all name types
 ; Namespace declarations
 (namespace_declaration
-  name: (identifier) `@name`) `@definition.namespace`
+  name: (_) `@name`) `@definition.namespace`
 
 (file_scoped_namespace_declaration
-  name: (identifier) `@name`) `@definition.namespace`
+  name: (_) `@name`) `@definition.namespace`
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/utils/CommentPlaceholders.java-46-65 (1)

46-65: Guard against null commandType to avoid NPEs.
These helpers call toLowerCase() directly; a null input would crash placeholder handling.

✅ Suggested fix
 public static String getPlaceholderMessage(String commandType) {
-    return switch (commandType.toLowerCase()) {
+    String normalized = commandType == null ? "" : commandType.toLowerCase();
+    return switch (normalized) {
         case "analyze" -> PLACEHOLDER_ANALYZE;
         case "summarize" -> PLACEHOLDER_SUMMARIZE;
         case "review" -> PLACEHOLDER_REVIEW;
         case "ask" -> PLACEHOLDER_ASK;
         default -> PLACEHOLDER_DEFAULT;
     };
 }
@@
 public static String getMarkerForCommandType(String commandType) {
-    return switch (commandType.toLowerCase()) {
+    String normalized = commandType == null ? "" : commandType.toLowerCase();
+    return switch (normalized) {
         case "summarize" -> CODECROW_SUMMARY_MARKER;
         case "review" -> CODECROW_REVIEW_MARKER;
         default -> CODECROW_COMMAND_MARKER;
     };
 }
java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/JobService.java-226-229 (1)

226-229: Guard against null job IDs before re-fetch.

When jobRepository.findById() is called with a null ID, it throws IllegalArgumentException rather than returning Optional.empty(). While all job creation methods save immediately before returning and call sites appear to only pass persisted jobs, adding a null-ID guard would be defensive and prevent potential issues if a transient job ever slips through.

🛈 Suggested guard (apply to startJob, completeJob, cancelJob, skipJob, and failJob)
-        job = jobRepository.findById(job.getId()).orElse(job);
+        Long jobId = job.getId();
+        if (jobId != null) {
+            job = jobRepository.findById(jobId).orElse(job);
+        }

Also applies to: lines 248-251, 292-295, 306-309, and failJob method

java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java-47-75 (1)

47-75: Race condition between duplicate check and recording.

There's a TOCTOU (time-of-check to time-of-use) race between recentCommitAnalyses.get(key) (Line 55) and recentCommitAnalyses.put(key, now) (Line 69). Two concurrent webhook events for the same commit could both pass the duplicate check if they interleave between get and put.

Use putIfAbsent or compute for atomicity:

🔒 Proposed fix using compute for atomic check-and-record
     public boolean isDuplicateCommitAnalysis(Long projectId, String commitHash, String eventType) {
         if (commitHash == null || commitHash.isBlank()) {
             return false;
         }
         
         String key = projectId + ":" + commitHash;
         Instant now = Instant.now();
         
-        Instant lastAnalysis = recentCommitAnalyses.get(key);
-        
-        if (lastAnalysis != null) {
-            long secondsSinceLastAnalysis = now.getEpochSecond() - lastAnalysis.getEpochSecond();
-            
-            if (secondsSinceLastAnalysis < DEDUP_WINDOW_SECONDS) {
-                log.info("Skipping duplicate commit analysis: project={}, commit={}, event={}, " +
-                        "lastAnalysis={}s ago (within {}s window)", 
-                        projectId, commitHash, eventType, secondsSinceLastAnalysis, DEDUP_WINDOW_SECONDS);
-                return true;
-            }
-        }
-        
-        // Record this analysis
-        recentCommitAnalyses.put(key, now);
+        // Atomically check and record to prevent race conditions
+        Instant[] previousValue = new Instant[1];
+        recentCommitAnalyses.compute(key, (k, lastAnalysis) -> {
+            if (lastAnalysis != null) {
+                long secondsSinceLastAnalysis = now.getEpochSecond() - lastAnalysis.getEpochSecond();
+                if (secondsSinceLastAnalysis < DEDUP_WINDOW_SECONDS) {
+                    previousValue[0] = lastAnalysis;
+                    return lastAnalysis; // Keep existing, signal duplicate
+                }
+            }
+            return now; // Record new analysis
+        });
+        
+        if (previousValue[0] != null) {
+            long secondsSinceLastAnalysis = now.getEpochSecond() - previousValue[0].getEpochSecond();
+            log.info("Skipping duplicate commit analysis: project={}, commit={}, event={}, " +
+                    "lastAnalysis={}s ago (within {}s window)", 
+                    projectId, commitHash, eventType, secondsSinceLastAnalysis, DEDUP_WINDOW_SECONDS);
+            return true;
+        }
         
         // Cleanup old entries
         cleanupOldEntries(now);
         
         return false;
     }
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/javascript.scm-1-42 (1)

1-42: Remove TypeScript-specific patterns from javascript.scm.

While language routing properly separates JavaScript and TypeScript query files, the javascript.scm file should not include TypeScript-specific patterns (interface_declaration, type_alias_declaration) on lines 35-42. These constructs don't exist in the JavaScript grammar and are already correctly defined in the separate typescript.scm file. This creates unnecessary duplication and obscures the intent of each query file.

The good coverage of standard JavaScript patterns and proper language routing via languages.py are solid, but clean up this file to focus only on JavaScript constructs.

python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py-103-117 (1)

103-117: Consider adding Kotlin module mapping.

LANGUAGE_TO_TREESITTER maps Language.KOTLIN to 'kotlin', but TREESITTER_MODULES doesn't include a 'kotlin' entry. This would cause Kotlin files to fail tree-sitter parsing despite being mapped.

python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py-83-99 (1)

83-99: Add strict=True to zip() to catch embedding count mismatches.

If get_text_embedding_batch returns fewer embeddings than expected (e.g., due to a partial failure), the zip will silently drop chunks. Using strict=True ensures the lengths match and raises ValueError on mismatch, preventing data loss.

🔧 Proposed fix
         # Build points with embeddings
         points = []
-        for (point_id, chunk), embedding in zip(chunk_data, embeddings):
+        for (point_id, chunk), embedding in zip(chunk_data, embeddings, strict=True):
             points.append(PointStruct(
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py-117-126 (1)

117-126: Rename ambiguous variable l to line.

The variable name l is flagged as ambiguous (easily confused with 1 or I). Rename it for clarity:

🔧 Proposed fix
                     if ')' not in sig and ':' not in sig:
-                        idx = next((i for i, l in enumerate(lines) if l.strip() == line), -1)
+                        idx = next((i for i, ln in enumerate(lines) if ln.strip() == line), -1)
                         if idx >= 0:
python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py-545-604 (1)

545-604: Exception handling improvements needed in migration path.

The static analysis correctly flags several exception handling issues:

  1. Line 554: f-string has no placeholders - remove the f prefix
  2. Lines 572-574: Use logging.exception instead of logger.error to capture stack trace, and chain exceptions with raise ... from err
  3. Lines 598-601: Chain the exception with raise ... from retry_err
  4. Line 604: Use bare raise instead of raise alias_err to preserve the original traceback
🔧 Proposed fixes for exception handling
-            if is_direct_collection and "already exists" in str(alias_err).lower():
-                    logger.info(f"Alias creation failed due to existing collection. Migrating from direct collection to alias-based indexing...")
+            if is_direct_collection and "already exists" in str(alias_err).lower():
+                    logger.info("Alias creation failed due to existing collection. Migrating from direct collection to alias-based indexing...")
                     try:
                         self.qdrant_client.delete_collection(alias_name)
                     except Exception as del_err:
-                        logger.error(f"Failed to delete old direct collection: {del_err}")
-                        raise Exception(f"Cannot create alias - collection '{alias_name}' exists and cannot be deleted: {del_err}")
+                        logger.exception(f"Failed to delete old direct collection: {del_err}")
+                        raise RuntimeError(f"Cannot create alias - collection '{alias_name}' exists and cannot be deleted") from del_err
                     except Exception as retry_err:
                         # Critical failure - we've deleted the old collection but can't create alias
                         # The temp collection still exists with all data, log clear instructions
                         logger.critical(
                             f"CRITICAL: Alias creation failed after deleting old collection! "
                             f"Data is safe in '{temp_collection_name}'. "
                             f"Manual fix required: create alias '{alias_name}' -> '{temp_collection_name}'. "
                             f"Error: {retry_err}"
                         )
-                        raise Exception(
+                        raise RuntimeError(
                             f"Alias creation failed after migration. Data preserved in '{temp_collection_name}'. "
                             f"Run: qdrant alias create {alias_name} -> {temp_collection_name}"
-                        )
+                        ) from retry_err
                 else:
                     # Some other alias error, re-raise
-                    raise alias_err
+                    raise
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/languages.py-75-117 (1)

75-117: Inconsistency between AST_SUPPORTED_LANGUAGES and TREESITTER_MODULES.

AST_SUPPORTED_LANGUAGES includes languages that don't have corresponding entries in TREESITTER_MODULES:

  • SCALA, LUA, PERL, SWIFT, HASKELL, COBOL

This means is_ast_supported(path) could return True for files where tree-sitter parsing will actually fail (since get_language in TreeSitterParser won't find a module mapping).

Either add the missing module mappings to TREESITTER_MODULES, or remove unsupported languages from AST_SUPPORTED_LANGUAGES to maintain consistency.

🔧 Option 1: Remove unsupported languages from AST_SUPPORTED_LANGUAGES
 # Languages that support full AST parsing via tree-sitter
 AST_SUPPORTED_LANGUAGES: Set[Language] = {
-    Language.PYTHON, Language.JAVA, Language.KOTLIN, Language.JS, Language.TS,
-    Language.GO, Language.RUST, Language.C, Language.CPP, Language.CSHARP,
-    Language.PHP, Language.RUBY, Language.SCALA, Language.LUA, Language.PERL,
-    Language.SWIFT, Language.HASKELL, Language.COBOL
+    Language.PYTHON, Language.JAVA, Language.JS, Language.TS,
+    Language.GO, Language.RUST, Language.C, Language.CPP, Language.CSHARP,
+    Language.PHP, Language.RUBY,  # Only languages with TREESITTER_MODULES entries
 }
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/typescript.scm-23-26 (1)

23-26: Arrow function declarations will match both function and variable definition patterns, causing duplicates in the index.

The pattern at lines 23-26 specifically captures arrow functions assigned to variables as @definition.function. However, line 49 captures all lexical_declaration nodes as @definition.variable, which will also match the same arrow function declarations.

This causes the same AST node to be extracted twice with different semantic types. Consider making the variable declaration pattern more specific to exclude arrow functions, or refactor to prevent duplicate extraction.

This issue also exists in the JavaScript file at lines 18-21 and 32-34.

python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py-874-900 (1)

874-900: Fix fragment counts when tiny sub-chunks are skipped.

total_sub is computed before the min-size filter, so headers/metadata can claim more fragments than are emitted. Prefilter first, then compute total_sub.

Proposed diff
-        sub_chunks = splitter.split_text(chunk.content)
-        
-        nodes = []
-        parent_id = generate_deterministic_id(path, chunk.content, 0)
-        total_sub = len([s for s in sub_chunks if s and s.strip()])
+        raw_chunks = splitter.split_text(chunk.content)
+        sub_chunks = [s for s in raw_chunks if s and s.strip()]
+        if len(sub_chunks) > 1:
+            sub_chunks = [s for s in sub_chunks if len(s.strip()) >= self.min_chunk_size]
+        nodes = []
+        parent_id = generate_deterministic_id(path, chunk.content, 0)
+        total_sub = len(sub_chunks)
@@
-        for i, sub_chunk in enumerate(sub_chunks):
-            if not sub_chunk or not sub_chunk.strip():
-                continue
-            if len(sub_chunk.strip()) < self.min_chunk_size and total_sub > 1:
-                continue
+        for sub_chunk in sub_chunks:
python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py-133-156 (1)

133-156: Remove the unused metadata dict.
metadata is assigned but never used; this will trigger lint errors and adds noise.

🧹 Minimal fix
-        metadata = {
-            "namespace": namespace,
-            "workspace": workspace,
-            "project": project,
-            "branch": branch,
-            "commit": commit,
-            "document_count": document_count,
-            "chunk_count": chunk_count,
-            "last_updated": datetime.now(timezone.utc).isoformat(),
-        }
-        
         logger.info(f"Indexed {namespace}: {document_count} docs, {chunk_count} chunks")
python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py-319-399 (1)

319-399: Return branch stats for branch-scoped operations.
update_files and delete_files are branch-specific, but they currently return project stats (branch="*"). This can mislead callers expecting branch-level results.

✅ Suggested fix
-            return self.stats_manager.get_project_stats(workspace, project, collection_name)
+            return self.stats_manager.get_branch_stats(workspace, project, branch, collection_name)
...
-        return self.stats_manager.get_project_stats(workspace, project, collection_name)
+        return self.stats_manager.get_branch_stats(workspace, project, branch, collection_name)
...
-        return self.stats_manager.get_project_stats(workspace, project, collection_name)
+        return self.stats_manager.get_branch_stats(workspace, project, branch, collection_name)
🧹 Nitpick comments (24)
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/rust.scm (2)

22-23: Consider capturing type/trait info from impl_item for richer metadata.

The current pattern captures the entire impl block but doesn't extract the type being implemented or the trait (for trait implementations). This metadata could be useful for code-splitting context.

💡 Optional enhancement for richer impl metadata
 ; Implementation blocks
-(impl_item) `@definition.impl`
+; Inherent impl
+(impl_item
+  type: (_) `@impl_type`) `@definition.impl`
+
+; Trait impl
+(impl_item
+  trait: (type_identifier) `@trait_name`
+  type: (_) `@impl_type`) `@definition.impl.trait`

41-46: Consider adding union_item and extern block patterns for completeness.

The file covers most Rust constructs, but a few less common ones are missing. These are optional additions depending on your codebase requirements.

💡 Optional patterns for additional Rust constructs
 ; Macro definitions
 (macro_definition
   name: (identifier) `@name`) `@definition.macro`
 
+; Union definitions
+(union_item
+  name: (type_identifier) `@name`) `@definition.union`
+
+; Extern blocks (FFI)
+(foreign_mod_item) `@definition.extern`
+
 ; Attributes
 (attribute_item) `@attribute`
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/processor/command/ReviewCommandProcessor.java (1)

157-160: Use log.debug instead of log.info for diagnostic logging.

The comment indicates this logging is for debugging purposes, but log.info will emit in production under normal log levels. Consider using log.debug to avoid log noise and reduce exposure of internal identifiers (connection IDs, model names) in production logs.

📝 Suggested fix
-            // CRITICAL: Log the AI connection being used for debugging
-            log.info("Building review command request for project={}, AI model={}, provider={}, aiConnectionId={}", 
-                    project.getId(), aiConnection.getAiModel(), aiConnection.getProviderKey(), aiConnection.getId());
+            log.debug("Building review command request for project={}, AI model={}, provider={}, aiConnectionId={}", 
+                    project.getId(), aiConnection.getAiModel(), aiConnection.getProviderKey(), aiConnection.getId());
python-ecosystem/mcp-client/service/multi_stage_orchestrator.py (1)

402-438: Minor: Redundant empty string check.

Line 425 (not clean_line == "") is redundant since line 418 already requires clean_line to be truthy, which excludes empty strings.

♻️ Suggested simplification
                 if (clean_line and 
                     len(clean_line) > 10 and  # Minimum meaningful length
                     not clean_line.startswith("//") and  # Skip comments
                     not clean_line.startswith("#") and
                     not clean_line.startswith("*") and
                     not clean_line == "{" and
-                    not clean_line == "}" and
-                    not clean_line == ""):
+                    not clean_line == "}"):
                     current_snippet_lines.append(clean_line)
python-ecosystem/rag-pipeline/src/rag_pipeline/services/query_service.py (4)

420-432: Consider adding a configurable limit for the fallback query.

The fallback query at line 424 fetches up to 1000 points to search for partial path matches. For large collections, this could impact performance. Consider making this limit configurable or applying early filtering criteria.


500-501: Consider logging exception type for better debugging.

The broad Exception catches are reasonable for a resilient context retrieval service, but logging the exception type alongside the message would help with debugging specific failure modes (e.g., network errors vs. validation errors).

💡 Example improvement
             except Exception as e:
-                logger.warning(f"Error querying file '{file_path}': {e}")
+                logger.warning(f"Error querying file '{file_path}': {type(e).__name__}: {e}")

Also applies to: 556-557, 606-607, 656-657


754-755: Rename unused loop variable q_type to _q_type.

The loop variable q_type is unpacked but not used within the loop body. Per Python convention, prefix unused variables with an underscore.

♻️ Proposed fix
-        for i, (q_text, q_weight, q_top_k, q_type) in enumerate(queries):
+        for i, (q_text, q_weight, q_top_k, _q_type) in enumerate(queries):
             logger.info(f"  Query {i+1}: weight={q_weight}, top_k={q_top_k}, text='{q_text[:80]}...'")

849-849: Remove unused import re.

The re module is imported but not used anywhere in this method. This appears to be a leftover from refactoring.

♻️ Proposed fix
         from collections import defaultdict
         import os
-        import re
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/c_sharp.scm (1)

50-53: Field and event declarations lack @name capture.

Unlike other declarations, field_declaration and event_declaration don't capture the @name. This is likely intentional since fields can declare multiple variables (e.g., int a, b, c;), but the inconsistency may affect downstream processing that expects a @name capture on all definitions.

If individual field/event names are needed, consider capturing the nested variable declarators:

Optional: capture individual field variable names
 ; Field declarations
-(field_declaration) `@definition.field`
+(field_declaration
+  (variable_declaration
+    (variable_declarator
+      (identifier) `@name`))) `@definition.field`
 
 ; Event declarations
-(event_declaration) `@definition.event`
+(event_declaration
+  (variable_declaration
+    (variable_declarator
+      (identifier) `@name`))) `@definition.event`
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/utils/CommentPlaceholders.java (1)

3-41: Make the utility class non-instantiable.
This class only exposes static members, so preventing instantiation is a small safety win.

♻️ Proposed refactor
-public class CommentPlaceholders {
+public final class CommentPlaceholders {
+    private CommentPlaceholders() {
+        // utility class
+    }
java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/dto/request/processor/PrProcessRequest.java (1)

39-45: Normalize blank preAcquiredLockKey to avoid accidental lock bypass.
If downstream checks only for non-null, an empty string could skip lock acquisition without a valid key.

♻️ Suggested guard
-    public String getPreAcquiredLockKey() { return preAcquiredLockKey; }
+    public String getPreAcquiredLockKey() {
+        return (preAcquiredLockKey == null || preAcquiredLockKey.isBlank()) ? null : preAcquiredLockKey;
+    }

Also applies to: 75-75

java-ecosystem/libs/core/src/test/java/org/rostilos/codecrow/core/dto/project/ProjectDTOTest.java (1)

53-78: Consider adding assertion for maxAnalysisTokenLimit.

The test adds the new field value (200000) to the constructor but doesn't assert on dto.maxAnalysisTokenLimit(). For completeness, add an assertion to verify the value.

💡 Suggested assertion
             assertThat(dto.webhooksConfigured()).isTrue();
             assertThat(dto.qualityGateId()).isEqualTo(50L);
+            assertThat(dto.maxAnalysisTokenLimit()).isEqualTo(200000);
         }
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java (1)

80-85: Cleanup runs synchronously on every call - consider async or periodic cleanup.

The cleanupOldEntries iterates through all map entries on every non-duplicate webhook. Under high traffic with many unique commits, this could add latency. For a more scalable approach, consider:

  • Using a scheduled task for periodic cleanup
  • Only cleaning up every N calls
  • Using a size-bounded cache like Caffeine with TTL
java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/controller/ProjectController.java (1)

593-598: Consider adding validation for maxAnalysisTokenLimit.

The maxAnalysisTokenLimit field accepts any Integer value, including negative numbers or zero, which would be invalid for a token limit. Consider adding validation constraints.

💡 Suggested validation
 public record UpdateAnalysisSettingsRequest(
         Boolean prAnalysisEnabled,
         Boolean branchAnalysisEnabled,
         String installationMethod,
+        `@jakarta.validation.constraints.Positive`
         Integer maxAnalysisTokenLimit
 ) {}

Alternatively, validate in the service layer if project-level defaults should be applied when null/invalid.

java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/service/GitLabAiClientService.java (1)

112-268: Consider extracting shared logic to reduce duplication across VCS providers.

GitLabAiClientService and GitHubAiClientService share substantial duplicate code:

  • Token estimation and limit checking
  • Incremental analysis mode determination
  • Delta diff fetching and ratio calculations
  • AiAnalysisRequest builder construction

Consider extracting common logic to a shared base class or utility to improve maintainability.

java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/service/BitbucketAiClientService.java (2)

125-126: Remove duplicate variable assignment.

aiConnection and projectAiConnection are assigned the same value from project.getAiBinding().getAiConnection(). Only one variable is needed.

♻️ Proposed fix
-        AIConnection aiConnection = project.getAiBinding().getAiConnection();
-        AIConnection projectAiConnection = project.getAiBinding().getAiConnection();
+        AIConnection aiConnection = project.getAiBinding().getAiConnection();

Then replace references to projectAiConnection (line 255) with aiConnection.


127-130: Consider reducing log verbosity for production.

Logging the aiConnectionId at INFO level on every request may be excessive in high-traffic scenarios. Consider using DEBUG level or making this conditional.

♻️ Proposed change
-        log.info("Building PR analysis request for project={}, AI model={}, provider={}, aiConnectionId={}", 
+        log.debug("Building PR analysis request for project={}, AI model={}, provider={}, aiConnectionId={}", 
                 project.getId(), aiConnection.getAiModel(), aiConnection.getProviderKey(), aiConnection.getId());
python-ecosystem/rag-pipeline/src/rag_pipeline/utils/utils.py (1)

107-111: Add parentheses to clarify operator precedence.

The and/or chaining on line 108 has ambiguous precedence. While Python evaluates and before or, explicit parentheses improve readability and prevent subtle bugs.

♻️ Proposed fix
     looks_like_archive = (
-        '-' in first_part and len(first_part) > 20 or  # owner-repo-commit
+        ('-' in first_part and len(first_part) > 20) or  # owner-repo-commit
         len(first_part) >= 40 or  # Just commit hash
         (first_part.count('-') >= 2 and any(c.isdigit() for c in first_part))  # Has digits and multiple hyphens
     )
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.py (1)

120-129: Thread safety consideration for global singleton.

The get_parser() function isn't thread-safe. If multiple threads call it concurrently during initialization, multiple instances could be created. This is likely fine since TreeSitterParser instances are effectively stateless except for caching, but worth noting.

🔒 Thread-safe singleton (optional)
+import threading
+
 # Global singleton instance
 _parser_instance: Optional[TreeSitterParser] = None
+_parser_lock = threading.Lock()


 def get_parser() -> TreeSitterParser:
     """Get the global TreeSitterParser instance."""
     global _parser_instance
     if _parser_instance is None:
-        _parser_instance = TreeSitterParser()
+        with _parser_lock:
+            if _parser_instance is None:
+                _parser_instance = TreeSitterParser()
     return _parser_instance
python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py (1)

123-125: Use logger.exception to capture stack trace on upsert failure.

logger.error loses the stack trace. logger.exception automatically includes it, which is valuable for debugging batch failures.

🔧 Proposed fix
             except Exception as e:
-                logger.error(f"Failed to upsert batch starting at {i}: {e}")
+                logger.exception(f"Failed to upsert batch starting at {i}: {e}")
                 failed += len(batch)
python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py (1)

167-211: Consider adding a minimum score floor.

The boosted score can become very low when multiple penalties compound (e.g., 0.8 * 0.95 * 0.7 = 0.532). While there's a max_score_cap, there's no floor. If a near-zero score could cause issues downstream, consider adding a min_score_floor:

         # Cap the score
         score = min(score, self.max_score_cap)
+        # Optional: enforce minimum floor
+        score = max(score, 0.01)  # or use a configurable floor
         
         return (score, priority)
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py (1)

54-71: Add ClassVar annotation to mutable class attribute.

COMMENT_PREFIX is a mutable dict at class level. While it's only read (not modified), annotating it with ClassVar makes the intent clear and satisfies the static analyzer:

🔧 Proposed fix
+from typing import Dict, Any, List, Optional, ClassVar
 ...

 class MetadataExtractor:
     ...
     
     # Comment prefixes by language
-    COMMENT_PREFIX: Dict[str, str] = {
+    COMMENT_PREFIX: ClassVar[Dict[str, str]] = {
         'python': '#',
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py (1)

219-219: Silence the unused pattern_id loop variable.

This avoids lint noise without changing behavior.

Proposed diff
-        for pattern_id, captures_dict in raw_matches:
+        for _pattern_id, captures_dict in raw_matches:
python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py (1)

102-131: Remove or use the unused alias_checker parameter.
The argument is never referenced; either drop it to simplify the API or use it to skip alias names if that was the intent.

♻️ Minimal cleanup
-    def list_all_indices(self, alias_checker) -> List[IndexStats]:
+    def list_all_indices(self) -> List[IndexStats]:

Comment on lines +14 to +18
public DiffTooLargeException(int estimatedTokens, int maxAllowedTokens, Long projectId, Long pullRequestId) {
super(String.format(
"PR diff exceeds token limit: estimated %d tokens, max allowed %d tokens (project=%d, PR=%d)",
estimatedTokens, maxAllowedTokens, projectId, pullRequestId
));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n "java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/exception/DiffTooLargeException.java"

Repository: rostilos/CodeCrow

Length of output: 1889


🏁 Script executed:

rg -n "new DiffTooLargeException" -C 3

Repository: rostilos/CodeCrow

Length of output: 4371


🏁 Script executed:

# Test String.format behavior with null values in Java
python3 << 'EOF'
# String.format behavior in Java:
# %d expects int, so if you pass null to %d, it will cause NPE
# %s with null is safe - it converts null to "null" string
print("Java String.format behavior:")
print("- %d with null: NPE (because %d expects primitive int)")
print("- %s with null: works fine, outputs 'null'")
print("\nHowever, Long is an Object that can be null.")
print("If you pass Long value to %d, Java auto-unboxes it.")
print("If the Long is null, auto-unboxing throws NPE before format is called.")
EOF

Repository: rostilos/CodeCrow

Length of output: 356


Replace %d with %s for nullable Long IDs to avoid NPE during auto-unboxing.

When projectId or pullRequestId are null, String.format("%d", null) throws NullPointerException during auto-unboxing, masking the intended "diff exceeds token limit" error. Using %s safely handles null values.

Safer formatting
-        super(String.format(
-            "PR diff exceeds token limit: estimated %d tokens, max allowed %d tokens (project=%d, PR=%d)",
-            estimatedTokens, maxAllowedTokens, projectId, pullRequestId
-        ));
+        super(String.format(
+            "Diff exceeds token limit: estimated %d tokens, max allowed %d tokens (project=%s, PR=%s)",
+            estimatedTokens, maxAllowedTokens, projectId, pullRequestId
+        ));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
public DiffTooLargeException(int estimatedTokens, int maxAllowedTokens, Long projectId, Long pullRequestId) {
super(String.format(
"PR diff exceeds token limit: estimated %d tokens, max allowed %d tokens (project=%d, PR=%d)",
estimatedTokens, maxAllowedTokens, projectId, pullRequestId
));
public DiffTooLargeException(int estimatedTokens, int maxAllowedTokens, Long projectId, Long pullRequestId) {
super(String.format(
"Diff exceeds token limit: estimated %d tokens, max allowed %d tokens (project=%s, PR=%s)",
estimatedTokens, maxAllowedTokens, projectId, pullRequestId
));
}
🤖 Prompt for AI Agents
In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/exception/DiffTooLargeException.java`
around lines 14 - 18, The DiffTooLargeException constructor currently uses
String.format with %d for projectId and pullRequestId which will NPE on null
Longs due to auto-unboxing; update the format string in the
DiffTooLargeException(int estimatedTokens, int maxAllowedTokens, Long projectId,
Long pullRequestId) constructor to use %s for the nullable Long parameters and
pass them unchanged so nulls are rendered safely (leave the integer token
placeholders as %d).

Comment on lines +46 to +48
public static boolean exceedsLimit(String text, int maxTokens) {
return estimateTokens(text) > maxTokens;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n "maxAnalysisTokenLimit" -C 2

Repository: rostilos/CodeCrow

Length of output: 28269


🏁 Script executed:

cat -n java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/util/TokenEstimator.java

Repository: rostilos/CodeCrow

Length of output: 3548


Inconsistent handling of maxTokens <= 0 between exceedsLimit and estimateAndCheck.

The exceedsLimit method lacks the defensive check present in estimateAndCheck. While estimateAndCheck guards against division by zero (line 75), it doesn't apply the same guard to the exceedsLimit boolean logic (line 79). If maxTokens <= 0, both methods will incorrectly mark any non-empty text as exceeding the limit. Add a guard to exceedsLimit for consistency, or apply it to the exceedsLimit field in estimateAndCheck:

Suggested fix
 public static boolean exceedsLimit(String text, int maxTokens) {
-    return estimateTokens(text) > maxTokens;
+    return maxTokens > 0 && estimateTokens(text) > maxTokens;
 }
@@
 public static TokenEstimationResult estimateAndCheck(String text, int maxTokens) {
     int estimated = estimateTokens(text);
     double utilization = maxTokens > 0 ? (estimated * 100.0 / maxTokens) : 0;
     return new TokenEstimationResult(
         estimated,
         maxTokens,
-        estimated > maxTokens,
+        maxTokens > 0 && estimated > maxTokens,
         utilization
     );
 }
🤖 Prompt for AI Agents
In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/util/TokenEstimator.java`
around lines 46 - 48, The exceedsLimit method should apply the same defensive
check as estimateAndCheck: ensure maxTokens > 0 before comparing; update
exceedsLimit(String text, int maxTokens) to return false when maxTokens <= 0 (or
use the same boolean expression pattern used in estimateAndCheck) and then
compare estimateTokens(text) > maxTokens, referencing the exceedsLimit and
estimateAndCheck methods and the estimateTokens call to keep behavior consistent
and avoid the incorrect true result for non-empty text when maxTokens <= 0.

Comment on lines +102 to +145
def preserve_other_branch_points(
self,
collection_name: str,
exclude_branch: str
) -> List[PointStruct]:
"""Preserve points from branches other than the one being reindexed.
Used during full reindex to keep data from other branches.
"""
logger.info(f"Preserving points from branches other than '{exclude_branch}'...")

preserved_points = []
offset = None

try:
while True:
results = self.client.scroll(
collection_name=collection_name,
limit=100,
offset=offset,
scroll_filter=Filter(
must_not=[
FieldCondition(
key="branch",
match=MatchValue(value=exclude_branch)
)
]
),
with_payload=True,
with_vectors=True
)
points, next_offset = results
preserved_points.extend(points)

if next_offset is None or len(points) < 100:
break
offset = next_offset

logger.info(f"Found {len(preserved_points)} points from other branches to preserve")
return preserved_points
except Exception as e:
logger.warning(f"Could not read existing points: {e}")
return []

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when preserving other branches to avoid data loss.
If reading existing points fails, returning an empty list lets the reindex proceed and can drop other branches during the swap. Also, loading all preserved points into memory scales poorly on large collections—consider streaming copy to the target collection instead of accumulating everything in RAM.

🔧 Suggested fix to avoid silent data loss
-        except Exception as e:
-            logger.warning(f"Could not read existing points: {e}")
-            return []
+        except Exception as e:
+            logger.warning(f"Could not read existing points: {e}")
+            raise
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 141-141: Consider moving this statement to an else block

(TRY300)


[warning] 142-142: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/branch_manager.py`
around lines 102 - 145, The preserve_other_branch_points function currently
swallows exceptions and returns an empty list (via the try/except around
self.client.scroll) which can cause silent data loss during reindex; also
accumulating all points in preserved_points is memory-unsafe for large
collections. Change the function to fail fast by removing the broad except or
re-raising after logging so errors surface (keep logger.warning but raise), and
refactor to stream results instead of collecting them in RAM—e.g., convert
preserve_other_branch_points into a generator or accept a callback/writer to
send each page (results from self.client.scroll) directly to the target
collection as you iterate, using self.client.scroll and next_offset to page
through until completion.

Comment on lines +53 to +65
def create_versioned_collection(self, base_name: str) -> str:
"""Create a new versioned collection for atomic swap indexing."""
versioned_name = f"{base_name}_v{int(time.time())}"
logger.info(f"Creating versioned collection: {versioned_name}")

self.client.create_collection(
collection_name=versioned_name,
vectors_config=VectorParams(
size=self.embedding_dim,
distance=Distance.COSINE
)
)
return versioned_name
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Versioned collection names can collide within the same second.

int(time.time()) has second resolution; concurrent indexing can generate identical names and fail collection creation. Prefer a higher‑resolution or random suffix.

Proposed diff (nanosecond timestamp)
-        versioned_name = f"{base_name}_v{int(time.time())}"
+        versioned_name = f"{base_name}_v{time.time_ns()}"
🤖 Prompt for AI Agents
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/collection_manager.py`
around lines 53 - 65, The version suffix in create_versioned_collection uses
int(time.time()) which has only second resolution and can collide; change the
suffix to a higher-resolution or random identifier (e.g., use time.time_ns() or
uuid.uuid4()) when building versioned_name so concurrent calls produce unique
names; update the versioned_name construction and any related logging that
references versioned_name in
collection_manager.CollectionManager.create_versioned_collection to use the new
suffix.

Comment on lines +50 to +56
# Group chunks by file path
chunks_by_file: Dict[str, List[TextNode]] = {}
for chunk in chunks:
path = chunk.metadata.get("path", str(uuid.uuid4()))
if path not in chunks_by_file:
chunks_by_file[path] = []
chunks_by_file[path].append(chunk)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Non-deterministic fallback path breaks upsert semantics.

When a chunk's metadata lacks a path, using uuid.uuid4() as a fallback generates a random value on each call. This defeats the purpose of deterministic IDs, causing duplicate points instead of replacements on re-indexing.

Consider using a consistent fallback like "unknown" or raising an error:

🐛 Proposed fix
         # Group chunks by file path
         chunks_by_file: Dict[str, List[TextNode]] = {}
         for chunk in chunks:
-            path = chunk.metadata.get("path", str(uuid.uuid4()))
+            path = chunk.metadata.get("path", "unknown")
             if path not in chunks_by_file:
                 chunks_by_file[path] = []
             chunks_by_file[path].append(chunk)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Group chunks by file path
chunks_by_file: Dict[str, List[TextNode]] = {}
for chunk in chunks:
path = chunk.metadata.get("path", str(uuid.uuid4()))
if path not in chunks_by_file:
chunks_by_file[path] = []
chunks_by_file[path].append(chunk)
# Group chunks by file path
chunks_by_file: Dict[str, List[TextNode]] = {}
for chunk in chunks:
path = chunk.metadata.get("path", "unknown")
if path not in chunks_by_file:
chunks_by_file[path] = []
chunks_by_file[path].append(chunk)
🤖 Prompt for AI Agents
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/point_operations.py`
around lines 50 - 56, The code in the grouping loop uses uuid.uuid4() as a
fallback for missing chunk.metadata["path"], making the fallback
non-deterministic and breaking upsert semantics; update the logic in the loop
that builds chunks_by_file (the for chunk in chunks block that uses
chunk.metadata.get("path", ...)) to use a deterministic fallback (e.g., a
constant like "unknown" or a stable computed key) or explicitly raise an error
when "path" is missing so that TextNode objects without a stable path cannot
produce random IDs and duplicate points on re-indexing.

Comment on lines +1 to +28
; Python tree-sitter queries for AST-based code splitting

; Import statements
(import_statement) @import

(import_from_statement) @import

; Class definitions
(class_definition
name: (identifier) @name) @definition.class

; Function definitions
(function_definition
name: (identifier) @name) @definition.function

; Decorated definitions (class or function with decorators)
(decorated_definition) @definition.decorated

; Decorators
(decorator) @decorator

; Assignment statements (module-level constants)
(assignment
left: (identifier) @name) @definition.assignment

; Type alias (Python 3.12+)
(type_alias_statement
name: (type) @name) @definition.type_alias
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the repository structure to understand context
find . -type f -name "*.scm" | head -20

Repository: rostilos/CodeCrow

Length of output: 714


🏁 Script executed:

# Search for type_alias references in the codebase
rg -i "type_alias" --type-list | head -5

Repository: rostilos/CodeCrow

Length of output: 450


🏁 Script executed:

# Look for any documentation or tests related to tree-sitter queries
find . -type f \( -name "*.md" -o -name "*test*" -o -name "*example*" \) | grep -E "(test|example|doc)" | head -20

Repository: rostilos/CodeCrow

Length of output: 944


🏁 Script executed:

# Check if there are other Python query files or configuration
find . -type f -name "*.scm" -o -name "*tree-sitter*" -o -name "*grammar*"

Repository: rostilos/CodeCrow

Length of output: 714


🏁 Script executed:

# Search for type_alias_statement usage anywhere in the repo
rg "type_alias_statement" -A 3 -B 1

Repository: rostilos/CodeCrow

Length of output: 395


🏁 Script executed:

# Search for any documentation about AST chunking or tree-sitter usage
cat ./python-ecosystem/rag-pipeline/docs/AST_CHUNKING.md

Repository: rostilos/CodeCrow

Length of output: 5100


🏁 Script executed:

# Check if there are any test files that validate these queries
find . -type f -name "*.py" | xargs rg -l "type_alias" | head -5

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Look at similar patterns in other language query files for reference
cat ./python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/typescript.scm

Repository: rostilos/CodeCrow

Length of output: 1380


🏁 Script executed:

# Search for tree-sitter configuration or dependencies
find . -type f \( -name "pyproject.toml" -o -name "requirements*.txt" -o -name "setup.py" \) | xargs grep -l "tree-sitter" 2>/dev/null

Repository: rostilos/CodeCrow

Length of output: 109


🏁 Script executed:

# Search for any usage or import of the python query file
rg "python.scm" -A 3 -B 1

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Check the requirements.txt to see tree-sitter versions
cat ./python-ecosystem/rag-pipeline/requirements.txt | grep -i tree

Repository: rostilos/CodeCrow

Length of output: 476


🏁 Script executed:

# Search for any tests that use or validate these queries
find . -path "*/test*" -name "*.py" | xargs rg -l "query\|parser\|splitter" 2>/dev/null | head -5

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Look at how the queries are used in the actual splitter code
find . -type f -name "*.py" -path "*/splitter/*" | head -10

Repository: rostilos/CodeCrow

Length of output: 518


🏁 Script executed:

# Check the actual splitter implementation to understand how queries are used
find . -type f -name "*splitter*.py" | xargs ls -lh 2>/dev/null | head -10

Repository: rostilos/CodeCrow

Length of output: 183


🏁 Script executed:

# Check the query_runner to understand how queries are executed
cat ./python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py

Repository: rostilos/CodeCrow

Length of output: 14628


🏁 Script executed:

# Check tree_parser to see how it loads and uses queries
cat ./python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.py

Repository: rostilos/CodeCrow

Length of output: 4243


🏁 Script executed:

# Search for any test files that specifically test Python queries
find . -type f -name "*.py" -path "*test*" | xargs rg -l "python.scm\|type_alias" 2>/dev/null

Repository: rostilos/CodeCrow

Length of output: 45


🏁 Script executed:

# Check if there's any documentation or examples about type alias handling
rg "type.*alias\|type_alias_statement" --type-add 'scm:*.scm' -i

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Look for tree-sitter-python documentation or examples in the repo
find . -type f -name "*.md" -o -name "*.txt" | xargs rg -i "python.*3.12\|type.*alias" 2>/dev/null | head -20

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Search for any Python 3.12+ type alias examples or tests
rg "type\s+\w+\s*=" --type py -B 2 -A 2 | head -30

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Check if there are any comments or documentation about the type_alias_statement
rg -i "type.*alias.*grammar\|type_alias.*node\|type_alias.*structure" --type md --type txt

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Let me search for tree-sitter grammar info online

Repository: rostilos/CodeCrow

Length of output: 43


🌐 Web query:

tree-sitter-python type_alias_statement grammar node structure

💡 Result:

In tree-sitter-python, a PEP 695 alias like:

type Alias[T] = dict[str, T]

is parsed as a type_alias_statement node whose important named children are exposed via fields (Tree-sitter “field names” become addressable children in the produced syntax tree). [1]

As of tree-sitter-python 0.23.6, type_alias_statement has these key fields: [2]

  • left: the alias “header” (the alias name, and—if present—its type parameters, e.g. Alias[T])
  • right: the aliased type expression on the RHS (e.g. dict[str, T])

So structurally (schematically) it looks like:

(type_alias_statement
  left:  <alias-name-or-generic-alias>
  right: <type-expression>)

where the literal tokens type and = are also present as (typically anonymous) children in the node’s full child list.

Sources:
[1] https://tree-sitter.github.io/tree-sitter/creating-parsers/2-the-grammar-dsl.html
[2] https://www.rpmfind.info/linux/RPM/opensuse/16.0/x86_64/tree-sitter-python-0.23.6-bp160.1.12.x86_64.html


🏁 Script executed:

# Search for tree-sitter-python grammar documentation in the repo
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "*.py" \) | xargs grep -l "tree-sitter" 2>/dev/null | head -5

Repository: rostilos/CodeCrow

Length of output: 345


🌐 Web query:

tree-sitter-python type_alias_statement left field identifier example query

💡 Result:

In tree-sitter-python, type_alias_statement does not define a left: field — it’s just:

type_alias_statement: $ => prec.dynamic(1, seq(
  'type',
  $.type,
  '=',
  $.type,
))

[1]

So you match the first type child (the alias “name” side) structurally in your query (not via left:). Field syntax in queries only works when the grammar assigned a field name. [2]

Example source

type Point = tuple[int, int]
type Box[T] = tuple[T, T]

Query: capture the alias identifier (simple type Name = ...)

(type_alias_statement
  "type"
  (type (identifier) `@type_alias.name`)
  "="
  (type) `@type_alias.value`)

(Works when the LHS type is just an identifier.)

Query: capture the alias identifier for generics too (type Name[T] = ...)

In this grammar, a generic LHS is parsed as generic_type: $.identifier $.type_parameter. [1]

(type_alias_statement
  "type"
  (type
    (generic_type
      (identifier) `@type_alias.name`
      (type_parameter) `@type_alias.type_params`))
  "="
  (type) `@type_alias.value`)

Sources: [1] [2]


🏁 Script executed:

# Check if there are any other Python query files or examples we can reference
find . -type f -name "*.py" -path "*splitter*" -o -path "*query*" | xargs grep -l "scm\|query" 2>/dev/null | head -5

Repository: rostilos/CodeCrow

Length of output: 364


Fix type_alias_statement query—incorrect field syntax and missing identifier nesting.

The type_alias_statement query on lines 27–28 uses incorrect syntax. Tree-sitter-python's type_alias_statement does not expose a name: field; the grammar structures it as a sequence: 'type', $.type, '=', $.type. To capture the alias name, nest the identifier within the first (type) node:

Suggested fix
; Type alias (Python 3.12+)
(type_alias_statement
  "type"
  (type (identifier) `@name`)
  "="
  (type)) `@definition.type_alias`

For generic aliases (e.g., type Point[T] = ...), also handle generic_type:

(type_alias_statement
  "type"
  (type
    [(identifier) `@name`
     (generic_type (identifier) `@name`)])
  "="
  (type)) `@definition.type_alias`
🤖 Prompt for AI Agents
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/queries/python.scm`
around lines 1 - 28, The type_alias_statement query is using an incorrect field
syntax; update the (type_alias_statement) query so it matches the grammar
sequence ("type" (type ...) "=" (type)) and capture the alias identifier by
nesting (identifier) `@name` inside the first (type) node instead of using a
non-existent name: field; also add an alternative to handle generic aliases by
allowing the first (type) to be either (identifier) `@name` or (generic_type
(identifier) `@name`), and keep the whole node annotated as
`@definition.type_alias`.

if not matches:
return []

source_bytes = text.encode('utf-8')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove or wire unused captures to avoid F841.

These locals are never used and will trip Ruff. Either propagate them into ASTChunk (if intended) or drop them.

Proposed diff (remove unused locals)
-        source_bytes = text.encode('utf-8')
@@
-                visibility = match.get(f'{match.pattern_name}.visibility')
-                return_type = match.get(f'{match.pattern_name}.return_type')
-                params = match.get(f'{match.pattern_name}.params')
                 modifiers = []

Also applies to: 340-342

🧰 Tools
🪛 Ruff (0.14.14)

[error] 263-263: Local variable source_bytes is assigned to but never used

Remove assignment to unused variable source_bytes

(F841)

🤖 Prompt for AI Agents
In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py` at
line 263, source_bytes is computed but never used, causing a Ruff F841; either
remove this unused local and the other unused captures around lines 340-342, or
wire them into the ASTChunk so they become meaningful fields. To fix: if you
need the data, add parameters/attributes to the ASTChunk constructor/definition
(class ASTChunk) and pass source_bytes (and the other captures) into the
ASTChunk creation sites; otherwise delete the unused local assignments (e.g.,
remove source_bytes = text.encode('utf-8') and the analogous unused variables at
340-342). Ensure any tests or downstream code expect the new ASTChunk signature
if you choose to propagate the values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants