|
| 1 | +# Smart Diff Parser |
| 2 | + |
| 3 | +Multi-language parser engine for the Smart Code Diff tool, built on top of tree-sitter for robust and accurate parsing of source code. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- **Multi-language Support**: Java, Python, JavaScript, C++, C |
| 8 | +- **Tree-sitter Integration**: Leverages tree-sitter parsers for accurate syntax analysis |
| 9 | +- **Normalized AST**: Converts language-specific parse trees to a unified AST representation |
| 10 | +- **Language Detection**: Automatic detection of programming language from file extensions and content |
| 11 | +- **Error Handling**: Graceful handling of syntax errors with detailed error reporting |
| 12 | +- **Extensible Architecture**: Easy to add support for new programming languages |
| 13 | + |
| 14 | +## Supported Languages |
| 15 | + |
| 16 | +| Language | File Extensions | Status | |
| 17 | +|------------|----------------|--------| |
| 18 | +| Java | .java | ✅ | |
| 19 | +| Python | .py, .pyw | ✅ | |
| 20 | +| JavaScript | .js, .jsx | ✅ | |
| 21 | +| C++ | .cpp, .cc, .cxx, .c++, .hpp, .hxx, .h++ | ✅ | |
| 22 | +| C | .c, .h | ✅ | |
| 23 | + |
| 24 | +## Usage |
| 25 | + |
| 26 | +### Basic Parsing |
| 27 | + |
| 28 | +```rust |
| 29 | +use smart_diff_parser::{ |
| 30 | + tree_sitter::TreeSitterParser, |
| 31 | + language::Language, |
| 32 | + parser::Parser, |
| 33 | +}; |
| 34 | + |
| 35 | +// Create a parser |
| 36 | +let parser = TreeSitterParser::new()?; |
| 37 | + |
| 38 | +// Parse Java code |
| 39 | +let java_code = r#" |
| 40 | +public class HelloWorld { |
| 41 | + public static void main(String[] args) { |
| 42 | + System.out.println("Hello, World!"); |
| 43 | + } |
| 44 | +} |
| 45 | +"#; |
| 46 | + |
| 47 | +let result = parser.parse(java_code, Language::Java)?; |
| 48 | +println!("Parsed {} with {} functions", |
| 49 | + result.language, |
| 50 | + result.ast.find_by_type(&NodeType::Function).len()); |
| 51 | +``` |
| 52 | + |
| 53 | +### File Parsing with Auto-detection |
| 54 | + |
| 55 | +```rust |
| 56 | +use smart_diff_parser::{ |
| 57 | + tree_sitter::TreeSitterParser, |
| 58 | + parser::Parser, |
| 59 | +}; |
| 60 | + |
| 61 | +let parser = TreeSitterParser::new()?; |
| 62 | + |
| 63 | +// Language is automatically detected from file extension |
| 64 | +let result = parser.parse_file("src/main.java")?; |
| 65 | +println!("Detected language: {:?}", result.language); |
| 66 | +``` |
| 67 | + |
| 68 | +### Working with AST |
| 69 | + |
| 70 | +```rust |
| 71 | +use smart_diff_parser::ast::NodeType; |
| 72 | + |
| 73 | +// Find all functions in the AST |
| 74 | +let functions = result.ast.find_by_type(&NodeType::Function); |
| 75 | +for func in functions { |
| 76 | + if let Some(name) = func.metadata.attributes.get("name") { |
| 77 | + println!("Found function: {} at line {}", name, func.metadata.line); |
| 78 | + } |
| 79 | +} |
| 80 | + |
| 81 | +// Find all classes |
| 82 | +let classes = result.ast.find_by_type(&NodeType::Class); |
| 83 | +for class in classes { |
| 84 | + if let Some(name) = class.metadata.attributes.get("name") { |
| 85 | + println!("Found class: {} at line {}", name, class.metadata.line); |
| 86 | + } |
| 87 | +} |
| 88 | +``` |
| 89 | + |
| 90 | +### Language Detection |
| 91 | + |
| 92 | +The language detector uses sophisticated pattern matching with confidence scoring to accurately identify programming languages from both file extensions and content analysis. |
| 93 | + |
| 94 | +#### Basic Detection |
| 95 | + |
| 96 | +```rust |
| 97 | +use smart_diff_parser::language::LanguageDetector; |
| 98 | + |
| 99 | +// Detect from file path |
| 100 | +let lang = LanguageDetector::detect_from_path("Calculator.java"); |
| 101 | +assert_eq!(lang, Language::Java); |
| 102 | + |
| 103 | +// Detect from content using pattern analysis |
| 104 | +let java_content = r#" |
| 105 | +public class Calculator { |
| 106 | + public static void main(String[] args) { |
| 107 | + System.out.println("Hello, World!"); |
| 108 | + } |
| 109 | +} |
| 110 | +"#; |
| 111 | +let lang = LanguageDetector::detect_from_content(java_content); |
| 112 | +assert_eq!(lang, Language::Java); |
| 113 | + |
| 114 | +// Combined detection (path + content) - most accurate |
| 115 | +let lang = LanguageDetector::detect("Calculator.java", java_content); |
| 116 | +assert_eq!(lang, Language::Java); |
| 117 | +``` |
| 118 | + |
| 119 | +#### Advanced Content Detection |
| 120 | + |
| 121 | +The content-based detection uses weighted pattern matching with language-specific indicators: |
| 122 | + |
| 123 | +**Java Detection Patterns:** |
| 124 | +- Strong indicators: `public class`, `System.out.println`, `public static void main` |
| 125 | +- Medium indicators: `import java.*`, `@Override`, generics syntax |
| 126 | +- Weak indicators: `final`, `static`, `.length` |
| 127 | + |
| 128 | +**Python Detection Patterns:** |
| 129 | +- Strong indicators: `def function():`, `class Name:`, `if __name__ == "__main__"` |
| 130 | +- Medium indicators: `import`, `self.`, `True/False/None` |
| 131 | +- Indentation analysis for Python-style code blocks |
| 132 | + |
| 133 | +**JavaScript Detection Patterns:** |
| 134 | +- Strong indicators: `function`, `const/let/var`, `console.log`, arrow functions `=>` |
| 135 | +- Medium indicators: `require()`, `module.exports`, `async/await` |
| 136 | +- Template literals `${}` |
| 137 | + |
| 138 | +**C++ Detection Patterns:** |
| 139 | +- Strong indicators: `#include <iostream>`, `std::cout`, `class` with access specifiers |
| 140 | +- Medium indicators: `template<>`, `namespace`, `virtual/override` |
| 141 | +- C++ specific syntax: `::`, `nullptr`, `auto` |
| 142 | + |
| 143 | +**C Detection Patterns:** |
| 144 | +- Strong indicators: `#include <stdio.h>`, `printf()`, `malloc/free` |
| 145 | +- Medium indicators: `struct`, `typedef`, pointer syntax |
| 146 | +- C-specific headers and functions |
| 147 | + |
| 148 | +#### Confidence Scoring |
| 149 | + |
| 150 | +Each pattern has a weight, and the language with the highest total score (above 0.3 threshold) is selected: |
| 151 | + |
| 152 | +```rust |
| 153 | +// Example: Mixed content detection |
| 154 | +let mixed_content = r#" |
| 155 | +#include <iostream> // C++ indicator (0.9) |
| 156 | +int main() { // Weak indicator (0.2) |
| 157 | + std::cout << "Hello" << std::endl; // Strong C++ (0.8) |
| 158 | + return 0; |
| 159 | +} |
| 160 | +"#; |
| 161 | + |
| 162 | +// Total C++ score: 0.9 + 0.2 + 0.8 = 1.9 |
| 163 | +// Result: Language::Cpp |
| 164 | +let detected = LanguageDetector::detect_from_content(mixed_content); |
| 165 | +``` |
| 166 | + |
| 167 | +#### Handling Edge Cases |
| 168 | + |
| 169 | +- **File extension priority**: When content is ambiguous, file extension provides the hint |
| 170 | +- **Minimum confidence**: Requires score > 0.3 to avoid false positives |
| 171 | +- **Penalty system**: Reduces score when conflicting language patterns are found |
| 172 | +- **Fallback**: Returns `Language::Unknown` when confidence is too low |
| 173 | + |
| 174 | +## AST Structure |
| 175 | + |
| 176 | +The parser converts language-specific parse trees into a normalized AST with the following node types: |
| 177 | + |
| 178 | +### Program Structure |
| 179 | +- `Program`: Root node of the AST |
| 180 | +- `Module`: Package/import declarations |
| 181 | +- `Class`: Class definitions |
| 182 | +- `Interface`: Interface definitions |
| 183 | + |
| 184 | +### Functions and Methods |
| 185 | +- `Function`: Function declarations/definitions |
| 186 | +- `Method`: Class method declarations |
| 187 | +- `Constructor`: Constructor methods |
| 188 | + |
| 189 | +### Statements |
| 190 | +- `Block`: Code blocks |
| 191 | +- `IfStatement`: Conditional statements |
| 192 | +- `WhileLoop`: While loops |
| 193 | +- `ForLoop`: For loops |
| 194 | +- `ReturnStatement`: Return statements |
| 195 | +- `ExpressionStatement`: Expression statements |
| 196 | + |
| 197 | +### Expressions |
| 198 | +- `BinaryExpression`: Binary operations (a + b) |
| 199 | +- `UnaryExpression`: Unary operations (-a) |
| 200 | +- `CallExpression`: Function/method calls |
| 201 | +- `AssignmentExpression`: Variable assignments |
| 202 | +- `Identifier`: Variable/function names |
| 203 | +- `Literal`: String, number, boolean literals |
| 204 | + |
| 205 | +### Declarations |
| 206 | +- `VariableDeclaration`: Variable declarations |
| 207 | +- `ParameterDeclaration`: Function parameters |
| 208 | +- `FieldDeclaration`: Class fields |
| 209 | + |
| 210 | +## Node Metadata |
| 211 | + |
| 212 | +Each AST node includes metadata: |
| 213 | + |
| 214 | +```rust |
| 215 | +pub struct NodeMetadata { |
| 216 | + pub line: usize, // Line number (1-based) |
| 217 | + pub column: usize, // Column number (1-based) |
| 218 | + pub original_text: String, // Original source text |
| 219 | + pub attributes: HashMap<String, String>, // Node-specific attributes |
| 220 | +} |
| 221 | +``` |
| 222 | + |
| 223 | +Common attributes: |
| 224 | +- `name`: Identifier name for functions, classes, variables |
| 225 | +- `function_name`: Function name for call expressions |
| 226 | +- `param_count`: Number of parameters for functions |
| 227 | +- `return_type`: Return type for functions (when available) |
| 228 | +- `type`: Variable type for declarations |
| 229 | + |
| 230 | +## Error Handling |
| 231 | + |
| 232 | +The parser provides detailed error information: |
| 233 | + |
| 234 | +```rust |
| 235 | +let result = parser.parse(invalid_code, Language::Java)?; |
| 236 | + |
| 237 | +if !result.errors.is_empty() { |
| 238 | + println!("Parse errors found:"); |
| 239 | + for error in &result.errors { |
| 240 | + println!(" - {}", error); |
| 241 | + } |
| 242 | +} |
| 243 | +``` |
| 244 | + |
| 245 | +## Examples |
| 246 | + |
| 247 | +### Parsing Demo |
| 248 | + |
| 249 | +Run the comprehensive parsing demo: |
| 250 | + |
| 251 | +```bash |
| 252 | +cargo run --example parse_demo |
| 253 | +``` |
| 254 | + |
| 255 | +This demonstrates parsing of all supported languages with sample code, showing: |
| 256 | +- AST generation and node extraction |
| 257 | +- Function and class detection |
| 258 | +- Error handling for invalid syntax |
| 259 | +- Node attribute extraction |
| 260 | + |
| 261 | +### Language Detection Demo |
| 262 | + |
| 263 | +Run the language detection demo: |
| 264 | + |
| 265 | +```bash |
| 266 | +cargo run --example language_detection_demo |
| 267 | +``` |
| 268 | + |
| 269 | +This demonstrates the sophisticated language detection capabilities: |
| 270 | +- File extension-based detection |
| 271 | +- Content-based pattern matching with confidence scoring |
| 272 | +- Combined detection strategies |
| 273 | +- Edge case handling (mixed content, ambiguous code, etc.) |
| 274 | + |
| 275 | +## Testing |
| 276 | + |
| 277 | +Run the test suite: |
| 278 | + |
| 279 | +```bash |
| 280 | +cargo test |
| 281 | +``` |
| 282 | + |
| 283 | +The tests cover: |
| 284 | +- Language detection |
| 285 | +- Parser creation and initialization |
| 286 | +- Parsing of sample code in all supported languages |
| 287 | +- AST node extraction and attribute checking |
| 288 | +- Error handling for invalid syntax |
| 289 | + |
| 290 | +## Architecture |
| 291 | + |
| 292 | +The parser is built with a modular architecture: |
| 293 | + |
| 294 | +- `language.rs`: Language detection and configuration |
| 295 | +- `language_config.rs`: Language-specific parsing configurations |
| 296 | +- `tree_sitter.rs`: Tree-sitter integration and AST conversion |
| 297 | +- `ast.rs`: AST node definitions and utilities |
| 298 | +- `parser.rs`: Main parser interface and error types |
| 299 | + |
| 300 | +## Adding New Languages |
| 301 | + |
| 302 | +To add support for a new language: |
| 303 | + |
| 304 | +1. Add the language to the `Language` enum in `language.rs` |
| 305 | +2. Add tree-sitter dependency to `Cargo.toml` |
| 306 | +3. Update `LANGUAGE_CONFIGS` in `language_config.rs` |
| 307 | +4. Add language-specific node mappings |
| 308 | +5. Add tests for the new language |
| 309 | + |
| 310 | +## Dependencies |
| 311 | + |
| 312 | +- `tree-sitter`: Core parsing engine |
| 313 | +- `tree-sitter-java`: Java grammar |
| 314 | +- `tree-sitter-python`: Python grammar |
| 315 | +- `tree-sitter-javascript`: JavaScript grammar |
| 316 | +- `tree-sitter-cpp`: C++ grammar |
| 317 | +- `tree-sitter-c`: C grammar |
| 318 | +- `serde`: Serialization support |
| 319 | +- `uuid`: Unique node identifiers |
| 320 | +- `regex`: Pattern matching for language detection |
0 commit comments