Skip to content

Commit 3b5bf68

Browse files
committed
Check in initial diff engine and semantic analysis
1 parent 70d301c commit 3b5bf68

File tree

7 files changed

+2505
-0
lines changed

7 files changed

+2505
-0
lines changed

crates/parser/README.md

Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
# Smart Diff Parser
2+
3+
Multi-language parser engine for the Smart Code Diff tool, built on top of tree-sitter for robust and accurate parsing of source code.
4+
5+
## Features
6+
7+
- **Multi-language Support**: Java, Python, JavaScript, C++, C
8+
- **Tree-sitter Integration**: Leverages tree-sitter parsers for accurate syntax analysis
9+
- **Normalized AST**: Converts language-specific parse trees to a unified AST representation
10+
- **Language Detection**: Automatic detection of programming language from file extensions and content
11+
- **Error Handling**: Graceful handling of syntax errors with detailed error reporting
12+
- **Extensible Architecture**: Easy to add support for new programming languages
13+
14+
## Supported Languages
15+
16+
| Language | File Extensions | Status |
17+
|------------|----------------|--------|
18+
| Java | .java ||
19+
| Python | .py, .pyw ||
20+
| JavaScript | .js, .jsx ||
21+
| C++ | .cpp, .cc, .cxx, .c++, .hpp, .hxx, .h++ ||
22+
| C | .c, .h ||
23+
24+
## Usage
25+
26+
### Basic Parsing
27+
28+
```rust
29+
use smart_diff_parser::{
30+
tree_sitter::TreeSitterParser,
31+
language::Language,
32+
parser::Parser,
33+
};
34+
35+
// Create a parser
36+
let parser = TreeSitterParser::new()?;
37+
38+
// Parse Java code
39+
let java_code = r#"
40+
public class HelloWorld {
41+
public static void main(String[] args) {
42+
System.out.println("Hello, World!");
43+
}
44+
}
45+
"#;
46+
47+
let result = parser.parse(java_code, Language::Java)?;
48+
println!("Parsed {} with {} functions",
49+
result.language,
50+
result.ast.find_by_type(&NodeType::Function).len());
51+
```
52+
53+
### File Parsing with Auto-detection
54+
55+
```rust
56+
use smart_diff_parser::{
57+
tree_sitter::TreeSitterParser,
58+
parser::Parser,
59+
};
60+
61+
let parser = TreeSitterParser::new()?;
62+
63+
// Language is automatically detected from file extension
64+
let result = parser.parse_file("src/main.java")?;
65+
println!("Detected language: {:?}", result.language);
66+
```
67+
68+
### Working with AST
69+
70+
```rust
71+
use smart_diff_parser::ast::NodeType;
72+
73+
// Find all functions in the AST
74+
let functions = result.ast.find_by_type(&NodeType::Function);
75+
for func in functions {
76+
if let Some(name) = func.metadata.attributes.get("name") {
77+
println!("Found function: {} at line {}", name, func.metadata.line);
78+
}
79+
}
80+
81+
// Find all classes
82+
let classes = result.ast.find_by_type(&NodeType::Class);
83+
for class in classes {
84+
if let Some(name) = class.metadata.attributes.get("name") {
85+
println!("Found class: {} at line {}", name, class.metadata.line);
86+
}
87+
}
88+
```
89+
90+
### Language Detection
91+
92+
The language detector uses sophisticated pattern matching with confidence scoring to accurately identify programming languages from both file extensions and content analysis.
93+
94+
#### Basic Detection
95+
96+
```rust
97+
use smart_diff_parser::language::LanguageDetector;
98+
99+
// Detect from file path
100+
let lang = LanguageDetector::detect_from_path("Calculator.java");
101+
assert_eq!(lang, Language::Java);
102+
103+
// Detect from content using pattern analysis
104+
let java_content = r#"
105+
public class Calculator {
106+
public static void main(String[] args) {
107+
System.out.println("Hello, World!");
108+
}
109+
}
110+
"#;
111+
let lang = LanguageDetector::detect_from_content(java_content);
112+
assert_eq!(lang, Language::Java);
113+
114+
// Combined detection (path + content) - most accurate
115+
let lang = LanguageDetector::detect("Calculator.java", java_content);
116+
assert_eq!(lang, Language::Java);
117+
```
118+
119+
#### Advanced Content Detection
120+
121+
The content-based detection uses weighted pattern matching with language-specific indicators:
122+
123+
**Java Detection Patterns:**
124+
- Strong indicators: `public class`, `System.out.println`, `public static void main`
125+
- Medium indicators: `import java.*`, `@Override`, generics syntax
126+
- Weak indicators: `final`, `static`, `.length`
127+
128+
**Python Detection Patterns:**
129+
- Strong indicators: `def function():`, `class Name:`, `if __name__ == "__main__"`
130+
- Medium indicators: `import`, `self.`, `True/False/None`
131+
- Indentation analysis for Python-style code blocks
132+
133+
**JavaScript Detection Patterns:**
134+
- Strong indicators: `function`, `const/let/var`, `console.log`, arrow functions `=>`
135+
- Medium indicators: `require()`, `module.exports`, `async/await`
136+
- Template literals `${}`
137+
138+
**C++ Detection Patterns:**
139+
- Strong indicators: `#include <iostream>`, `std::cout`, `class` with access specifiers
140+
- Medium indicators: `template<>`, `namespace`, `virtual/override`
141+
- C++ specific syntax: `::`, `nullptr`, `auto`
142+
143+
**C Detection Patterns:**
144+
- Strong indicators: `#include <stdio.h>`, `printf()`, `malloc/free`
145+
- Medium indicators: `struct`, `typedef`, pointer syntax
146+
- C-specific headers and functions
147+
148+
#### Confidence Scoring
149+
150+
Each pattern has a weight, and the language with the highest total score (above 0.3 threshold) is selected:
151+
152+
```rust
153+
// Example: Mixed content detection
154+
let mixed_content = r#"
155+
#include <iostream> // C++ indicator (0.9)
156+
int main() { // Weak indicator (0.2)
157+
std::cout << "Hello" << std::endl; // Strong C++ (0.8)
158+
return 0;
159+
}
160+
"#;
161+
162+
// Total C++ score: 0.9 + 0.2 + 0.8 = 1.9
163+
// Result: Language::Cpp
164+
let detected = LanguageDetector::detect_from_content(mixed_content);
165+
```
166+
167+
#### Handling Edge Cases
168+
169+
- **File extension priority**: When content is ambiguous, file extension provides the hint
170+
- **Minimum confidence**: Requires score > 0.3 to avoid false positives
171+
- **Penalty system**: Reduces score when conflicting language patterns are found
172+
- **Fallback**: Returns `Language::Unknown` when confidence is too low
173+
174+
## AST Structure
175+
176+
The parser converts language-specific parse trees into a normalized AST with the following node types:
177+
178+
### Program Structure
179+
- `Program`: Root node of the AST
180+
- `Module`: Package/import declarations
181+
- `Class`: Class definitions
182+
- `Interface`: Interface definitions
183+
184+
### Functions and Methods
185+
- `Function`: Function declarations/definitions
186+
- `Method`: Class method declarations
187+
- `Constructor`: Constructor methods
188+
189+
### Statements
190+
- `Block`: Code blocks
191+
- `IfStatement`: Conditional statements
192+
- `WhileLoop`: While loops
193+
- `ForLoop`: For loops
194+
- `ReturnStatement`: Return statements
195+
- `ExpressionStatement`: Expression statements
196+
197+
### Expressions
198+
- `BinaryExpression`: Binary operations (a + b)
199+
- `UnaryExpression`: Unary operations (-a)
200+
- `CallExpression`: Function/method calls
201+
- `AssignmentExpression`: Variable assignments
202+
- `Identifier`: Variable/function names
203+
- `Literal`: String, number, boolean literals
204+
205+
### Declarations
206+
- `VariableDeclaration`: Variable declarations
207+
- `ParameterDeclaration`: Function parameters
208+
- `FieldDeclaration`: Class fields
209+
210+
## Node Metadata
211+
212+
Each AST node includes metadata:
213+
214+
```rust
215+
pub struct NodeMetadata {
216+
pub line: usize, // Line number (1-based)
217+
pub column: usize, // Column number (1-based)
218+
pub original_text: String, // Original source text
219+
pub attributes: HashMap<String, String>, // Node-specific attributes
220+
}
221+
```
222+
223+
Common attributes:
224+
- `name`: Identifier name for functions, classes, variables
225+
- `function_name`: Function name for call expressions
226+
- `param_count`: Number of parameters for functions
227+
- `return_type`: Return type for functions (when available)
228+
- `type`: Variable type for declarations
229+
230+
## Error Handling
231+
232+
The parser provides detailed error information:
233+
234+
```rust
235+
let result = parser.parse(invalid_code, Language::Java)?;
236+
237+
if !result.errors.is_empty() {
238+
println!("Parse errors found:");
239+
for error in &result.errors {
240+
println!(" - {}", error);
241+
}
242+
}
243+
```
244+
245+
## Examples
246+
247+
### Parsing Demo
248+
249+
Run the comprehensive parsing demo:
250+
251+
```bash
252+
cargo run --example parse_demo
253+
```
254+
255+
This demonstrates parsing of all supported languages with sample code, showing:
256+
- AST generation and node extraction
257+
- Function and class detection
258+
- Error handling for invalid syntax
259+
- Node attribute extraction
260+
261+
### Language Detection Demo
262+
263+
Run the language detection demo:
264+
265+
```bash
266+
cargo run --example language_detection_demo
267+
```
268+
269+
This demonstrates the sophisticated language detection capabilities:
270+
- File extension-based detection
271+
- Content-based pattern matching with confidence scoring
272+
- Combined detection strategies
273+
- Edge case handling (mixed content, ambiguous code, etc.)
274+
275+
## Testing
276+
277+
Run the test suite:
278+
279+
```bash
280+
cargo test
281+
```
282+
283+
The tests cover:
284+
- Language detection
285+
- Parser creation and initialization
286+
- Parsing of sample code in all supported languages
287+
- AST node extraction and attribute checking
288+
- Error handling for invalid syntax
289+
290+
## Architecture
291+
292+
The parser is built with a modular architecture:
293+
294+
- `language.rs`: Language detection and configuration
295+
- `language_config.rs`: Language-specific parsing configurations
296+
- `tree_sitter.rs`: Tree-sitter integration and AST conversion
297+
- `ast.rs`: AST node definitions and utilities
298+
- `parser.rs`: Main parser interface and error types
299+
300+
## Adding New Languages
301+
302+
To add support for a new language:
303+
304+
1. Add the language to the `Language` enum in `language.rs`
305+
2. Add tree-sitter dependency to `Cargo.toml`
306+
3. Update `LANGUAGE_CONFIGS` in `language_config.rs`
307+
4. Add language-specific node mappings
308+
5. Add tests for the new language
309+
310+
## Dependencies
311+
312+
- `tree-sitter`: Core parsing engine
313+
- `tree-sitter-java`: Java grammar
314+
- `tree-sitter-python`: Python grammar
315+
- `tree-sitter-javascript`: JavaScript grammar
316+
- `tree-sitter-cpp`: C++ grammar
317+
- `tree-sitter-c`: C grammar
318+
- `serde`: Serialization support
319+
- `uuid`: Unique node identifiers
320+
- `regex`: Pattern matching for language detection

0 commit comments

Comments
 (0)