From 750157f28d1ef99cfb3bf1ce200f5d4cfd97a705 Mon Sep 17 00:00:00 2001 From: "google-labs-jules[bot]" <161369871+google-labs-jules[bot]@users.noreply.github.com> Date: Fri, 25 Jul 2025 07:29:29 +0000 Subject: [PATCH] I am adding architecture.md and code_review.md. --- architecture.md | 51 +++++++++++++++++++++++++++++++++++++++ code_review.md | 63 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 114 insertions(+) create mode 100644 architecture.md create mode 100644 code_review.md diff --git a/architecture.md b/architecture.md new file mode 100644 index 0000000..06ffb42 --- /dev/null +++ b/architecture.md @@ -0,0 +1,51 @@ +# MapReduce Framework Architecture + +This document provides a detailed overview of the custom C++ MapReduce framework's architecture. The framework is designed to process large datasets in a distributed and parallel manner. + +## Core Components + +The framework consists of four main components: + +1. **Master Server:** The central coordinator of the MapReduce jobs. +2. **Mapper Node:** Responsible for processing input data and generating intermediate key-value pairs. +3. **Reducer Node:** Responsible for aggregating the intermediate data from the Mappers to produce the final output. +4. **File Server:** A simple distributed file system for storing input, intermediate, and output files. + +### 1. Master Server (`master_server.cpp`, `master_server.h`) + +The Master Server is the brain of the MapReduce framework. It is responsible for: + +* **Job Management:** It receives job requests from clients, assigns tasks to Mappers and Reducers, and monitors their progress. The `Job` struct (`master_server.h`) encapsulates all the information about a MapReduce job, including its ID, the client's socket, the number of Mappers and Reducers, and the input files. +* **Task Scheduling:** The Master divides the input files into smaller chunks (represented by the `Chunk` struct) and assigns them to available Mapper nodes. Once the Mappers complete their tasks, the Master assigns the intermediate files to the Reducer nodes. +* **Fault Tolerance:** The current implementation has a basic fault tolerance mechanism. If a Mapper or Reducer node fails, the Master can re-assign its task to another available node (this is not fully implemented but the structure is there). +* **Communication:** The Master communicates with the Mappers, Reducers, and clients using TCP sockets. It has a main loop that listens for incoming connections and handles requests accordingly. + +### 2. Mapper Node (`MapperNode.cpp`, `MapperNode.h`, `Mapper.cpp`, `Mapper.h`) + +The Mapper Node is responsible for the "Map" phase of the MapReduce process. Its primary tasks are: + +* **Data Processing:** It receives a chunk of the input file from the Master, processes it, and generates intermediate key-value pairs. The specific mapping logic is defined in the `WordCount` and `InvertedIndex` classes. +* **Partitioning:** The intermediate key-value pairs are partitioned based on a hash function (or a custom partitioning function) and written to intermediate files. Each partition corresponds to a Reducer. +* **Communication:** The Mapper Node communicates with the Master to receive tasks and report its status. It also interacts with the File Server to download its assigned input chunk and upload the intermediate files. + +### 3. Reducer Node (`ReducerNode.cpp`, `ReducerNode.h`, `Reducer.cpp`, `Reducer.h`) + +The Reducer Node is responsible for the "Reduce" phase of the MapReduce process. Its primary tasks are: + +* **Data Aggregation:** It downloads the intermediate files from the File Server for its assigned partition. It then sorts and aggregates the intermediate key-value pairs to produce the final output. +* **Output Generation:** The Reducer writes the final output to a file on the File Server. +* **Communication:** The Reducer Node communicates with the Master to receive tasks and report its status. It also interacts with the File Server to download the intermediate files and upload the final output file. + +### 4. File Server (`fs_server.cpp`, `fs_server.h`, `fs_client.cpp`, `fs_client.h`) + +The File Server is a simple distributed file system that provides storage for the MapReduce framework. Its responsibilities include: + +* **File Storage:** It stores the input files, intermediate files generated by the Mappers, and the final output files generated by the Reducers. +* **File Access:** It provides an interface for the Mappers and Reducers to download and upload files. It also allows clients to upload input files and download the final output. +* **Concurrency Control:** The File Server uses mutexes to handle concurrent file access from multiple Mappers and Reducers. + +## Communication Protocol + +All communication between the nodes is done via TCP sockets. The messages are simple strings with a specific format. The `Opcode` enum in `utilities.h` defines the different types of messages that can be exchanged. + +For example, a client can initiate a job by sending a message to the Master with the `NEW_JOB` opcode, followed by the job details. The Master then responds with a confirmation or an error message. A similar protocol is used for communication between the Master and the Mapper/Reducer nodes. diff --git a/code_review.md b/code_review.md new file mode 100644 index 0000000..2d96f0a --- /dev/null +++ b/code_review.md @@ -0,0 +1,63 @@ +# Code Review + +This document contains a professional code review of the C++ MapReduce framework implementation. + +## General Observations + +The codebase implements a basic MapReduce framework with a Master, Mappers, Reducers, and a File Server. The implementation uses raw C++ sockets for communication and pthreads for concurrency. While the framework is functional, there are several areas where the code can be improved for robustness, maintainability, and performance. + +## Specific Issues and Recommendations + +### 1. Error Handling + +The error handling in the codebase is inconsistent and incomplete. In many places, the code checks for errors but doesn't handle them properly. For example, in `master_server.cpp`, if a `read()` or `write()` call fails, the error is printed to the console, but the program continues to execute, which could lead to undefined behavior. + +**Recommendation:** + +* Implement a consistent error-handling strategy. For example, use exceptions or return error codes to propagate errors up the call stack. +* Ensure that all system calls and library functions are checked for errors, and that appropriate action is taken when an error occurs. +* Use a logging framework to log errors and other important events. This will make it easier to debug the system. + +### 2. Concurrency and Threading + +The codebase uses pthreads for concurrency, but the use of threads is not always safe. For example, in `MapperNode.cpp`, a new thread is created for each incoming request, but the thread is detached, which means that the main thread cannot join it and wait for it to finish. This can lead to resource leaks and other problems. + +**Recommendation:** + +* Use a thread pool to manage the worker threads. This will limit the number of concurrent threads and prevent the system from being overloaded. +* Avoid detaching threads unless absolutely necessary. Instead, use `pthread_join()` to wait for the threads to finish. +* Use mutexes and other synchronization primitives to protect shared data from race conditions. + +### 3. Network Communication + +The network communication protocol is simple but not very robust. The messages are simple strings with a custom format, which can be difficult to parse and error-prone. + +**Recommendation:** + +* Use a more structured data format for the messages, such as JSON or Protocol Buffers. This will make the messages easier to parse and less error-prone. +* Implement a more robust message framing mechanism to ensure that the entire message is received before it is processed. +* Consider using a higher-level networking library, such as Boost.Asio or ZeroMQ, to simplify the networking code. + +### 4. Code Style and Readability + +The code style is inconsistent, and the code is not always easy to read. For example, there are many magic numbers and hard-coded strings in the code, which makes it difficult to understand and maintain. + +**Recommendation:** + +* Adopt a consistent code style and use a linter to enforce it. +* Use meaningful variable and function names. +* Add comments to explain complex or non-obvious parts of the code. +* Replace magic numbers and hard-coded strings with named constants or configuration variables. + +### 5. Memory Management + +The code uses raw pointers for memory management, which can be error-prone. For example, there are several places where `new` is used to allocate memory, but there is no corresponding `delete` to free it. This can lead to memory leaks. + +**Recommendation:** + +* Use smart pointers, such as `std::unique_ptr` and `std::shared_ptr`, to manage memory automatically. This will help to prevent memory leaks and other memory-related errors. +* Avoid using `new` and `delete` directly unless absolutely necessary. + +## Conclusion + +The C++ MapReduce framework is a good starting point, but it needs to be improved in several areas to be considered a professional implementation. By addressing the issues raised in this code review, the developers can create a more robust, maintainable, and performant framework.