Multithreaded HTTP Web Proxy with LRU Cache

This project is a multithreaded HTTP web proxy implemented in C with a custom LRU cache for responses and a HTTP request parsing library. The proxy accepts HTTP requests from clients (e.g., browsers or curl), forwards them to the remote server, relays the response back to the client, and optionally stores the response in an in‑memory cache so that repeated requests are served faster.

The core logic lives in:

proxy_server_with_cache.c – proxy server, threading, networking, caching, error handling
proxy_parse.h – interface for the HTTP request parsing library used by the proxy

High‑Level Features

Acts as an HTTP proxy for HTTP/1.0 and HTTP/1.1 requests.
Supports only the GET method (other methods are rejected / not processed).
Forwards requests to remote servers (e.g., google.com) and streams back responses.
Response caching with LRU eviction:
- Cache entries are keyed by the full request string.
- Cache elements are bounded by MAX_ELEMENT_SIZE.
- Global cache size is bounded by MAX_SIZE.
- LRU (Least Recently Used) eviction when the cache is full.
Multi‑threaded concurrency:
- Each client connection is handled in a separate thread.
- MAX_CLIENTS controls the maximum concurrent active clients.
- A semaphore limits concurrent workers; a mutex protects shared cache state.
Basic HTTP error responses generated by the proxy itself for invalid/unsupported cases.

Project Structure

proxy_parse.h
- Declares the struct ParsedRequest and struct ParsedHeader types.
- Declares functions to:
  - Create / destroy a parsed request: ParsedRequest_create, ParsedRequest_destroy.
  - Parse a raw HTTP request string: ParsedRequest_parse.
  - Reconstruct ("unparse") a request or just its headers: ParsedRequest_unparse, ParsedRequest_unparse_headers.
  - Manage headers (set/get/remove): ParsedHeader_set, ParsedHeader_get, ParsedHeader_remove.
- Provides a documented example of how to parse and manipulate headers programmatically.
proxy_server_with_cache.c
- Includes system headers for POSIX networking (socket, bind, listen, accept, connect, recv, send), threading (pthread), semaphores, and time utilities.
- Implements:
  - Socket server setup on a configurable port.
  - Connection handling via worker threads (pthread_create).
  - Client request parsing using the proxy_parse library.
  - Request normalization (ensuring Host and Connection: close headers exist).
  - Connection to the remote web server and streaming of the HTTP response.
  - Caching of responses using an LRU policy.
  - Basic HTTP error response generation.

Detailed Design

1. Networking and Server Setup

Key constants in proxy_server_with_cache.c:

MAX_CLIENTS – upper bound on simultaneous client connections (also initial semaphore value).
MAX_BYTES – buffer size used for reading/writing data across sockets.
MAX_ELEMENT_SIZE – maximum allowed size for a single cache element.
MAX_SIZE – maximum total size of all cached elements combined.

Startup flow in main:

Initialize a semaphore semaphore with value MAX_CLIENTS.
Initialize a mutex lock for synchronizing access to the global cache.
Read the proxy listening port from the command line (default logic expects one argument; the proxy listens on that port).
Create a TCP socket (proxy_socketId).
Set SO_REUSEADDR on the socket so that it can be rebound quickly.
Bind the socket to INADDR_ANY on the chosen port.
Call listen(proxy_socketId, MAX_CLIENTS) to start listening.
Enter an infinite loop where:
- accept waits for new client connections.
- For each accepted client socket, a new thread is created using pthread_create, running thread_fn.

2. Per‑Connection Thread Function

Each client connection is handled by thread_fn:

Concurrency control using semaphore:
- sem_wait(&semaphore) decrements the semaphore and blocks if the maximum number of active clients has been reached.
Read the full HTTP request:
- Allocate a buffer of size MAX_BYTES and read from the client using recv.
- Continue reading until the end of HTTP headers ("\r\n\r\n") or the buffer is full.
Clone the raw request string:
- A copy of the entire incoming request (tempReq) is created.
- This copy is later used as the cache key.
Check cache first:
- Call find(tempReq) to look up a matching cached response.
- If found, stream the cached data back to the client in chunks of MAX_BYTES until the full response is sent.
- Update the element’s LRU timestamp (inside find).
If not cached, parse and forward:
- Use ParsedRequest_create and ParsedRequest_parse to parse the raw HTTP request.
- Only GET requests are supported:
  - If request->method is not "GET", the proxy prints a message and does not forward.
- Validate:
  - request->host exists.
  - request->path exists.
  - HTTP version is HTTP/1.0 or HTTP/1.1 via the checkHTTPVersion helper.
- Call handle_request to forward the request to the remote server and relay the response.
- If handle_request fails, sendErrorMessage is used to return an HTTP error to the client.
Cleanup:
- Destroy the parsed request (ParsedRequest_destroy).
- Shutdown and close the client socket (shutdown + close).
- Free request buffers and tempReq.
- sem_post(&semaphore) increments the semaphore, allowing another client to be handled.

3. HTTP Request Construction and Forwarding

The handle_request function is responsible for transforming the client’s request and talking to the real server:

Build a normalized request line + headers:
- Start with "GET", the request->path, and request->version, followed by "\r\n".
- Ensure that the Connection header is set to "close" using ParsedHeader_set.
- Ensure that the Host header exists; if not, set it to request->host.
- Use ParsedRequest_unparse_headers to serialize only the headers into the same buffer, appending them to the request line.
- Final buffer structure is:
```
GET /path HTTP/1.1\r\n
Host: example.com\r\n
Connection: close\r\n
...other headers...\r\n
\r\n
```
Determine upstream server port:
- Default is 80.
- If request->port is non‑NULL, convert it using atoi and use that port instead.
Connect to the remote server:
- Use connectRemoteServer(request->host, server_port) to open a TCP connection.
- This helper:
  - Resolves the host name via gethostbyname.
  - Fills a sockaddr_in structure.
  - Calls connect and returns the socket descriptor on success.
Send request and stream response:
- send the fully constructed HTTP request to the remote server.
- Repeatedly recv response chunks from the remote server into buff.
- For each chunk:
  - Immediately send it to the client.
  - Append it into a dynamically growing buffer (temp_buffer) used to accumulate the full response for caching.
- When recv returns 0 or negative, stop reading.
Cache the response:
- Null‑terminate temp_buffer.
- Call add_cache_element(temp_buffer, strlen(temp_buffer), tempReq):
  - tempReq is the original raw request string, used as the cache key.
  - data points to the full HTTP response as received from the remote server.
- Free temporary buffers and close the remote server socket.

4. HTTP Error Handling

sendErrorMessage(int socket, int status_code) builds and sends HTML error responses generated entirely by the proxy. It supports:

400 Bad Request
403 Forbidden
404 Not Found
500 Internal Server Error
501 Not Implemented
505 HTTP Version Not Supported

Each response includes:

An appropriate HTTP/1.1 status line.
Content-Length, Content-Type: text/html, and Connection: keep-alive headers.
A Date header formatted with gmtime and strftime.
A simple HTML body describing the error.

This function is called when parsing fails, when unsupported methods are used, or when forwarding fails, depending on the logic in thread_fn and handle_request.

5. Caching Subsystem (LRU Cache)

The cache is built around the cache_element struct:

typedef struct cache_element {
    char *data;               // Full HTTP response
    int len;                  // Length in bytes of data
    char *url;                // Full request string used as the key
    time_t lru_time_track;    // Last access time, used for LRU
    struct cache_element *next;
} cache_element;
``

Global state in [proxy_server_with_cache.c](proxy_server_with_cache.c):
- `cache_element *head;` – head of a singly linked list of cache entries.
- `int cache_size;` – total size (in bytes) of all cache elements.
- `pthread_mutex_t lock;` – protects access to `head` and `cache_size`.

#### 5.1 Cache Lookup: `find`

- Locks the mutex with `pthread_mutex_lock(&lock)`.
- Traverses the linked list starting from `head`.
- Compares each element’s `url` with the requested `url` using `strcmp`.
- If a match is found:
  - Prints debug information.
  - Updates `lru_time_track` to the current time (`time(NULL)`) to mark it as recently used.
  - Returns the `cache_element *`.
- Unlocks the mutex before returning.

#### 5.2 Cache Eviction: `remove_cache_element`

- Locks the mutex.
- If the cache is non‑empty:
  - Iterates over the list to find the element with the **smallest** `lru_time_track` (oldest use).
  - Maintains pointers:
    - `temp` – current best candidate for eviction.
    - `p` – node just before `temp`.
  - Removes `temp` from the list:
    - If `temp` is the `head`, move `head` to `head->next`.
    - Otherwise, set `p->next = temp->next`.
  - Decrements `cache_size` by the size of the evicted element:
    - Subtract `temp->len` (response size).
    - Subtract `sizeof(cache_element)` and `strlen(temp->url) + 1` for metadata and key.
  - Frees `temp->data`, `temp->url`, and `temp` itself.
- Unlocks the mutex.

#### 5.3 Cache Insert: `add_cache_element`

- Locks the mutex.
- Computes `element_size = size + 1 + strlen(url) + sizeof(cache_element)`.
- If `element_size > MAX_ELEMENT_SIZE`:
  - Unlocks and returns without caching (element too big).
- Otherwise, while `cache_size + element_size > MAX_SIZE`:
  - Call `remove_cache_element()` until there is enough space.
- Allocate a new `cache_element` and its `data` and `url` buffers.
- Copy the response into `data` and the key into `url`.
- Set `lru_time_track = time(NULL)`.
- Insert the new element at the head of the list: `element->next = head; head = element;`.
- Increment `cache_size` by `element_size`.
- Unlock the mutex.

This design ensures:
- Cache entries are **bounded per element** and **bounded globally**.
- Frequently requested resources stay in the cache.
- Oldest, least recently used responses are evicted first.

### 6. HTTP Request Parsing Library (`proxy_parse.h`)

`proxy_parse.h` defines the abstraction used for parsing and manipulating HTTP requests:

- `struct ParsedRequest` holds:
  - `method`, `protocol`, `host`, `port`, `path`, `version`.
  - A buffer and length for the raw request line.
  - A dynamic array/list of `ParsedHeader` entries.

- `struct ParsedHeader` represents one HTTP header as a `key: value` pair.

Key functions used by the proxy:
- `ParsedRequest_create` – allocate and initialize a `ParsedRequest`.
- `ParsedRequest_parse` – parse a raw request buffer into fields and headers.
- `ParsedHeader_set` – ensure headers like `Host` and `Connection` have desired values.
- `ParsedHeader_get` – check if a particular header (e.g., `Host`) exists.
- `ParsedRequest_unparse_headers` – convert headers back into wire format, appended to the request line built in `handle_request`.

The example in the header shows how these functions work together; the proxy uses them in a similar pattern but tailored to forwarding requests.

---

## Building the Proxy

This code is written for a **POSIX environment** (Linux/Unix/macOS). On Windows you are expected to use something like **WSL** or a POSIX‑compatible toolchain (e.g., MinGW with appropriate adjustments) because it depends on headers like `<unistd.h>`, `<netinet/in.h>`, `<arpa/inet.h>`, and `<pthread.h>`.

Assuming you have a `proxy_parse.c` implementation available, a typical build command with `gcc` would look like:

```bash
gcc -Wall -O2 -pthread -o webproxy \
    proxy_server_with_cache.c proxy_parse.c

If the parsing library is provided as a precompiled object file or static library, adjust the command accordingly (e.g., link against -lproxyparse).

Running the Proxy

Run the compiled proxy with a port number:

./webproxy 8080

The proxy will start and listen on port 8080 (or the port you pass as argument).
It prints messages about binding and each connected client, including the client’s IP address and port.

To test with curl using the proxy:

curl -x http://localhost:8080 http://example.com/

First request to a URL: fetched from the remote server, response cached.
Second identical request: should be served from cache (you’ll see debug messages indicating cache hits).

You can also configure your browser’s HTTP proxy settings to point to localhost:8080 and browse regular HTTP sites through it (HTTPS via CONNECT is not implemented).

Known Limitations and Assumptions

Method support: Only GET is supported.
Protocol support: Designed for HTTP/1.0 and HTTP/1.1.
HTTPS / CONNECT not supported: The proxy does not implement tunneling for HTTPS.
No persistent connections to upstream: Requests are sent with Connection: close and each remote server connection is closed after the response.
Parsing library dependency: Requires a proxy_parse implementation matching proxy_parse.h.
No full header/body parsing on responses: Responses are treated as opaque byte streams and cached as‑is.
Basic error messages: Error handling is straightforward and mainly used when parsing or network operations fail.

Possible Extensions

Some natural next steps if you want to grow this project further:

Add support for additional HTTP methods such as HEAD and POST.
Implement HTTPS proxying using the CONNECT method.
Add more robust parsing and validation of both requests and responses.
Implement configurable cache policies (e.g., using Cache-Control or Expires headers).
Implement logging to files with timestamps and request/response metadata.
Add command‑line flags for cache size, element size, and maximum clients.

This README reflects the full design and behavior implied by the current code: a multi‑threaded HTTP/1.x web proxy with an in‑memory LRU cache, implemented with raw sockets, POSIX threads, and a custom HTTP parsing library.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
README.md		README.md
proxy_parse.h		proxy_parse.h
proxy_server_with_cache.c		proxy_server_with_cache.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multithreaded HTTP Web Proxy with LRU Cache

High‑Level Features

Project Structure

Detailed Design

1. Networking and Server Setup

2. Per‑Connection Thread Function

3. HTTP Request Construction and Forwarding

4. HTTP Error Handling

5. Caching Subsystem (LRU Cache)

Running the Proxy

Known Limitations and Assumptions

Possible Extensions

About

Uh oh!

Releases

Packages

Languages

Quartz1605/Multithreaded-Web-server-proxy

Folders and files

Latest commit

History

Repository files navigation

Multithreaded HTTP Web Proxy with LRU Cache

High‑Level Features

Project Structure

Detailed Design

1. Networking and Server Setup

2. Per‑Connection Thread Function

3. HTTP Request Construction and Forwarding

4. HTTP Error Handling

5. Caching Subsystem (LRU Cache)

Running the Proxy

Known Limitations and Assumptions

Possible Extensions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages