diff --git a/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/lunar-lander.png b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/lunar-lander.png
new file mode 100644
index 0000000..a0d8f8e
Binary files /dev/null and b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/lunar-lander.png differ
diff --git a/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/mars rover.png b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/mars rover.png
new file mode 100644
index 0000000..79643b8
Binary files /dev/null and b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/mars rover.png differ
diff --git a/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/mars rover2.png b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/mars rover2.png
new file mode 100644
index 0000000..4038f0b
Binary files /dev/null and b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/mars rover2.png differ
diff --git a/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/neuralnet.png b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/neuralnet.png
new file mode 100644
index 0000000..0dae5bc
Binary files /dev/null and b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/neuralnet.png differ
diff --git a/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/policy.png b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/policy.png
new file mode 100644
index 0000000..6bf2499
Binary files /dev/null and b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/policy.png differ
diff --git a/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/q function.png b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/q function.png
new file mode 100644
index 0000000..9f60426
Binary files /dev/null and b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/q function.png differ
diff --git a/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/return.png b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/return.png
new file mode 100644
index 0000000..f071f99
Binary files /dev/null and b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/images/return.png differ
diff --git a/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/w3-ch1-reinforcement-learning-introduction.md b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/w3-ch1-reinforcement-learning-introduction.md
new file mode 100644
index 0000000..a02e3df
--- /dev/null
+++ b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/w3-ch1-reinforcement-learning-introduction.md
@@ -0,0 +1,298 @@
+# week 3: Reinforcement learning
+**learning objectives**
+- Understand key terms such as **return**, **state**, **action**, and **policy** as they apply to reinforcement learning.
+- Learn the **Bellman equations** and their significance in reinforcement learning.
+- Explore the **state-action value function** and its role in decision-making.
+- Understand how to handle **continuous state spaces** in reinforcement learning tasks.
+- Build a **deep Q-learning network** to solve problems.
+
+---
+
+## Ch 1: Reinforcement learning introduction
+**What is reinforcement learning?**
+- RL focuses on training systems to make decisions based on rewards received for actions taken in various states..
+
+- Unlike supervised learning, RL doesn't require labeled datasets (state-action pairs). Instead, it uses a reward system to learn optimal behavior.
+
+- In machine learning, reinforcement
+learning is one of those ideas that while not very widely applied in
+commercial applications yet today, is one of the pillars of machine learning.
+
+**Example**
+Let's take an autonomous helicopter as an example:
+check out the stanfrod autonomous helicopter flying videos [here.](http://heli.stanford.edu)
+
+#### Problem Setup:
+The helicopter is equipped with:
+- An onboard computer
+- GPS, accelerometers, gyroscopes, and a magnetic compass for precise location and orientation tracking.
+
+The **goal**: Given the helicopter’s position and state, determine how to move the control sticks to keep the helicopter balanced and flying without crashing.
+
+In reinforcement learning terms:
+1. The **state** (\(s\)) includes the helicopter’s position, orientation, speed, etc.
+2. The **action** (\(a\)) determines how far to push the control sticks.
+3. The **reward function** measures the helicopter’s performance:
+ - Rewards smooth flying.
+ - Penalizes crashing or poor performance.
+
+The task is to find a function that maps from the **state** of the helicopter (\(s\)) to an **action** (\(a\)), guided by the **reward function**, which encourages good behavior (e.g., stable flight) and discourages bad outcomes (e.g., crashes).
+
+---
+**Why Not Supervised Learning for RL?**
+- It Let's say we could get
+a bunch of observations of states and maybe have an expert human pilot tell
+us what's the best action y to take. You could then train a neural
+network using supervised learning to directly learn the mapping from
+the states s which I'm calling x here, to an action a which I'm
+calling the label y here. But the problem is that when the helicopter
+is moving through the air is actually very ambiguous, what is
+the exact one right action to take.
+
+- One way to think of why
+reinforcement learning is so powerful is you have to tell it what
+to do rather than how to do it. And specifying the reward function rather
+than the optimal action gives you a lot more flexibility in how
+you design the system. Concretely for flying the helicopter,
+whenever it is flying well, you may give it a reward of plus
+one every second it is flying well. And maybe whenever it's flying poorly
+you may give it a negative reward or if it ever crashes, you may give it a very
+large negative reward like negative 1,000.
+
+**Other applications of RL:**
+- **Factory Optimization**: Maximizing efficiency by rearranging workflows.
+- **Financial Trading**: Optimizing stock trades to minimize market impact.
+- **Game Playing**: From checkers and chess to Go and video games, RL has achieved remarkable success.
+
+
+
+
+# Mars Rover Example: Reinforcement Learning Formalism
+
+To complete our understanding of the reinforcement learning formalism, let’s explore a simplified example inspired by the Mars rover. This example, adapted from Stanford professor Emma Branskill and my collaborator Jagriti Agrawal (who has worked on controlling the actual Mars rover), will help us understand key reinforcement learning concepts.
+
+---
+
+## **The Problem Setup**
+The Mars rover can occupy one of **six positions**, represented by six boxes:
+
+
+
+
+
+- The rover starts in **state 4** (the fourth box).
+- The position of the rover is called the **state** in reinforcement learning.
+- We label these positions as **state 1**, **state 2**, ..., **state 6**.
+- the **state 1** and **state 6** are called **terminal states** what that means is that, after it gets to one of
+these terminals states, gets a reward at that state, but then nothing
+happens after that.
+
+The **goal** of the rover:
+- Carry out science missions like:
+ - Analyzing rock surfaces using sensors such as a drill, radar, or spectrometer.
+ - Taking interesting pictures for scientists on Earth.
+
+### **Rewards**
+Each state has an associated **reward**:
+- **State 1**: Reward = **100** (most valuable state for science).
+- **State 6**: Reward = **40** (less valuable but still interesting).
+- **States 2, 3, 4, and 5**: Reward = **0** (not much interesting science here).
+
+---
+
+## **Actions**
+At every time step, the rover can take one of two actions:
+1. **Go Left**
+2. **Go Right**
+
+The objective is to determine the best sequence of actions for the rover to maximize its reward.
+
+---
+
+
+## **Core Elements of Reinforcement Learning**
+At every time step in reinforcement learning:
+1. **State (S)**: The current state of the rover (e.g., state 4).
+2. **Action (A)**: The action chosen by the rover (e.g., go left or go right).
+3. **Reward (R(S))**: The reward associated with the current state (e.g., reward = 0 in state 4).
+4. **Next State (S')**: The new state the rover transitions to as a result of the action (e.g., state 3 after going left from state 4).
+
+These four components — **state**, **action**, **reward**, and **next state** — form the foundation of reinforcement learning algorithms. They guide the decision-making process for taking actions.
+
+### **Reward Formalism**
+- The reward **R(S)** is tied to the **current state** (not the state the rover transitions to).
+- For example:
+ - At **state 4**, reward = 0.
+ - When moving to **state 3**, the reward associated with **state 4** remains 0.
+
+---
+
+
+### **Example: Moving Left**
+If the rover starts in **state 4** and chooses to go **left**:
+1. At **state 4**: Reward = 0
+2. At **state 3**: Reward = 0
+3. At **state 2**: Reward = 0
+4. At **state 1**: Reward = 100
+
+Upon reaching **state 1**, the day ends. In reinforcement learning, **state 1** is called a **terminal state**, meaning:
+- The rover receives the reward associated with that state (100 in this case).
+- The rover cannot take further actions after reaching this state (e.g., due to fuel or time constraints).
+
+---
+
+### **Example: Moving Right**
+If the rover starts in **state 4** and chooses to go **right**:
+1. At **state 4**: Reward = 0
+2. At **state 5**: Reward = 0
+3. At **state 6**: Reward = 40
+
+Upon reaching **state 6**, the day ends (another terminal state).
+
+---
+
+### **Wasting Time Example**
+The rover could also follow an inefficient sequence of actions:
+1. Start in **state 4**.
+2. Go **right** to **state 5**: Reward = 0.
+3. Then go **left** back to **state 4**, **state 3**, and **state 2**, eventually reaching **state 1**: Reward = 100.
+
+In this case, the rover wastes time and fuel by going back and forth. While this sequence is valid, it is not optimal.
+
+---
+
+## **how do you prevent wasting time?**
+we prevent this using what we call **return** in RL.
+The **return** is the total reward accumulated over time, weighted by a discount factor, which accounts for how much future rewards are valued compared to immediate rewards.
+One analogy that you might find helpful to understand **return** is if you imagine you have a five-dollar
+bill at your feet, you can reach down and pick up, or half an hour across town, you can walk half an hour and pick up a 10-dollar bill. Which one would you rather go after? Ten dollars is much better than five dollars, but if you need to walk for half an hour to go and get that 10-dollar bill, then maybe it'd be more convenient to just pick up the five-dollar bill instead.
+
+### let's take the Mars Rover example:
+
+
+
+
+If starting from state
+4 you go to the left, we saw that the
+rewards you get would be zero on the first
+step from state 4, zero from state 3, zero from state 2, and then 100 at state
+1, the terminal state.
+The **return** is defined as the sum of these rewards, weighted by a **discount factor** (denoted by `γ` or "gamma"). This discount factor is a number slightly less than `1`. For example, let’s set `γ = 0.9`.
+we start in state 4 with reward of $0$, plus zero (reward of state 3) times $γ$ wich is $0$, plus zero (reward of state 2) times $γ^2$ wich is $0$, plus $100$ (reward of state 1) times $γ^3$ wich is $72.9$
+
+The return is calculated as:
+
+```math
+G = R_1 + \gamma R_2 + \gamma^2 R_3 + \gamma^3 R_4 + \dots
+```
+```math
+G = 0 + 0.9 \cdot 0 + 0.9^2 \cdot 0 + 0.9^3 \cdot 100
+```
+```math
+G = 0 + 0 + 0 + 0.729 \cdot 100 = 72.9
+```
+
+## General Return Formula
+
+For any sequence of rewards:
+```math
+G = R_1 + \gamma R_2 + \gamma^2 R_3 + \dots + \gamma^{n-1} R_n
+```
+
+The discount factor $\gamma$ reflects impatience in reinforcement learning, where rewards obtained sooner contribute more to the return. In many reinforcement learning algorithms, a common choice for the discount factor will be a number pretty close to 1, like 0.9, or 0.99, or even 0.999.
+
+---
+
+
+## Evaluating Returns from Different States
+
+Let’s consider starting from different states and always moving **left**:
+
+| Starting State | Rewards Sequence | Discounted Return |
+|----------------|------------------|-------------------|
+| State 1 | 100 | 100 |
+| State 2 | 0, 0, 100 | 50 |
+| State 3 | 0, 0, 0, 100 | 25 |
+| State 4 | 0, 0, 0, 0, 100 | 12.5 |
+
+---
+
+## Comparing Strategies: Moving Left vs. Moving Right
+
+If the Mars Rover always moves **right** instead:
+- Starting from **state 4**, the rewards are `0, 0, 40`.
+- Using $\gamma = 0.5$:
+ $$G = 0 + 0.5 \cdot 0 + 0.5^2 \cdot 40 = 10$$
+
+| Starting State | Rewards Sequence | Discounted Return |
+|----------------|------------------|-------------------|
+| State 1 | 40 | 40 |
+| State 2 | 0, 40 | 20 |
+| State 3 | 0, 0, 40 | 10 |
+| State 4 | 0, 0, 0, 40 | 5 |
+
+In this case, always moving left gives higher returns compared to always moving right.
+
+---
+
+## Mixed Strategy Example
+
+Suppose we use a mixed strategy:
+- Move **left** from states 2, 3, and 4.
+- Move **right** from state 5.
+
+The resulting returns are:
+| State | Rewards Sequence | Discounted Return |
+|---------------|------------------|-------------------|
+| State 1 | 100 | 100 |
+| State 2 | 0, 0, 100 | 50 |
+| State 3 | 0, 0, 0, 100 | 25 |
+| State 4 | 0, 0, 0, 0, 100 | 12.5 |
+| State 5 | 0, 40 | 20 |
+| State 6 | 40 | 40 |
+
+---
+
+
+
+
+## Key Insights About Returns
+
+1. **Discount Factor $\gamma$**:
+ - Rewards closer to the current state contribute more to the return.
+ - Common choices for $\gamma$: `0.9`, `0.99`, or `0.999`.
+
+2. **Effect of Negative Rewards**:
+ - Negative rewards (e.g., penalties) are discounted more if they occur later, encouraging actions that delay penalties.
+
+3. **Real-World Implications**:
+ - In finance, $\gamma$ models interest rates or the time value of money.
+ - In reinforcement learning, $\gamma$ encourages faster reward accumulation.
+
+---
+
+## **Policy function $\pi$**
+
+As we've seen, there are many different ways that you can take actions in the reinforcement learning problem. For example:
+
+- We could decide to always go for the nearer reward: go left if the leftmost reward is nearer, or go right if the rightmost reward is nearer.
+- Another way to choose actions is to always go for the larger reward.
+- Alternatively, we could always go for the smaller reward (though this doesn’t seem like a good idea).
+- Or, you could choose to go left unless you're just one step away from the lesser reward, in which case you go for that one.
+
+In reinforcement learning, our goal is to come up with a function called a **policy** $\pi$, whose job is to take as input any state `s` and map it to some action `a` that it wants us to take.
+
+For example, for the policy below:
+
+- If you're in **state 2**, the policy maps to the **left action**.
+- If you're in **state 3**, the policy says **go left**.
+- If you're in **state 4**, the policy also says **go left**.
+- If you're in **state 5**, the policy says **go right**.
+
+
+
+
+Formally, $\pi(s)$ tells us what action to take in a given state `s`.
+
+The goal of reinforcement learning is to find a policy $\pi$ or $\pi(s)$ that tells you what action to take in every state in order to **maximize the return**.
+
diff --git a/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/w3-ch2-state-action-value-function.md b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/w3-ch2-state-action-value-function.md
new file mode 100644
index 0000000..54d27b5
--- /dev/null
+++ b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/w3-ch2-state-action-value-function.md
@@ -0,0 +1,314 @@
+## Ch 2: State-action value function (Q Function)
+
+## Introduction
+When we start developing reinforcement learning algorithms, a key quantity will be the **state-action value function**, typically denoted as **Q**. Let’s break down what this function is and why it is crucial.
+
+---
+
+## What is the State-Action Value Function?
+
+The **state-action value function** (or Q function) is a function that takes:
+- A **state** `s`
+- An **action** `a`
+
+It outputs a number `Q(s, a)`, which represents the **expected return** if:
+1. You start in state `s`.
+2. Take the action `a` just once.
+3. Behave optimally thereafter (take actions that yield the highest possible return).
+
+---
+
+### Circular Definition and Resolution
+At first, this definition may seem circular:
+- How can we compute `Q(s, a)` if we don’t know the optimal behavior yet?
+- Why compute `Q(s, a)` if we already know the optimal policy?
+
+This circularity will be resolved later when we explore specific reinforcement learning algorithms.
+
+---
+
+## Example: Mars Rover Problem
+
+### Problem Setup
+Consider a policy where:
+- Go **left** from states `2, 3, 4`.
+- Go **right** from state `5`.
+
+This policy is optimal when the discount factor $\gamma = 0.5$.
+
+### Calculating `Q(s, a)`
+Let’s calculate `Q(s, a)` for a few states and actions.
+
+#### Example 1: `Q(2, right)`
+- Start in state `2`, take the action **right**:
+ - You reach state `3`.
+ - Follow the optimal policy: $3 \to 2 \to 1 \to 100$.
+- Return:
+ - `0` (state 2) + `0.5 x 0` (state 3) + `0.5^2 x 0` (state 2) + `0.5^3 x 100 = 12.5`.
+
+#### Example 2: `Q(2, left)`
+- Start in state `2`, take the action **left**:
+ - You reach the terminal state and receive `100`.
+- return:
+ - `0 + 0.5 x 100 = 50`.
+
+#### Example 3: `Q(4, left)`
+- Start in state `4`, take the action **left**:
+ - Follow the optimal policy: $4 \to 3 \to 2 \to 1 \to 100$.
+- Rewards:
+ - `0 + 0.5 \cdot 0 + 0.5^2 \cdot 0 + 0.5^3 \cdot 100 = 12.5`.
+
+### Summary Table
+| State | Action | `Q(s, a)` |
+|-------|--------|--------------|
+| 2 | Left | 50 |
+| 2 | Right | 12.5 |
+| 4 | Left | 12.5 |
+| 4 | Right | 10 |
+
+if you were to carry out this exercise for all of the other states and all of the other actions, you end up with:
+
+
+
+---
+
+## Policy from `Q(s, a)`
+
+Once we compute `Q(s, a)` for all states and actions:
+1. In each state `s`, choose the action `a` that maximizes `Q(s, a)`.
+2. This defines the optimal policy $\pi(s)$.
+
+For example:
+- In state `4`, compare:
+ - `Q(4, left) = 12.5`
+ - `Q(4, right) = 10`
+- Optimal action: **left**.
+
+---
+
+## Key Insights
+1. **Optimal Return**: The best possible return from a state `s` is:
+ $$
+ \max_a Q(s, a)
+ $$
+2. **Optimal Action**: The optimal action $\pi(s)$ is:
+ $$
+ \arg\max_a Q(s, a)
+ $$
+
+---
+
+## Terminology
+- The **state-action value function** is often denoted as:
+ - `Q(s, a)` or $Q^*(s, a)$.
+ - $Q^*(s, a)$: Refers to the optimal Q function.
+- These terms are used interchangeably in the reinforcement learning literature.
+
+---
+
+## The Bellman Equation
+
+### Notation
+To describe the Bellman Equation, the following notations are used:
+- ` S `: The current state.
+- ` R(S) `: The reward of the current state.
+- ` A `: The current action taken in state ` S `.
+- ` S' `: The next state after taking action ` A ` from ` S `.
+ - Example:
+ - Starting in ` State 4 ` and taking action `left` leads to ` S' = State 3 `.
+- ` A' `: The action taken in ` S' ` (the next state).
+
+---
+
+### The Bellman Equation
+The Bellman Equation is as follows:
+
+$$
+Q(S, A) = R(S) + \gamma \cdot \max_{A'} Q(S', A')
+$$
+
+Where:
+- ` R(S) `: Reward of the current state ` S `.
+- `γ`: Discount factor (e.g., ` γ = 0.5 `).
+- ` \max_{A'} Q(S', A') `: The maximum value of ` Q ` over all possible actions ` A' ` in the next state ` S' `.
+
+---
+
+### Example Calculations
+
+#### Example 1: ` Q(2, right) `
+1. Current state: ` S = State 2 `.
+2. Current action: ` A = right `.
+3. Next state: ` S' = State 3 `.
+4. Rewards:
+ - $R(State 2) = 0$.
+ - $Q(State 3, A') = \max(25, 6.25) = 25$.
+
+Using the Bellman Equation:
+
+$$
+Q(2, \text{right}) = R(2) + \gamma \cdot \max_{A'} Q(3, A')
+$$
+
+Substitute values:
+$$
+Q(2, \text{right}) = 0 + 0.5 \cdot 25 = 12.5
+$$
+
+#### Example 2: ` Q(4, left) `
+1. Current state: ` S = State 4 `.
+2. Current action: ` A = left `.
+3. Next state: ` S' = State 3 `.
+4. Rewards:
+ - $R(\text{State 4}) = 0$.
+ - $Q(\text{State 3}, A') = \max(25, 6.25) = 25$.
+
+Using the Bellman Equation:
+
+$$
+Q(4, \text{left}) = R(4) + \gamma \cdot \max_{A'} Q(3, A')
+$$
+
+Substitute values:
+$$
+Q(4, \text{left}) = 0 + 0.5 \cdot 25 = 12.5
+$$
+
+#### Terminal States
+In a **terminal state**, the Bellman Equation simplifies to:
+$$
+Q(S, A) = R(S)
+$$
+This is because there's no ` S' `, so the second term disappears.
+
+---
+
+### Key Takeaways
+1. **Definition Recap**:
+ $$
+ Q(S, A) = R(S) + \gamma \cdot \max_{A'} Q(S', A')
+ $$
+ The total return consists of two parts:
+ - Immediate reward: ` R(S) `.
+ - Discounted future return: $\gamma \cdot \max_{A'} Q(S', A')$.
+
+2. **High-Level Intuition**:
+ - The **total return** in a reinforcement learning problem can be decomposed into:
+ - **Immediate reward**: $R(S)$.
+ - **Future return**: $\gamma \cdot \max_{A'} Q(S', A')$.
+ - The Bellman Equation captures this decomposition.
+
+3. **Practical Note**:
+ Even if the Bellman Equation feels complex, you can still apply it systematically to compute values.
+
+---
+
+# Stochastic Environments in Reinforcement Learning
+
+In many applications, the actions you take may not produce reliable or deterministic outcomes. For instance, when commanding a Mars rover to move left, environmental factors like slippery terrain or obstacles might cause it to slip or move in an unintended direction. Many robots face this challenge due to external influences like wind, uneven surfaces, or mechanical limitations.
+
+This situation can be modeled using a **stochastic environment**, which is a generalization of the reinforcement learning (RL) framework. Let’s explore this using a simplified Mars rover example:
+
+---
+
+## Stochastic Behavior of Actions
+
+Suppose your rover is in a grid with six states. If you command it to go **left**:
+- There is a **90% probability (0.9)** that it will move to the intended state.
+- There is a **10% probability (0.1)** that it will slip and move in the opposite direction.
+
+For example:
+- **In state 3**, commanding "left" has:
+ - A **90% chance** of moving to state 2.
+ - A **10% chance** of moving to state 4 instead.
+
+Similarly:
+- **Commanding "right"** in state 3:
+ - Has a **90% chance** of moving to state 4.
+ - Has a **10% chance** of moving to state 2.
+
+This randomness makes the environment **stochastic**.
+
+---
+
+## Policies and Outcomes
+
+Let’s consider a policy that specifies actions:
+- Go **left** in states 2, 3, and 4.
+- Go **right** in state 5.
+
+If the rover starts in **state 4**, the sequence of states it visits will depend on the outcomes of its actions:
+1. **First attempt**:
+ - Command "left": Success! The rover moves to state 3.
+ - Command "left" again: Success! It moves to state 2.
+ - Command "left" again: Success! It moves to state 1 and collects the reward.
+
+ Sequence: `4 → 3 → 2 → 1`, with rewards `0, 0, 0, 100`.
+
+2. **Second attempt**:
+ - Command "left": Success! The rover moves to state 3.
+ - Command "left" again: Failure! It slips and moves back to state 4.
+ - Command "left" again: Success! It moves to state 3, and so on.
+
+ Sequence: `4 → 3 → 4 → 3 → 2 → 1`, with rewards `0, 0, 0, 0, 100`.
+
+3. **Third attempt**:
+ - Command "left": Failure! It slips and moves to state 5.
+ - Command "right": Success! It moves to state 6.
+
+ Sequence: `4 → 5 → 6`, with rewards `0, 0, 40`.
+
+---
+
+## Expected Return in Stochastic Environments
+
+In a stochastic reinforcement learning problem:
+- The **sequence of rewards** is random because the outcome of each action is uncertain.
+- Instead of maximizing a single return, we focus on **maximizing the expected return**:
+ - The average of the **sum of discounted rewards** over many trials.
+
+Mathematically, the expected return is:
+$$
+\mathbb{E}[R_1 + \gamma R_2 + \gamma^2 R_3 + \dots]
+$$
+where:
+- $\mathbb{E}$ denotes the expected value (average over all possible outcomes).
+- $R_t$ is the reward at time $t$.
+- $\gamma$ is the discount factor.
+
+---
+
+## The Bellman Equation in Stochastic Environments
+
+In deterministic environments, the Bellman equation is:
+$$
+V(s) = R(s, a) + \gamma V(s')
+$$
+where `s'` is the next state after taking action `a` in state `s`.
+
+In stochastic environments:
+- The next state `s'` is **random**, so we take the **expected value** over all possible next states:
+$$
+V(s) = R(s, a) + \gamma \mathbb{E}_{s'}[V(s')]
+$$
+This accounts for the uncertainty in the transition from `s` to `s'`.
+
+---
+
+## Practical Example: Mars Rover Misstep Probability
+
+Let’s define a **misstep probability**:
+- `p = 0.1`: The rover slips 10% of the time.
+
+If you follow the optimal policy:
+- The **optimal return** will decrease as `p` increases because the rover’s control becomes less reliable.
+- For example:
+ - At `p = 0.1`, the optimal return is slightly reduced.
+ - At `p = 0.4`, the optimal return drops significantly because the rover follows commands correctly only 60% of the time.
+
+### Experiment:
+You can simulate this by adjusting the misstep probability in a reinforcement learning lab or notebook. Observe how:
+- The **expected return** changes.
+- The **Q-values** (state-action values) decrease as control reliability diminishes.
+
+---
\ No newline at end of file
diff --git a/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/w3-ch3-continuous-state-spaces.md b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/w3-ch3-continuous-state-spaces.md
new file mode 100644
index 0000000..b500bc3
--- /dev/null
+++ b/Course 3 - Unsupervised Learning and Recommender Systems/week 3- reinforcement learning/w3-ch3-continuous-state-spaces.md
@@ -0,0 +1,278 @@
+## Ch 3: Continuous state spaces
+
+# Continuous State Spaces in Reinforcement Learning
+
+Many robotic control applications involve **continuous state spaces**. Let’s explore what this means and how to generalize concepts from discrete states to continuous ones.
+
+---
+
+## Discrete vs. Continuous States
+
+In the simplified Mars rover example, we used a **discrete set of states**, meaning the rover could only be in one of six possible positions. For example:
+- The rover could be in **state 1, 2, 3, 4, 5, or 6**.
+
+However, most robots can occupy a vast number of positions rather than just a small discrete set. Their positions are better represented as **continuous values**. For example:
+- If the Mars rover can be anywhere along a line, its position might range from **0 to 6 kilometers**, and it could occupy **any value** within this range:
+ - E.g., 2.7 km, 4.8 km, or any other decimal.
+
+This is an example of a **continuous state space**, where the state is represented by a real-valued number.
+
+---
+
+## Example 1: Controlling a Car or Truck
+
+Let’s consider a toy car or truck. If you are building a **self-driving car** and want it to drive smoothly, the state of the vehicle might include the following:
+
+1. **Position**:
+ - `x`: Its position along the horizontal axis.
+ - `y`: Its position along the vertical axis.
+
+2. **Orientation**:
+ - $\theta$: The angle the car is facing (e.g., between $0^\circ$` and $360^\circ$ ).
+
+3. **Velocities**:
+ - $\dot{x}$: Speed in the `x`-direction.
+ - $\dot{y}$: Speed in the `y`-direction.
+ - $\dot{\theta}$: Angular velocity (rate at which the car is turning).
+
+### Summary of Car State
+For a car, the state is represented as a **vector** of six continuous numbers:
+$$
+[x, y, \theta, \dot{x}, \dot{y}, \dot{\theta}]
+$$
+Each of these values can take on a range of real numbers, such as:
+- $\theta$: Ranges between $0^\circ$ and $360^\circ$.
+- $\dot{\theta}$: Indicates whether the car is turning at 1°/s, 30°/s, or even 90°/s.
+
+---
+
+## Example 2: Controlling an Autonomous Helicopter
+
+Now, let’s extend the concept to an **autonomous helicopter**. To control a helicopter, we need to capture both its position and orientation:
+
+1. **Position**:
+ - `x`: North-south position.
+ - `y`: East-west position.
+ - `z`: Height above the ground.
+
+2. **Orientation**:
+ - $\phi$: Roll (tilt left or right).
+ - $\theta$: Pitch (tilt forward or backward).
+ - $\psi$: Yaw (compass direction: north, east, south, or west).
+
+3. **Velocities**:
+ - $\dot{x}, \dot{y}, \dot{z}$: Speeds in the x-, y-, and z-directions.
+ - $\dot{\phi}, \dot{\theta}, \dot{\psi}$: Angular velocities (how quickly the roll, pitch, and yaw are changing).
+
+### Summary of Helicopter State
+For a helicopter, the state is represented as a **vector** of 12 continuous numbers:
+$$
+[x, y, z, \phi, \theta, \psi, \dot{x}, \dot{y}, \dot{z}, \dot{\phi}, \dot{\theta}, \dot{\psi}]
+$$
+Each of these values can take a range of real numbers.
+
+---
+
+## Continuous State Markov Decision Processes (MDPs)
+
+In a **continuous state reinforcement learning problem**:
+- The state isn’t just one of a small set of discrete values (e.g., 1-6 for the Mars rover).
+- Instead, the state is a **vector** of continuous values, any of which can take on a large range of possible values.
+
+This is referred to as a **continuous state Markov Decision Process (MDP)**.
+
+---
+
+# Lunar Lander Application in Reinforcement Learning
+
+The **lunar lander** task is a classic example of a reinforcement learning problem, often used by researchers to test algorithms. In this task, the goal is to control a simulated lunar lander so that it can land safely on the moon's surface.
+
+## Objective
+In the lunar lander simulation, you are in charge of landing a spacecraft that is rapidly approaching the surface of the moon. Your task is to fire thrusters at the right moments to slow down and guide the lander to land between two flags on the landing pad.
+
+### Successful Landing
+A successful landing involves the lander firing thrusters downward, left, and right to position itself perfectly between two flags on the landing pad.
+
+
+
+### Crash Scenario
+On the other hand, if the reinforcement learning algorithm fails, the lander might crash, as shown here:
+
+
+## Actions
+There are four possible actions at each time step:
+- **Nothing**: No action taken, and inertia and gravity pull the lander toward the surface.
+- **Left**: Fire the left thruster to move the lander to the right.
+- **Main**: Fire the main engine to slow the descent.
+- **Right**: Fire the right thruster to move the lander to the left.
+
+These actions are abbreviated as: **Nothing**, **Left**, **Main**, and **Right**.
+
+## State Space
+The state space of the lunar lander consists of several variables:
+- **Position (X, Y)**: The lander's position on the horizontal (X) and vertical (Y) axes.
+- **Velocity ($\dot{X}$, $\dot{Y}$)**: The speed of the lander along both axes.
+- **Angle** $\Theta$: The lander's tilt or orientation.
+- **Angular Velocity (Theta dot)**: The rate of change of the angle.
+- **Leg Grounding (L, R)**: Indicates whether the left leg (L) or the right leg (R) is touching the ground. These are binary values (0 or 1).
+
+## Reward Function
+The lunar lander has a complex reward function designed to encourage safe landing and minimize fuel waste:
+- **Landing on the pad**: Between +100 and +140 points depending on the precision of landing.
+- **Moving toward the pad**: Positive reward for moving closer, negative reward for drifting away.
+- **Crash**: Large penalty of -100 points for crashing.
+- **Soft landing**: +100 points for landing softly (no crash).
+- **Leg grounding**: +10 points for each leg grounded.
+- **Fuel consumption**: Small penalties for firing thrusters (-0.3 for the main engine and -0.03 for the side thrusters).
+
+This reward function is crucial because it guides the learning process of the agent.
+
+## Reinforcement Learning Goal
+The goal is to learn a policy **π** that, given a state **S**, picks an action **a** to maximize the return, which is the sum of discounted rewards. The **discount factor ($\gamma$)** is usually set to a high value (around **0.985**) to emphasize long-term rewards.
+
+## Learning Algorithm
+To solve this problem, we will use **deep reinforcement learning** and neural networks to develop a policy that can successfully land the lunar lander.
+
+Let's see how we can use reinforcement learning to control the Lunar Lander or for other reinforcement learning problems. The key idea is that we're going to train a neural network to compute or to approximate the state-action value function $Q(s, a)$, which in turn will help us pick good actions.
+
+## Neural Network Architecture
+
+The heart of the learning algorithm is to train a neural network that inputs the current state and action and computes or approximates $Q(s, a)$.
+
+In particular, for the Lunar Lander, we will take the state ` s ` and any action ` a ` and put them together.
+
+### State Representation
+
+The state ` s ` consists of 8 values:
+
+- ` x `, ` y ` (coordinates)
+- $\dot{x}$, $\dot{y}$ (velocity)
+- $\theta$, $\dot{\theta}$ (angle and angular velocity)
+- LR (whether the landing legs are grounded)
+
+### Action Representation
+
+There are four possible actions:
+1. Do nothing
+2. Left thruster
+3. Main engine
+4. Right thruster
+
+Each action is encoded using a one-hot vector. For example:
+- Action 1 (nothing) is encoded as `[1, 0, 0, 0]`
+- Action 2 (left) is encoded as `[0, 1, 0, 0]`, and so on.
+
+Thus, the input to the neural network is a 12-dimensional vector: 8 for the state and 4 for the action encoding.
+
+### Neural Network Structure
+
+
+The neural network consists of:
+- **Input Layer**: 12 features (state and action encoding)
+- **Hidden Layers**: Two hidden layers with 64 units each
+- **Output Layer**: A single output representing $Q(s, a)$
+
+The job of the neural network is to output $Q(s, a)$, which is the state-action value for the Lunar Lander given the input state and action.
+
+### Training the Neural Network
+
+We'll train the network to approximate $Q(s, a)$ using Bellman’s equation:
+
+$$
+Q(s, a) = R(s) + \gamma \cdot \max_{a'} Q(s', a')
+$$
+
+Where:
+- $R(s)$ is the reward for the state ` s `
+- $\gamma$ is the discount factor
+- $\max_{a'} Q(s', a')$ is the maximum future reward achievable from the next state ` s' `
+
+This equation is used to create training examples from the Lunar Lander simulator.
+
+### Generating the Training Set
+
+We will perform the following steps:
+1. **Interact with the environment**: Take random actions in the Lunar Lander, which gives us state-action-reward-next-state tuples ` (s, a, R(s), s') `.
+2. **Generate Training Examples**:
+ - For each tuple, the input ` x ` is ` (s, a) `, and the target ` y ` is calculated as:
+
+ \[
+ y = R(s) + \gamma \cdot \max_{a'} Q(s', a')
+ \]
+
+3. **Initial Random Q-Function**: Initially, the ` Q `-values are unknown, so the network guesses randomly. Over time, as we collect more experiences, the network will improve its estimates.
+
+### Training Process
+
+- After collecting a large number of experiences (e.g., 10,000), we create a dataset of ` (x, y) ` pairs.
+- We use supervised learning to train the network by minimizing the mean squared error loss between the predicted ` Q(s, a) ` and the target value ` y `.
+
+### Replay Buffer
+
+We store the most recent 10,000 experiences in a **replay buffer**. This allows the network to learn from a diverse set of experiences, not just the most recent ones.
+
+### Full Learning Algorithm
+
+1. **Initialize the neural network**: Randomly initialize the weights and biases.
+2. **Collect experiences**: Randomly interact with the environment to gather state-action-reward-next-state tuples.
+3. **Store experiences**: Save the most recent 10,000 experiences.
+4. **Train the network**: Create training sets from the replay buffer and train the neural network using supervised learning.
+
+Through this iterative process, the neural network will improve its approximation of the state-action value function ` Q(s, a) `, leading to better decision-making in the Lunar Lander environment.
+
+# Improved Neural Network Architecture for DQN
+
+## Modified Neural Network Architecture
+
+Here’s the modified neural network architecture:
+
+- **Input**: 8 numbers corresponding to the state of the lunar lander.
+- **Hidden Layers**: 64 units in the first hidden layer and 64 units in the second hidden layer.
+- **Output Layer**: 4 output units.
+
+The job of the neural network is to compute simultaneously the ` Q `-values for all four possible actions in state ` s `:
+- $Q(s, \text{nothing})$
+- $Q(s, \text{left})$
+- $Q(s, \text{main})$
+- $Q(s, \text{right})$
+
+This is more efficient because, given the state ` s`, we can run inference just once and get all four of these values. We can then quickly pick the action ` a` that maximizes ` Q(s, a)`.
+
+### Bellman’s Equation Efficiency
+
+Notice also that in Bellman’s equation, there's a step in which we need to compute:
+
+$$
+\max_{a'} Q(s', a') \cdot \gamma + R(s)
+$$
+
+This neural network architecture makes it much more efficient to compute this because we get all ` Q(s', a')` values for all actions ` a'` at the same time. We can then simply pick the maximum to compute the value for the right-hand side of Bellman’s equation.
+
+# $\epsilon$-gready policy
+
+The learning algorithm that we developed, even while you're still learning how to approximate `Q(s,a)`, you need to take some actions in the lunar lander. How do you pick those actions while you're still learning? The most common way to do so is to use something called an Epsilon-greedy policy. Let's take a look at how that works.
+
+### The Algorithm
+One of the steps in the algorithm is to take actions in the lunar lander. When the learning algorithm is still running, we don't really know what's the best action to take in every state.how do we take actions in this step of the learning algorithm? Let's look at some options.
+
+- #### Option 1
+When you're in some state s, we might not want to take actions totally at random because that will often be a bad action. One natural option would be to pick, whenever in state s, pick an action a that maximizes `Q(s,a)`. We may say, even if `Q(s,a)` is not a great estimate of the Q function, let's just do our best and use our current guess of `Q(s,a)` and pick the action a that maximizes it. It turns out this may work okay, but isn't the best option.
+
+- #### Option 2
+Instead, here's what is commonly done. Here's option 2, which is most of the time, let's say with probability of $0.95$, pick the action that maximizes `Q(s,a)`. Most of the time we try to pick a good action using our current guess of `Q(s,a)`. But the small fraction of the time, let's say, $5\%$ of the time, we'll pick an action a randomly.
+
+#### Why Do We Want to Occasionally Pick an Action Randomly?
+- Well, here's why. Suppose there's some strange reason that `Q(s,a)` was initialized randomly so that the learning algorithm thinks that firing the main thruster is never a good idea. Maybe the neural network parameters were initialized so that Q(s, main) is always very low. If that's the case, then the neural network, because it's trying to pick the action a that maximizes `Q(s,a)`, it will never ever try firing the main thruster. Because it never ever tries firing the main thruster, it will never learn that firing the main thruster is actually sometimes a good idea.
+
+- Under option 2, on every step, we have some small probability of trying out different actions so that the neural network can learn to overcome its own possible preconceptions about what might be a bad idea that turns out not to be the case.
+
+- This idea of picking actions randomly is sometimes called an exploration step. Because we're going to try out something that may not be the best idea, but we're going to just try out some action in some circumstances, explore and learn more about an action in the circumstance where we may not have had as much experience before.
+
+- Taking an action that maximizes `Q(s,a)`, sometimes this is called a greedy action because we're trying to actually maximize our return by picking this. Or in the reinforcement learning literature, sometimes you'll also hear this as an exploitation step. I know that exploitation is not a good thing, nobody should ever explore anyone else. But historically, this was the term used in reinforcement learning to say, let's exploit everything we've learned to do the best we can.
+
+- In the reinforcement learning literature, sometimes you hear people talk about the exploration versus exploitation trade-off, which refers to how often do you take actions randomly or take actions that may not be the best in order to learn more, versus trying to maximize your return by say, taking the action that maximizes `Q(s,a)`.
+
+- This approach, that is option 2, has a name, is called an Epsilon-greedy policy, where here Epsilon is 0.05 is the probability of picking an action randomly. This is the most common way to make your reinforcement learning algorithm explore a little bit, even whilst occasionally or maybe most of the time taking greedy actions.
+
+### Start with High Epsilon
+Lastly, one of the tricks that's sometimes used in reinforcement learning is to start off Epsilon high. Initially, you are taking random actions a lot at a time and then gradually decrease it, so that over time you are less likely to take actions randomly and more likely to use your improving estimates of the `Q-function` to pick good actions. For example, in the lunar lander exercise, you might start off with Epsilon very, very high, maybe even Epsilon equals $1.0$. You're just picking actions completely at random initially and then gradually decrease it all the way down to say $0.01$, so that eventually you're taking greedy actions $99\%$ of the time and acting randomly only a very small one percent of the time.
\ No newline at end of file