📌TL;DR

Implemented Markov Decision Process (MDP) and value iteration to solve the FrozenLake 8x8 environment, achieving 100% success rate compared to 0.88% for random policies (113x improvement). Random policies succeeded in only 88/10,000 episodes with high variance, while the optimal policy achieved perfect consistency (0 std dev) across all episodes by strategically navigating along grid edges to avoid holes. Value iteration converged in ~30-50 iterations, demonstrating the power of dynamic programming in reinforcement learning.

Reinforcement learning can seem abstract when you first learn about Bellman equations and value functions. In this tutorial, I'll show you how to implement a complete solution to the FrozenLake environment using Markov Decision Processes (MDP), comparing random policy evaluation with optimal value iteration.

Understanding the Problem

FrozenLake is a classic reinforcement learning environment where an agent must navigate across a frozen lake from a starting position to a goal while avoiding holes in the ice. The catch? The ice is slippery, making the environment stochastic. When you choose an action, you might slide in an unintended direction.

The 8x8 grid gives us:

64 states (each grid cell)
4 actions (Left=0, Down=1, Right=2, Up=3)
Stochastic transitions (slippery=True means actions are unreliable)
Sparse rewards (only receive reward when reaching the goal)

This environment perfectly demonstrates the challenges of sequential decision-making under uncertainty.

Project Overview

The project implements two approaches:

Random Policy Evaluation: Generate random policies and find the best performers through statistical analysis
Value Iteration: Use the Bellman optimality equation to compute the optimal policy mathematically

Comparing these approaches shows the power of MDP algorithms over trial and error.

Setting Up the Environment

First, install the required libraries:

pip install gymnasium numpy matplotlib

The project uses Gymnasium (formerly OpenAI Gym) for the FrozenLake environment, numpy for numerical computations, and matplotlib for visualizations.

Creating the Environment

def create_frozen_lake_env(render_mode="ansi"):
    env = gym.make(
        "FrozenLake-v1",
        desc=None,
        map_name="8x8",
        is_slippery=True,
        render_mode=render_mode,
    )
    nS = env.observation_space.n  # 64 states
    nA = env.action_space.n  # 4 actions
    return env, nS, nA

The render mode controls visualization. Use "ansi" for text output, "human" for visual simulation, or "rgb_array" for recording videos.

Part 1: Random Policy Evaluation

The first approach generates random policies and evaluates them through extensive experimentation.

Generating Random Policies

A policy is simply a mapping from states to actions. For 64 states and 4 possible actions:

def generate_random_policy(nA, nS, seed=None):
    if seed is not None:
        np.random.seed(seed)

    # For each state, randomly choose an action
    policy = np.random.randint(0, nA, size=nS)
    return policy

Using different seeds ensures we explore a diverse set of policies.

Evaluating Policies

def evaluate_policy(env, policy, num_experiments=100,
                    num_episodes=10000, progress_frequency=20):
    goals_list = []
    steps_list = []

    for experiment in range(num_experiments):
        if experiment % progress_frequency == 0:
            print(f"Running experiment {experiment+1}/{num_experiments}...")

        goals, holes, total_rewards, total_goal_steps = run_one_experiment(
            env, policy, num_episodes
        )
        goals_list.append(goals)

        avg_steps_to_goal = (total_goal_steps / goals) if (goals > 0) else 0
        steps_list.append(avg_steps_to_goal)

    mean_goals = np.mean(goals_list)
    std_goals = np.std(goals_list)
    mean_steps = np.mean(steps_list)

    return goals_list, steps_list, mean_goals, std_goals, mean_steps

This function runs 100 experiments with 10,000 episodes each, collecting statistics on:

How many times the goal is reached
Average steps needed to reach the goal
Variability across experiments

The large number of episodes (10,000) ensures statistical significance despite the environment's stochasticity.

Random Policy Results

Random policies typically reach the goal in less than 1% of episodes, with high variance. The best random policy in my experiments succeeded in only 88 episodes out of 10,000 (0.88% success rate), with a standard deviation of 2.46 and an average of 81.1 steps to reach the goal when successful.

This demonstrates a key limitation: random exploration is extremely inefficient for complex environments.

Part 2: Value Iteration for Optimal Policy

Value iteration uses dynamic programming to compute the optimal policy directly.

The Bellman Optimality Equation

The core idea is that the value of a state equals the maximum expected return achievable from that state:

V(s) = max_a [ sum_s' P(s'|s,a) * (R(s,a,s') + gamma * V(s')) ]

Where:

V(s) is the value of state s
a is an action
s' is a successor state
P(s'|s,a) is the transition probability
R(s,a,s') is the reward
gamma is the discount factor

Implementing Value Iteration

def part_two():
    env, nS, nA = create_frozen_lake_env()
    discount_rate_gamma = 1.0
    convergence_threshold_theta = 1e-4

    # Initialize value functions
    cur_value_function = np.zeros(nS)
    prev_value_function = np.zeros(nS)

    iteration_count = 0
    while True:
        iteration_count += 1
        value_function_diff_delta = 0

        # For each state
        for s in range(nS):
            Q_sa = []

            # For each action
            for a in range(nA):
                q = 0
                for prob, next_state, reward, done in env.unwrapped.P[s][a]:
                    # Bellman optimality equation
                    q += prob * (
                        reward + discount_rate_gamma * prev_value_function[next_state]
                    )
                Q_sa.append(q)

            # Update value with best action
            cur_value_function[s] = max(Q_sa)

            # Track maximum change
            value_function_diff_delta = max(
                value_function_diff_delta,
                abs(cur_value_function[s] - prev_value_function[s])
            )

        # Check convergence
        if value_function_diff_delta < convergence_threshold_theta:
            print(
                f"Value iteration converged after {iteration_count} iterations "
                f"with delta = {value_function_diff_delta:.6f}"
            )
            break

        # Update previous value function
        prev_value_function = cur_value_function.copy()

This iterative process continues until the value function stabilizes (changes less than the threshold).

Optimal Policy Results

The optimal policy dramatically outperforms random policies:

Success Rate: 100% of episodes reach the goal (10,000 out of 10,000 episodes) compared to 0.88% for the best random policy
Path Length: Average of 117 steps to goal (versus 81.1 for the rare successful random episodes)
Consistency: Zero standard deviation (perfect consistency across all episodes)

This represents over a 113x improvement in success rate, demonstrating the remarkable power of systematic optimization over random search. Interestingly, the optimal policy takes more steps on average (117 vs 81.1) because it prioritizes safety over speed, navigating along the edges of the grid to completely avoid holes. This trade-off is worthwhile given the dramatic improvement in reliability from 0.88% to 100% success.

Key Insights

Why Value Iteration Works

Value iteration succeeds because it:

Considers all possible futures: Evaluates expected outcomes across all possible transitions
Propagates information: Good states increase the value of nearby states
Finds global optima: Considers long-term consequences, not just immediate rewards
Handles stochasticity: Properly accounts for transition probabilities

The Exploration-Exploitation Trade-off

Random policies explore indiscriminately. The optimal policy balances:

Exploration: Sometimes moving away from the goal to avoid holes
Exploitation: Taking the most direct path when safe

This balance emerges naturally from the value function, which encodes both safety and efficiency.

Impact of Slipperiness

The stochastic transitions make the problem much harder for random policies. However, the optimal policy can achieve 100% success rate by finding paths that avoid holes completely, even with the uncertainty in movement. The key insights are:

Actions don't always execute as intended due to slipperiness
The optimal policy strategically chooses paths along the edges of the grid
By avoiding proximity to holes, the policy ensures safety even when slipping occurs
The specific layout of holes in the 8x8 FrozenLake map allows for completely safe routes

This demonstrates that careful planning can overcome stochastic challenges in environments with favorable configurations.

Comparing Approaches

Aspect	Best Random Policy	Optimal Policy (Value Iteration)
Success Rate	0.88% (88/10,000)	100% (10,000/10,000)
Average Steps to Goal	81.1	117
Consistency (Std Dev)	2.46 (high variance)	0.00 (perfect consistency)
Computation Time	None (random)	~0.5 seconds
Episodes to Evaluate	1,000,000 (10 x 100K)	1,000,000 (100 x 10K)

Value iteration finds a dramatically better policy (113x improvement) in under a second, compared to the nearly useless random approach.

Practical Applications

While FrozenLake is a toy problem, MDP techniques apply to real-world scenarios:

Robotics: Path planning with uncertain motion
Resource Management: Optimizing inventory under demand uncertainty
Healthcare: Treatment planning with uncertain patient responses
Finance: Portfolio optimization with market uncertainty
Game AI: Strategic decision-making with incomplete information

The key insight is that any sequential decision problem under uncertainty can be formulated as an MDP.

Conclusion

This project demonstrates the fundamental concepts of Markov Decision Processes and dynamic programming in reinforcement learning. By comparing random policy evaluation with value iteration, we see the dramatic advantage of systematic optimization.

Key takeaways:

MDPs model sequential decision-making under uncertainty effectively
Value iteration finds optimal policies through iterative improvement
The Bellman equation provides mathematical foundations for optimization
Stochastic environments require statistical evaluation across many episodes
Optimal policies vastly outperform random exploration

The optimal policy achieved a perfect 100% success rate across all 10,000 episodes, compared to just 0.88% for the best random policy, representing a 113x improvement. This demonstrates the remarkable power of reinforcement learning algorithms for solving complex sequential decision problems. The optimal policy achieved zero standard deviation (perfect consistency) by strategically navigating paths that completely avoid holes, even in a stochastic environment.

Value iteration and related algorithms form the foundation for modern reinforcement learning, including deep Q-learning and policy gradient methods that have achieved superhuman performance in games and robotics.

📓 Interactive Notebook

Explore the complete implementation in my Jupyter notebook:

Note: This project uses Python scripts. Check the repository for the complete code and requirements.

Jonesh Shrestha
AI/ML Engineer

Solving FrozenLake with Markov Decision Processes and Value Iteration