← Work

Deep RL for Traffic Signal Control

Python PyTorch Reinforcement Learning

We thought traffic lights in America weren't smart, so we tried to make them think for once.

The Problem

The Texas A&M Transportation Institute reports that traffic light delays account for 12-55% of total commute time. Despite affecting millions daily, most cities still rely on static timer-based systems that fail to respond to real-time traffic conditions. While some Intelligent Transportation Systems (ITS) exist, they use predefined logic that lacks adaptivity during special events and often optimise locally rather than across multiple intersections.

Our team recognised both the societal impact and the gap in current solutions, setting out to build a traffic simulation powered by reinforcement learning to optimise grid-wide operations dynamically.

Technical Approach

Environment Setup

We configured a realistic traffic environment using CityFlow, a microscopic traffic simulator, where we defined intersection layouts, traffic routes, and vehicle behavior. The simulation was packaged in Docker to ensure reproducibility and proper integration with CityFlow’s controlling API.

The environment features:

Deep Q-Network Architecture

We implemented a DQN agent that approximates the Q-function using a neural network:

Q-Value Computation:

Q(s, a) = r + γ max Q(s', a')

Where:

Network Components:

Key Hyperparameters:

Reward Function Design

We experimented with three reward strategies:

1. Pressure-Based Reward Minimises traffic congestion by penalising queue imbalances:

P = Σ (q_l - Σ q_l')

Where q_l is the queue length on incoming lane l, and q_l' represents successor lanes.

Advantages: Encourages realistic, responsive control and minimises incoming/outgoing traffic differences
Limitations: May cause long queues in anticipation of future rewards

2. Count-Based Reward Directly penalises total waiting vehicles:

R_count = -Σ q_l

Advantages: Simple and effective at reducing current queue lengths
Limitations: May lead to bottlenecking in other network regions

3. Combined Count + Pressure (Our Final Approach) Balances immediate queue reduction with long-term flow management:

R = α · R_count + (1 - α) · R_pressure, with α = 0.5

This hybrid approach reduces long waiting lines while preventing downstream bottlenecks.

Training Process

The DQN training loop follows these steps:

  1. Reset environment and initialise state s_0
  2. For each timestep t:
    • Select action a_t using ε-greedy policy (10% random, 90% DQN-based)
    • Apply action and observe next state s_{t+1}, reward r_t, and done flag d_t
    • Store transition (s_t, a_t, r_t, s_{t+1}, d_t) in replay buffer
    • Sample random batch from buffer
    • Update Q-network using Bellman equation
    • Periodically sync target network weights
  3. Record episode metrics (total reward, duration)

Experience Replay Benefits:

Results

Our DQN agent successfully learned effective policies for traffic optimisation:

Performance Metrics:

Both Count-Based and Combined Pressure-Count reward functions demonstrated strong performance in reducing traffic flow and achieving the target clearance time.

Technical Implementation

Technology Stack:

State Representation:

Action Space:

GitHub Repository: Traffic_RL2