We thought traffic lights in America weren't smart, so we tried to make them think for once.
The Texas A&M Transportation Institute reports that traffic light delays account for 12-55% of total commute time. Despite affecting millions daily, most cities still rely on static timer-based systems that fail to respond to real-time traffic conditions. While some Intelligent Transportation Systems (ITS) exist, they use predefined logic that lacks adaptivity during special events and often optimise locally rather than across multiple intersections.
Our team recognised both the societal impact and the gap in current solutions, setting out to build a traffic simulation powered by reinforcement learning to optimise grid-wide operations dynamically.
We configured a realistic traffic environment using CityFlow, a microscopic traffic simulator, where we defined intersection layouts, traffic routes, and vehicle behavior. The simulation was packaged in Docker to ensure reproducibility and proper integration with CityFlow’s controlling API.
The environment features:
We implemented a DQN agent that approximates the Q-function using a neural network:
Q-Value Computation:
Q(s, a) = r + γ max Q(s', a')
Where:
r is the immediate reward for current state and actionmax Q(s', a') is the estimated maximum future reward from the next stateγ = 0.99 is the discount factor for future rewardsNetwork Components:
(s, a, r, s', d) where d indicates episode terminationKey Hyperparameters:
We experimented with three reward strategies:
1. Pressure-Based Reward Minimises traffic congestion by penalising queue imbalances:
P = Σ (q_l - Σ q_l')
Where q_l is the queue length on incoming lane l, and q_l' represents successor lanes.
Advantages: Encourages realistic, responsive control and minimises incoming/outgoing traffic differences
Limitations: May cause long queues in anticipation of future rewards
2. Count-Based Reward Directly penalises total waiting vehicles:
R_count = -Σ q_l
Advantages: Simple and effective at reducing current queue lengths
Limitations: May lead to bottlenecking in other network regions
3. Combined Count + Pressure (Our Final Approach) Balances immediate queue reduction with long-term flow management:
R = α · R_count + (1 - α) · R_pressure, with α = 0.5
This hybrid approach reduces long waiting lines while preventing downstream bottlenecks.
The DQN training loop follows these steps:
s_0t:
a_t using ε-greedy policy (10% random, 90% DQN-based)s_{t+1}, reward r_t, and done flag d_t(s_t, a_t, r_t, s_{t+1}, d_t) in replay bufferExperience Replay Benefits:
Our DQN agent successfully learned effective policies for traffic optimisation:
Performance Metrics:
Both Count-Based and Combined Pressure-Count reward functions demonstrated strong performance in reducing traffic flow and achieving the target clearance time.
Technology Stack:
State Representation:
Action Space:
GitHub Repository: Traffic_RL2