Overcooked Project

Project Overview

  • Implemented PPO and used IPPO for multi-agent reinforcement learning
    • Demonstrated competitive performance
  • Focused on generalization on unseen layouts:
    • Trained agents on 5 training layouts.
    • Evaluated generalization on 2 unseen layouts.
  • Built using Jax/Flax to achieve high throughput of acting/learning steps
    • Enabled feasible runtimes for hyperparameter optimization
  • Conducted Optuna hyperparameter search to identify the optimal PPO hyperparameters, state encoding and architecture

Sparse reward is not enough

  • Reward is given only when a soup is delivered
  • Soup delivery requires an orchestration of intermediate steps
    • Problem: it is difficults for agents to understand which intermediate actions contribute to the final delivery
  • Solution: Implemented a gymnasium wrapper that uses info dictionary to do reward shaping
    • “Babysits” the agents to incentivize actions that increase the probability of delivery
  • Early reward hacking is transient
    • The agent pair eventually learns that delivering soups maximizes total reward.

State Representation & Architectures

Featurized Encoding

  • Hand-crafted features
  • Produces a 96-dimensional vector (assumes 2 pots)
  • Encodes:
    • Agent-centric features (position, direction)
    • Pot states
    • Relative distances to key entities
  • Problems:
    • Spatial geometry and any information not captured by the hand-crafted features are discarded
    • Increasing the number of pots increases the feature dimension
  • Processed via an MLP

Lossless Encoding

  • Raw state
  • \(H\times W\times C\) tensor:
    • \(H, W\): spatial dimensions of the layout
    • \(C\): each channel is a binary grid indicating presence/absence of an entity in a cell
  • Advantages:
    • Preserves full spatial structure and state information
    • Layout-agnostic, supports variable entity numbers
  • Problem: layouts may have different \(H \times W\)
    • Solution: zero-padding to the maximum size of all training layouts.
  • Processed via Nature-CNN-inspired architecture

Experiments & Results

Layout Avg. Score
Cramped Room 220
Asymmetric Advantage 460
Coordination Ring 260
Forced Coordination 260
Counter Circuit -
Cramped Tomato (Unseen) 315
5x5 (Unseen) -

Conclusion

  • Generalization emerges when training on a diverse set of layouts
  • Featurized encoding with shared parameters converges faster and obtain higher total reward¹
  • Fragility remains:
    • Policies often “overfit” to specific training coordination patterns
    • Small layout changes can still break synchronization
  • Future work:
    • Increase both the variety and number of training layouts through procedural generation

¹ Total Reward: Sum of cumulative rewards across all training layouts

Demo Time 🚀