main

Overcooked Project

Implemented PPO and used IPPO for multi-agent reinforcement learning
- Demonstrated competitive performance
Focused on generalization on unseen layouts:
- Trained agents on 5 training layouts.
- Evaluated generalization on 2 unseen layouts.
Built using Jax/Flax to achieve high throughput of acting/learning steps
- Enabled feasible runtimes for hyperparameter optimization
Conducted Optuna hyperparameter search to identify the optimal PPO hyperparameters, state encoding and architecture

Reward is given only when a soup is delivered
Soup delivery requires an orchestration of intermediate steps
- Problem: it is difficults for agents to understand which intermediate actions contribute to the final delivery
Solution: Implemented a gymnasium wrapper that uses info dictionary to do reward shaping
- “Babysits” the agents to incentivize actions that increase the probability of delivery
Early reward hacking is transient
- The agent pair eventually learns that delivering soups maximizes total reward.

Hand-crafted features
Produces a 96-dimensional vector (assumes 2 pots)
Encodes:
- Agent-centric features (position, direction)
- Pot states
- Relative distances to key entities
Problems:
- Spatial geometry and any information not captured by the hand-crafted features are discarded
- Increasing the number of pots increases the feature dimension
Processed via an MLP

Raw state
\(H\times W\times C\) tensor:
- \(H, W\): spatial dimensions of the layout
- \(C\): each channel is a binary grid indicating presence/absence of an entity in a cell
Advantages:
- Preserves full spatial structure and state information
- Layout-agnostic, supports variable entity numbers
Problem: layouts may have different \(H \times W\)
- Solution: zero-padding to the maximum size of all training layouts.
Processed via Nature-CNN-inspired architecture

Generalization emerges when training on a diverse set of layouts
Featurized encoding with shared parameters converges faster and obtain higher total reward¹
Fragility remains:
- Policies often “overfit” to specific training coordination patterns
- Small layout changes can still break synchronization
Future work:
- Increase both the variety and number of training layouts through procedural generation

¹ Total Reward: Sum of cumulative rewards across all training layouts