Agentic Development at the Frontier¶
Agenda
- Environments are foundational for agentic AI development
- Showcase tools, systems, and simulators needed for RL training
- Demonstrate how environments integrate with RL training and LM post-training
- Workshop jointly organized with Reflection AI, Hugging Face, Unsloth, and PyTorch Foundation
PyTorch RL
- Building agents requires new infrastructure stack
- Infrastructure: getting infra out of the way to focus on algorithms
- Data: exposing models to as many environments and skills as possible
- Key challenges in distributed RL systems
- Orchestration - heterogeneous compute scheduling across different resource types
- Programming model - torch distributed built for data parallel, not RL workflows
- Performance - dual producer-consumer problems with replay buffer and parameter server bottlenecks
- Monarch: PyTorch-native distributed programming framework
- Actor-based system with imperative Python API
- RDMA transfers as first-class citizens
- Enables fault tolerance and heterogeneous scaling
- Torch Forge built on Monarch
- Control plane: service abstraction for routing and load balancing
- Data plane: in-memory storage with automatic resharding
- Reduces RL code from thousands of lines to simple, readable pseudocode
- Supports both synchronous and asynchronous RL training
OpenEnv - Hugging Face
- Problem: static datasets plateau in reward, models degrade when deployed to real world
- Solution: unified, partially observable MDP interface for RL environments
- Gymnasium-style API with step and reset functions
- Can host environments locally or scale via Hugging Face Spaces API
- DeepSeek-V3.2 trained on 1,800 distinct environments with 85,000+ complex prompts
- Available environments
- Coding environments for software engineering agents
- Browser team environments
- Wordle puzzles, games, and more
- CLI tool works like uv package manager
- open init creates project skeleton
- open push deploys to Hugging Face Hub
- Type-safe by design using Python data classes
- Action, observation, and state classes prevent tensor mismatch errors
- No external dependencies required
Unsloth PyTorch OpenEnv
- Colab notebook demo: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game.ipynb
- 2048 game reinforcement learning example
- Goal: train language model to generate winning strategies
- Reward good actions, penalize bad actions over many iterations
- Only requires one prompt and one environment - no additional training data needed
- Key RL principle: “more good, less bad” repeated over thousands of steps
- Integration with Hugging Face TRL framework
- Memory optimization: 50% reduction in memory usage with faster processing
- Provides 2,000+ RL environments through Unsloth integration
- Anti-cheating mechanisms to prevent strategy exploitation
- GitHub: https://github.com/unslothai/unsloth
Reflection
- Evolution from static benchmarks to agentic benchmarks requiring dynamic problem-solving
- Task horizon trends
- AI models completing hour-long tasks 50% of the time
- Task length doubling every 7 months
- Software engineering acceleration even faster
- Required LLM capabilities for autonomous agents
- Multi-step reasoning and planning
- Tool use (already working well)
- Self-correction and recovery from failed trajectories
- Long horizon task coherence
- Coding agents as primary success case
- Deterministic, text-based, and verifiable
- Easy verification through unit tests and tool-assisted feedback
- Evolution from autocomplete to autonomous coding across multiple files
- Pre-training vs post-training focus
- Pre-training: reasoning ability and long horizon task solving
- Post-training: environments provide both data and evaluation
- Need diverse, challenging tasks at edge of current model capabilities
- Reflection AI hiring for frontier open agentic models development