An Introduction to Unity ML-Agents with Hugging Face 🤗

Let’s train a curious agent to destroy Pyramids

Thomas Simonini
13 min readJun 22, 2022

⚠️ A new updated version of this article is available here 👉

⚠️ A new updated version of this article is available here 👉

In the last ten years, we have witnessed massive breakthroughs in reinforcement learning (RL). From the first successful use of RL by a deep learning model for learning a policy from pixel input in 2013 to Decision Transformers, we live in an exciting moment, and if you want to learn about RL, this is the perfect time to start.

This moment is also exciting because we have access to so many unique environments and we can build our ones using the Unity game engine. Indeed, the Unity ML-Agents toolkit is a plugin based on the game engine Unity that allows us to use the Unity Game Engine as an environment builder to train agents.

Source: Unity ML-Agents Toolkit

From playing football, learning to walk, jumping big walls, and training a cute doggy to catch sticks, Unity ML-Agents Toolkit provides a ton of exceptional pre-made environments.

But there is more. Normally, when you want to use MLAgents you need to install Unity and know how to use it. At Hugging Face, we worked on an ML-Agents experimental update where you don’t need to install Unity or know how to use the Unity Editor to use ML-Agents environments. Plus, you can publish your models to the Hugging Face Hub for free and visualize your agent playing directly online 🎉.

What you’ll get at the end of this tutorial

And it’s what we’re going to do today. We’ll learn about ML-Agents and use one of the pre-made environments: Pyramids. In this environment, we’ll train an agent that needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top.

To do that, it will need to explore its environment, and we will use a technique called curiosity.

Then, after training we’ll push the trained agent to the Hugging Face Hub and you’ll be able to visualize it playing directly on your browser without having to use the Unity Editor. You’ll be also be able to visualize and download others trained agents from the community.

You can replay your agents online

Sounds exciting? Let’s start.

This tutorial is part of Deep Reinforcement Learning Class 🤗, a free class where you’ll learn the theory and practice using famous Deep RL libraries such as Stable Baselines3, RL Baselines3 Zoo, ML-Agents, and RLlib.

It can be followed as a standalone tutorial if you understand the basics of reinforcement learning, but if you want to go deeper into the class, you should sign up (it’s free) here 👉

You can follow this tutorial and train your agent directly on Google Colab made by one of our classmates Abid Ali Awan alias kingabzpro.

👇 👇 👇

What’s the Hugging Face Hub? 🤗

The Hugging Face Hub is a platform with over 50K models, 5K datasets, and 5K demos in which people can easily collaborate in their ML workflows.

The Hub is a central place where anyone can share, explore, discover, and experiment with open-source Machine Learning: think of it as the Github for Machine Learning models.

For Reinforcement Learning, we have already integrated Stable-Baselines3 and RL-Zoo. Indeed, with one line of code, you can evaluate, record a replay, generate a model card of your agent and push it to the Hub.

If you want to see how 👉 check Unit 1 of the course.

You can find the list of Reinforcement Learning models (already 1500) here:

Our integrations generates a video replay of your agent and evaluate it

To be able to upload your models to the Hub, you’ll need an account (it’s free). You can sign up here

How do Unity ML-Agents work?

Before training our agent, we need to grab the big picture of ML-Agents.

What is Unity ML-Agents?

Unity ML-Agents is a toolkit for the game engine Unity that allows us to create environments using Unity or use pre-made environments to train our agents.

It’s developed by Unity Technologies, the developers of Unity, one of the most famous Game Engines used by the creators of Firewatch, Cuphead, and Cities: Skylines.

Firewatch was made with Unity

The four components

With Unity ML-Agents, you have four essential components.

Source: Unity ML-Agents Documentation

The first is the Learning Environment, which contains the Unity scene (the environment) and the environment elements (game characters).

The second is the Python API which contains the low-level Python interface for interacting and manipulating the environment. It’s the API we use to launch the training.

Then, we have the Communicator that connects the environment (C#) with the Python API (Python).

Finally, we have the Python trainers: the RL algorithms made with PyTorch (PPO, SAC…).

Inside the Learning Component

Inside the Learning Component, we have three important elements:

The first is the agent, the actor of the scene. We’ll train the agent by optimizing his policy (which will tell us what action to take in each state). The policy is called Brain.

Finally, there is the Academy. This element orchestrates agents and their decision-making process. Think of this Academy as a maestro that handles the requests from the python API.

To better understand its role, let’s remember the RL process. This can be modeled as a loop that works like this:

Source: Sutton’s Book

Now, let’s imagine an agent learning to play a platform game. The RL process looks like this:

  • Our agent receives state S0 from the environment — we receive the first frame of our game (environment).
  • Based on the state S0, the agent takes an action A0 — our agent will move to the right.
  • The environment transitions to a new state S1.
  • Give a reward R1 to the agent — we’re not dead (Positive Reward +1).

This RL loop outputs a sequence of state, action, and reward. The goal of the agent is to maximize the expected cumulative reward.

The Academy will be the one that will send the order to our Agents and ensure that agents are in sync:

  • Collect Observations
  • Select your action using your policy
  • Take the Action
  • Reset if you reached the max step or if you’re done.

Now that we understand how ML-Agents works and the Hub, we’re ready to understand the Pyramid environment and train our agent.

The Pyramid Environment

The goal in this environment is to train our agent to get the gold brick on the top of the Pyramid. In order to do that, it needs to press a button to spawn a pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top.

The reward system is:

In terms of observation, we don’t use normal vision (frame), but we use 148 raycasts that can each detect objects (switch, bricks, golden brick, and walls.)

Think of raycasts as lasers that will detect if it passes through an object.

Source: Unity ML-Agents Documentation

We also use a boolean variable indicating the switch state (did we turn on or not the switch to spawn the Pyramid) and a vector that contains agent’ speed.

The action space is discrete with four possible actions:

Our goal is to hit the benchmark with a mean reward of 1.75.

To train this new agent that seeks that button and then the Pyramid to destroy, we’ll use a combination of two types of rewards:

  • The extrinsic one given by the environment.
  • But also an intrinsic one called curiosity. This second will push our agent to be curious, or in other terms, to better explore its environment.

(optional) What is Curiosity in Deep RL?

I already cover curiosity in detail in 2 other articles here and here if you want to dive into the mathematical and implementation details.

Two Major Problems in Modern RL

To understand what is curiosity, we need first to understand the two major problems with RL:

First, the sparse rewards problem: that is, most rewards do not contain information, and hence are set to zero.

Remember that RL is based on the reward hypothesis, which is the idea that each goal can be described as the maximization of the rewards. Therefore, rewards act as feedback for RL agents, if they don’t receive any, their knowledge of which action is appropriate (or not) cannot change.

Thanks to the reward, our agent knows that this action at that state was good

For instance, in Vizdoom “DoomMyWayHome,” your agent is only rewarded if it finds the vest. However, the vest is far away from your starting point, so most of your rewards will be zero. Therefore, if our agent does not receive useful feedback (dense rewards), it will take much longer to learn an optimal policy and it can spend time turning around without finding the goal.

A big thanks to Felix Steger for this illustration

The second big problem is that the extrinsic reward function is handmade, that is in each environment, a human has to implement a reward function. But how we can scale that in big and complex environments?

So what is curiosity?

Therefore, a solution to these problems is to develop a reward function that is intrinsic to the agent, i.e., generated by the agent itself. The agent will act as a self-learner since it will be the student, but also its own feedback master.

This intrinsic reward mechanism is known as curiosity because this reward push to explore states that are novel/unfamiliar. In order to achieve that, our agent will receive a high reward when exploring new trajectories.

This reward is in fact designed on how human acts, we have naturally an intrinsic desire to explore environments and discover new things.

There are different ways to calculate this intrinsic reward, the classical one (curiosity through next-state prediction) was to calculate curiosity as the error of our agent of predicting the next state, given the current state and action taken.

Because the idea of curiosity is to encourage our agent to perform actions that reduce the uncertainty in the agent’s ability to predict the consequences of its own actions (uncertainty will be higher in areas where the agent has spent less time, or in areas with complex dynamics).

If the agent spend a lot of times on these states, it will be good to predict the next state (low curiosity), on the other hand, if it’s a new state unexplored, it will be bad to predict the next state (high curiosity).

Using curiosity will push our agent to favor transitions with high prediction error (which will be higher in areas where the agent has spent less time, or in areas with complex dynamics) and consequently better explore our environment.

There’s also other curiosity calculation methods. ML-Agents uses a more advanced one called Curiosity through random network distillation. This is out of the scope of the tutorial but if you’re interested I wrote an article explaining it in detail.

Let’s destroy some pyramids! 💥

Ok, we’re now ready to train our agent 🔥

Step 1: Clone and install the Hugging Face ML-Agents fork

We need to clone the repository, that contains the experimental version of the library that allows you to push your trained agent to the Hub.

# Clone the repository 
git clone
# Go inside the repository and install the package
cd ml-agents
pip3 install -e ./ml-agents-envs
pip3 install -e ./ml-agents

Step 2: Download the Environment Executable

  • Download from here:
  • Unzip it and place it inside the MLAgents cloned repo in a new folder called trained-envs-executables/windows.

Step 3: Modify the PyramidsRND config file

In ML-Agents, you define the training hyperparameters into config.yaml files.

In our case, we use PyramidsRND (Pyramids with Random Network Distillation).

For this first training, we’ll modify one thing:

  • The total training steps hyperparameter is too high since we can hit the benchmark in only 1M training steps.

To do that, we go to config/ppo/PyramidsRND.yaml, and modify these to max_steps to 1000000.

As an experimentation, you should also try to modify some other hyperparameters, Unity provides a very good documentation explaining each of them here.

We’re now ready to train our agent 🔥.

Step 4: Train our agent

To train our agent, we just need to launch mlagents-learn and select the executable containing the environment.

We define four parameters:

  • mlagents-learn <config>: the path where the hyperparameter config file is.
  • — env: where the environment executable is.
  • run_id: the name you want to give to your training run id.
  • — no-graphics: to not launch the visualization during the training (you can remove this parameter, but that might slow the training).

You should see something like this:

You can monitor your training by launching Tensorboard using this command:

The tensorboard files will be also uploaded to the Hub, so you will be able to access it when you want:

The training will take 30 to 45min depending on your machine, go take a ☕️you deserve it 🤗.

Step 5: Push the agent to the 🤗 Hub

Now that we trained our agent, we’re ready to push it to the Hub and see him playing online 🔥.

First we need to log to Hugging Face to push a model:

  • Copy the token
  • Run this and past the token:
huggingface-cli login

Then, we simply need to run mlagents-push-to-hf.

And we define 4 parameters:

  • — run-id: the name of the training run id.
  • — local-dir: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.
  • — repo-id: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>

If the repo does not exist it will be created automatically

  • — commit-message: since HF repos are git repository you need to define a commit message.

If you have a rebase error:

OSError: error: cannot pull with rebase: Your index contains uncommitted changes. error: please commit or stash them.

The best is to delete Hub in ml-agents folder. And then launch again mlagents-push-to-hf.

Else, if everything worked you should have this at the end of the process(but with a different url 😆) :

Your model is pushed to the hub. You can view your model here:

It’s the link to your model, it contains a model card that explains how to use it, your Tensorboard and your config file. What’s awesome is that it’s a git repository, that means you can have different commits, update your repository with a new push etc.

But now comes the best: being able to visualize your agent online.

Step 6: Watch our agent play 👀

For this step it’s simple:

  • In step 1, choose your model repository which is the model id(in my case ThomasSimonini/MLAgents-Pyramids).
  • In step 2, choose what model you want to replay:
  • I have multiple one, since we saved a model every 500000 timesteps. But if I want the more recent I choose Pyramids.onnx
  • What’s nice is to try with different models step to see the improvement of the agent.

🎁 Bonus: Why not train on another environment?

Now that you know how to train an agent using MLAgents, why not try another environment? MLAgents provides 18 different and we’re building some custom ones. The best way to learn is to try things of your own, have fun.

You have the full list of the one currently available on Hugging Face here 👉

You can also play with the agents we’re already uploaded here:

And don’t forget we have a discord server where you can ask questions and exchange with your classmates 👉

That’s all for today. Congrats on finishing this tutorial! You’ve just trained your first ML-Agent and shared it to the Hub 🥳.

The best way to learn is to practice and try stuff. Why not try another environment? ML-Agents has 18 different environments, but you can also create your own? Check the documentation and have fun!

If you have any thoughts, comments, or questions, feel free to comment below.

And if you liked my article, please click the 👏 below as many times as you liked the article so that other people will see this here on Medium.

Finally, we want to improve and update the course iteratively with your feedback. If you have some, please fill this form 👉

See you next time!

Keep learning, stay awesome 🤗,



Thomas Simonini

Developer Advocate 🥑 at Hugging Face 🤗| Founder Deep Reinforcement Learning class 📚 |