Building a smart Robot AI using Hugging Face 🤗 and Unity

A Robot obeying your orders

Thomas Simonini
9 min readSep 16, 2021

I’ve updated this tutorial, the new version of the tutorial is here 👉

I’ve updated this tutorial, the new version of the tutorial is here 👉

Today we’re going to build this adorable smart robot that will perform actions based on player text input.

It uses a deep language model to understand any text input and find the most appropriate action of its list.

What’s interesting with that system, contrary to classical game development, is that you don’t need to hard-code every interaction. Instead, you use a language model that selects what’s robot possible action is the most appropriate given user input.

To make this project, we’re going to use:

  • Unity Game Engine (2020.3.18.f1 and +).
  • The Jammo Robot asset, made by Mix and Jam.
  • Hugging Face 🤗

This article is intended for people that already have some Unity basic skills. If it’s not the case and you just want to try to interact with the robot, you can go to the project page, download it and follow the documentation.

So let’s get started!

The power of Sentence Similarity 🤖

Before diving into the implementation, we need to understand how does the project work and what sentence similarity is.

How does the project work?

With this project, we want to give more liberty to the player. Instead of giving an order to a robot by just clicking a button, we want him to interact with it through text.

The robot has a list of actions and uses a sentence similarity model that selects the closest action (if any) given the player’s order.

For instance, if I write, “hey grab me the red box”, the robot wasn’t programmed to know what’s “hey grab me the red box”, is. But the sentence similarity model made the connection between this order and the “bring me red cube” action.

Therefore, thanks to this technique, we can build believable character AI without having the tedious process of mapping by hand every possible player input interaction to robot response. By letting the sentence similarity model do the job.

What’s Sentence Similarity?

Sentence Similarity is a language model able, given a source sentence and sentences, to calculate how much similar sentences are to the source sentence.

For instance, if our source sentence is “Hey there!” it’s very close to “Hello!” sentence.

In our case, we use all-mpnet-base-v2, a sentence transformer model. Using this model, we’re able to let our robot “decide” given input text what’s the most appropriate action to take. The model is already trained so we can use it directly.

Step 1: Select the Sentence Similarity Model

Getting Started with HuggingFace 🤗

HuggingFace contains a lot of amazing language models and an API (Accelerated Inference API) to directly plug them on your projects.

But first, you need to create an account:

When your account is created, go to Accelerated Inference API dashboard, click on the profile (top right hand corner) and API Token.

Copy the API Token, it’s the key you’ll need to be able to use the API.

⚠️ For security reasons, DO NOT SHARE THIS KEY, it’s a private key.

Accelerated Inference API 🚀

Now that we have the API key, we can choose and try our model.

I chose this sentence similarity model (all-mpnet-base-v2):

But feel free to try other Sentence Similarity models.

What’s nice is that we can try the model directly on the website with the Hosted Inference API. Let’s make a test:

We see that “fetch the red block” is close to robot action “bring me red cube”, so the model is working correctly.

Now that we choose this one, we need to get the API URL, to do that click on Deploy > Accelerated Inference.

This opens a modal where you can copy the API_URL:

We’re now ready to connect the API to Unity 🔗.

Step 2: Connect HuggingFace API 🤗 to Unity 🔗

Now, we need to connect the Hugging Face model API to Unity to be able to use it.

Open Scenes/Tutorial and in Scripts/Tutorial_:

The scene looks like this, you have your robot surrounded by different objects (BlueCube, RedCube, RedPillar, etc).

In Scripts/Tutorial open HuggingFaceAPI_Tutorial.cs:

This script will handle the POST request to ask the API given a player input return the sentence similarity score for each robot action in the action list and from that we will be able to select the action with the highest similarity score.

HFScore(): call the API

To call the API and handle the result, we use a Coroutine function since this type of function can wait for execution, and we want to wait for that API to return a response before continuing the execution.

  • First, we form the JSON for the POST Request, it looks like this:
  • Then we make the web request and return the response.
  • The response is a string that looks like this “[0.7777, 0.19, 0.01]”, to work with it, we need to transform it to an array of floats. That is what ProcessResult(data) is doing.

ProcessResult(): convert result to an array of floats and find the max score and max score index.

Because the API returns a string “[0.77, 0.18, 0.01,…]” we can’t work with it since we need to find the highest score and its index.

Therefore we need to convert the result to an array of floats [0.77, 0.19, 0.01] and find the max score and max score index and that’s exactly what ProcessResult() function does.

Thanks to this function, we updated maxScoreIndex and maxScore variables that we’ll call in the Robot Behavior to select the correct action.

Fill the inspector

The last step before working on our robot behaviors is to fill your API key and model URL to the inspector.

Click on the HuggingFaceAPI object in the scene and, in the inspector, update Model_url and Hf_api_key.

⚠️ It’s important that you don’t share your project if you define the API Key.

Step 3: Build the Robot Behavior 🤖

Now that we’ve connected our Unity Project to the Hugging Face Model API, we need to define the behavior of our robot.

The idea is that our robot has different possible actions and the choice of the actions will depend on the API output.

We need first to define the Finite State Machine, a simple AI where each State defines a certain behavior.

Then, we’ll make the utility function that will select the State hence the series of actions to perform.

Let’s define the State Machine

In a state machine, each state represents a behavior, for instance, move to a column, saying hello, etc. Based on the state the agent is it will perform a series of actions.

In our case, we have 7 states:

The first thing we need to do is create an enum called State that contains each of the possible States:

Because we need to constantly check the state, we define the state machine into the Update() method using a switch system where each case is a state.

For each state case, we define the behavior of our agents, for instance in our state Hello, the robot must move towards the player, face him correctly and then launch its hello animation, then go back to an Idle State.

Let’s define all of them:

We have now defined the behavior for each different State. The magic here will come from the fact that’s the language model that will define what State is the closest to the Player input. And in the utility function, we call this state.

Let’s define the Utility Function

Our action list looks like this:

  • Sentence is what will be fed to the API
  • Verb is the State
  • Noun (if any) is the object to interact with (Pillar, Cube, etc)

This utility function will select the Verb and Noun associated with the sentence having the highest similarity score with the player input text.

But first, to get rid of a lot of strange input text, we need to have a similarity score threshold (this threshold must be proportional to the number of different actions since we have 9 I put a threshold of 0.20)

For example, if I say “Look all the rabbits”, none of our possibles actions are relevant. Hence instead of choosing the action with the highest score, we’ll call the State Puzzled that will animate the robot with a perplexed animation.

If the score is higher, then we’ll get the verb corresponding to a State and the noun (goalObject) if any.

We set the state corresponding to the verb. That will activate the behavior corresponding to it.

And that’s it, now we’re ready to interact with our robot!

Step 4: Let’s interact with our Robot 🤖

In this step, you just need to click on the play button in the editor. And you can prompt some orders and see the results.

Don’t forget to hit enter when you finished typing your order.

It may have some bugs, especially the model loading bug, if you see the robot staying in idle mode, wait 45 seconds because it may be the model loading for the first time. And check the console for any error messages.

What’s Next? 🧭

How can I add more actions?

Let’s take an example:

  • Copy YellowPillar gameobject and move it
  • Change the name to GreenPillar
  • Create a new material and set it to green (RGB:)
  • Place the material on GreenPillar

Now that we’ve placed the new game object, we need to add this possibility into the sentences, click on Jammo_Player.

In the list of actions click on the plus button and fill this new action item:

  • Add Go to green column
  • GoTo
  • GreenColumn

And that’s it!

That’s all for today, with a few steps you’ve just built a robot that’s able to perform actions based on your orders that’s amazing!

What you can do next is diving deeper into the code in Scripts/Final and looking at the final Scene in Scenes/Final. I commented on every part of it so it should be relatively straightforward.

If you want to know more about language models and transformers check the amazing HuggingFace course on it (it’s free):

And if you liked my article, please click the 👏 below as many times as you liked the article so other people will see this here on Medium.

And don’t forget to follow me on Medium, on Twitter, and Youtube.

See you next time,

Keep learning, stay awesome,



Thomas Simonini

Developer Advocate 🥑 at Hugging Face 🤗| Founder Deep Reinforcement Learning class 📚 |