Building a smart Robot AI using Hugging Face đ¤ and Unity
A Robot obeying your orders

Today weâre going to build this adorable smart robot that will perform actions based on player text input.

It uses a deep language model to understand any text input and find the most appropriate action of its list.
Whatâs interesting with that system, contrary to classical game development, is that you donât need to hard-code every interaction. Instead, you use a language model that selects whatâs robot possible action is the most appropriate given user input.

To make this project, weâre going to use:
- Unity Game Engine (2020.3.18.f1 and +).
- The Jammo Robot asset, made by Mix and Jam.
- Hugging Face đ¤
This article is intended for people that already have some Unity basic skills. If itâs not the case and you just want to try to interact with the robot, you can go to the project page, download it and follow the documentation.
So letâs get started!
The power of Sentence Similarity đ¤
Before diving into the implementation, we need to understand how does the project work and what sentence similarity is.
How does the project work?
With this project, we want to give more liberty to the player. Instead of giving an order to a robot by just clicking a button, we want him to interact with it through text.
The robot has a list of actions and uses a sentence similarity model that selects the closest action (if any) given the playerâs order.

For instance, if I write, âhey grab me the red boxâ, the robot wasnât programmed to know whatâs âhey grab me the red boxâ, is. But the sentence similarity model made the connection between this order and the âbring me red cubeâ action.

Therefore, thanks to this technique, we can build believable character AI without having the tedious process of mapping by hand every possible player input interaction to robot response. By letting the sentence similarity model do the job.
Whatâs Sentence Similarity?
Sentence Similarity is a language model able, given a source sentence and sentences, to calculate how much similar sentences are to the source sentence.
For instance, if our source sentence is âHey there!â itâs very close to âHello!â sentence.

In our case, we use all-mpnet-base-v2, a sentence transformer model. Using this model, weâre able to let our robot âdecideâ given input text whatâs the most appropriate action to take. The model is already trained so we can use it directly.
Step 1: Select the Sentence Similarity Model
Getting Started with HuggingFace đ¤
HuggingFace contains a lot of amazing language models and an API (Accelerated Inference API) to directly plug them on your projects.
But first, you need to create an account:

When your account is created, go to Accelerated Inference API dashboard, click on the profile (top right hand corner) and API Token.

Copy the API Token, itâs the key youâll need to be able to use the API.
â ď¸ For security reasons, DO NOT SHARE THIS KEY, itâs a private key.

Accelerated Inference API đ
Now that we have the API key, we can choose and try our model.
I chose this sentence similarity model (all-mpnet-base-v2):
But feel free to try other Sentence Similarity models.
Whatâs nice is that we can try the model directly on the website with the Hosted Inference API. Letâs make a test:

We see that âfetch the red blockâ is close to robot action âbring me red cubeâ, so the model is working correctly.
Now that we choose this one, we need to get the API URL, to do that click on Deploy > Accelerated Inference.

This opens a modal where you can copy the API_URL:

Weâre now ready to connect the API to Unity đ.
Step 2: Connect HuggingFace API đ¤ to Unity đ
Now, we need to connect the Hugging Face model API to Unity to be able to use it.
Open Scenes/Tutorial and in Scripts/Tutorial_:

The scene looks like this, you have your robot surrounded by different objects (BlueCube, RedCube, RedPillar, etc).

In Scripts/Tutorial open HuggingFaceAPI_Tutorial.cs:

This script will handle the POST request to ask the API given a player input return the sentence similarity score for each robot action in the action list and from that we will be able to select the action with the highest similarity score.
HFScore(): call the API
To call the API and handle the result, we use a Coroutine function since this type of function can wait for execution, and we want to wait for that API to return a response before continuing the execution.
- First, we form the JSON for the POST Request, it looks like this:

- Then we make the web request and return the response.
- The response is a string that looks like this â[0.7777, 0.19, 0.01]â, to work with it, we need to transform it to an array of floats. That is what ProcessResult(data) is doing.
ProcessResult(): convert result to an array of floats and find the max score and max score index.
Because the API returns a string â[0.77, 0.18, 0.01,âŚ]â we canât work with it since we need to find the highest score and its index.
Therefore we need to convert the result to an array of floats [0.77, 0.19, 0.01] and find the max score and max score index and thatâs exactly what ProcessResult() function does.
Thanks to this function, we updated maxScoreIndex and maxScore variables that weâll call in the Robot Behavior to select the correct action.
Fill the inspector
The last step before working on our robot behaviors is to fill your API key and model URL to the inspector.
Click on the HuggingFaceAPI object in the scene and, in the inspector, update Model_url and Hf_api_key.
â ď¸ Itâs important that you donât share your project if you define the API Key.

Step 3: Build the Robot Behavior đ¤
Now that weâve connected our Unity Project to the Hugging Face Model API, we need to define the behavior of our robot.
The idea is that our robot has different possible actions and the choice of the actions will depend on the API output.
We need first to define the Finite State Machine, a simple AI where each State defines a certain behavior.
Then, weâll make the utility function that will select the State hence the series of actions to perform.
Letâs define the State Machine
In a state machine, each state represents a behavior, for instance, move to a column, saying hello, etc. Based on the state the agent is it will perform a series of actions.
In our case, we have 7 states:

The first thing we need to do is create an enum called State that contains each of the possible States:
Because we need to constantly check the state, we define the state machine into the Update() method using a switch system where each case is a state.
For each state case, we define the behavior of our agents, for instance in our state Hello, the robot must move towards the player, face him correctly and then launch its hello animation, then go back to an Idle State.

Letâs define all of them:
We have now defined the behavior for each different State. The magic here will come from the fact thatâs the language model that will define what State is the closest to the Player input. And in the utility function, we call this state.
Letâs define the Utility Function
Our action list looks like this:

- Sentence is what will be fed to the API
- Verb is the State
- Noun (if any) is the object to interact with (Pillar, Cube, etc)
This utility function will select the Verb and Noun associated with the sentence having the highest similarity score with the player input text.
But first, to get rid of a lot of strange input text, we need to have a similarity score threshold (this threshold must be proportional to the number of different actions since we have 9 I put a threshold of 0.20)
For example, if I say âLook all the rabbitsâ, none of our possibles actions are relevant. Hence instead of choosing the action with the highest score, weâll call the State Puzzled that will animate the robot with a perplexed animation.

If the score is higher, then weâll get the verb corresponding to a State and the noun (goalObject) if any.
We set the state corresponding to the verb. That will activate the behavior corresponding to it.
And thatâs it, now weâre ready to interact with our robot!
Step 4: Letâs interact with our Robot đ¤
In this step, you just need to click on the play button in the editor. And you can prompt some orders and see the results.
Donât forget to hit enter when you finished typing your order.
It may have some bugs, especially the model loading bug, if you see the robot staying in idle mode, wait 45 seconds because it may be the model loading for the first time. And check the console for any error messages.
Whatâs Next? đ§
How can I add more actions?
Letâs take an example:
- Copy YellowPillar gameobject and move it
- Change the name to GreenPillar
- Create a new material and set it to green (RGB:)
- Place the material on GreenPillar
Now that weâve placed the new game object, we need to add this possibility into the sentences, click on Jammo_Player.
In the list of actions click on the plus button and fill this new action item:


- Add Go to green column
- GoTo
- GreenColumn
And thatâs it!
Thatâs all for today, with a few steps youâve just built a robot thatâs able to perform actions based on your orders thatâs amazing!
What you can do next is diving deeper into the code in Scripts/Final and looking at the final Scene in Scenes/Final. I commented on every part of it so it should be relatively straightforward.
If you want to know more about language models and transformers check the amazing HuggingFace course on it (itâs free):
And if you liked my article, please click the đ below as many times as you liked the article so other people will see this here on Medium.
And donât forget to follow me on Medium, on Twitter, and Youtube.
See you next time,
Keep learning, stay awesome,