This is a module of my AI powered animatronics pipeline. It is responsible for taking input text and generating an audio file as well as timestamped markers for each viseme change.
The visemes can be used to translate directly to jaw movements on the animatronics.
Currently the whole thing is only powered by ElevenLabs.
You will need an account and an
API token on your environment as ELEVENLABS_TOKEN
.
You can run the server with ./main.py api
and make a request like this:
curl -L http://127.0.0.1:5000/generate \
-X POST \
-d voiceName="[ElevenVoices] American Female Teen" \
-d text="this is a test" \
-d name="test 1"
response:
{
"audio": "<base64 encoded hex string of the mp3 audio>",
"audioLength": 1.1493877551020408,
"emitRatio": 0.7,
"mp3File": "/home/ken/projects/ai-skeletons/phone-generation/test 1.mp3",
"outputName": "test 1",
"prompt": "this is a test",
"results": [
["0.060", "T"],
["0.100", "i"],
["0.190", "s"],
["0.250", "i"],
["0.340", "s"],
["0.380", "@"],
["0.480", "t"],
["0.530", "e"],
["0.700", "s"],
["0.810", "t"]
],
"voiceID": "FxXx1SvSMrk96HmqFCUS",
"voiceName": "[ElevenVoices] American Female Teen"
}
The results are an array containing the timestamp (in seconds) and the viseme symbol based on this table.
The audio can be decoded by reversing the encoding:
jq -r '.audio' test.json \
| xxd -r -p \
| base64 -d \
> test-1-decoded.mp3
You can also use the tool for generation without standing up the api using ./main.py generateFull
See the options here.