Audio Chunking Text Tags
Audio chunking text tags are very useful for fixing two rare, but annoying, defects in the audio analysis. When audio with very large silences is analyzed, the recognizer has a tendancy to get confused and misplace results. Also, when very long audio files (usually greater than 90 seconds) without audio chunking text tags are analyzed, they must be chunked automatically and passed into the recognizer with no text. Both of these issues can significantly degrade automatic result quality. Prior to audio chunking text tags, the best workaround was to split these files into smaller, more managable files.
This solution didn't work well for many game pipelines, so we implemented audio chunking text tags, which allow you to mark in the text where you would like the audio to be split. This allows you to cut one file into multiple smaller files, but leave the game asset unchanged.
- Tags are encapsulated with angle brackets: < >
- Tags contain only one number: the time in seconds of the marker: <5.4>
Imagine a single audio file used in a cinematic. The audio file is 120 seconds long, but the character only speaks two small phrases. "Yes, I'm here", starting at 5.6 seconds and going to 6.8 seconds, and "Who? Sawyer? I haven't seen him in three days.", starting at 100.2 seconds and going to 104.3 seconds. Mark up the text as such:
<5>Yes, I'm here<7.5> <99.5>Who? Sawyer? I haven't seen him in three days.<105>
Notice the padding given on either end of the speech. If you place the marker too close to the onset of the speech, the recognizer won't give its best results. Use between 0.5 and 1.0 seconds of padding on either end of the speech for best results.
Now imagine another audio file, one that has a lot of silence at the beginning, and the character's speech "Whoa! You scared me", right at the end, starting at about 15 seconds into the audio. Mark up the text as such:
<14.5>Whoa! You scared me
Notice that you don't need to mark the end of the chunk if it's near the end of the audio.
Imagine one last audio file, where the character immedately says "Duck!", just a half second into the audio file, and stops talking around 1 second, but there is a large amount of trailing silence with no further speech. Chances are this doesn't need to be specially marked up, but if you find a problem with a similar audio file, mark it up as such:
- Tags should always increase in value. The only caveat is the following legal markup: <1.0>Phrase one<10.0> <10.0>Phrase Two<14.2>. Even though the two middle tags have the same time, because they're tagging different phrases, that's fine.
- You should never overlap chunks.
- Don't place chunk markers with negative times.
- Don't place chunk markers with times greater than the audio duration.