In the Amplified Actors article last September, I spoke of some experiments from 2018 and later at the BBC and NHK with synthetic newsreaders. Those “experiments” are now productized and available from Virtual Human startup Hour One, with their text driven AI anchors and newsrooms.
In my article I spoke of an experiment with Multilingual Newsreaders, between the BBC and London start-up Synthesia, which allowed newsreader Matthew Amroliwala to deliver the news in Spanish, Mandarin and Hindi, as well as his native English. Amroliwala only speaks English.
The article also referred to experiments at NHK with a fully artificial presenter, although only monolingual.
Since the 2018 BBC experiment, experiments Synthesia.io has become a corporate darling for internal communication, and is a tool I’ve used for Lumberjack System.
Enter Virtual Human startup Hour One, who have introduced a new tool to create with AI anchors from a text script, which Synthesia does as a slower render, while Hour One’s ‘reporters’ are real time synthetic humans driven solely from text. You can explore some examples in voicebot.ai’s article.
The results look a little plastic, and we could sure use some more body movement, but it’s important to realize these are very early explorations. Synthesia’s second generation are far more natural than the first, which are/where still incredibly valuable for training and education purposes.
On the other hand, your presenters never tire, they’re prepared to work 24/7 without complaint or pay, and you can update them without contractual issues. Obviously not for your showcase news show, but there’s a lot of opportunity about two iterations from now.
Synthetic presenters will take over more and more of the routine, bread-and-butter staples of production because of their utter simplicity. For example, if I were to record myself for a ‘top and tail’ on camera piece I need to find somewhere to set up. Position some lights. Wire myself for audio, set up a camera, rehearse and record the piece.
Compared with selecting an Avatar, adding a background if I want (still or video), typing or pasting in the text, listen to the text spoken and make any adjustments, submit for render.
Example one takes an hour or more. Example two takes about 10 minutes and I don’t leave my desk. Another 10 minutes and I have a high quality presenter who’s a lot better looking than me, and passes for genuine almost the entire time. The cost benefit ration will make it compelling.
Training, corporate communications and personalized marketing are already using them. How far they will penetrate is unpredictable because at the other end, we have NVIDIAs MetaHumans, which are (mostly) human performance driven.