Although I’ve shamelessly stolen the title from Joe B (@zbutcher on Twitter) I think it does represent a shift in the way we work with our source media.
Now, before I start let me be clear. I am NOT saying timecode is unimportant. I’m NOT saying that timecode is passé and suddenly irrelevant. Timecode remains incredibly important for any tape based access.
What I am saying is that text search – or phonetic search derived from text – is becoming a highly viable, and in many ways superior, way to search and find content. Timecode’s primary role was in being able to identify any given frame from a tape by tape and frame number. There’s nothing wrong with that approach, but as humans we don’t think in “reel and Timecode”, which is why text is a superior option.
In this technology summary, I’m going to consider:
- What tools are at our finger tips right now and their relative merits,
- Why speech transcription is ultimately more valuable than phonetic search (long term),
- Developments in speech transcription, and
- Transcription technologies.
Software and tools available now
There are two broad approaches to text search in the marketplace today: those that transcribe the speech into text and those that use phonetic search. In the first category we have Adobe Premiere/Audition/Soundbooth using Autonomy’s technology to transcribe speech to text. In the second we have Avid and Boris using Nexidia’s phonetic search technology in Media Composer (for Phrasefind and Script Sync) and Soundbite (formerly Get!) respectively. Our own prEdit edits video using a text transcript.
Phonetic search makes no attempt to understand meaning. Essentially, Nexidia’s technology “understands” what an audio waveform would look like when the text is input. Nexidia scans all the audio files ahead of search and indexes them for search. When you input a word, it estimates the waveform then finds matches for that waveform by comparing the index with the target waveform. This technique is great when you don’t have a transcript, or have a transcript without any time stamps.
This technology is also used to align a text file (script) with the audio in the audio and video files in Avid’s ScriptSync with the optional ScriptSync package. This also works great if you have a transcript of interviews to align the audio in the video with the transcript. However, if you have a transcript available, working with prEdit may be faster, manipulating the audio and video by editing and modifying text.
Going the other way – from speech to text – provides an additional advantage in that we have meaning associated with the text, and we have a transcript that carries through production into distribution. If you want searchable text for distribution then Adobe’s professional tools are the only automated tools: phonetic search isn’t viable because it would require distributing the Nexidia engine into distribution (and with their licensing model, I don’t see that happening any time soon).
However, the speech to text engine used by Adobe (licensed from Autonomy) is still a work in progress: results can be quite good but the average result is less spectacular, sometimes completely useless. That is why Adobe have added, and strongly pushed, the ability to provide the speech transcript engine a guide script. Surprisingly, a guide script alone – even one where it is an exact transcript – does not transcribe perfectly. The best results require a trip via Adobe Story. This has the advantage of keeping all punctuation and paragraph breaks  (and automatic subclipping into paragraph subclips in prEdit). For a comparison on accuracy you might be interested in Colin Brougham’s comparisons.
The big disadvantage to this approach is that it requires a transcript, the very thing most people want to automate because of the cost.
Why speech transcription is more valuable than phonetic search
Speech transcription carries meaning. Phonetic search does not carry meaning. This is an important distinction, because it means that speech transcription is valuable metadata as well as a production tool, while phonetic search is a useful production tool, but has no value as metadata away from the Nexidia engine.
Speech transcription can be carried with the media throughout its life, even into edited versions. Indeed, this is Adobe’s intent. They tend to focus on the speech transcript metadata as part of a distribution strategy more than its use in postproduction. The speech transcript metadata is carried inside media files as XMP metadata. (There are other alternatives for speech transcript metadata in distribution files, but I’ll discuss that further down.)
Speech transcription can be searched by anyone, at any stage, without needing a proprietary engine.
Speech transcription can be used to derive keywords and other expressions of meaning, which is valuable not only for automating some types of production (some types folks, only some) but extremely valuable as metadata for later finding content.
Transcribed speech is the input to prEdit. Briefly prEdit allows you to easily add metadata (log notes), break interviews into thought segments and eliminate less useful material, before searching and building a story by dragging and dropping text blocks. Editing can continue through the story building process and narration added (converted to voice instantly). At any time you can preview a clip or clips, the full story, or any selected part of the story before exporting to Final Cut Pro 7 or Premiere Pro CS 5.5 or later.
So, while phonetic search is a great post-production tool, transcription into real text has a wider range of uses and uses outside post-production. The only trouble is, speech transcription is expensive!
Developments in speech transcription
It’s a great time to live if you’re interested in speech transcription. The most significant developments have been hybrid computer-human approaches used by 3PlayMedia, SpeakerText and SpeechPad to reduce the cost and time of transcription. These companies particularly are focused on the need for transcription for video in distribution, but are excellent choices for transcriptions for prEdit or other postproduction needs.
Until we get a fully automatic speech transcription with adequate accuracy, these will help. Even with human correction some uncommon words or names will not be transcribe accurately.
Transcription technologies
I’ve talked about Nexidia and the phonetic search technology and why my long term preference is for speech transcription, so it’s no surprise that I spend some time following what’s happening with the technology. What is interesting to me is that the two companies who are generally recognized as having the most accurate under the widest range of conditions are not (yet) available for postproduction work.
Google has been amassing huge numbers of examples of speech for recognition – an important first step to accurate speech recognition – via the (now defunct) Goog411 initiative. Google have been using this technology for automated captioning for YouTube videos and for voicemail transcriptions within Google Voice. As a Google voice customer I’d say that the results are definitely much better than the attempts by Vonage (laughable) and comparable or better than Premiere Pro’s use of Autonomy.
More information on Google’s voice recognition plans can be found in this Techcruch article with Mike Cohen, head of Google’s Voice recognition efforts. Also interesting to note is that Google is slowly opening up an API for their speech recognition efforts, starting with Chrome version 11. How open that is for third party developers to use, remains unknown, but it’s an interesting direction from the search giant. If the API became open, and this is one of the two most accurate speech transcription technologies, why wouldn’t savvy developers like us, start to use it and integrate it into our software?
Equally prominent in the speech transcription/recognition community is Nuance, probably best known for powering Dragon Dictate, Dragon Naturally Speaking (and the variations) and the speech recognition component of Apple’s Siri technology. (Siri adds a lot of powerful tools on top of this basic recognition layer, but if the speech isn’t recognized accurately nothing good can come from it downstream.) There is no public API for any of Nuance’s technology (nor Siri for that matter). Nuance tends to do direct deals with companies who want to license its technology – a fairly standard practice in the technology world.
My fondest hope is that Apple’s license from Nuance will be extended to OS X and a speech recognition framework be included in OS X for developers. It’s a fond hope, not anything real!
Google and Nuance have the most accurate technologies that do not require speech training. I’m a Dictate user but that product uses training to obtain high (very high) accuracy in transcription. To be able to be accurate without needing training is what we need for interview transcriptions for post.
Tip: If you do have Dictate or any of the PC variants of Nuance’s products, one technique is to listen to the interview and speak it with your own (trained) voice. I’m not yet there, but it is possible to speak-as-you-hear (the basis of the common ear-bud presenter trick) for fast, accurate transcription.
Beyond the giants, there are many other technology companies, or open source projects, in the speech recognition field that are worth mentioning. One thing that should be noted is that most of these technologies are cloud based (as is Apple’s Siri). They work as long as they have a connection to their primary servers.
I’ll let these folk speak for themselves. Whichever technologies prevail, we’re definitely seeing a surge in the accuracy and flexibility of speech recognition, that is going to factor in post production in the coming years, beyond where we are at right now.
Speech Recognition for your iPhone application. The claim is that the technology is:
- currently unavailable from Apple APIs
- easy to add to your application
- convenient for all types of apps: games, fun and promo applications or utilities
- suitable for the iPhone and the iPod Touch
- cheaper than you might expect.
Sphinx-4 is a state-of-the-art speech recognition system written entirely in the JavaTMprogramming language. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).
Sphinx-4 started out as a port of Sphinx-3 to the Java programming language, but evolved into a recognizer designed to be much more flexible than Sphinx-3, thus becoming an excellent platform for speech research.
Speech Recognition with Javascript; speechapi.com
With speechapi.com’s javascript API, it is possible to build interesting speech-web mashups that include both speech-to-text as well as text-to-speech.
A combination of several technologies and open source tools make this possible. In the browser, Flash is used to access the microphone and stream the audio to an RTMP server. Red5 is used because its a versatile media server that has the benefit of being open source and free.
Open-Source Large Vocabulary CSR Engine Julius
“Julius” is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. Based on word N-gram and context-dependent HMM, it can perform almost real-time decoding on most current PCs in 60k word dictation task. Major search techniques are fully incorporated such as tree lexicon, N-gram factoring, cross-word context dependency handling, enveloped beam search, Gaussian pruning, Gaussian selection, etc. Besides search efficiency, it is also modularized carefully to be independent from model structures, and various HMM types are supported such as shared-state triphones and tied-mixture models, with any number of mixtures, states, or phones. Standard formats are adopted to cope with other free modeling toolkit such as HTK, CMU-Cam SLM toolkit, etc.
I have no idea what that means either!
Scripto is a light-weight, open source, tool that will allow users to contribute transcriptions to online documentary projects.  The tool will include a versioning history and full set of editorial controls, so that project staff and manage public contributions.  The design and development of the tool is being supported by grant funding from the National Endowment for the Humanities, Office Digital Humanities, and the National Historical Publication and Records Commission.
From an audio or video media, transcription can automatically generate a file ofsubtitles, the keyword list in XML format and the entire plain text.  (That’s the translation from the French – speech recognition is not limited to English!)
And interesting, although not currently speech recognition:
Currently, SoundHound’s specialty is delivering information about music. Users can sing or hum a tune into its SoundHound app and the app returns the song name, as well as other information. Last week, the company released its Hound app, which can identify when a user says the name of an artist or album.
The slightly frivolous-seeming “name that tune†aspects of SoundHound’s applications belie the seriousness of the technology and business underneath it all. SoundHound has raised $16 million in venture capital and currently has 55 full-time employees. Investors have been attracted to the company by the future potential of SoundHound’s core technology, Mohajer told me. “We own all of our technology, while a lot of other apps in this space license their core technology,†he said. “We built everything in-house and we own all of our intellectual property.
14 replies on “Text is the New Timecode”
Phillip,
Great summary of the state of the art. I’m now clicking through all of your references.
What’s your guess as to when one of these companies will have
an accurate, reliable solution that replaces the “send it out for transcription” solution that most of the industry now uses?
Mark
Given the results being delivered by Google voice search and Nuance with Siri, I’m thinking one of those is getting close with the technology. Whether they’ll make it available in a manner that’s suitable for our industry is another matter. Although an open API gives developers a good starting point to build a purely software service on.
At Bisk, we have a crew working on manual word for word transcription. Now once that’s done, we utilize some automation in MacCaption which syncs up the words in the transcript with the video. As XMP metadata, is the video actually time synchronized with the text? How does that work with editing?
The Adobe tools (PPro, Audition, Soundbooth) will lock the text to the audio by using Autonomy’s analysis engine. That is then saved as XMP metadata and is available in PPro.
For editing you can highlight a section of the text, and drag that “clip” to an timeline.
Well, that sounds more elegant than the way we do it currently. Can the text be exported separately with time references? Or does it become part of your distributable files? I would imagine the latter would depend upon your output format support.
Transcriptize (AssistedEditing.com) will export the text with time references, and it does become part of the distributable files. Searching text in “Flash” was the primary purpose. Most output formats are supported.
OK, but what about other languages? For instance, Nuance (Dragon Naturally) is the only one I know that transcribes into Spanish. Spanish is the second most talked language in the Occidental World (after english. Yet, not even Apple computers are able to “read outloud” a Spanish text.
So the thing I’m asking here is, when will investigators (and enterprises) in this field consider other cultures (and revenues)?
Will Adobe’s speech recognition or Google will soon allow Spanish transcription?
Thanks
Good news, Adobe supports Spanis “One aspect of this feature is the ability to choose a language for the speech to be analyzed: English, French, German, Italian, Korean, Japanese, or Spanish. Unfortunately, it’s not clear from within Premiere Pro that you need to download separate speech analysis models for each language in order to choose and use one of these additional languages.”
http://blogs.adobe.com/premiereprotraining/2010/12/download-speech-analysis-libraries-for-premiere-pro-cs4-or-cs5.html
As you noted so does Nuance, and so, apparently does Google. And then I found this http://en.wikipedia.org/wiki/List_of_speech_recognition_software that I had missed. (View it tomorrow or hit esc before the redirect)
Altough I found the overview of current speech-to-text technologies very interesting, I think you should probably put quite a bit more nuance (heh) on the opening disclaimer:
“Now, before I start let me be clear. I am NOT saying timecode is unimportant. I’m NOT saying that timecode is passé and suddenly irrelevant. Timecode remains incredibly important for any tape based access.”
Timecode is not really all that linked to tape-based access nowadays. It is equally important in order to sync up multi-camera shoots where often the cameras can’t be guaranteed to stop and start automatically and in sync, and in general to sync (either on-set or in post) devices that aren’t necessarily using the same protocols for REC, genlock, etc. – anywhere from audio recorders and slates up to motion control rigs.
At the same time, timecode and speech recognition aren’t really that much in competition with another, since, well, for one not every shot has speech in it! It might not even have a text equivalent at all (e.g. because there is no script), so even in post, there will be no way to auto-link text to it, it will still have to be keyworded by hand.
Maybe one day we’ll have face, object and action recognition (the technology is already available in various forms in various products), and maybe then we will have an truly automatical way to generate text encoding the meaning on each and every shot, but until then…
Anyway, I agree with the basic premise that text is a better way to search and identify content, but people have been using text to do that for many years, even if it is as basic as searching and filtering in your bins by (relevant) clip name. Additional text metadata that works on a intra-clip basis (like script-linking or transcripting) is just an evolution of that concept. I am guessing few editors are truly using timecode to identify stuff anymore, in this world of bins, clips, tags, subclips and markers.
Time of Day is the new timecode! All modern devices are equipped with high accuracy TOD clocks that applications like Adobe OnLocation are using to match log notes with media files – without timecode. I actually believe, despite my disclaimer, that timecode is less and less important every day. Whereas once we could only synchronize double system on timecode (automatically – manual sync is always possible0 these days much more is done with audio waveforms (FCP X and Pluraleyes) than is done with Timecode (and yes, we sell the product for timecode, not audio so my best interest is in timecode).
There’s a huge advantage when the text is integral with the media, not separate. I think you make my point at the end.
The internal timestamp clocks of most newish audio & video devices are indeed good enough to not drift significantly within one shooting day, but on pretty much all of them, there is no way to jam-sync the internal clocks to frame or sub-frame accuracy (i.e., make all devices agree that it’s 12:23:03.364 at the same time), which makes them pretty useless for precise synchronization of multi-camera shoots, let alone audio or motion-control rigs. Not to mention that many of them don’t even store milisecond information in their timestamp metadata, making it even more useless for this purpose. One day, when the internal clocks of a/v devices will be automatically synced via radio clock broadcasts or with the GPS clock signal, maybe you will be able to avoid using timecode, but until that day…
Audio waveform annalysis is undoubtedly a very useful tool to synchronize clips, but you use that only when you don’t have access to the right tool (i.e. timecode), since it has its own set of problems, starting with the most obvious and basic one: that pesky speed of light/speed of sound difference, and working its way up through the more advanced ones (quality of the scratch tracks, etc.).
I am sure that many low(er)-end productions try to hack around these issues, especially when using equipment that doesn’t even have timecode or a way to jam-sync it externally, but it remains a hack that involves some amount of manual correction in post. For productions where you want need frame-accurate automatic syncing, there is really no way to beat timecode currently, and it’s being used every day, everywhere in the world, including in tapeless shoots.
Right now there’s probably 20x more media synchronized with audio via Pluraleyes than all the “pro” production using jam synch’d timecode. Trust me, I know the sales numbers for the software to do both! And the synched audio in post in FCP X has been perfect to date.
Timecode is faster but actually no more accurate. Even shoots sync’d with timecode require manual correction on some shots.
The world has changed Andrei. Not saying that TC is a bad alternative – it’s great and we have software to support it, so my business is in timecode and pushing it, but reality is, the days of TC being the only solution are long gone. Nor the most accurate solution. Technology evolves.
I have no idea where you get your numbers, but I bet there are orders of magnitude fewer Pluraleyes licenses than there are NLE licenses out there, so I am guessing that the difference is probably not quite what you think it is.
Also, I am not quite sure where these argument-from-numbers come from (nothing particular against you, mind you – just a more general sidenote). In the film business there have always been different areas with wildly different budgets and wildly different needs and with wildly different tools that serve those budgets and those needs, and if there are millions of people doing YouTube videos one way, doesn’t mean that their way is necessarily better or more “modern” than say the way a few tens of thousands of people work on Hollywood film sets.
Case in point, there is also 100 times more media shot without adequate lighting, or without adequate grip, or without adequate camera technology than media that is shot with all those things – many times, you simply have to make do with what your budget and/or shoot time allows, and live with the consequences (and less often, you forgo high(er)-end tools because of artistic intent).
That’s how it is with timecode sync – if you can’t afford to have it on set, you use a workaround, and there is nothing wrong with that, but that doesn’t make the workaround better, more accurate or more advanced. It remains a workaround, and yes, auto-sync by waveform is a better workaround than doing it by hand with a clapboard.
It is also true that timecode sync can cause in itself headaches, in 99 times out 100 not because of the principle, but rather because of operator error or device/firmware bugs – the various run modes, the need for repeated jam synching to avoid drift, DF and no DF, using the right tool as master, etc. are all things than can go wrong on a shoot, even though people should really know better by know.
But then again, I have also seen cases of people forgetting to record scratch tracks on their DSLRs, and then wondering why PluralEyes doesn’t sync the takes with their external audio recorder – well, duh! Or wondering why they have to slip everything by hand in sync after the auto-sync process, when they shot a concert with cameras spread all over a stadium – damn physics! Or wondering why their H4 recorder doesn’t stay in sync over longer takes – heh, cheap is cheap! That’s how things are, just about any workflow has the potential for errors at every step of the way.
Anyway, this is probably veering into really offtopic: the point was that I would be very happy to see a replacement for timecode that can be used to ensure consistent sync between multiple devices and that is more automated, easier to use, cross-brand compatible and just as accurate or more. This technology, alas, simply does not exist today.
You must be aware that I am a software developer who works making software for Non Linear Editors. Specifically Sync-N-Link for Final Cut Pro (7) designed to batch sync by timecode.
Pretty much if you work with double system audio and video within the high end production market and edit with FCP 7 you use our software.
I know our sales, while respectable, are a fraction of the sales of Plural Eyes. I get my numbers from the source. where are yours coming from?