In the Overview I pointed out that most of what is being written up as Artificial Intelligence (AI) is really the work of Learning Machines/Machine Learning. We learnt that Learning Machines can improve your tax deduction, do the work of a paralegal, predict court results, analyze medical scans, and much more. It seems that every day I read of yet another application.
There are readily available Learning Machines available for all comers, but there are ways to benefit from them without even using one.
Whether they are used for finding tax deductions; picking identifying skin cancers from diagnostic images; winning at poker; or calculating insurance payouts these machines have been specifically trained for the task by showing them thousands of examples and providing feedback.
I’ll be discussing the possible application of Learning Machines to post production tasks in part three, but we don’t have to wait to create our own Machine Learning application, to benefit.
Learning Machines have already been applied to ‘common’ tasks like speech to text, keyword extraction, image recognition, facial recognition, emotion detection, sentiment analysis, etc. and the resulting ‘smart Application Programming Interfaces (APIs)’ are available to anyone who wants to pay the (quite small) fees for use.
Let me be clear.
The technology for speech to text with accuracy equal to a human transcriber is available now from multiple vendors.
The technology for extracting keywords from that transcript is available now from multiple vendors.
The technology for recognizing images and returning keywords describing their content exists now from multiple vendors.
The technology to identify faces and link instances of the same face is available from multiple vendors.
The technology for recognizing emotion (that this is an emotionally charged moment) is available now from multiple vendors, while the ability to identify what emotion is being expressed exists from fewer sources.
I could go on, but I think they are the most directly applicable of these new “smart APIs” to post production workflows.
These technologies are available now. To anyone. For very small fees.
It is also a logical direction for my own Lumberjack System to be heading. At NAB 2016 we introduced Magic Keywords, which was our first use of these smart APIs; specifically one to derive keywords from transcripts. Lumberjack relies on transcript input. For the moment!
The only other app I know currently taking use of these type of technologies in the production sphere is SpeedScriber (currently in beta but an app I’ve been testing). SpeedScriber uses a computer service to create the basic transcript, which is corrected (where necessary) in a well designed editing interface.
SpeedScriber’s first pass is very accurate on the words, less so on speaker identification but is improving in that respect. I am aware of at least two other organizations that are planning, or working on, a SpeedScriber competitor.
Until we see actual implementations let me describe how we will be using these technologies.
- At the end of a shoot interviews are analyzed and delivered to the editor fully transcribed ready to search by word – a real phrase find instead of one that requires phonetic spellings to get a match.
- Key concepts are derived from the transcript and used to segment and label the transcripts with keywords, that are used to organize the clips. Subject (keyword) based timelines are automatically generated as an editing starting point.
- For scripted production, transcripts would be used to align all takes (or angles) of the same section of script to the script. (Not unlike Avid’s implementation of Nexidia’s technology in ScriptSync, but without Avid’s ScriptSync interface.)
- Highlighted are areas of strong emotion, with each emotion being tagged. (How useful would that be for reality TV?)
- People (talent, actors/characters) are recognized and grouped by face. Faces will need to be identified either by direct entry (once) or by comparison with social media, or a cast list.
- The b-roll is analyzed and identified by shot type (W, M, C) and content, provided as keywords and organized into bins or collections.
All of these features could be implemented now. I’m sure there are people working on implementing some of them, other than us, but I want to emphasize that this is just the beginning. These tools have matured dramatically in the last two years. For example, in the article cited above about the Japanese insurance company using IBM Watson to calculate insurance payouts is this:
It’s noteworthy that IBM’s Watson Explorer is being used by the insurance company in this way barely a year after the head of the Watson project stated flatly that his system wouldn’t be replacing humans any time soon.
Similarly, the difference two years made in AI music composition is dramatic.
This is a very, very quickly developing technology and we’re at the very, very beginning. When Greg and I created First Cuts back in 2008, the necessary metadata had to be entered manually. Not anymore. That editing algorithm becomes infinitely more powerful if we have automatic generation of the necessary metadata, which we (theoretically) do now.
Except the way we created it seems like a backward approach with the advent of Learning Machines!
Instead of spending hundreds of hours analyzing how I made edits (and why) and turning those summations into many interactive ‘rules of thumb’, the modern approach would be to show hundreds (or thousands) of hours of examples to a Learning Machine, and have it generate its own rules for making good edits.
How far we can take Machine Learning, and start to integrate Applied AI, is the subject of my next installment.