I first wrote about derived metadata back at the end of January 2009. Derived metadata uses computer analysis to derive metadata from the video source. There are now technologies for speech-to-text, meaning extraction, facial detection, facial recognition, emotion detection, image recognition, and more. One company has been accumulating these somewhat diverse technologies: Apple.
Apple license speech-to-text technology from Nuance, makers of the Dragon dictation apps. It does appear they are building an internal team as Apple prefers to own the core technologies it uses.
Through the purchase of Siri, they gained Natural Language Processing technology for deriving meaning from speech. Keyword extraction is a normal part of NLP tech. It’s worth noting that it’s reported that the Siri team is the largest team within Apple.
Facial detection technology has been part of Apple’s ecosystem long enough to have built-in frameworks in iOS. FCP X also uses facial detection to provide metadata including the shot type: W, M, CU etc. I contemplated how it might be used in August 2011.
In 2010 Apple purchased Polar Rose for facial recognition technology. (The difference between this is a face – which is detection – and this is Philip’s face – which is recognition.) I wrote about this (and a lot more) about derived metadata in early 2012.
Today the Wall Street Journal is reporting that Apple have purchase Emotient for the emotion detection technology: whether you’re noticing something, paying any attention to it, and what emotion you are displaying. No indication of how they plan on using it.
Apple also have patents on image recognition apparently as part of the Polar Rose acquisition.
UPDATE: I totally forgot GPS and mapping technology, already fairly mature and now a technology Apple own. Location metadata, along with business lookup at a location, is valuable metadata.
I should point out that Google also has most of these technologies in house, either by internal development or via a purchase. I consider Google’s speech-to-text tech the equal with Nuance’s.
I have to say, that’s an impressive suite of metadata-deriving technologies. Speech is translated to text, keyword ranges extracted, person detection already gives us shot type (in FCPX), people identified and named, emotion detected and the content of b-roll labelled. All without a single minute of human attention.
It’s not editing, but it’s a damned good start on organization. Add some basic string-out building algorithms  like we used in First Cuts and there’s a basic starting point for non-scripted shows, for editors to work their magic on, with more creative time, less organizational time.