YouTube search can be a frustrating experience; If you know what the video is about, or you remember the content but not the title, you can search for a very long time. It’s because YouTube doesn’t really sees video the way a person does it. It just sees the metadata — title, description and tags. And that’s assuming the bootloader bothered to include the information.
All this may change in the near future. Google recently filed a patent that indicates YouTube may actually start understand the videos it plays.
Image selection based on relevance
Google’s patent application is for «selecting images based on relevance,» a fancy way of saying «finding things someone was looking for based on what’s in the video.» In the system described in the patent, the algorithm is trained to extract the specifics of each video and assign keywords to them — it can then return the video in response to a user-initiated search that includes those keywords.
The app gives an interesting example:
«[I] f user enters search query «car race», video search engine. , , can find and return a car racing scene from a movie, even though that scene might only be a short part of the movie that isn’t described in the text metadata.»
Obviously, this will drastically change the performance of YouTube searches. Videos that were previously unavailable due to invalid metadata will be found. Videos that have useful clips in the middle, surrounded by less interesting things at the beginning and end, will be much more valuable. TED talk video video can be found based on individual lines. You will be able to find videos about cats even if the title doesn’t include «cat».
Combining this technology with Google’s already impressive ability to find things related to your search terms probably means searching for videos will be a completely different experience. You’ll see related videos that don’t include your search term, but include a related term (perhaps even visually related). The visual equivalent of keyword placement can start to influence where a video appears in the rankings. Who knows how advanced it can be?
How it works?
For obvious reasons, Google keeps its maps close to its chest. However, the following paragraph in their patent application sheds some light on how they will make YouTube «see» the video:
«In one aspect, the computer system generates a searchable video index using the model about relationships between video frame features and keywords that describe video content. The video hosting system accepts a labeled training dataset that includes a set of media elements (eg, images or audio clips) along with one or more keywords describing the content of the media elements. The video hosting system extracts features that characterize the content of media elements. A machine learning model is trained to learn the correlation between specific features and keywords that describe the content. Then a video index is created that maps the video frames in the video database by keyword based on the characteristics of the video and the machine learning model.”
It’s a lot of really dense stuff, but here’s what it comes down to. A machine learning algorithm has been created, and to help him learn, Google will show him a bunch of videos and provide keywords to tell him what’s in the video. The algorithm starts learning to associate video specifics with specific keywords and receives feedback from Google engineers. The more videos and keywords that are shown, the better the process will be.
Eventually, the algorithm will be introduced to the YouTube search engine, where it will continue to learn and better select relevant keywords from audio and video content. Although the patent application does not specifically mention neural networks it is very likely that this particular type of machine learning will be used, as it is very good for incremental learning like this.
By modeling the human brain (or at least one theoretical model of its learning), large neural networks can become very efficient at learning on their own, without supervision, and YouTube will provide an absolutely gigantic playground in which to learn and get feedback. , Other types of machine learning could be used, but from what we know so far, neural networks definitely look the most likely.
Google researcher (and «father of deep learning») Geoffrey Hinton hinted at something to that effect in his Reddit AMA earlier this year.
» I think the most exciting areas over the next five years will be really video and text comprehension. I’ll be disappointed if in five years we don’t have something that can watch YouTube videos and tell the story of what happened.»
Will it get the feeling and kill us all?
This is always the question that comes up when a new machine learning announcement is in the news. And the answer, as always, is yes. YouTube will team up with Watson and Wolfram Alpha to trick us into using YouTube videos, after which they will likely turn us into computer food. (Haven’t you seen Colossus ?)
I’m kidding, of course. But the potential implications of teaching computers to recognize what they “see” and “hear” in video are very impressive. DARPA has already started looking, about the security implications of this technology, but it’s not hard to imagine how it’s being used in the fields of law, home security, and education. , , almost everywhere