Changing the way we Save Video with our Minds

9 min readJul 3, 2021

Have you ever watched 1 hour, 2 hour, 3 hour-long podcasts/talks and forgot absolutely everything you watched after a week? If you are smart, you would be taking notes whenever you watch a podcast, but if you’re like me, you may just be lazy.

Of course, many of us also watch these videos in the shower, while driving, while doing chores, etc. Obviously, in those scenarios you’re hands are tied up. Note-taking isn’t always practical, so what else can we do to capture the most important tidbits of interesting videos?

Say hello to the video summarization!

Chopping Up Videos with AI

Video summarization aims to generate a short synopsis that summarizes the video content by selecting its most informative and important parts.
- Video Summarization Using Deep Neural Networks: A Survey

A hot topic in the deep learning field, video summarization aims to save our time and energy by:

Compiling a storyboard of important frames within a video,
Or stitching together short segments that are significant to the video.

Most commonly, an AI algorithm receives information regarding the video in the form of sound, motion, colour-based feature vectors. These are all methods that do not consider the subjective aspects of summarization: interest and personal value. Both these factors cannot accurately be quantified using features extracted directly from the video.

Biosignals can Save Video Summarization

Biosignals are signals from living organisms that provide information about the biological and physiological structures and their dynamics
- BIOSIGNALS IN HUMAN-COMPUTER INTERACTION

Biosignals allow us to not only gauge physical processes, as explained above but also neurophysiological systems. We have access to biosignals that can to some extent explain the neuronal activity that occurs in the big pile tissue in our head. Even better, we can do this completely non-invasively by utilizing neuroimaging modalities like electroencephalography, electromyography, and functional near-infrared spectroscopy.

The ability to understand our brain is a major game-changer for video summarization because we can now actively measure how interested people are while watching something. Add this alongside existing methods for summarization, and we have on our hands a potent application.

Now, of course, measuring someone’s interest isn’t just a matter of making them put on a headset and getting a yes or a no; for even the most skilled neurophysiologists, biosignals like EEG will look like a kindergartener’s drawing at first glance. In this article, I’ll be covering how I used a $200 dollar EEG headset and made MindFrames, a web app that summarizes videos using your mind!

What does MindFrame do?

In short, Mindframe:

Collects your brainwaves as you watch a video
Classifies 20-second intervals of your brainwave either interested or uninterested
Slices the interesting segments of the video based on the step above
Outputs a montage reel of all the interesting segments, as well as how many times you were interested and at what timepoints

Mindframe does this by identifying moments of situational interest via EEG. Situational interest is basically the interest that arises based on context.

For example, if I need to learn integrals, and I’m reading a math textbook, situational interest would arise when I reach the section that talks about interest. Detecting situational interest through brain-computer interfaces has gained popularity for both neuromarketing and knowledge retention optimization.

Where does Situational Interest come from

At an abstract level, situational interest can be elicited by novel or surprising aspects of an object or situation. For example, while listening to a podcast, if you hear a new topic is introduced, you may be situationally interested. Going deeper, the features that have been found to arouse situational interest are personal relevance, novelty, activity level, and comprehensibility.

You don’t actually have to have any personal interest in a topic for situational interest to arise. For example, I’m not fascinated by birds yet I can still be interested in them based on the situation. If I’m walking through the park and I hear a bird chirping, my attention may be placed on the bird, and this is an example of interest where I usually wouldn’t be excited about outside of that context.

The occipital lobe is the most important part of the brain when we talk about anything related to attention, interest, or focus because of its huge role in visual processing. The frontal lobe also plays a role in attention and the subsection that lies within the left hemisphere is in charge of language processing, which is central when attending to videos.

Getting the moneymaker data

Of course, to start off with any type of AI project, we need to explore the data. In my case, I had to collect my own data. When neurophysiologists need to collect data related to EEG phenomena, they engineer experiments that aim to induce the specific response they want to analyze. In my case, I had to develop an experiment that is able to contrast interest vs non-interest while someone watches a video.

The experiment used in “*Detecting user attention to video segments using interval EEG features. Expert Systems with Applications*”

Moon et al., 2020 provided the foundation for my EEG experiment. I used 2 subsets of videos, one with videos that are considered exciting and action-packed, and the second one had videos that were relatively informative. You can find the video playlist here. After watching a video, the subject would rate 20-second intervals using a 5-point Likert scale, with 5 being most attentive and 1 being least.

Once all the videos had been watched, only the 80th percentile of ratings would be considered “interested” and the rest would be considered “uninterested”. This was to reduce any type of bias that a subject may have, where they may either rate all the videos on the higher end and vice versa.

As a proof of concept, I went through the experiment, watched all the videos and made my own dataset. Optimally, I would need to amass a dataset from hundreds of subjects to train a consumer-ready AI model that can generalize. EEG is mostly unique to everyone, but Wai et al., 2020 and others have shown that attention-related EEG features are indeed generalizable.

In total, I was able to get a total of 94 samples of 20 second EEG data, which is obviously not enough data to emulate the true distribution that occurs within interested and non-interested data, so applied data augmentation.

I grouped each of the positive interest and negative interest EEG samples together and took segments of different samples to create new samples that can be used to train the model. At the end of this process, I was able to generate 5000 synthetic samples.

What do our brainwaves even look like?

Brainwaves look like just that, waves, and not the waves you saw in your high school calculus class, but rather a very random and non-stationary wave. Even worse, EEG is riddled with random artifacts, whether it be from your eyeblinks or the powerline. To get what we want from the seemingly gibberish signals in my dataset, we need to apply some preprocessing techniques.

Firstly, to passively remove as many artifacts as I can, I applied a collection of filters on the raw EEG. Filters allow us to suppress or attenuate certain frequencies in a signal so that only the frequencies relevant to the observed biosignal are prevalent. I specifically applied a notch filter at 60 Hz, and a bandpass filter with a low cutoff frequency of 4 Hz and a high cutoff frequency of 40 Hz.

After, I decomposed the EEG signal its essential frequency components through a method called power spectral density(PSD). If you want to learn more, you can check out another one of my articles here. TL;DR, PSD allows us to understand how much certain frequency bands contribute to the overall EEG signal. By gauging certain frequency bands, we also get an understanding of someone’s interest while they watch their video.

For example, a decrease in the Alpha band during an attentional task is highly correlated with situational interest, and an increase in the Gamma band is highly correlated with “Aha” moments during an activity. I used multiple ratios of band powers as they are commonly used to index interest.

Specifically, I measured:

Theta/Alpha: Has been used to quantify visual and spatial attention
Theta/Beta: Has been found to be correlated with mind wandering
Theta/Gamma: High Gamma represents attention, while high theta represents drowsiness
Alpha/Beta: Alpha/high beta is also another index ratio that can be used for detecting situational interest
Alpha/Gamma: Relation between both bands has been correlated with visual processing
Beta/Gamma: High Gamma and Beta are both correlated with interest and attention

From each 20 second interval in the dataset, I calculate these band ratios each second. Then, I applied a 5 second simple moving average on the 20 second batch of band ratio. From this average batch of data, 4 different descriptive statistics were taken: minimum, maximum, skewness, and median. These are the final features used in our algorithm.

The algorithm I chose to use is a simple gaussian Naive Bayes model as I lacked the data required to train a more complex model. Naive Bayes is a high bias, low variance model, hence it is less likely to overfit the data that I had available.

I also tried implementing data augmentation with the hopes that I could acquire enough samples to train more complex models, though my attempt fell short as my original data didn’t encompass enough of the natural distribution to really benefit from augmentation.

This final model is in charge of classifying whether or not the user was interested during each 20-second interval in a video, and will output a bunch of 1s and 0s identifying which segments of the video should be sliced.

Building the Web App

Building Mindframe was my first experience of doing anything web development and related, and I thought it would be a pain in the ass to figure out how to stream my Muse 2 to a browser. Thankfully, Brains@play made my life a whole lot easier.

By providing a simple framework for developing neurotechnology web apps, I was able to get my stream running in no time! Brains@play provides the web socket groundwork required to connect neuroimaging devices like the Muse 2 over a browser.

I used Vanilla JS to create the frontend portion, and since I had to do all my processing in python, I used a Flask server to handle all the requests. In the backend, I used MoviePy to handle all the video editing and compiling, which would output a final file that would be sent to the frontend.

Final Demo

Next Steps

Although the application does work, there is still room for a ton of improvement if I wanted to bring MindFrame to production. Here are a few of the main biggest iterations that can be made:

Improvements on the algorithm: A simple machine learning model isn’t going to cut it when we are dealing with highly non-linear data like EEG. Optimally, with more data, I could train deep learning networks that do a better job at learning the distribution of the data. Transfer learning could also be used to overcome the lack of data.
Generating more data: With a little bit more natural data, I could benefit from using data augmentation and synthetic data generation, which would increase the amount of training data I have access to by multiple magnitudes
Speeding up the video compilation: As you may have noticed in the demo, it takes forever to compile the video in the backend, but this issue can be overcome by using threading or using multiprocessing. One subprocess would be dealing with slicing and compile audio, while another subprocess would deal with video.
Cleaning up the data + better feature extraction: After plotting a heat map correlation matrix, I realized that many of my features are insignificant. Further experimenting with new feature extraction techniques (Wavelet transform, PCA, EMB, e.t.c) could improve classification accuracy.
Adding additional modalities: Taking inspiration from classical video summarization, we can also use audiovisual embeddings as features to the model, adding more reliability to the results of the app.