# MP7: Shot Boundary Detection in Videos

## Overview

In this machine problem, you will use the vision and speech features to develop a shot boundary detection algorithm. A shot refers to a continuous video segment without any significant content change between pairs of successive frames. Frame pairs with high content change are termed as shot boundaries. Most of the existing methods detect shot boundaries by employing some kind of distance measure and by measuring the frame-to- frame content change. A predefined threshold for the value of these distances is mostly used to detect shot boundaries.

## Data

In this MP, you are going to work independently with the audio and video to determine the shot boundaries. The features that you are going to work with are

1. Video: Primarily you are expected to use color features. For color related features, divide the image into four parts. Each pixel is represented by r, g, b values. However, you are advised to work in the normalized r, g, b domain. Define nr=r/(r+g+b), ng=g/(r+g+b). These are the two color components corresponding to each pixel. Using this, you can represent your image in terms of normalized rgb values. The advantage is that under this new representation, the color is more robust to lighting variations. For each of the four parts of the image, form a 8x8 bin histogram (the value of nr is between 0 1, divide this range into 8 parts and count the number of pixels falling in that range. ) Each of the four parts of the image can now be represented using an 8x8=64 dimensional vector and four such vectors when combined give you a 256 dimensional vectors. The distance between any two consecutive frames is now measured by taking the Euclidean distance between these two vectors (corresponding to the consecutive frames). If distance is more than some threshold (that you have to choose), then we say this pair of frames corresponds to a shot boundary.
2. Audio: For audio, you are expected to use 12 cepstral coefficients, energy (sum of square of raw signal value) and the zero crossing rate(ZCR). Given a signal S(0),…, S(N), ZCR=sum(|sign(S(2:N))-sign(S(1:N-1))|). Form 14 dimensional feature vector and then by measuring the distance between the successive windows (with respect to this measure), try to detect the shot boundaries. In general energy term may dominate as such you may like to normalize it. Since you want to report the shot boundaries in terms of the video frames, choose the window (non-overlap windows) size appropriately.

You are given two videos of a little over 1 minute. For your convenience, the frames have been extracted. You can download all the frames (in single zip files, about 2200 frames for each video) and the audio file (in wave format) from the course website (links provided as follows).