In this project we intend to implement an algorithm that automatically registers and stitches together the frames from multiple video sources to create a wide-angle or 360-degree video. This algorithm will complete the task using only the video frames (no other camera orientation data needed). The core assumption is that the images will all have been taken from approximately the same vantage point at the same time using one camera that pans from one side to the other. Image features will be found using a method known as SIFT [1], an algorithm which has been covered extensively in class. Using this stitched video, we intend to apply a tracking algorithm to discover how it performs at the overlapping regions. The tracking algorithm we intend to use implements Kernelized Correlation Filters and hence is labeled as the KCF Tracker [2]. This object tracking method was chosen as it is both scale invariant and has the ability to recover from object occlusion, all while being a model that is lightweight enough to be run in real time [3].
Tracking objects across multiple cameras has several real world applications, including surveillance cameras and entertainment. In recent years the surveillance camera industry has begun producing panoramic array security cameras. These cameras use multiple image sensors and multiple lenses contained within a single housing to simultaneously capture high resolution video across a wide field of view (sometimes up to 360 degrees). These cameras can be an efficient and cost effective solution when there is a need to cover a large, open area for surveillance. Compared to using multiple traditional cameras, just one panoramic camera may be able to view the entire area of interest. In order for the video captured by the cameras to be useful to an end-user, the separate video signals coming from the separate sensors should be combined in a way that allows the user to “pan” and “tilt” a viewport in software over a large continuous video canvas. This continuous canvas of video is achieved by stitching each frame of video from each camera. A powerful feature that has emerged in surveillance software is the ability to track the position of objects like persons, vehicles, high value items, etc. When combining this tracking technology with panoramic cameras, it is very important to characterize and understand how a real-time object tracking algorithm will perform when subjected to the various image artifacts that arise when stitching video.
Another important application that combines video stitching with object tracking is live sports camera automation. Many high school and college gymnasiums host a large number of different sporting events over the course of a year. While a few high-profile events may have the budget to afford a dedicated live television production, the vast majority of games do not have these resources. However, many games played in high school and college gyms still have fans that would like to watch from home. Especially during disruptions like the Covid-19 pandemic, it is important to provide fans with a way to enjoy these events remotely. One solution to this problem that has emerged in recent years is an automated live streaming camera system. While a static camera with a wide lens may be able to capture the action during a game, viewers usually prefer to see a more zoomed-in perspective that follows the action (e.g. panning left and right during a basketball game as players run back and forth). This system employs two or three static cameras mounted inside an enclosure, each with their optical centers placed close together, and angled to capture a different part of the playing surface. [5]
The video signals from each camera are stitched onto a single image plane, and an object detection and tracking algorithm finds the locations of players in the game. The software fits a bounding box around the players, and, using a smoothing algorithm to avoid abrupt changes, pans a cropped viewing window back and forth on the image plane to simulate the motion of a camera on a tripod following the players. This cropped view can then be live-streamed to social media platforms to allow parents and other viewers to watch the game from home. Two key advantages of this system are that no human operator is required to operate the camera, and there are no moving parts in the system, which reduces mechanical wear over the life of the system.
Our group was unable to find a proper dataset that contained split video files and ground truth bounding boxes so we made the decision to generate our own data. To accomplish this we first found a standard object tracking dataset that provided us with videos of single objects and the ground truth bounding boxes those objects were contained in. We ultimately decided on using LaSOT: Large-scale Single Object Tracking (found here) which provides over 1500 videos belonging to 70 unique object categories. The average video length is approximately 2000 frames, and each video offers a varying level of tracking difficulty by means of full or partial occlusion.
Model overview has been split up into three sections: Data Preparation, Image Stitching, and Object Tracking. In each section we will discuss the mathematical and logical reasons behind the decisions we made in the construction of our model. Additionally, we will briefly touch on how certain elements of the model were implemented in the code, but to better understand the inner workings of the entire project please visit our repository here.
In order to have videos that we could stitch back together, we first had to take each video and split it into a left and right video source. To allow for proper reconstruction, each frame of the split videos was given an overlap. As the size of the overlap increased, the number of optimal matched points between the two videos increased as well. Each frame in the split videos had a random rotation and translation applied to it in an attempt to add some noise. We quickly realized that static augmentations were not putting a great deal of pressure on our model, as the transformation matrix would only need to be calculated once for the first pair of frames. To counter this, additional rotations and translations were applied to consecutive frames. An example input and output pair are shown below.
Furthermore, we wanted to simulate the differences between camera sensors and settings by applying some color and brightness distortion to each frame for one of the videos. This created a distinct boundary between the left and right video sources that needed to be smoothed out in the section of our model dealing with image stitching.
To stitch the split frames back together we need any algorithm that is able to extract a set of matching points found within both frames. There exist many algorithms that can accomplish such a task, but we decided on using the SIFT algorithm as it offers a quick and efficient framework that we are both familiar with implementing. However, not all point pairs returned by the SIFT algorithm are going to be optimal. At its base, image stitching is an overdetermined least squares problem which attempts to solve the following equation, where h represents our homography matrix. $$ A\vec{h} = \vec{0} $$ To further constrain the problem we restrict h such that it can only have unit norm. With this new constrained least squares problem we are able to redefine our equation in the following form with the eigenvector corresponding to the smallest eigenvalue being the vector which minimizes the least squares loss. $$ A^{T}A \vec{h} = \lambda \vec{h} $$ However, minimizing squared loss does not guarantee that we have found the most optimal solution, as the SIFT algorithm is likely to return some non-optimal point pairs. Alternatively, iterating through all matched SIFT points and finding the global minimum is a computationally expensive process, so instead we implement random sampling (RANSAC).
Using the SIFT and RANSAC implementations detailed above we were able to reliably generate real-time stitched frames using the data from the split video sources. A side effect of the random rotations and homography projection was that each stitched video had large black boxes on the perimeter. While this would not be an issue for typical usage, we found the offset shifted the image enough that the ground truth bounding boxes that we needed to evaluate our model were no longer in the correct locations. To move the bounding boxes back to their correct location we apply a dynamic crop that works to remove any blank areas from the final stitched video. Shown below are examples of output videos before and after the cropping occurs.
As mentioned earlier in the Data Preparation section, in an attempt to simulate inputs from different recording devices we added random color distortion and brightness offsets to the frames in the right split video. Our final step in the image stitching process was to smooth out the overlap between the left and right frames to make the distinct boundary less apparent. This is achieved by taking an average of the pixels within the overlap. Pixels closer to certain edge of the overlap will be influenced greater by information from that image (e.g. left pixels and left image). The following videos outline the contrast between smoothed and unsmoothed frames.
Our tracking algorithm makes use of kernelized filters to follow objects from frame to frame. By initializing the tracker with the ground truth bounding box from the first frame of the original video, the algorithm is able to learn the general structure and color of the tracked object. Using this information, the algorithm is able to make predictions on the object's location by finding areas of the video that correlate most greatly with the learned filter. With each consecutive frame, the algorithm is better able to update its learned filter by using a combination of data collected from all previous and current frames [2]. Illustrated below is the general architecture of the KCF model we implemented.
Examining the figure, we are able to extract a few of the highlights and drawbacks of the model. The first major takeaway is the model's ability to store and remember previous kernel filters. Becuase of this the tracking algorithm is resistant to object occlusion and dynamic object velocities. The model does not rely on continuity, as each update looks for the tracked object on a global scale instead of locally. Additionally, this allows the model to be invariant to changes in scale when using a scale-invariant kernel function to generate the filters [3]. One of the main drawbacks of this model is its inability to track objects that have a fluid or variable shape/structure. Knowing that the kernelized filters are constructed using information only present in previous frames, it makes sense that an object undergoing a rapid transition or transformation might cause the algorithm to misidentify the tracked object. A few of the failure cases highlight this issue by failing to track a plane after it experiences a swift change in brightness and color [3].
We do not intend to empirically evaluate the stitched video frames and instead focused primarily on testing the effectiveness of the tracking algorithms. We compared the output of the tracking algorithms against the ground truth by assessing the commonly used Intersection over Union (IoU) metric. This metric takes the ratio of the intersection of the bounding boxes against their union yielding a range of [0, 1], with an optimally predicted box returning a value of 1. Shown below is the equation for calculating the IoU as well as a few examples demonstrating varying degrees of bounding box overlap.
For our tests we took 20 different videos from the aircraft category of the LaSOT dataset. Each video was clipped to include 100 frames, and the data augmentations described above were applied. We measured the IoU of every frame (except for the first frame as it will always have an IoU of 1) for each of the 20 videos and recorded the average IoU across all frames. To test the model's resilience to noise and distortion we additionally tested a wide range of hyperparameters, seeing how the performance degraded by minimizing the amount of frame overlap or by maximizing the amount of color and brighness distortion introduced.
We found that a majoirty of our tests performed very well. Performing the test measuring IoU, mentioned above, we found that the tracking algorithm achieved an average IoU of 0.700 on stitched videos. Compared to the average IoU of 0.775 recorded when running the algorithm on the original videos, this puts our model's performance within 10% of the baseline. Shown below are two videos to emphasize the similarities and differences between a successful trial and its corresponding original video. It is clear to see that the tracking algorithm performs identically on both videos for a majority of the frames.
Despite performing well on normal hyperparameter values, we were able to draw out failure cases by pushing these values to their respective extremes. For brightness, we found that adjusting that brightness of one frame by +/- 40 with respect to a frame from the other video would cause too much contrast in the overlap region. For overlap, we found that reducing the amount of overlap to 2% of the total video width became too narrow for SIFT to find enough optimal points. This lack of points resulted in the video frames stitching together incorrectly and would completely derail the tracking algorithm. Exemplified below are two example cases, each belonging to one of the two cases detailed above.
One opportunity for further research is to experiment with additional forms of distortion on the split videos. For example, the white balance of the videos can be shifted, which may simulate a mismatch between the different image sensors used. Also, more complicated manipulations of the contrast and color curves may further simulate the imbalances between different sensors.
It would also be interesting to expand this research to include different tracking algorithms. For this project we chose the KCF tracker primarily because it was well-suited for real-time tracking. However, there are many other tracking algorithms and techniques available, including some which use deep learning models to locate and follow objects. It would be interesting to experiment with these models to evaluate their performance on stitched video, and to see how their performance compares to the KCF tracker we used here.
Yanfang Li, Yaming Wang, Wenqing Huang and Zuoli Zhang, "Automatic image stitching using SIFT." 2008 International Conference on Audio, Language and Image Processing, pp. 568-571. IEEE, 2008.
Bolme, David S., J. Ross Beveridge, Bruce A. Draper, and Yui Man Lui. "Visual object tracking using adaptive correlation filters." 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 2544-2550. IEEE, 2010.
Yang, Yuebin, and Guillaume-Alexandre Bilodeau. "Multiple object tracking with kernelized correlation filters in urban mixed traffic." 2017 14th Conference on Computer and Robot Vision (CRV), pp. 209-216. IEEE, 2017.
Held, David & Thrun, Sebastian & Savarese, Silvio. (2016). "Learning to Track at 100 FPS with Deep Regression Networks." https://arxiv.org/abs/1604.01802
“Live sports production,” Synergy Sports, 14-Jan-2022. [Online]. Available: https://synergysports.com/solutions/live-sport-production/.