I spent a little time looking at machine learning work that could track all the objects in a video. (I was wondering, for fun, if I could use video of assembling something to reproduce plans for it by automatically extracting and modeling the parts. This looks doable but takes some coding etc.) # Segment Anything Model First of all, Facebook released something called Segment Anything (or SAM) in April of this year. This is a set of models and a python library that can automatically bound objects in a single still image. It will either bound what the user requests, or enumerate every object, and it's a generative model that outputs a statistical distribution with each call, so you can make it fuzzy if you want. SAM urls: - https://segment-anything.com/ - https://github.com/facebookresearch/segment-anything - https://arxiv.org/abs/2304.02643 # Video Like other model releases like this, there has been a flurry of attempts to extend SAM to larger domains, such as video. I briefly engaged three video segmenters released in the past few months, and consider them presently to be the most accessible and quick to find cutting edge attempts. They all have demo code that is either online or can be run in colab. They all use SAM to engage a single frame, and then use further modeling to extend the segmentation to adjacent frames. They do this in a forward-time manner, but the underlying code is agnostic to which direction frames are fed in. - Track-Anything or TAM at https://github.com/gaomingqi/Track-Anything . This is a small cutting edge video segmentation research codebase based on SAM. The research code does not wrap SAM's ability to label every object in a scene, so the demo only tracks what the user selects. This is trivial to add if time is spent understanding it. TAM's package tree is at Track-Anything/tracker/ but some manual path organization is needed, used in Track-Anything/track_anything.py . The API is simple. - Segment-and-Track-Everything or SAM-Track at https://github.com/z-x-yang/Segment-and-Track-Anything . This is also a small cutting edge video segmentation research codebase based on SAM. Unlike TAM, SAM-Track does wrap SAM's ability to enumerate every object. SAM-Track's main sourcefile is Segment-and-Track-Everything/SegTracker.py . Like TAM, a little manual path organization is used. - HQTrack at https://github.com/jiawen-zhu/HQTrack . This is the third small cutting edge video segmentation research codebase I considered. The developers tout this as having the second best score in an upcoming leaderboard release at a major conference. It's sketchy they are sharing this result before its release. The codebase is highly recommended, and I didn't end up trying it. It looks similar to SAM-Track . None of them are matured to installable packages in the developer codebases. I didn't look in depth, but did find one incomplete attempt to port one as a ROS (robot operating system) package when I briefly looked. The underlying function calls do not appear complex, however. # Upcoming leaderboard The upcoming leaderboard HQTrack uses for advertisement is VOTS2023 at https://www.votchallenge.net/vots2023/ . Results will be released at ICCV2023 in Paris on October 3rd, at which point people will learn which research is the one that outcompeted HQTrack. The results paper was opened to public review on June 30th but I did not immediately find it; it looks like, whether it is on the internet or not, the intent is to hide the results until October.