[ot][ml] Current Video Object Segmentation Models

Fri Sep 1 19:17:41 PDT 2023

I spent a little time looking at machine learning work that could
track all the objects in a video.

(I was wondering, for fun, if I could use video of assembling
something to reproduce plans for it by automatically extracting and
modeling the parts. This looks doable but takes some coding etc.)

# Segment Anything Model

First of all, Facebook released something called Segment Anything (or
SAM) in April of this year. This is a set of models and a python
library that can automatically bound objects in a single still image.

It will either bound what the user requests, or enumerate every
object, and it's a generative model that outputs a statistical
distribution with each call, so you can make it fuzzy if you want.

SAM urls:
- https://segment-anything.com/
- https://github.com/facebookresearch/segment-anything
- https://arxiv.org/abs/2304.02643

# Video

Like other model releases like this, there has been a flurry of
attempts to extend SAM to larger domains, such as video. I briefly
engaged three video segmenters released in the past few months, and
consider them presently to be the most accessible and quick to find
cutting edge attempts.

They all have demo code that is either online or can be run in colab.
They all use SAM to engage a single frame, and then use further
modeling to extend the segmentation to adjacent frames. They do this
in a forward-time manner, but the underlying code is agnostic to which
direction frames are fed in.

- Track-Anything or TAM at https://github.com/gaomingqi/Track-Anything
. This is a small cutting edge video segmentation research codebase
based on SAM. The research code does not wrap SAM's ability to label
every object in a scene, so the demo only tracks what the user
selects. This is trivial to add if time is spent understanding it.
TAM's package tree is at Track-Anything/tracker/ but some manual path
organization is needed, used in Track-Anything/track_anything.py . The
API is simple.

- Segment-and-Track-Everything or SAM-Track at
https://github.com/z-x-yang/Segment-and-Track-Anything . This is also
a small cutting edge video segmentation research codebase based on
SAM. Unlike TAM, SAM-Track does wrap SAM's ability to enumerate every
object. SAM-Track's main sourcefile is
Segment-and-Track-Everything/SegTracker.py . Like TAM, a little manual
path organization is used.

- HQTrack at https://github.com/jiawen-zhu/HQTrack . This is the third
small cutting edge video segmentation research codebase I considered.
The developers tout this as having the second best score in an
upcoming leaderboard release at a major conference. It's sketchy they
are sharing this result before its release. The codebase is highly
recommended, and I didn't end up trying it. It looks similar to
SAM-Track .

None of them are matured to installable packages in the developer
codebases. I didn't look in depth, but did find one incomplete attempt
to port one as a ROS (robot operating system) package when I briefly
looked. The underlying function calls do not appear complex, however.

# Upcoming leaderboard

The upcoming leaderboard HQTrack uses for advertisement is VOTS2023 at
https://www.votchallenge.net/vots2023/ . Results will be released at
ICCV2023 in Paris on October 3rd, at which point people will learn
which research is the one that outcompeted HQTrack. The results paper
was opened to public review on June 30th but I did not immediately
find it; it looks like, whether it is on the internet or not, the
intent is to hide the results until October.