[ot][notes] how to automatically tag actions in video data

Thu Dec 29 15:55:08 PST 2022

just tried this; it isn't hard. i followed from
https://mmaction2.readthedocs.io/en/1.x/get_started.html . i pursued
this toolkit because it is used by the major new research models
(although they aren't in it quite yet).

1. install mmengine and mmcv using mim

pip3 install -U openmim
mim install mmengine 'mmcv>=2.0.0rc1'

2. install mmaction2 from source (to quickstart with the demo data)

git clone https://github.com/open-mmlab/mmaction2.git
cd mmaction2
git checkout 1.x # or dev-1.x for cutting edg
pip3 install -v -e .

3. download an action recognition model. i believe this model (tsn) is
relatively old (2018?), but it is their example, and it is the one i
just tried successfully.

mim download mmaction2 --config \
    tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb --dest .

4. use the demo script and the downloaded model to select the top 5
labels from a list for an mp4 video. note: add --device=cpu if you
want to run outside nvidia cuda.

python demo/demo.py tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb.py \
    tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb_20220818-2692d16c.pth \
    demo/demo.mp4 tools/data/kinetics/label_map_k400.txt

- the .py is the model config
- the .pth is the model weights, larger file
- the .txt appears to be a list of labels for it to choose from
- i tried a different .mp4 and it seemed to work despite not
normalizing the resolution, which seems quite surprising to me

I don't know how to do this "properly" or if all the above is correct,
but it worked for me. I like to store information once I have
successes because I never know when I might dissociate away.

The next new cutting edge model for doing this is InternVideo, one url
for which is https://github.com/OpenGVLab/InternVideo . The weights
were just released publicly 2 days ago, although I'm still waiting on
confirmation to access the google drive myself. It's set up based on
mmaction.

It looks like the latest model available in stable mmaction is
videoswin (2022); i have not tried it myself at this time. The
surrounding framework can also perform many, many other tasks than
simple action recognition, such as plotting human skeleton vertices or
filling in missing video data.