just tried this; it isn't hard. i followed from https://mmaction2.readthedocs.io/en/1.x/get_started.html . i pursued this toolkit because it is used by the major new research models (although they aren't in it quite yet). 1. install mmengine and mmcv using mim pip3 install -U openmim mim install mmengine 'mmcv>=2.0.0rc1' 2. install mmaction2 from source (to quickstart with the demo data) git clone https://github.com/open-mmlab/mmaction2.git cd mmaction2 git checkout 1.x # or dev-1.x for cutting edg pip3 install -v -e . 3. download an action recognition model. i believe this model (tsn) is relatively old (2018?), but it is their example, and it is the one i just tried successfully. mim download mmaction2 --config \ tsn_imagenet-pretrained-r50_8xb32-1x1x8-100e_kinetics400-rgb --dest . 4. use the demo script and the downloaded model to select the top 5 labels from a list for an mp4 video. note: add --device=cpu if you want to run outside nvidia cuda. python demo/demo.py tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb.py \ tsn_r50_8xb32-1x1x8-100e_kinetics400-rgb_20220818-2692d16c.pth \ demo/demo.mp4 tools/data/kinetics/label_map_k400.txt - the .py is the model config - the .pth is the model weights, larger file - the .txt appears to be a list of labels for it to choose from - i tried a different .mp4 and it seemed to work despite not normalizing the resolution, which seems quite surprising to me I don't know how to do this "properly" or if all the above is correct, but it worked for me. I like to store information once I have successes because I never know when I might dissociate away. The next new cutting edge model for doing this is InternVideo, one url for which is https://github.com/OpenGVLab/InternVideo . The weights were just released publicly 2 days ago, although I'm still waiting on confirmation to access the google drive myself. It's set up based on mmaction. It looks like the latest model available in stable mmaction is videoswin (2022); i have not tried it myself at this time. The surrounding framework can also perform many, many other tasks than simple action recognition, such as plotting human skeleton vertices or filling in missing video data.