Abstract:
Human action recognition is a fundamental task in computer vision, and it has many applications in different fields. Deep Learning methods demonstrate efficiency in learning spatiotemporal representations of data from skeleton extraction to accelerometers or radar. This article compares three classifiers, an LSTM, a Bi-LSTM, and a Transformer with a unique skeleton extraction method, the YOLO-PoseV11. Using the N-UCLA dataset, the best approach relies on a Bi-LSTM with a hidden-size of 100, achieving a F1 score of 74.18%. Although the results do not achieve the State-Of-The-Art, this paper presents a comparison between the LSTMs and the Transformer in action recognition, providing a base for future research on Transformer networks in this field.