Human Action Anticipation

Predicting the Future: A Jointly Learnt Model for Action Anticipation

Comparison between action recognition and action anticipation

We present an action anticipation model that enables the prediction of plausible future actions by forecasting both the visual and temporal future. In contrast to current state-of-the-art methods which first learn a model to predict future video features and then perform action anticipation using these features, the proposed framework jointly learns to perform the two tasks, future visual and temporal representation synthesis, and early action anticipation. The joint learning framework ensures that the predicted future embeddings are informative to the action anticipation task.

This approach is inspired by recent theories of how humans achieve the action predictive ability. Recent psychology literature has shown that humans build a mental image of the future, including future actions and interactions (such as interactions between objects) before initiating muscle movements or motor controls. These representations capture both the visual and temporal information of the expected future. Mimicking this biological process, our action anticipation method jointly learns to anticipate future scene representations while predicting the future action, and outperforms current state-of-the-art methods.

Action Anticipation GAN (AA-GAN): The model receives RGB and optical flow streams as the visual and temporal representations of the given scene. Rather than utilising the raw streams we extract the semantic representation of the individual streams by passing them through a pre-trained feature extractor. These streams are merged via an attention mechanism which embeds these low-level feature representations in a high-level context descriptor. This context representation is utilised by two GANs: one for future visual representation synthesis and one for future temporal representation synthesis; and the anticipated future action is obtained by utilising the context descriptor. Hence context descriptor learning is influenced by both the future representation prediction, and the action anticipation task.

Forecasting Future Action Sequences with Neural Memory Networks

Future action sequence forecasting

We propose a novel neural memory network based framework for future action sequence forecasting. This is a challenging task where we have to consider short-term, within sequence relationships as well as relationships in between sequences, to understand how sequences of actions evolve over time. To capture these relationships effectively, we introduce neural memory networks to our modelling scheme.

Most existing and related methods utilise LSTM (Long Short-Term Memory) networks to handle video sequence information. However, as this task relies on partial information, such methods are vulnerable to ambiguities. For instance, the observed action “wash vegetables” could lead to numerous subsequent actions like “cut vegetables”, “put in fridge”, “peel vegetables”, etc. Therefore, considering only the information from the observed input is not sufficient. It is essential to consider the current  environment context as well as the historic behaviour of the actor, and map long-term dependencies to generate more precise predictions. In our previous example, this means understanding the sequence of events preceding “wash vegetables” and how such event sequences have progressed in the past in order to better predict the future.

Memory networks store historical facts and when presented with an input stimulus (a query) they generate an output based on knowledge that persists in the memory. Recent works on memory networks have shown encouraging results when mapping long-term de

Proposed action sequence forecasting model

pendencies among the stored facts compared to using LSTMs which map the dependencies within the

input sequence. Inspired by these findings we incorporate neural memory networks and propose a framework for generating long-term predictions for the action sequence prediction task.


Furthermore, we show the significance of using two input streams, the observed frames and the corresponding action labels, which provide different information cues for our prediction task. Furthermore, through the proposed method we effectively map the long-term relationships among individual input sequences through separate memory modules, which enables better fusion of the salient features. Our method outperforms the state-of-the-art approaches by a large margin on two publicly available datasets: Breakfast and 50 Salads.