Not so Social GAN
We take a look at Social GAN’s pooling module, which according to the authors, helps the model “predict socially acceptable trajectories which avoid collisions”. Surprisingly, our experiments show that models without the pooling module tend to have just as much or less collisions than the models with the pooling module. We also show that models with a Random-Pooling function match the performance of the original Social GAN models with a Max-Pooling function. We conclude that Social GAN’s pooling module may not be as effective at encouraging socially acceptable trajectories as claimed in the paper.
What is Social GAN about?
While moving, people obey a large number of unwritten common sense rules that comply with social conventions. The ability to model these rules and use them to understand and predict human motion in complex real world environments is valuable for the development of socially aware systems. For example, an autonomous vehicle should be able to predict the future positions of pedestrians and adjust its path to avoid collisions. The problem of trajectory prediction can be viewed as a sequence generation task, where we are interested in predicting the future trajectories of people based on their past positions. In this case, future and past positions are sequences of coordinates (x, y tuples). Following the recent successes of Recurrent Neural Networks (RNNs) for sequence prediction tasks, the authors use an RNN Encoder-decoder framework to predict people’s future trajectories, given their past trajectories.
The main contributions of Social GAN are its approach to consider interactions between people (pooling module) and the way it encourages diverse sample generation (variety loss function). Next, we will take a closer look at Social GAN’s pooling module.
What’s wrong with Social GAN?
According to the authors, the pooling module causes the models to “predict socially acceptable trajectories which avoid collisions.” Surprisingly, the experiments that we will take a look at next contradict this claim. We will see that models without the pooling module tend to have just as much or less collisions than models with the pooling module.
The authors provide pre-trained models with and without the pooling module (the 20V-20 and 20VP-20 models from the paper). They trained two models—prediction lengths 8 and 12—for each of the five datasets. That’s twenty models in total, ten with and another ten without the pooling module. You can find these exact models here. Note that these are pre-trained by the author and we didn’t modify these models in any way. By using the evaluation script provided by the authors we get the same results they do.
Here we have one example situation where two pedestrians walk next to each other, the prediction differs from the ground truth but it looks socially acceptable. Note that colors differentiate pedestrians, dots are ground truth trajectories, and dashed lines are predictions of a model:
Each situation consists of a sequence of frames, each frame consists of the current position (x, y coordinates) of all pedestrians in the situation. We say that the situation contains a collision if there is a collision in any of the frames. We say that there is a collision in a frame if there are at least two pedestrians with a distance lower than a given threshold. So for a higher threshold we expect more collisions, for a high enough threshold we expect all situations to contain collisions. We are interested in a threshold for which we can say that resulting collisions are not socially acceptable. Then we can use this threshold to detect predictions that are not socially acceptable.
For each frame we compute the euclidian distance between all pedestrians, if any of these distances is below the threshold we classify the situation that this frame belongs to as containing a collision. The following figure shows the percentage of situations that contain collisions for different thresholds, for all of the pretrained models that the authors published (20V-20 and 20VP-20 with prediction lengths 8 and 12). Each of the four charts describes five of the in total twenty models that the authors provided:
Interestingly, the pooling module doesn’t seem to reduce the number of collisions. The models without the pooling module tend to have just as much or less collisions than the models with the pooling module. This contradicts the rationale behind the pooling module which is supposed to avoid collisions. As expected, we see that the more crowded datasets UNIV and ZARA2 have more collisions. This makes sense since there is less space when there are more pedestrians passing each other, so we expect them to get closer to each other. Also as expected, we see that the amount of collisions doesn’t change significantly between the prediction lengths, in fact, we see that a larger prediction length tends to result in more collisions which makes sense since for longer predicted trajectories there are more chances for collisions.
Next we are interested in inspecting specific situations which have been classified as having a collision. We evaluate the pretrained 20VP-20 model that is provided by the authors on ETH (prediction length 8). Let’s take a look at two arbitrarily chosen situations that were classified as having a collision (threshold = 0.1):
We see that despite the pooling module, the predicted trajectories deviate from the ground truth in a way that’s not socially acceptable. In the first example, the trajectories of the two pedestrians converge into one trajectory. In the second example, the blue and green pedestrians collide while crossing each other’s path.
As described in the paper, Social GAN uses Max-Pooling as the pooling function. To see how important a specific pooling function is, we replace the Max-Pooling with Mean-Pooling (instead of selecting the maximum value simply compute the mean over all values) and Random-Pooling (instead of selecting the maximum value simply randomly select one of the values). Note that we only swap out the max pooling function, we leave the rest of the pooling module as well as the training and evaluation procedures as is. For the different pooling modules we compute the same two error metrics as in the paper: Average Displacement Error (ADE): average L2 distance between ground truth and the prediction over all predicted time steps, and Final Displacement Error (FDE): distance between the predicted final destination and the true final destination at the last time step. We add a third metric to evaluate the social acceptability of the predicted trajectories: the percentage of situations that contain a collision for a given threshold. The following table contains the 3 metrics across all datasets for the pretrained models provided by the authors (20V-20 and 20VP-20 with prediction lengths 8), the retrained 20VP-20 model (no modifications, simply retrained models from the paper with exactly the same arguments as used by the authors), and the 2 pooling module variations (exactly the same as original 20VP-20 models except for the swapped pooling function).
We see that none of the pooling functions improves the collision metric. Surprisingly, as measured by the ADE and FDE, the 20VP-20 model with the random pooling function matches the performance of the other models.
The code for the discussed experiments can be found here.