1 Explainable Defense Coverage Classification in NFL Games using Deep Neural Networks Huan Song1, Mohamad Al Jazaery1, Haibo Ding1, Lin Lee Cheong1, Jonathan Jung2, Mike Band2, Michael Chi2 and Tom Bliss2 1 Amazon ML Solutions Lab, 2 NFL Next Gen Stats (NGS) 1. Introduction Machine learning (ML)-powered football analytics has received considerable interest in recent years [1, 2, 3, 4, 5, 6, 7], with majority of existing analytic measures centered around offense strategies and performances [5, 6]. In contrast, the defensive side of the game has received relatively less attention and development. At the core of understanding and analyzing any defensive strategy is the coverage scheme, i.e., the rules and responsibilities of each defender tasked with stopping the pass. Classifying the coverage scheme for every pass play provides insights and new understanding to the football game to teams, broadcasters and fans alike. The preferences of play callers become apparent through coverage scheme data, such as Bill Belichick using Cover 1 at a top 5 rate in five consecutive seasons. Coverage scheme classification also allows deeper understanding on how respective coaches and teams continuously adjust their strategies based on their opponent’s strengths. For example, the Packers and Chiefs both faced significantly more man coverage through the first 11 weeks of the 2022 season than they did in 2021 after both teams traded away their leading receivers during the offseason (Davante Adams & Tyreek Hill, respectively). Finally, coverage scheme classification enables the development of new defensive-oriented analytics such as uniqueness of coverages [18]. In 2020, Brandon Staley designed the most unique set of coverages for the Rams while the fired Gregg William’s was the least unique. Manual identification of these coverages on a per-play basis is both laborious and difficult as it requires football specialists to carefully inspect the game footage. Thus, there is a need for an automated coverage classification model to effectively and efficiently scale to reduce cost and turnaround time. This coverage classification model also needs to address the inherent ambiguity around the deployed coverage schemes that can be difficult to grasp even for expert reviewers. For example, the defensive coaching staff will often disguise their coverages to mislead the quarterback. It is thus important to develop model explanation method to facilitate the understanding of what the machine learning model utilized to classify these coverages and arrived at a given conclusion. Figure 1 below shows the location of all offensive and defensive players at the start of an example play (left) and in the middle of the same play (right). The model showed relatively low confidence in its coverage classification on this play, with the top two predictions (Cover 3 Zone & Cover 1 Man) falling under 50 percent. The play action fake and the defenders’ reactions to it along with the route distribution made it harder for the model to determine whether it was man or zone coverage. 2 Figure 1: Example of an ambiguous play that shows the complexity of the task. Left, at the start of the play, and right in the middle of the play. Full list of player acronyms is in Appendix. To the best of our knowledge, ML-based coverage classification has not been fully studied. Previous efforts from [8] dabbled on this topic by adapting the convolutional neural network (CNN)-based Kaggle Zoo winning solution of the 2020 Big Data Bowl [17], but ignored the temporal progress of the play. Based on our analysis, this approach struggled in achieving sufficient accuracy needed for productionization and reduction of manual review. Production readiness is defined here as achieving >95% accuracy in identifying man versus zone-type plays, as well as ability to determine plays that require further expert reviewing. In this paper, we present a novel deep learning approach that significantly outperforms [8] for automatic coverage classification. Raw sensor data comprised of location, speed and acceleration is collected for every player and utilized as inputs into an automatic coverage classification pipeline. We baseline using the published CNN-based model [8, 17] as well as the improved versions with incorporated long short-term memory (LSTM) component. We find that our proposed addition of attention layers results in improved classification accuracy, as these layers enables the model to learn to focus on specific aspects of a play. Further performance gain is achieved by applying label smoothing to tackle the inherent challenges in distinguishing the intricate coverage schemes, and model ensemble to bootstrap decisions from multiple independently trained base models. Finally, we incorporate model explanations via play embedding analysis and gradient-based approaches that provide confidence that the notoriously opaque deep learning model correctly captures football knowledge, and aligns with human experts’ understanding. These model explanations also help speed up visual review processes, and bring additional insights about defense coverage schemes. This remainder of the paper is organized as follows: we review the related work on tracking-based football analytics, coverage classification, and model explanation in Section 2. In Section 3, we present our coverage modeling and model explanation approaches. In Section 4, we describe our evaluation results on coverage classification and model explanation results. In Section 5, we conclude by summarizing our approach and results, and outlining our planned future works. 3 2. Related work 2.1 Tracking-based football analytics Football tracking data contains rich information of the game dynamics including the player and ball location, speed, acceleration in real-time. This enriched large-scale data has attracted multiple indepth studies to analyze team and specific player’s performance, including trajectory prediction [7], quarterback evaluation [5, 6], pass inference penalty prediction [9], receiver openness and expected gain prediction [3], and run vs. pass prediction [4]. Other published works focused on expanding the analytics capability, either with additional data sources or improved model architecture design. In [1], the authors demonstrated the importance of incorporating charting annotations with tracking data. Authors in [2] focused on developing more advanced architecture components for feature representation learning to tackle the variable duration problem of events and the ordering problem of players. In [10], a graph neural network was developed to better capture the player interactions and their fast progression over time. 2.2 Football coverage classification Despite the criticality of analyzing and understanding the defensive strategies, it has only been investigated in a few works so far. Dutta et.al. [11] developed an unsupervised learning approach to group each player’s pass coverage into the high-level man vs. zone categories. The approach from [12] focused on team-wide defense coverages, but is based only on vision data. The most relevant work to ours is [8], where B. Baldwin developed a convolutional neural network to identify eight defensive coverage schemes. However, only a single frame from each play is utilized for the identification. The temporal progression of the player location and interactions contains critical information about the coverage scheme, and relying on the static features from certain frame could significantly limit the predictive power. In this paper, we design and describe new architectural components that tackle the temporal modeling challenge and beyond, leading to a performant classification model. 2.3 Model explanation for sports analytics Although deep neural network models have achieved remarkable results in various sport analytics problems, its black-box nature prohibits interpretation of how it came to the conclusion. The explainability, however, is critical in 1) extracting additional insights on the data and predictive task, 2) verifying that the model correctly captured the related sport knowledge, and 3) indicating when human experts should be involved in-the-loop to resolve any prediction issues. The explainability of sports analytics models was studied only recently. In [13], interpretable decision tree-based models were developed along with neural network models for football pass vs. rush prediction to study how much accuracy of DNNs they can capture. A case study on outcome prediction of volleyball matches was conducted in [14] that utilized different explanation approaches including Boolean Rule Column Generation, ProtoDash, and SHAP (SHapley Additive exPlanations). For baseball predictions, [15] utilized Shapley values to get both local feature importance and global feature importance for batter vs. pitcher plate appearance (PA). [16] leveraged LIME (Local Interpretable Model-agnostic Explanations) for NBA gameplay predictions that discovered insights leading to the success of a given NBA team. These works focused on the explanation of high-level statistical features such as player historical performances. Our work in this paper provides comprehensive understanding on both the global level that discovers important samples of interest for manual review, and for the first time, on the instance level that uncovers the leading evidences on the fine-grained play tracking data. 4 3. Task Definition, Data, and Methods 3.2 Task Definition We define the defensive coverage classification problem as a multi-class classification task, with three types of man coverage (where each defensive player covers certain offensive player) and five types of zone coverage (each defensive player covers a certain area on the field). These eight classes are visually depicted in Figure 2 below: Cover 0 Man, Cover 1 Man, Cover 2 Man, Cover 2 Zone, Cover 3 Zone, Cover 4 Zone, Cover 6 Zone and Prevent (also zone coverage). Multitude of information over time must be accounted for to properly identify the correct coverage, including the way defenders lined up before the snap, the adjustments to offensive player movement once the ball is snapped, coverage disguises and even blown coverage assignments. Figure 2. Defensive coverage types considered in our classification task. Circles in blue are the defensive players laid out in a particular type of coverage; circles in red are the offensive players. Full list of player acronyms is in Appendix. 5 Figure 3. Player tracking data illustration on the snapshots of the 1st frame (left) and the 10th frame (right) of a Cover 1 Man play. A human reviewer visually inspects the entire play, taking into account multitude of interactions and positions, before making the final determination that this is a Cover 1 Man play. Full list of player acronyms is in Appendix. The complexity and time-dependency of correctly identifying a coverage is illustrated in Figure 3, which shows two timed snapshots for a Cover 1 Man play. The offensive players are depicted in red, and the defensive players in blue. The letter within the blue and red circles denotes the player position on the field. In order to correctly determine the coverage as Cover 1 Man, the human reviewer or the model needs to account for the 1) interaction between the wide receivers (WR) and cornerbacks (CB), 2) interaction between the running back (RB) and linebackers (OLB, ILB), and 3) the location of the safety in the middle of the field (SS) as a single-high safety patrolling the deep middle area, over the duration of the play. 3.2 Data Game tracking data is captured at 10 samples per second, including the player location, speed, acceleration and orientation. This is available for every player and every play from 2018 to 2021 by NFL’s Next Gen Stats. We utilize 2018-2020 seasons data for model training and validation, and 2021 season data for model evaluation. Each season consists of around 17000 plays. Initial data cleaning was applied to remove noise introduced by sensor errors. For model training, we utilize the tracking data and the manually annotated coverage labels. We plot the coverage class distribution and its change over seasons in Figure 4. The data shows unbalanced distribution over the classes where Cover 1 Man and Cover 3 Zone are dominant and Prevent class is in the minority. This is to be expected: Cover 1 Man and Cover 3 Zone are the two base coverages in modern football and Prevent coverage is a situational play call mostly saved for end of regulation situations. Additionally, the distribution over the seasons highlights the fact that the Man-type coverages popularity is generally decreasing season by season from 2018 to 2021 compared to the Zone-type coverages. 6 Figure 4. Coverage class distribution over 2018-2021 seasons. Figure 5. The explainable coverage classification framework, starting with inputs from the top of the sketch. Detailed information about the model is in Section 3.5 and about the explanations in Section 3.6. 3.3 Explainable Coverage Classification Framework Figure 5 illustrates our overall modeling framework, with the input of player tracking data and coverage labels starting at the top of the figure. Given the input, we first conduct feature engineering to construct the player pair-wise relative features similar to [8], and then utilize convolutional neural network (CNN) to model the complex player interactions similar to the Kaggle Zoo solution [17]. Unlike [8] and [17], we apply a self-attention module that learns to aggregate the frame embeddings to focus on the most critical time steps, and an ensemble model that pools the decisions made by each model individually. The pooled decision is the output coverage classification. In addition, we develop a comprehensive model explanation method based on the learned play embeddings to provide both global and instance explanations. Global explanation utilizes embedding analysis to uncover potentially problematic plays for manual review, whereas instance explanation utilizes gradientbased CNN explanation to highlight most critical player interactions leading up to the identified coverage. In the next sections, we will describe details in the feature and data engineering (Section 3.4), CNN-attention model architecture (Section 3.5), and model explanation methods (Section 3.6). 7 3.4 Data and Feature Engineering Figure 6. Data processing x, y definition shown on football field. Player raw features including the location, speed, acceleration etc. are decomposed onto the two axes. Algorithm 1. Play trimming algorithm. We perform similar data processing steps as described in [8] and [17] that included decomposing raw features into x and y axes (as defined in Figure 6), unifying all play directions to left-to-right, and augmenting the y-axis location during training using random flipping with a 0.5 probability. We highlight key differences to [8] and [17] that we implement to improve the model’s performance: • Full offensive players. [8] limited the feature engineering to 5 non-quarterback offensive players. We expand to full non-quarterback offensive players of 10 to maximize the input information. This provides the flexibility to let the model learn to capture the most important signals for coverage classification. • Play trimming. We utilize a sequence of frames in the play to make the prediction, whereas both [8] and [17] were based on a single frame. As such, the duration of the play needs to be taken into account. Since the play lengths vary dramatically, we perform trimming of longer plays to focus on the first several seconds that contain the most important coverage indicators. The detailed trimming logic is described in Algorithm 1. • Temporal downsampling. Due to the incorporation of full offensive players and additional frames from the play, the size of the input tensor to the model increases significantly. To reduce the memory footprint and make both training and inference more efficient, we experimented temporal downsampling of the play with different factors. We found downsampling by a factor of 2 (reducing to 5 frames per second) did not reduce the classification performance, and utilize it for all experiments in this paper. After the raw data has been processed, we perform feature engineering to construct the play feature sequence as the input for model digestion. For a given frame, our representation is inspired by the Zoo model from 2020 Big Data Bowl Kaggle solution [17]: we construct an “image” for each time step with the defensive players at the rows and offensive players at the columns. The “pixel” of the “image” thus represents the features for the intersecting pair of players. Different from [17], we extract a sequence of the frame representations, which effectively generates a mini-“video” to characterize the play. 8 Figure 7. Illustration of the example features for the 1st frame (left) and the 10th frame (right) in correspondence to Figure 3. The x axis definition is given in Figure 6. Player acronyms are the same as in Figure 3 and the full list is in Appendix. Figure 7 visualizes how the features evolve over time in correspondence to the two snapshots given in Figure 3. For visual clarity, we only show four features out of all the ones we extracted: “x position to LOS (line of scrimmage)” and “x speed” for defenders, which capture their location and speed on the horizontal direction of the play field; “relative x position” and “relative x speed” for the interacting defensive and offensive player pair, where the feature value is reflected at the “pixel”. The pixel color encodes the value according to the color-bar. Notice how the features progress over time as players move: for example, at 10th frame on the “relative x speed” feature, the 3 wide receivers (WR) columns have generally larger values, indicating the aggressing movements. On the other hand, on the “relative x position” feature, their intersecting “pixels” with SS and 3 CBs have relatively smaller values, indicating the close proximity these players got into. Comparatively, reading from the “x position to LOS” feature, large values for the SS and 3 CBs confirm their locations on the field. Altogether, we construct the following two sets of features: 1) defender features consisting of the defender position, speed, acceleration and orientation, on x and y axis that corresponds to the horizontal and vertical direction of the field; 2) defender-offense relative features consisting of the same attributes but calculated as the difference between the defensive and offensive players. Aside from the player movement features, we also experimented with incorporating game contextual information including the down, yards to endzone, yards to go, number of pass rushers and running routes etc. These extra features did not show clear improvement of the coverage classification performance and was thus removed from the productionized pipeline. We conjecture that the rich tracking data inherently cover game and play information for the model and the context did not provide additional perspectives. 3.5 Coverage Classification Model We develop an ensemble CNN-attention model that utilizes the features constructed in Section 3.4 for coverage classification. We describe the key architectural designs that are important for performant modeling in the next few subsections. 9 Figure 8. Diagram of the convolutional module Figure 9. Self-attention mechanism for temporal modeling. 3.5.1 CNN module The “image” feature construction as in Section 3.4 facilitated the modeling of each play frame through a CNN. Figure 8 shows the internal structure of our CNN: we modified the convolutional (Conv) block utilized by the Zoo solution [17] with a branching structure that is comprised of a shallow 1-layer CNN and a deep 3-layer CNN. Batch normalization is utilized after each convolution layer and dropout is applied at the end of the block. An important detail on the convolution layer is the internal 1x1 kernel: having the convolutional kernel looking at each player pair individually ensures that the model is invariant to the player ordering. In data processing, the ordering needs to be consistent over time for a play. For simplicity, we order the players based on their NFL ID for all play samples. After the 2D Conv Block, pooling is applied along the offense axis (“image” columns). The results are then fed into a one-dimensional (1D) Conv Block composed of a similar structure as 2D Conv Block, but with 1D convolutional layers. Following [17], we utilize a weighted combination of average and max pooling with the weights of 0.7 and 0.3. We experimented with modified weights but the modifications did not provide any performance improvement. At the end of the CNN module is a linear block that consists of 3 fully connected layers with batch normalization and dropout in between. We obtain the frame embeddings as the output of the CNN module. 3.5.2 Temporal modeling Once the ball is snapped, a play takes only a few seconds to complete. Within the short period, the fast-progressing, rich temporal dynamics contain key indicators to identify the coverage. The ML 10 model needs to not only aggregate the information contained in individual frames, but also capture the correlations among the frames and potentially weigh them differently. We design a self-attention module [21] for the temporal modeling and compare it with a more conventional, bidirectional LSTM approach (quantitative comparison in Section 4.1). High-level illustration of the self-attention module is given in Figure 5, where the self-attention module is stacked on top of the frame embeddings learned from the CNN. The learned attention embeddings as the output are then averaged to obtain the embedding of the whole play. Finally, a fully connected layer is connected to determine the coverage class of the play. We illustrate the internal structure of the self-attention module in Figure 9, where the attention weights are calculated as the scaled dot-product between each query frame and every key frames. The weights are then used in the linear combination of the value frames to compute the frame representations. Specifically, 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑄,𝐾, 𝑉) = 𝐶𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑!, ℎ𝑒𝑎𝑑!, ⋯ , ℎ𝑒𝑎𝑑") ℎ𝑒𝑎𝑑# = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊# $,𝐾𝑊# %, 𝑉𝑊# &) where 𝐾, 𝑉,𝑄 are frame embeddings learned from the constructed “image” feature, 𝑊# $, 𝑊# %, 𝑊# & are the layer weights, and ℎ is the number of attention heads. 3.5.3 Model ensemble and label smoothing As described in Section 3.2, the 8 coverage schemes have an imbalanced distribution: for example, Cover 1 Man and Cover 3 Zone are frequently utilized while Prevent and Cover 2 Man are rare. In addition, we identified adjustments in more specific coverage calls that can lead to ambiguity among the 8 general coverage classes for both manual charting and model classification. The coverage imbalance and ambiguity make the clear separation among coverages challenging. We utilize model ensemble to tackle these challenges during model training. We experimented with the following ensemble methods: fusion, voting, and gradient boosting, along with different number of base models. In fusion, all base models are jointly trained with the training loss calculated from the averaged output of all base models. Voting differs from fusion by fitting base models independently, and averaging their outputs only during inference. For gradient boosting, the base models are trained sequentially, where the training target is associated with outputs from previously fitted base models. Our study shows that in fact, the more straightforward voting method achieves the best classification result and the 5-model ensemble works the best. In the voting-based ensemble, each base model has the same CNN-attention architecture and is trained independently from different random seeds. The final classification takes the average over the outputs from all base models. We further incorporate label smoothing into the cross-entropy loss to handle the label ambiguity. The idea is to encourage the model to adapt to the inherent coverage ambiguity instead of overfitting to potentially biased annotations. To smooth the labels, in the loss calculation, the original one-hot class distribution is combined with a small amount of uniform class distribution to introduce uncertainty. For example, Cover 3 Zone label is modified as 90% probability of 3-Zone and equal probabilities of anything else. Denote the original one-hot encoded label vector for sample 𝑥 as 𝑦' ()*+"(, and the number of classes as 𝐾 = 8. Label smoothing is calculated as, 𝑦' -./*-+01((," = (1 − 𝛼)𝑦' ()*+"(, + 𝛼/𝐾 where 𝛼 is the tunable weighting parameter to control the smoothing strength. 𝑦' -./*-+01((," is then used in the cross-entropy loss calculation. 11 3.6 Model Explanations Figure 10. Global explanation: t-SNE embeddings of a downsampled subset of 2018- 2020 season training plays. The plays are color-encoded according to the ground-truth annotation shown in the legend. The legends are shortened class labels of the original 8 classes as depicted in task definition, with M representing Man and Z representing Zone, and the word Cover removed. Figure 11. Potentially mislabeled plays highlighted on the t-SNE embeddings. The topranked identifications by the KNN algorithm are shown with triangles. The color encodes ground-truth coverage annotation. The black-box nature of deep neural networks prohibits the interpretation of how it determines the coverage scheme from tracking data. Our analysis reveals the inherent challenges in ensuring that the model captures the football knowledge, and reviewing the model’s decision under coverage ambiguities and wrong classifications. To tackle this, we develop a two-stage, top-down model explanation approach. The first stage analyzes the learned play embeddings from the coverage classification model to discover any patterns that require manual review. We utilize t-distributed stochastic neighbor embedding (t-SNE) [26] and experimented with the parameters including perplexity, number of iterations and the random seed to extract stable 2D projections. To reduce visual clutter, we perform stratified sampling to analyze a subset of all training data that consists of around 9000 plays. The projected 2D embeddings are visualized in Figure 10. We find that the majority of each coverage scheme are well separated, demonstrating the classification capability gained by the model. However, we highlight two important patterns that need further investigation: 1) a small number of plays deviate significantly from their respective coverage cluster. This could be attributed to mislabels of the coverage or high degree of coverage ambiguity. 2) among certain coverages, there is significant overlapping of plays. For example, we identify a long, curved cluster consisting of a mixture of Cover 1 Man and Cover 3 Zone plays (blue and green samples) and the cluster deviates from the main clusters of both types. This could entail inherent ambiguity that can exist between these two coverage concepts and specific adjustments on play calls that are not accounted for in the general ground-truth labeling. To effectively extract the example plays associated with these patterns for manual review, 12 we utilize basic outlier detection and unsupervised clustering methods. The detailed methods and our findings from manual review are described in Section 4.2.1. Figure 12. Instance explanation: we utilized Guided GradCAM algorithm to extract the highlighted pixels and mapped them back to football field where the line thickness corresponds to the player interaction strength. To shed lights on model’s decision and speed up the manual review on individual plays, we develop the second stage of instance explanation. It zooms into the individual play of interest, and extracts frame-by-frame player interaction highlights that contribute the most to the identified coverage scheme. This is achieved through Guided GradCAM algorithm [22] and the extraction process is illustrated in Figure 12. Starting from the coverage classification score obtained by the model (bottom of the figure), the algorithm consists of two steps. The first step (left branch in Figure 12) uses Guided Backpropagation [27] to extract the salient pixels of the input image that activate the neurons. These highlights are class-agnostic, general contributing features. The second step (right branch in Figure 12) uses GradCAM to back-propagate the coverage score using the gradients to localize class-discriminative pixels. Note that we utilize the feature maps at the output of the 2D Conv Block (as in Figure 8) to extract the GradCAM result. Results from these two steps are then elementmultiplied and we select frame with the highest activation as the most critical time step for explanation. Considering the multiple base models (not shown in Figure 12 for conciseness) used in the ensemble, we also select the base model that outputs the highest activation. The explanation result is coverage-discriminative, pixel-level highlights on the transformed “image” feature as in Figure 7. As the final step to illustrate the highlights intuitively, we map them back on the football field and visualize the corresponding player interactions. The line thickness annotates the interaction strength. The detailed results on example plays are shown in Section 4.2.2. 13 4. Metrics and results In this section, we describe the experimental metrics and results for our explainable coverage classification model. We first introduce the quantitative experiment setup and performance comparison to baseline models (Section 4.1). Next, we provide results from our model explanation methods and the insights discovered by them (Section 4.2). 4.1 Quantitative evaluation Table 1. Best model and training parameters from hyperparameter optimization. CNN output dimensionality Learning rate Weight decay Label smoothing weight Dropout rate for fully connected layers Dropout rate for convolutional layers Number of heads in selfattention module 128 0.0054 0.0005 0.07 0.3 0.2 16 As mentioned in Section 3.2, we utilize 2018-2020 seasons data for model training and validation, and 2021 season data to for quantitative evaluation. We performed a 5-fold cross-validation to select the best model during training. We apply the Adam optimizer with weight decay and perform hyperparameter optimization to select the best settings on multiple model architecture and training parameters. The best parameters are shown in Table 1. To evaluate the model performance, we computed the coverage accuracy, F1 score, top-2 accuracy and accuracy of the man vs. zone task. The CNN-based Zoo model used in [8] is the most relevant for coverage classification and we used it as the baseline. In addition, we consider improved versions of the baseline that incorporate the temporal modeling components for comparative study: a CNNLSTM model that utilizes a bi-directional LSTM to perform the temporal modeling, and a single CNNattention model that is used as the backbone of our model, but without the ensemble and label smoothing components. We obtain the performance results from 5 runs with different random seeds and report the average and standard deviation measures. The results are shown in Table 2. Table 2. Quantitative evaluation of the coverage classification model in comparison with the baseline and improved versions of it. Model Test acc. 8 coverages (%) Top-2 acc. 8 coverages (%) F1 score 8 coverages Test acc. Man vs. Zone (%) Baseline: Zoo model 68.8±0.4 87.7±0.1 65.8±0.4 88.4±0.4 CNN-LSTM 86.5±0.1 93.9±0.1 84.9±0.2 94.6±0.2 CNN-attention 87.7±0.2 94.7±0.2 85.9±0.2 94.6±0.2 Ours: ensemble of 5 CNN-attention models 88.9±0.1 97.6±0.1 87.4±0.2 95.4±0.1 We observe that incorporation of the temporal modeling module significantly improves the baseline Zoo model that was based on a single frame. Compared to the strong baseline of CNN-LSTM model, our proposed modeling components including the self-attention module, model ensemble and labeling smoothing combined provide significant performance improvement. The final model is performant as demonstrated by the evaluation measures. In addition, we identify very high top-2 accuracy and significant gap to the top-1 accuracy. This can be attributed to the coverage ambiguity: when the top classification is incorrect, the 2nd guess often matches human annotation. 14 4.2 Model explanation results 4.2.1 Global explanations As shown in Figure 10 and described in Section 3.6, we observe interesting cluster patterns among different coverage types. In this experiment, we utilize basic outlier detection and clustering algorithms to further investigate these patterns. First, we notice that some plays are “mixed” into other coverage types. These plays could potentially be mislabeled and deserve manual inspection. To automatically identify the candidates to review, we design a self-verification method that compares each play’s coverage label with the labels of its neighbors on the learned embedding space. This is achieved with a K-Nearest Neighbors (KNN) classifier. For each example, we compute its correctness score as the classification probabilities on its annotated class label from the KNN. We experimented with different K, i.e., the number of neighbors parameter and chose a relatively large parameter of K=80 to avoid prioritizing samples inside the ambiguity regions. The lowest-score examples are shown in Figure 11. We randomly sampled 13 plays from the highlighted examples for expert review and found that 12 out of the 13 plays were indeed labeled incorrectly. The remaining one play was designated as a zone match splitsafety coverage that falls in between Cover 2 Zone (label) and Cover 2 Man (model classification). Inspection of the play footage revealed that the two outside cornerbacks (CBs) kept their eyes on the QB the entire time, which could not be accounted for by the tracking data. The second interesting observation from Figure 10 is that there are several overlapping regions among the coverage types, indicating coverage ambiguity. We identify the most prominent ambiguities, and utilize a clustering algorithm to extract the associated example plays. Considering the complex topology, we apply spectral clustering algorithm [28] on the play embeddings. We experimented with different number of cluster parameter, by starting with a small value, and gradually increasing it such that the visually identified ambiguity region is covered by one of the clusters. Note that the clustering algorithm is not aimed for the optimal separation of the plays, but rather to effectively select the plays associated with the ambiguity region. The identification results on three prominent regions are visualized in Figure 13. Our expert review uncovered interesting patterns on the adopted coverages: • The first ambiguity region, as shown in Figure 13(a), deals with the two different single-high coverage concepts: Cover 3 Zone vs Cover 1 Man. The main distinction between these two coverages is man vs zone coverage. Most of the play examples in this region involve some sort of “pattern matching”. In these plays, the coverage responsibilities are contingent upon how the offensive receivers’ routes are distributed, and adjustments can make the play look like a mix of zone and man coverages. For example, one such adjustment we identified applies to Cover 3 Zone, when the cornerback (CB) to one side is locked into man coverage (“Man Everywhere he Goes” or MEG) and the other has a traditional zone drop. • The second ambiguity region, Cover 4 Zone vs. Cover 6 Zone as shown in Figure 13(b), deals with another pair of coverages that have overlap in their assignments. Cover 6 Zone is best understood as a split field coverage, where one side of the defense is playing Cover 4 Zone and the other is playing Cover 2 Zone. This means one side’s cornerback (the Cover 2 side) is responsible for the “flat” area, while the other is responsible for the deep outside quarter of the field. A key indicator in identifying Cover 6 from Cover 4 will be the flat cornerback initially staying in place or stepping down at snap, while the deep quarter cornerback will start by backpedaling. On a number of plays in this region, the flat area wasn’t threatened by any receivers, so that cornerback eventually had the freedom to drop back, making it look 15 more like Cover 4. Another pattern from the examples was the relative depth of the safeties. On mostly plays, the defense presents more of a single-high safety shell pre-snap, with one safety significantly deeper than the other. Spacing of the safeties made it appear that they are responsible for a deep half rather than a deep quarter, especially if they are wider. • A majority of play examples in the third region, Cover 0 Man vs. Cover 1 Man as shown in Figure 13(c), are in the red zone, especially within the 5-yard line. Given the reduced space in this area of the field, it becomes more difficult to determine whether there is a “deep safety” (an indicator of Cover 1 Man). On the plays outside the red zone, the defense showed a singlehigh safety at snap. However, that player did not drop into the deep middle on any of those plays. Instead, that player would end up in man coverage to replace a blitzing player or help double a dangerous receiver. (a) t-SNE embeddings for Cover 3 Zone (left), Cover 1 Man (middle), and the identified ambiguity cluster in red with randomly sampled 10 plays marked with black “x” for manual review (right). (b) t-SNE embeddings for Cover 4 Zone (left), Cover 6 Zone (middle), and the identified ambiguity cluster in red with randomly sampled 10 plays marked with black “x” for manual review (right). (c) t-SNE embeddings for Cover 0 Man (left), Cover 1 Man (middle), and the identified ambiguity cluster in red with randomly sampled 10 plays marked with black “x” for manual review (right). Figure 13. Ambiguity analysis on the 3 prominent overlapping regions from t-SNE embeddings: Cover 3 Zone vs. Cover 1 Man (a), Cover 4 Zone vs. Cover 6 Zone (b), and Cover 0 Man vs. Cover 1 Man (c). 16 4.2.2 Instance explanations We demonstrate the instance explanation results in this subsection. We first inspect the extracted explanations for “easier” examples whose coverage strategy is clear, to verify that the explanations capture the meaningful player interactions. Then, we utilize the explanation method to shed light on model’s decision on some low-confidence plays. Figure 14. Instance explanation results on a Cover 1 Man play (top) and a Cover 3 Zone play (bottom). 17 Figure 14 visualizes the instance explanations of a Cover 1 Man play and a Cover 3 Zone play. Note that the frames are selected using the method described in Section 3.6 and Figure 12. On the top figure, the explanation picks up the frame 2.7 seconds into the play and the strong interaction identified by the model between the left slot WR and slot CB. This is aligned with the clear indicator of man coverage with the CB squaring up on the receiver and following him inside and then outside on a whip route. To the other side of the formation, the explanation correctly identifies that the two defensive backs follow the receivers they align across from even as the receivers switch inside and outside, a key man coverage indicator. The TE aligned in the slot is followed by FS on an out-breaking route, while the WR aligned wide is followed by the CB on an in-breaking route. When we consider the deep middle FS, the play is clearly Cover 1 Man. On the bottom plot of Figure 14, the initial drops of both outside corners to the outside thirds without any regard for the routes being run clearly shows Cover 3 Zone. The explanation picks up this frame at 5.5 seconds into the play, when the pass rush has forced the QB to scramble, but each of these deep third players have maintained their responsibilities. The strong interaction between the TE in the deep middle of the field and both the MLB and CB is the correct reasoning of a Cover 3 framework: he wouldn’t be that open if the defense was playing man or match coverage. At the same time, the MLB having strong interactions with both inside TEs who aligned on his side of the formation presnap is another clear piece of evidence: he is in zone so he did not follow either TE, even as they entered and exited the area he was responsible for. After confirming the utility of the instance explanation method, we utilize it to shed light of model’s decision when the prediction confidence is low. These plays deserve manual inspection and the instance explanation can help speed up the process. Figure 15 demonstrates the explanation result of a play where the model identifies Cover 1 Man with 61.7% probability and Cover 0 Man with 28.4% probability. When asked to explain the decision of Cover 1 Man, the algorithm identifies the frame (Figure 15 top plot) that comes after the play action fake. At that point it is clearer that the SS is patrolling the deep middle. The highlighted interactions between WR and CB are indeed the correct evidences of man coverage. When asked to explain Cover 0 Man, the algorithm picks up the frame (Figure 15 bottom plot) that comes significantly earlier in the play, right after the snap as the quarterback has turned his back to fake the handoff to the RB. The highlighted interaction between the SS and the right WR is due to the safety moving in that direction, which may have led the model to think he is in man coverage instead of playing the deep middle. This play also conforms to our findings from the third ambiguity region (Figure 13(c)): the condensed space given the proximity to the goal line makes it harder for the model to identify whether there is a “deep safety”. 18 Figure 15. Instance explanation result on a play with 61.7% predicted probability of Cover 1 Man (top) and 28.4% predicted probability of Cover 0 Man (bottom). Looking back at the play we illustrated in Figure 1, the model predicted Cover 3 Zone with 44.5% probability and Cover 1 Man with 31.3% probability. We generate the explanation results for both classes as shown in Figure 16. The top plot for Cover 3 explanation comes right after the ball snap. The CB on the offense’s right has the strongest interaction lines, because he is facing the QB and stays in place. He ends up squaring off and matching with the receiver on his side who threatens him deep. 19 Figure 16 Instance explanation result on a play with 44.5% predicted probability of Cover 3 Zone (top) and 31.3% predicted probability of Cover 1 Man (bottom). This is the same play as the one we illustrated in Figure 1. The bottom plot for Cover 1 explanation comes a moment later, as the play action fake is happening. One of the strongest interactions is with the CB to the offense’s left, who is dropping with the WR. Play footage reveals that he keeps his eyes on the QB before flipping around and running with the WR who is threatening him deep. The SS on the offense’s right also has a strong interaction with the TE on his side, as he starts to shuffle as the TE breaks inside. He ends up following him across the 20 formation, but the TE starts to block him, indicating the play was likely a run-pass option. This explains the uncertainty of the model’s classification: the TE is sticking with the SS by design, creating biases in the data. 5. Conclusion This paper presents a novel ensemble CNN-attention model to classify defense coverage schemes in a performant manner. It significantly outperformed existing frame-based model and achieved production-ready performance. This approach is easily generalizable and extensible to include additional types of coverages beyond the eight coverages we considered in the paper. The classification model has been deployed to production by NFL NGS engineering and product teams. To extract insights regarding coverage ambiguity and model decision-making process, we further developed a comprehensive model explanation method. Through global explanation that uncovers coverage ambiguity patterns and instance explanation that highlights critical signals on the player interactions, our approach revealed interesting insights about the team and player behaviors. This also enables intelligent selection of plays for efficient human reviews. In future work, we plan to investigate game-theoretic approaches [23, 24] for the explanation of the coverage classification model. In addition, we would like to study temporal graph neural networks (GNNs) that can directly model the player interactions from raw data, as well as GNN-based model explanation approaches [25]. 6. Acknowledgements We would like to thank Keegan Abdoo from NFL NGS for providing the invaluable football insights, and Huzefa Rangwala from Amazon ML Solutions Lab for paper writing suggestions. 21 References [1] Eric Eager, George Chahrouri, Timo Riske, Brad Spielberger, LauSze Yui, Zach Drapkin, Tej Seth. “Using Tracking and Charting Data to Better Evaluate NFL Players: A Review” In MIT Sloan Sport Analytics Conference. 2022. [2] Horton, Michael. "Learning feature representations from football tracking." In MIT Sloan Sports Analytics Conference. 2020. [3] Hochstedler, Jeremy H. "Incorporating spatiotemporal machine learning into Major League Baseball and the National Football League." PhD diss., Massachusetts Institute of Technology, 2016. [4] Goyal, Udgam. "Leveraging machine learning to predict playcalling tendencies in the NFL." PhD diss., Massachusetts Institute of Technology, 2020. [5] Reyers, Matthew, and Tim B. Swartz. "Quarterback evaluation in the national football league using tracking data." AStA Advances in Statistical Analysis (2021): 1-16. [6] da Silva, Gustavo Pompeu, and Rafael de Andrade Moral. "Frame by frame completion probability of an NFL pass." arXiv preprint arXiv:2109.08051 (2021). [7] Lin Lee Cheong, Xiangyu Zeng and Ankit Tyagi. “Prediction of Defensive Player Trajectories in NFL Games with Defender CNN-LSTM Model” In MIT Sloan Sports Analytics Conference. 2021. [8] Ben Baldwin. “Computer Vision with NFL Player Tracking Data using torch for R: Coverage classification Using CNNs.” https://www.opensourcefootball.com/posts/2021-05-31- computer-vision-in-r-using-torch/ [9] Skoki, Arian, Jonatan Lerga, and Ivan Štajduhar. "ML-Based Approach for NFL Defensive Pass Interference Prediction Using GPS Tracking Data." In 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO), pp. 1038-1043. IEEE, 2021. [10]Raabe, Dominik, Reinhard Nabben, and Daniel Memmert. "Graph representations for the analysis of multi-agent spatiotemporal sports data." Applied Intelligence (2022): 1-21. [11]Dutta, Rishav, Ronald Yurko, and Samuel L. Ventura. "Unsupervised methods for identifying pass coverage among defensive backs with NFL player tracking data." Journal of Quantitative Analysis in Sports 16, no. 2 (2020): 143-161. [12]Dickmanns, Ludwig. "Pose Estimation and Analysis for American Football Videos." (2021). [13]Joash Fernandes, Craig, Ronen Yakubov, Yuze Li, Amrit Kumar Prasad, and Timothy CY Chan. "Predicting plays in the national football league." Journal of Sports Analytics 6, no. 1 (2020): 35- 43. [14] Lalwani, Abhinav, Aman Saraiya, Apoorv Singh, Aditya Jain, and Tirtharaj Dash. "Machine Learning in Sports: A Case Study on Using Explainable Models for Predicting Outcomes of Volleyball Matches." arXiv preprint arXiv:2206.09258 (2022). [15] Silver, Joshua, and Tate Huffman. "Baseball Predictions and Strategies Using Explainable AI." In The 15th Annual MIT Sloan Sports Analytics Conference. 2021. [16] Wang, Yuanchen, Weibo Liu, and Xiaohui Liu. "Explainable AI techniques with application to NBA gameplay prediction." Neurocomputing 483 (2022): 59-71. [17]Dmitry Gordeev, Philipp Singer. “1st place solution The Zoo.” https://www.kaggle.com/c/nflbig-data-bowl-2020/discussion/119400 [18] Tej Seth, Ryan Weisman, “PFF Data Study: Coverage scheme uniqueness for each team and what that means for coaching changes”, https://www.pff.com/news/nfl-pff-data-study-coveragescheme-uniqueness-for-each-team-and-what-that-means-for-coaching-changes [19] Selvaraju, Ramprasaath R., Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. "Grad-cam: Visual explanations from deep networks via gradient-based 22 localization." In Proceedings of the IEEE international conference on computer vision, pp. 618- 626. 2017. [20]NFL Football Operations, “Which NFL teams mix up defensive coverages the most week-toweek”, https://operations.nfl.com/gameday/analytics/stats-articles/which-nfl-teams-mix-updefensive-coverages-the-most-week-to-week/ [21] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017). [22] Selvaraju, Ramprasaath R., Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. "Grad-cam: Visual explanations from deep networks via gradient-based localization." In Proceedings of the IEEE international conference on computer vision, pp. 618- 626. 2017. [23] Lundberg, Scott M., and Su-In Lee. "A unified approach to interpreting model predictions." Advances in neural information processing systems 30 (2017). [24] Lundberg, Scott M., Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. "From local explanations to global understanding with explainable AI for trees." Nature machine intelligence 2, no. 1 (2020): 56- 67. [25] Ying, Zhitao, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. "Gnnexplainer: Generating explanations for graph neural networks." Advances in neural information processing systems 32 (2019). [26] Van der Maaten, Laurens, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of machine learning research 9, no. 11 (2008). [27] Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. "Striving for simplicity: The all convolutional net." arXiv preprint arXiv:1412.6806 (2014). [28] Von Luxburg, Ulrike. "A tutorial on spectral clustering." Statistics and computing 17, no. 4 (2007): 395-416. 23 Appendix Player position acronyms in Figure 2 Defensive positions W "Will" Linebacker, or the weak side LB M "Mike" Linebacker, or the middle LB S "Sam" Linebacker, or the strong side LB CB Cornerback DE Defensive End DT Defensive Tackle NT Nose Tackle FS Free Safety SS Strong Safety S Safety Offensive positions X Usually the number 1 WR in an offense, they align on the LOS. In trips formations, this receiver will often align isolated on the backside. Y Usually the starting TE, this player will often align in-line and to the opposite side as the X. Z Usually more of a slot receiver, this player will often align off the LOS and on the same side of the field as the TE. H Traditionally a fullback, this player is more often a third WR or a second TE in the modern league. They can align all over the formation, but are almost always off the line of scrimmage. Depending on the team, this player could also be designated as a F. T The featured running back. Other than empty formations, this player will align in the backfield and be a threat to receive the handoff. QB Quarterback C Center G Guard Player position acronyms in other figures, if not in the above Defensive positions LB Linebacker ILB Inside Linebacker OLB Outside Linebacker MLB Middle Linebacker Offensive positions RB Running Back FB Fullback WR Wide Receiver TE Tight End LG Left Guard RG Right Guard T Tackle LT Left Tackle RT Right Tackle