Evidence of Similarity Preference Among Diving Judges in the Summer Olympics

 
 

Are the Olympic Diving Judge Panels susceptible to Compatriot Preference?

The report for this post can also be downloaded by clicking this link.

Many sports, such as gymnastics, diving, ski jumping, and figure skating, rely on judges’ objective judgments to determine the winner of a competition. Judges usually follow a consistent rating scale (e.g., Diving: 0.0 – 10.0). Sport governing bodies are responsible for setting and enforcing quality control parameters for judge performance. Given the judging scandals in figure skating at the 1998 and 2002 Olympics, judge performance received greater scrutiny. This article aims to investigate if nationality can affect judges’ grading in either direction in diving at the 2021 Olympic Games. Empirical studies have been conducted on judges’ nationality bias. For example, Nationalistic Judging Bias in the 2000 Olympic Diving Competition (Emerson and Meredith 2010) and Racial Bias in National Football League Officiating (Eiserloh Dawson G., Foreman Jeremy J., Heintz Elizabeth C. 2020). This Case Study aims to stand on the shoulders of those giants and hopefully impact the realm of improving the judging system in sports by investigating the cognitive compatriot preference of judges. In this report, we denote a judge and diver pair as “compatriot” if they share the same NOC code (Wikipedia, collective, n.d.) and “alien” if they do not share the same NOC code.

Tokyo 2020 Olympics – Diving – Women’s 3m Springboard – Final – Tokyo Aquatics Centre, Tokyo, Japan – August 1, 2021. Tingmao Shi of China in action during the last dive REUTERS/Annegret Hilse

Insights

Although the total evolution of our species has lasted for years on end, humanity, as we know it today, is only about 200,000 years old. We have learned to take advantage of specialized psychological and sociological aspects that we can even continue through the millennia. These doings are rarely conscious but a vital part of our complex thought process, shaped by the species’ long history. It has long been known, for example, that familiar stimulus is seen as more attractive than exotic stimuli. This effect has been called “the blot-exposure effect” or “the familiarity principle” (The Decision Lab, n.d.). Most often, these factors reach evolving importance. The so-called familiarity principle has helped either an individual survive or even get a particular social role so that we can create large communities where solidarity and harmony are more important than an individual’s well-being. Although these factors certainly have many positive characteristics, they involve a risk of systematic error in decision-making and interpretation. We may value what we know better than it is exotic. For example, performance assessments need to be examined considering these effects. Performance evaluation is part of almost every workplace. Workers get evaluated by superiors, students by teachers, and team leaders by executives. This chain of evaluations creates an environment that makes most careers depend not only on the performance itself but also on the perception and evaluation of performance. Interestingly, there seems to be a sizeable possibility for biases in such judgments (Meyer, Mary, and Booker, Jane 2001). An example of this is that in a recent study (Lyngstad, Torkild Hovde, and Härkönen, Juho and Rønneberg, Leiv Tore Salte 2020), for example, found strong evidence of bias in sport performance evaluations in ski jumping.

In individual diving competitions, divers are evaluated by a panel composed of seven judges with strict rules for determining the diving excellence of each diver. After each dive, all judges evaluate the performance simultaneously. While judges know the other jury members, they neither can observe their evaluations nor are allowed to communicate with them. Scores range from zero to ten based on the execution and degree of difficulty. The highest and lowest scores from a diver’s six dives are excluded, while the rest are weighted for difficulty. Regarding grading these dives, there are five elements the judges must evaluate. Each dive part is evaluated as part of the overall score a diver receives. They are starting position, approach, take off, flight (including overall height achieved), and entry into the water. Evaluating these different scores for an overall score of a dive can be challenging. The multivariate aspect of observing the dive and simultaneously assessing various aspects exposes a possibility for an unconscious decision of preference.

Imagine five people in a room watching a movie about fencing and then immediately splitting the people into spaces where they tell you what they thought about the film; you will most likely get five very different answers. Hence it is vital to investigate the cognitive preference of judges and try to choose judges panel based on the fairest result.

Data

The modern Olympics comprises all the Games from Athens 1986 to Tokyo 2021. The Olympics is more than just a quadrennial multi-sport world championship. It is a lens through which to understand global history, including shifting geopolitical power dynamics, women’s empowerment, and society’s evolving values. The Olympics foundation has gathered data from the games for a long time but only recently started digging into the data patterns. This report’s data comes from 24 (fairly*) identical PDFs made by the Olympics Committee and distributed to students enrolled in Statistical Case Studies at Yale University. From talking to Elliot Schwartz, the performance Data Liaison for the US Olympic Committee, we have no reason not to trust that the data is consistent with results, names, and judges.

Along with this data, we also have information about the gender nationality of each judge in a panel of the round and event a dive is performed in. Combining these two datasets is done for the analysis. For statistical reasons, the dives which scored zero have been excluded from the dataset. This is the case as it is clearly defined when a dive has been failed and scores a zero; little to no judges’ assessment goes into that and thus, no personal preference to estimate. For example, rule D 6.28 states that “When a diver refuses to execute a dive, the Referee shall declare a failed dive. “(Fédération Internationale De Natation 2017). In the original data, four dives scored zero; two of those dives can be seen here: Arantxa Chavez failed dive (s9v, Youtube Channel 2021) and Pamela Ware failed dive (Arvi Viva, Youtube Channel 2021).

Statistical Methods

Calculations and baseline T-Tests

In statistics, a null hypothesis, H0, is a statement assumed to be valid unless it can be shown to be incorrect beyond a reasonable doubt. The idea is that the null hypothesis generally assumes that there is nothing new or surprising in the population. Our test will have the following hypothesis schema: 

H0: There is no grade inflation over the difference between grades of compatriots and alien judges 
H1: There is grade inflation over the difference between grades of compatriots and alien judges

First, we need to calculate the differences between the observations of non-compatriot grading and compatriots. The difference can be seen here below in the visual form:

Regarding these calculations, we need to take into consideration that we have way more observations of non-compatriot grading than compatriots.

We have noted the average difference between each compatriot and alien. However, we need to get the difference from each judge on a panel compared to the dive so that we can account for the dive strength done by the diver. We do that in three ways.

Average of all seven judges for a dive [Noted as 7]
Average of 6 judges for a dive (excluding the judge we are looking at) [Noted as 6]
Median of all seven judges for a dive [Noted as Median]

Looking at the averages of these differences, which we call blunders, along with the count of each group, can be seen below:

Randomization with Permutation tests

To counteract against the lack of data for compatriots, we run a statistical model called the permutation test. A permutation test computes the sampling distribution for any test statistic under the strong null hypothesis. Thus we exclude the ones with non-significant p-values in all three differences (7, 6, and median). Now we run a permutation test to determine the statistical significance of the model using our calculated p-values as baselines and then compute many random permutations of that data. If the model is significant, the original test statistic value should lie at one of the tails of the null hypothesis distribution. The results after the permutation test can be seen below in the visual form. In the plot, yes indicates that a judge from the country has compatriot preferences.

From the plot above, we see that six judges seem to showcase a similar preference towards their compatriots. These judges are from Canada, Germany, Ukraine, China, Mexico, and Korea.

Clustering By Country in Europe

Now we have an idea of the similarity preference that the judges showcase and wonder if there is more under the surface. According to the article, Genes mirror geography within Europe (Novembre, J., Johnson, T., Bryc, K. et al. 2008), our biological structure is built the same way as our compatriots. Thus it might be the case that we also have similar preferences to others within similar social networks. This part is more suitable for social studies, and it takes derby rivalries, history of wars, and other similar disputes into consideration. We, however, decided to do this by looking at the continents and looking for similarity preferences within them. We split our data into Europe and others, and the split is as follows:

We now use the same methods described above for each compatriot but only for Europe against non- Europeans. With the t-test baseline and the randomization using permutation tests, we gather these significant relationships between the null hypothesis that grades are inflated by the same continent (Europe) judges and divers.

By the plot above, only the judge from New Zealand seems to have similar preferences towards the non-European divers. This chapter of clustering by countries should be looked at as an example of further possible analysis rather than exact results as it disregards many historical facts and countries’ evolution. 

Results and Discussions

These tests give an idea of what can be concluded but disregard many other similarity preferences and thus can not be thought of as a definite result. We are, however, sure of our statistical analysis. The judges with similar preferences towards their compatriots are Canada, Germany, Ukraine, China, Mexico, and Korea.

As noted in the introduction, the so-called familiarity principle has helped either an individual survive or even reach a particular social role to create large communities where solidarity and harmony are more important than an individual’s well-being. Thus, it might be interesting to look even further for similarity preferences by divers and judges characteristics such as height, ethnicity, gender, etc.

 
Previous
Previous

Showing Gratitude - Fulbright

Next
Next

Hvern eigum við að?