How can we test improvement over time in crucial leadership skills?

The FLIGBY serious game is an interactive leadership simulation where participants’ leadership skills are reliably assessed against the 29 factors in FLIGBY’s Leadership Skillset. The competencies themselves have been rigorously defined by expert development teams and mapped to several validated competency frameworks such as the Executive Core Qualifications (ECQ) System and Gallup’s StrengthsFinder framework, the predecessor to CliftonStrengths (Marer, Buzady & Vecsey, 2019).

The nature of the FLIGBY’s Master-Analytics Profiler means that many of the typical biases we look to combat in performance rating are removed, but how can we assess if participating in the game itself improves performance against the flow leadership competencies?

In order to do this, a comparison against baseline would need to be conducted – i.e. an improvement in the same measure, demonstrating a before and after effect. Unfortunately, the answer is not so simple as there is a range of possible biases inherent in most available rating approaches, which means the data is not always reliable. Let’s consider the most common (and most damaging) biases we tend to come across in personality and performance ratings.

Response bias “garbage in, garbage out”

Response bias refers to (usually unconscious) flaws in the way people respond, which result in inaccurate or misleading responses to questions.

  • Socially desirable responding is a tendency for people to choose responses that will be viewed favorably by others, whether that is around values and beliefs, traits, competence or cultural norms. This is particularly likely where people have concerns that their responses will be judged and that the outcomes of judgments could have important consequences.
  • Acquiescence bias is the tendency for people to agree with the statements presented to them in surveys more frequently, even if this is not a true reflection. For example, when given the statement “I think that providing regular feedback is important when managing others”, some may be disproportionately more inclined to agree if their options are limited.

Rater bias “it’s all in the eye of the beholder

Rater biases or idiosyncratic rater effects refer to errors in judgment that can occur when raters have pre-existing biases that influence how they rate others’ performance.

  • Leniency and severity effects relate to raters judging people more harshly or kindly, irrespective of actual performance. For example, a teacher may be reputed as a “harsh marker” with higher expectations than average.
  • The halo and horns effects occur when raters’ judgments are disproportionately positive or negative based on a single attribute. An example might be where a colleague is impolite in social situations and this unconsciously influences raters’ judgments of their technical skills.
  • Affinity bias can be closely linked to these effects, which occurs when raters judge people more favorably who are similar to them. The opposite is the basis of alienation bias.

Carryover effects – “practice makes perfect”

People repeating the same assessment can, unfortunately, lead to practice effects, which are changes or improvements in performance as a result of participants’ already being familiar with the requirements of the task.

Most relevant to systems that lead to personal growth such at the FLIGBY game is response-shift bias. This can occur when a person’s frame of reference changes between the two points of measurement. For example, somebody rating their emotional intelligence before playing FLIGBY could conceivably rate themselves as lower in this skill afterward, because their understanding of the concept and self-awareness could have improved.

Assessment design – the devil is in the detail

However, all hope is not lost for performance ratings. No rating is 100% accurate, which is why we work within what we define to be acceptable levels of reliability as a scientific community. That is to say that rigorous testing and validation procedures are followed to ensure that mutually agreed standards of reliability are upheld.

It’s important to be able to establish whether results are reliable and many statistical tests can assist with this. For example, the Stochastic Frontier Estimation (SFE) is a statistical approach that can reveal when response biases have been present and can even reveal differences in participants’ responses as a result of response-shift bias!

More good news is that many response and rater biases are a result of poor test design or systematic biases based on the idiosyncrasies of people. This means that they can be relatively easily identified and eliminated, or ideally prevented altogether.

This may include questions that measure the same variable but are worded in different ways, e.g. asking both “I am able to manage conflicting demands well” and “I often find it hard to manage my time when I have lots of things to do”.

Research would also suggest that appropriate choice of the rating scale is important e.g. forced-choice format questions are more resistant to response bias than the Likert scale style questions we are all familiar with (strongly agree, agree etc.)

Furthermore, many idiosyncratic rater effects can largely be mitigated by two things: having more than one rater and the same people performing the ratings before and after. When measuring performance, taking an aggregated view from several raters can dilute the effects of one biased rater. In terms of showing before and after effects, having the same raters complete both assessments means that the same idiosyncrasies are likely to be acting in both conditions and therefore a difference from baseline can be established, regardless of what the baseline score was.

So while there is no silver bullet to ensure performance data is always reliable, there are checks and balances which can assist with data quality. Of crucial importance is selecting appropriate data collection procedures for exactly what you’re looking to measure and also to conduct pilots – test, tweak and retest. When it comes to measuring the difference that makes the difference, rather than “the devil is in the detail”, “the devil is in the design” would perhaps be more appropriate.

Kristina Risley

Learning and Organisational Development Specialist,

Lecturer in Leadership and Human Resources; Ph.D. Candidate

University of Westminster, London, U.K.