NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Michie S, Wood CE, Johnston M, et al. Behaviour change techniques: the development and evaluation of a taxonomic method for reporting and describing behaviour change interventions (a suite of five studies involving consensus methods, randomised controlled trials and analysis of qualitative data). Southampton (UK): NIHR Journals Library; 2015 Nov. (Health Technology Assessment, No. 19.99.)

Cover of Behaviour change techniques: the development and evaluation of a taxonomic method for reporting and describing behaviour change interventions (a suite of five studies involving consensus methods, randomised controlled trials and analysis of qualitative data)

Behaviour change techniques: the development and evaluation of a taxonomic method for reporting and describing behaviour change interventions (a suite of five studies involving consensus methods, randomised controlled trials and analysis of qualitative data).

Show details

Chapter 5Reliability of identification of behaviour change techniques in interventions using Behaviour Change Technique Taxonomy version 1 (study 4)

Abstract

Objectives: To assess frequency and reliability (intercoder and test–retest) of identifying BCTs in written intervention descriptions.

Methods: 40 coders were trained to identify BCTs defined in BCTTv1. Coders identified BCTs in 40 intervention descriptions published in protocols and repeated this task 1 month later. A consensus of judgements reached by coders who were experienced in coding BCTs (and also developed BCTTv1) were compared with those of trained coders and used as the index of current validity.

Results: 80 out of 93 (86%) defined BCTs were identified by at least one trained coder and 22 (28%) of these were identified in 16 or more of 40 descriptions. Good intercoder reliability was observed across 80 BCTs identified in the protocols; 64 (80%) achieved mean PABAK scores of ≥ 0.70 and 59 (74%) achieved mean scores of ≥ 0.80. There was good test–retest reliability; good within-coder agreement was observed between baseline and 1 month. For the 32 coders providing data at both time points, mean PABAK scores for the two occasions ranged from 0.84 to 0.99. Good concurrent validity for 15 BCTs was agreed to be present by the experienced coders and mean PABAK score of ≥ 0.70 was achieved for 14 out of the 15 BCTs.

Conclusions: BCTTv1 can be used by trained coders to identify BCTs in intervention descriptions reliably (both in terms of agreement with each other and over time) and validly (assessed by agreement with experienced coder consensus). Some BCT definitions require further clarification.

Introduction

Results from two initial phases of reliability testing indicated BCTTv1 to be reliable in specifying 26 frequently occurring BCTs when used by the taxonomy developers.40 These findings helped to identify BCT labels and definitions requiring refinement. It also suggested that user training may be required as high reliability is not only dependent on the content of the taxonomy but also on the extent that the user is skilled in using it. Since this initial assessment of reliability, the taxonomy has been developed further and we have developed a user-training programme for BCTTv1 delivered through face-to-face workshops, distance group tutorials and more recently via an online training programme (see Chapter 4). Study 4 reports a detailed assessment of reliability of use of the taxonomy also reported in Abraham et al.107 It investigates concurrent validity of BCTTv1 and evaluates the use of BCTTv1 by newly trained users (primary researchers, systematic reviewers and practitioners) in specifying BCTs across a wide range of complex BCIs.

Since Abraham and Michie’s20 initial paper, intercoder reliability has been demonstrated for various subsequent BCT taxonomies.20,23,27,28,30,48 One indication of intercoder reliability is the percentage of BCTs for which there is agreement between coders that a BCT is present or absent. Abraham and Michie20 reported > 93% agreement between three pairs of coders (including the authors). As agreement scores tend to be inflated by the number of occasions on which neither coder reports the BCT to be present, Cohen’s kappa scores were calculated excluding BCTs agreed to be absent. Across 78 reliability tests (of identification of 26 defined BCTs and BCT clusters across 195 published intervention descriptions) the average kappa per technique was 0.79 with only three BCTs yielding kappa scores of < 0.70. Landis and Koch102 have suggested that kappa scores of 0.60–0.79 can be described as ‘substantial’ and that scores of ≥ 0.80 can be described as ‘outstanding’ but, conventionally, 0.70 is taken to indicative of acceptable intercoder reliability. Thus, these findings demonstrated that BCTs can be reliably identified in published descriptions. Similar methods were applied to BCTTv1.

Initial assessment of BCTTv1 used six experienced coders on the BCTTv1 study team and calculated kappa (PABAK, for full description see Analysis) when there were at least five identifications of a BCT in the 65 coded intervention descriptions.40 We set the criteria of ‘at least five identifications’ because the more frequent the BCTs, the greater the confidence that PABAK is a useful indicator of reliability of judging the BCT to be present. Of the 26 BCTTv1 BCTs meeting this criterion, nine were found to have ‘outstanding’ and 14 ‘substantial’ reliabilities. Although these are good kappa results, they raise several questions. First, would equally good reliability be obtained by newly trained coders who have not been involved in the development of the taxonomy? The extensive discussions involved in taxonomy development described by Michie et al.40 might result in enhanced agreement owing to tacit knowledge beyond the explicit BCT labels and definitions. therefore, it is important to examine the extent to which the BCT taxonomy enables newly trained coders (i.e. coders who have not been involved in the development of BCTTv1) to reach agreement about the content of interventions.

Second, are these levels of agreement good enough to enable replicable implementation of intervention content? Most researchers would probably accept that BCTs definitions generating kappa scores of 0.70 can be reliably identified but may have reservations about those with lower scores.108 Although these reliability statistics indicate levels of agreement, they do not quantify the degree of error that might be incurred when scientists try to replicate interventions from their descriptions. When two practitioners read a report of an effective intervention with kappa agreement of 0.80, which is ‘outstanding’, how different might their implementation of the intervention be? For example, in coding 40 BCTs, it would be possible for each of two coders to identify two BCTs but for each to have identified different BCTs, or for one coder to identify and implement five BCTs while the other identified and implemented nine (i.e. a further four BCTs), would result in quite different interventions.

Third, it is possible to assess reliability of coding of specific BCTs, or of an intervention description as a whole. So, for example, if an intervention contained some BCTs that were reliably coded and some that were not, replication of the intervention might fail owing to the omission of poorly recognised BCTs. There might also be failure to replicate if the intervention description was badly written, either by omitting aspects of the intervention or by lack of clarity in the description. Improved coding systems cannot facilitate detection of a BCT if the description omits to mention that it was included in the intervention; Lorencatto et al.12 found that published descriptions reported, on average, less than half the BCTs that were included in the longer protocol descriptions. BCTTv1 might enable authors to report fuller, clearer descriptions of their evaluated interventions.17 On the other hand, when the description is simply unclear, BCTTv1 should enable agreement between coders as it clearly specifies that coders should not infer the presence of a BCT from vague descriptions. It is therefore important to assess not only the reliability of coding of each BCT, but also to investigate whether or not intervention descriptions are satisfactorily coded.

Fourth, it is quite possible for two coders to reach good agreement but be wrong, that is it is reliable but not valid. While reliability of use of BCT taxonomies has typically been evaluated, it is also important to assess the ‘validity’ of these judgements. Poor reliability or a good level of reliability between newly trained coders but without agreement with expert consensus would both be unsatisfactory outcomes. High reliability with poor validity might occur for a variety of reasons such as misleading pre-coding discussion or a misleading cue in the intervention description. Estimates of validity require a criterion against which the codes are judged, but in the case of intervention descriptions such a criterion does not exist.

Abraham and Michie20 assessed reliability between the two coders who developed the 26 BCT definitions (the authors) and also between the first author and two other coders trained to use the taxonomy. Findings showed no significant difference between reliability scores indicating that trained coders could identify BCTs as reliably as coders involved in defining BCTs. In order to assess concurrent validity of BCTTv1, we compared trained coders’ identification of BCTs with BCTs judged to be present in descriptions by a consensus of experienced coders. This consensus took as its starting point the independent coding by pairs from a pool of six taxonomy developers who were experienced in identifying BCTs in intervention descriptions. This resulted in the reliability data for 15 BCT definitions reported in both Michie et al.40 (see Table 3) and Abraham et al.107 It was then further developed by discussion of any discrepancies by each of the pairs. Further discrepancies were identified during BCT training of new coders. If a resolution was not obvious, SM and the study researcher (CW) reviewed the remaining discrepancies and proposed a coding. The list of BCT codes resulting from this process was circulated to the group of six researchers and each agreed the final codes. In the absence of any better criterion for valid codes, consensus regarding the presence for these 15 BCTs was used as a criterion by which to judge the validity of trained coders’ judgements across the same 40 intervention descriptions.

Fifth, BCT identifications rated with low confidence by coders could indicate problems with specific BCTs or ambiguity of specific intervention descriptions, which might prevent achievement of good intercoder reliability and accurate replication. However, our analyses have shown that reliability of BCT identification was not positively correlated with coder confidence and indeed tended to be negatively correlated.109 Therefore, it is important to ascertain, if possible, whether or not codes made with high confidence are also reliable and valid ratings.

Finally, various authors have proposed different methods of assessing kappa that adjust for its limitations. Byrt et al.99 proposed the ‘PABAK’, which results in considerably more meaningful indices for data such as BCT coding when the prevalence of each BCT is low and bias may occur; therefore, it was the statistic chosen by Michie et al.40 as the most appropriate to measure intercoder reliability in applying BCTTv1. Gwet et al.110 suggest that further adjustment should be made for chance agreement that might occur when two coders make random ratings and make allowance for this in the alternative chance – corrected statistic to kappa (AC1) statistic110 (see equations 7 and 8 in Gwet et al.110 and equation 4.1 in Gwet et al.111). We compare reliability results generated by applying the PABAK and AC1 formulae.

The aims of study 4 were:

  1. To assess the reliability (intercoder and test–retest) and validity of labels and definitions in BCTTv1 when used by newly trained coders to specify BCTs across a range of intervention descriptions.
  2. To assess whether or not confidence in codes relates to validity of codes.
  3. To identify labels and definitions in BCTTv1 requiring further refinement.

First, we investigated how many of the 93 BCTs were used by participants to describe the interventions coded in order to ascertain whether or not BCTTv1 contained too many BCTs.

i.

How often are particular behaviour change techniques identified in intervention descriptions?

We then addressed the following RQs.

Replicability of coding

ii.

To what extent do trained coders agree on BCTs indentified in intervention description; how good is intercoder reliability? And does their change over a 1-month period?

iii.

Does intercoder reliability vary across different intervention descriptions?

iv.

How good is test–retest reliability? Do coders identify the same BCTs at baseline as they do 1 month later?

v.

Are meaningfully different patterns of reliability data generated by different indices of intercoder reliability?

Concurrent validity of coding

vi.

What is the extent of agreement between trained coders and a consensus reached by experienced coders about whether or not BCTs are present? Is this stable over a 1-month period?

Confidence of coding

vii.

How confident are trained coders in identifying BCTs from intervention descriptions and is this stable over a 1-month period?

viii.

How does confidence in identification relate to observed intercoder reliability?

ix.

How does trained coders’ confidence of BCT identification relate to trained coders’ agreement with experienced coders’ consensus about which BCTs are present?

Feedback on using Behaviour Change Technique Taxonomy version 1

x.

Are there any BCT labels and definitions that trained coders judge to be in need of refinement and/or clarification? How do such judgements relate to observed intercoder reliability?

Method

This study is also published as Abraham et al.107

Materials

Forty descriptions of BCIs were used to test reliability of identification of BCT labels and definitions. Protocols selected were from those published in three interdisciplinary journals that publish BCI protocols related to health improvement between 2009 and 2010 namely, BioMed Central (BMC) Public Health (n = 24), Implementation Science (n = 11) and BMC Health Services Research (n = 5). The 40 descriptions included interventions designed to promote or change behaviours to prevent illness (n = 13), behaviours to improve illness management (n = 13) and behaviours of health-care professionals (n = 14). Quota sampling ensured that protocols were selected from each of these three broad categories.

A coding task booklet (consisting of the 40 intervention descriptions and task instructions) was developed and sent to each coder. Coders used BCTTv140 to identify the absence/presence of BCTs in the intervention descriptions. For a list of the 40 protocols from which intervention descriptions were extracted, Table 10; for the coding booklet, see Appendix 2, Table 20.

TABLE 10

TABLE 10

List of 40 protocols from which intervention descriptions were extracted: sampled across BMC Public Health, Implementation Science and BMC Health Services Research

Participants

For participant demographic data (Table 11). Forty-eight coders who had not been involved in the development of BCTTv1 were trained to use the taxonomy96 (for more details of recruitment and training, see Chapter 4, Materials). Of these, 72.5% were from the UK, 17.5% from other European countries, 5% from the USA and 5% from Australia. They ranged in age from 24 to 60 years (mean = 37.13 years, SD = 7.45 years) and 70% were women. Eighty per cent had obtained a research or clinical doctorate and 13% identified themselves as active practitioners in their field. Eight-eight per cent rated themselves as being highly confident in using the taxonomy to specify intervention content after training.

TABLE 11

TABLE 11

Demographic information for trained coders (study 4)

Procedure

Development of expert consensus for behaviour change technique identification

During the development of BCTTv1, six experienced BCT coders (study team members: CA, SM, MJ, JF, WH and MR), working in pairs, independently identified BCTs in the intervention descriptions using BCTTv1. They used the same coding method as described above for trained coders. The agreement within pairs produced reliability data for the 86 BCTs available at that time, as reported in Michie et al.40 The expert consensus for the current study took this as its starting point. Expert consensus was developed by discussion of any discrepancies within each of the pairs and was also informed by feedback from trained coders. If a resolution was not obvious, SM and the study researcher (CW) reviewed the remaining discrepancies and proposed a coding. The list of BCT codes resulting from this process was circulated to the whole study team who agreed the final codes. We used this consensus about the presence of BCTs in the descriptions as a criterion against which trained coder codings were judged and concurrent validity was assessed.107

Behaviour change technique identification by trained coders

The trained coders (n = 48) were randomised into 24 pairs using a random number generator. Both members of the coding pair received the same set of 20 (out of the 40) intervention descriptions to code. A random number generator was also used to allocate descriptions to coder pairings. The trained coders completed the coding task at two time points, 1 month apart.

At time 1, 40 coders (20 coding pairs) of the 48 coders completed the exercise generating 8–12 (as opposed to the planned 12) sets of reliability data for each of the 40 intervention descriptions. Coders used BCTTv1 to identify BCTs in each intervention description. They indicated which BCT was identified, where in the description it was identified, and also rated their confidence in their identification, using ‘+’ to mean ‘present in all probability but evidence not clear’ and ‘++’ to mean ‘present beyond all reasonable doubt and clear evidence’. After completing the task for the first time (‘time 1’), coders returned all materials and were asked to delete any copies they had made.

One month later (‘time 2’), coders were resent coding materials (including the same 20 descriptions but in a different order) and completed the coding exercise again. Thirty-two coders completed the exercise at time 2 (16 coding pairs, comprising the same trained coders as time 1). Coding took approximately 1 day of work for each coder on each occasion. Coders were paid an honorarium for their time.

Feedback on behaviour change technique definitions and labels

After completing the coding exercise, coders were asked to provide free response written feedback about using BCTTv1. They were invited to identify BCT definitions and labels that they believed remained unclear.

Analysis

The extent to which coders agreed on the absence/presence of BCTs in the descriptions (‘intercoder reliability’) was assessed using PABAK.99 Coders at time 1 were randomly allocated to pairs working on the same set of intervention descriptions. Mean PABAK scores for each trained coder pairing, for each BCT and for each intervention description were calculated using the number of agreements and disagreements between each coding pair. Trained coder agreement with expert consensus was calculated by pairing each individual trained coder with the expert consensus BCT identifications (as described above). Trained coder agreement with expert consensus was represented by mean PABAK scores for each trained coder–consensus pairing and for each BCT.

Test–retest reliability of trained coder judgements was also assessed using PABAK. Coders at time 1 were paired with themselves at time 2. Mean PABAK scores for each intervention description were calculated using the number of agreements and disagreements. A PABAK score for each coder was then calculated using the mean of these scores.

Prevalence- and bias-adjusted kappa was used rather than Cohen’s kappa statistic.101 Many authors have discussed inter-rater agreement measures.111 Kappa tends to underestimate identification reliability when the number of instances is small or has asymmetric distribution between agreements and disagreements, thus PABAK overcomes these problems. Guidance on interpretation of Cohen’s kappa and other reliability statistics has been published and, conventionally, 0.70 is regarded as indicative of acceptable of good inter-rater reliability as indicated by Cohen’s kappa. In the absence of evidence-based guidance on interpretation of PABAK scores, we report means above and below 0.70 and 0.80.

Test–retest of intercoder reliability between time 1 and time 2 and of trained coder agreement with experienced coder consensus was assessed by bivariate correlations, and stability of reliability was assessed by paired t-tests. For each of the BCTs identified, the number of high confidence ratings by each trained coder at time 1 and at time 2 was calculated. The percentage of high confidence ratings was then calculated for each BCT taking into account the frequency of BCT identification across the 40 intervention descriptions. Paired t-tests (i.e. by pairing percentage of high confidence ratings made at time 1 with percentage of high confidence ratings made at time 2) were used to assess stability of high confidence ratings. A multivariate ANOVA was carried out to assess whether or not percentage of high confidence ratings at time 1 and at time 2 differed according to frequency of BCT identification (i.e. between categories of frequently, occasionally and rarely).

A number of alternatives to Cohen’s kappa have been developed. Gwet110 tested a number of such reliability indices and concluded that the AC1 statistic (see equations 7 and 8 in Gwet110) had optimal output characteristics. This is particularly true when the frequencies of occurrence are small. In the present study, a two-tailed chi-squared test applying the Yates’ correction for continuity was used to explore which statistic, PABAK or AC1, gave the higher number of BCTs achieving good reliability (i.e. ≥ 0.70).

The free response feedback provided by coders was subjected to analyses. The lead researcher on these analyses (CW) read coders’ feedback and sorted responses into themes. Themes were developed during inspection of the data. Additional themes were added if data did not fit the existing themes. An independent coder (KS) then checked allocation of feedback to themes. Any discrepancies were resolved through discussion between the two researchers.

Results

Reliability of behaviour change technique identification by trained coders

Agreement rates were similar across the 20 trained coder pairs at time 1. All coding pair means fell within one SD of the overall PABAK mean scores (mean = 0.86, SD = 0.02) indicating that there were no outlying pairs. Consequently, all pairs were included in subsequent analyses.

Table 12 lists the 93 BCT labels from BCTTv1, ordered by the frequency of identification by trained coders at time 1. The columns in the table also show (i) the mean PABAK scores between trained coder pairs; (ii) the range of mean PABAK scores between trained coder pairs; (iii) the percentage of identifications made with high confidence ratings; (iv) the mean AC1 score between trained coder pairs; and (v) the number of intervention descriptions (out of 40) in which each BCT was identified. Time 2 data for (i) and (iii) are given in parentheses in the respective columns.

TABLE 12

TABLE 12

Mean agreement between trained coder pairs and confidence about the presence or absence of 93 BCTs in 40 intervention descriptions, time 1

Number of behaviour change techniques identified

How often were specific behaviour change techniques identified?

Of the 93 BCTs in BCTTv1, 80 (86%) were identified by at least one trained coder, at time 1. Twenty-two BCTs were identified in 16 or more intervention descriptions (i.e. in at least 40% of the 40 descriptions; referred to as ‘frequently identified’). Twenty-five BCTs were identified in between 6 and 15 intervention descriptions (‘occasionally’ identified) and 33 BCTs were identified in 1–5 descriptions (‘rarely’ identified). Thirteen BCTs were not identified in any description at time 1, but two of these were identified at time 2.

Replicability of coding

To what extent do trained coders agree on BCTs identified in intervention descriptions; how good is intercoder reliability? And does this change over a 1-month period?

Mean PABAK scores ranged from 0.30 to 1.00 over the 80 BCTs identified, and 64 out of the 80 observed BCTs (80%) achieved mean PABAK scores of ≥ 0.70 and 59 (74%) achieved mean PABAK scores of ≥ 0.80 indicating that good intercoder reliability was observed across BCTs by trained coders. All of the BCTs under ‘occasionally’ and ‘rarely’ identified categories achieved mean PABAK scores of ≥ 0.70 but only 6 out of the 22 (27%) of the BCTs in the ‘frequently’ identified category reached the 0.70 threshold.

Sixteen BCTs failed to reach the 0.70 PABAK threshold with nine falling below 0.60. All were frequently identified. Four of these were very close to the threshold, scoring 0.68 or 0.69: problem-solving (0.68), action planning (0.68), behavioural practice/rehearsal (0.68) and monitoring of behaviour by others without feedback (0.69). The remaining 12 were goal-setting (outcome) (0.61), prompts/cues (0.66), feedback on outcome(s) of behaviour (0.62), adding objects to the environment (0.57), information about health consequences (0.50), feedback on behaviour (0.57), social support (practical) (0.51), goal-setting (behaviour) (0.48), information about social and environmental consequences (0.50), instruction on how to perform behaviour (0.31), credible source (0.40) and social support (unspecified) (0.30).

Intercoder reliability at time 1 and at time 2 was strongly correlated, r(14) = 0.97; p < 0.001. The pattern of mean PABAK scores at time 1 and time 2 did not significantly change, t(15) = −0.05; p = 0.97. The pattern of mean PABAK scores across the BCTs identified also did not significantly change, [t(79) = 0.56; p = 0.58, time 1 PABAK mean = 0.88, SD = 0.17; time 2 PABAK: mean = 0.87, SD = 0.17]. Twelve out of the 22 frequently observed BCTs at time 2 had mean PABAK scores < 0.70, and 10 out of these also fell below this threshold at time 1. This suggests that intercoder reliability remained stable over time.

Does intercoder reliability vary across different intervention descriptions?

We considered whether or not mean PABAK scores differed across descriptions (as opposed to BCTs). Across the 40 intervention descriptions, mean PABAK scores ranged between 0.73 and 0.96 (mean = 0.87, SD = 0.05). Four of the 40 descriptions had mean PABAK scores of < 0.80 (mean PABAK for the four descriptions = 0.76, SD = 0.04). All of the four descriptions were taken from protocols published in 2010 by BMC Public Health. Two described interventions targeting behaviours to prevent illness (physical activity: mean PABAK = 0.73, and healthy eating: mean PABAK = 0.77), one described an intervention targeting behaviours to improve illness management (reduction in the use of methamphetamine: mean PABAK = 0.79) and one described an intervention targeting the behaviour of health-care professionals (health promotion and obesity prevention: mean PABAK = 0.78). There was no indication therefore, that any particular description or subset of the descriptions influenced intercoder reliability.

How good is test-retest reliability? Do coders identify the same BCTs at baseline as they do 1 month later?

Over the 32 coders providing data at both time points, mean PABAK scores for the two occasions ranged from 0.84 to 0.99, and 14 out of the 32 coders (44%) achieved mean PABAK scores of ≥ 0.80 and 18 (56%) achieved mean PABAK scores of ≥ 0.90. Good test–retest reliability was observed across all coders, indicating that those coders who performed well at time 1 were also those who performed well at time 2.

Are meaningfully different patterns of reliability data generated by different indices of intercoder reliability?

Correlational analyses showed that the relationship between mean PABAK scores and mean AC1 scores across BCTs at time 1 was near perfect (r = 0.96; p < 0.001, two-tailed). This reflects the mathematical similarity of the two formulae. For all occasionally and rarely identified BCTs, both PABAK and AC1 statistics show reliable identification of BCTs (i.e. > 0.70). Twenty of the 22 of the BCTs from the ‘frequently’ identified category meet the threshold of 0.70 when reliability is represented by AC1 but only six meet this threshold when represented by PABAK [χ2 (degrees of freedom = 1, n = 44) = 15.89; p < 0.001]. For 59 of the BCTs, the AC1 statistic generated higher reliability scores but lower scores for 10 of the BCTs. AC1 was significantly more likely than PABAK to generate reliability scores that exceeded the 0.70 threshold [χ2 (degrees of freedom = 1, n = 160) = 10.58; p < 0.001].

Validity of coding

What is the agreement between trained coders and developer consensus about which behaviour change techniques are present? Is this stable over a 1-month period?

Table 13 shows data for 15 BCTs out of 86. These BCTs were identified at least five times across the 40 intervention descriptions by the experienced coders and appear in Table 13 ordered by frequency of identification.40 Subsequent columns show (i) the mean PABAK scores for trained coder agreement with the experienced coders, (ii) the mean PABAK scores for experienced coders, (iii) the mean PABAK scores for the 15 BCTs, between trained coder pairs at time 1, (iv) the percentage of trained coder identifications for the 15 BCTs, made with high confidence ratings and, (v) the number of descriptions in which the BCT was identified (out of 40) by at least one of the experienced coders. For reference, time 2 data is provided in parentheses for each of (iii) and (iv).

TABLE 13

TABLE 13

Trained coder agreement with consensus of experienced coders about BCTs present in 40 intervention descriptions

Mean PABAK scores for experienced coders for the 15 BCTs (see Table 13) ranged from 0.60 to 0.90 (overall mean = 0.77); all but one BCT achieved a mean PABAK score of ≥ 0.70. Mean trained coders’ PABAK scores ranged from 0.40 to 0.85 (overall mean = 0.70). Six BCTs achieved mean PABAK scores < 0.70. The mean reliability scores for experienced coders and trained coders were correlated with a large effect size [r(13) = 0.69; p < 0.01, two tailed] but were significantly different [t(14) = 3.01; p < 0.01] with experienced coders achieving higher PABAK scores for 13 out of the 15 BCTs. The BCT ‘credible source’, reduced the overall intercoder reliability for both the experienced coders and the trained coders (PABAK score = 0.60 and 0.40 for experienced coders and trained coders, respectively). Discounting this particular BCT, the overall mean PABAK scores for the 15 BCTs would increase to 0.78 and 0.72 for experienced coders and trained coders, respectively.

Concurrent validity scores for the 15 BCTs at time 1 ranged from 0.49 to 0.83, with four BCTs having scores < 0.70. Three of these four were associated with BCTs having low reliabilities and the fourth, ‘review behaviour goal(s)’ had a validity score of 0.68. The poorest validity was for ‘credible source’, with ‘social support (practical)’ and ‘goals-setting (outcome)’ also being low and all three of these had also shown low intercoder reliability. The mean PABAK validity scores across trained coder consensus pairs did not significantly change with time [t(31) = 0.84; p = 0.41] and scores on the two occasions were correlated [r(30) = 0.67; p < 0.001]. Validity was reasonably high and remained stable over a 1-month period.

Confidence of coding

How confident are trained coders in identifying behaviour change techniques from intervention descriptions and is this stable over a one-month period?

The percentage of identifications judged by coders to have been made with high confidence (‘++’) across the 80 BCTs identified, ranged from 0% to 100% with an average percentage of 57%. The trained coders appeared to be more confident about identifying BCTs at time 2 than the same BCTs at time 1: t(79) = –2.08; p < 0.05 (mean percentage of judgements made with high confidence at time 1 mean = 57, SD = 23.65; time 2 mean = 63.04, SD = 23.11). The percentage of trained coders high confidence ratings also increased for the 15 BCTs agreed as present by the experienced coders. t(14) = –2.78; p < 0.05 (mean percentage of judgements made with confidence ‘++’: time 1 66.47, SD = 11.51; time 2 71.73, SD = 10.55).

The percentage of high confidence ratings did not differ according to frequency of BCT identification at time 1 [F(2,77) = 2.55; p = 0.08]; however, they significantly differed at time 2 [F(2,77) = 3.48; p < 0.05]. Frequently identified BCTs were rated with more confidence (mean percentage = 70.45, SD = 9.68) than those BCTs that were rarely identified BCTs [p < 0.05 (mean percentage = 55.33, SD = 32.53)] and there were no significant differences between categories of rarely and occasionally (p = 0.18), or between categories of occasionally and frequently (p = 1.00) identified BCTs.

How does confidence relate to intercoder reliability?

For the 80 BCTs identified by trained coders, confidence was not correlated with greater reliability. Perhaps surprisingly, mean PABAK scores were not positively correlated with confidence, but rather tended to be negatively correlated, r(78) = –0.37; p = 0.07 (two tailed).

How does trained coders’ confidence of behaviour change technique identification relate to trained coders’ agreement with experienced, coder consensus about which behaviour change techniques are present?

Confidence ratings were not associated with validity; trained coders’ confidence in their identification of the 15 BCTs agreed by experienced consensus was not associated with mean PABAK scores [r(13) = 0.27; p > 0.10].

Feedback on use of Behaviour Change Technique Taxonomy version 1

Are there any BCT labels and definitions that trained coders’ judge to be in need of refinement and/or clarification? How do such judgements relate to observed intercoder reliability?

Free response feedback was categorised into three themes: (1) BCT definitions and labels that remain unclear/require refinement, (2) improving reliable and valid application and (3) methods to improve the usability of BCTTv1.

Behaviour change technique definitions and labels that remain unclear/require refinement

Of the 93 BCTs in BCTTv1, trained coders highlighted 12 as being unclear or requiring further refinement. Nine of these (75%) BCTs were among the 12 BCTs achieving mean PABAK reliability scores of < 0.70 (listed above). Three of these were among the 15 BCTs identified by the experienced coders achieving mean PABAK validity scores of < 0.70. This suggests that there was considerable correspondence between coders' individual judgements of BCT definition clarity, BCT intercoder reliability and validity.

Fifty per cent of the BCT definitions judged by trained coders as being in need of further clarification belonged to the same BCT grouping in BCTTv1, therefore, referring to the same underlying mechanism of change. For example, coders noted that if distinctions between the three social support BCTs were clearer (i.e. Social support – unspecified, Social support – practical and Social support – emotional), users would find it easier to decide when unspecified forms of support could more accurately be specified as practical:

Make the definitions of ‘social support unspecified’ clearer and in tune with the definition of Social support (practical)

Participant 7

Would be useful to have ‘social support’ better defined

Participant 11

. . . still finding the social support BCTs vague – e.g. aside from the definition of ‘social support’, what defines ‘practical’ in practical social support?

Participant 4

Similar refinement was recommended for the three information provision BCT variants (i.e. Information about health consequences, Information about emotional consequences and Information about social and environmental consequences):

Some BCTs, e.g. social support (unspecified) or information about social and environmental consequences are too vague and used for too many techniques.

Participant 18

Improving reliable and valid application

Four BCTs achieved intercoder reliability mean PABAK scores of 0.68 and 0.69, at time 1. These were: problem-solving (0.69), demonstration of the behaviour (0.69), self-monitoring of behaviour (0.69) and monitoring of outcome(s) of behaviour without feedback (0.68). We considered these as being close enough to the 0.70 threshold to be counted as reliably identified. Three additional BCT definitions were identified by trained coders as being in need of clarification and/or refinement. These were: demonstration of the behaviour, action planning and restructuring the physical environment. Coders tended to find several ‘pairs’ of BCTs difficult to distinguish: instruction on how to perform the behaviour versus demonstration of the behaviour; goal-setting (behaviour) versus goal-setting (outcome); goal-setting (behaviour) versus action planning and restructuring the physical environment versus adding objects to the environment. To further facilitate reliable and valid specification of these BCTs, guidance was requested on when to code one rather than both of these BCT pairs:

Why is this (excerpt) not coded as ‘instruction on how to perform the behaviour’? (This is a wider question about how demonstration differs from instruction.) Does demonstration encompass instruction?

Participant 9

Two BCTs should not be inferred as always co-occurring. It is possible to have a demonstration of a behaviour without practice and vice versa. So the definitions of these two BCTs should be revised.

Participant 23

The BCT ‘credible source’, created considerable difficulty as indicated by low reliability and validity for coders, but also low reliability for the experienced coders. Coders noted that identification required inferences regarding intervention recipients’ evaluation of the credibility of those delivering interventions and that this could sometimes be unclear. They suggested that intercoder reliability could be improved if additional guidance documents were made available:

What determines/constitutes ‘credibility’ of a source? Is a parent providing information to a child a ‘credible source’? Also, is telling someone to do a behaviour a case of ‘presenting information . . . in favour of or against the behaviour’?

Participant 33

Methods to improve the usability of Behaviour Change Technique Taxonomy version 1

Coders’ suggestions for improving usability focused on improving the speed at which specific BCTs could be located in the taxonomy. Frequent suggestions included having a hyperlinked index page at the beginning of version 1 to minimise the need to manually search through BCTs:

Would be useful to have a first page list of all codes and where to find them and any linked codes as this was difficult to scroll back and forth finding the codes on each page during doing.

Participant 14

. . . presentation of the taxonomy – a front page with all BCT labels numbered and with page of definition to enable access

Participant 2

Discussion

Eighty of the 93 BCTs defined in BCTTv1 were identified in 40 intervention descriptions by at least one trained coder, 22 BCTs were identified in at least 16 descriptions and 47 BCTs were identified in at least six of 40 descriptions. Thus, coders made extensive use of BCTTv1, justifying the large number of BCTs included. Clearly specification of BCIs requires at least this range of BCTs to describe the active content.

Good intercoder reliability was observed across the 80 identified BCTs, with 80% achieving mean PABAK scores of at least 0.70 and 74% with scores of ≥ 0.80. Poorer reliabilities were more common for frequently occurring BCTs; just 12 of the 22 (55%) achieved mean PABAK scores of ≥ 0.68 and six (27%) achieved ≥ 0.70.

Intercoder reliability was equally good across intervention descriptions and remained similar across two tests separated by 1 month, indicating temporal stability. There was also good trained coder agreement with developer consensus about the 15 BCTs agreed on as present, thus indicating concurrent validity.

Trained coders’ confidence in their BCT identifications increased from time 1 to time 2, and varied across BCTs. There tended to be a negative correlation between trained coders’ confidence of BCT identification and intercoder agreement. Coders’ confidence in identifying BCTs was not related to either intercoder agreement or to agreement with experienced coders. This suggests that perceived confidence does not appear to be a useful indicator of accuracy of identification of BCTs in intervention descriptions. Reliability was consistently higher when represented by the AC1 statistic than the PABAK statistic.

The BCTTv1 taxonomy of BCT labels and definitions is an important development of previous work and provides the most comprehensive listing of BCTs to date. Of the 22 BCTs observed in 16 or more of 40 intervention descriptions, 18 (82%) were also identified in the Abraham and Michie20 BCT taxonomy, confirming the frequent occurrence of these 18 BCTs. However, while Abraham and Michie defined only 22 BCTs and four packages of BCTs, BCTTv1 defines 93 separate BCTs, 80 of which were identified by coders in 40 published intervention descriptions. The data reported here provide good overall intercoder and test–retest reliability for 80 of those definitions among trained coders. Furthermore, the data suggest that newly trained coders can code accurately as defined by agreement with experienced coder consensus. Given the expertise derived from the experience of developing BCT definitions and the previous experience of the developers in identifying BCTs in intervention descriptions, correspondence between trained coder judgements and experienced developers’ judgements was regarded as a measure of the validity of the trained coders’ judgements. Thus, validity of BCTTv1 is demonstrated among new users and does not depend on additional discussion and cues that may have influenced the experienced coders. Therefore, the BCTTv1 taxonomy can be used successfully, with appropriate training, by a wide range of users to identify a wide range of BCTs included in intervention descriptions.

Beyond testing the reliable and valid use of current BCT labels and definitions, a primary purpose of this research was to identify definitions that might be applied more reliably with further clarification or guidance. Sixteen frequently identified BCTs achieved PABAK scores of < 0.70 with nine of these falling below 0.60. These included four pairs of BCTs defining variants of the same underlying change mechanism, namely (1) Information about (a) health consequences and (b) social and environmental consequences; (2) goal-setting in relation to (a) outcomes and (b) behaviours; (3) feedback on (a) behaviour and (b) outcome(s) of behaviour; and (4) social support both (a) practical and (b) unspecified.

It may be appropriate to remain somewhat circumspect about trained coders’ feedback on clarity of BCT definitions as confidence in BCT identification was not related to observed intercoder reliability (and indeed tended to be negatively correlated) or to trained coder agreement with experienced coder consensus. Nonetheless, coders provided clear feedback suggesting that further distinction between such BCTs groupings was needed.

Intercoder reliability was poor for ‘instruction on how to perform behaviour’ and individual coder feedback suggested confusion with ‘demonstration of the behaviour’. This was one of four pairs of BCTs (including two goal-setting definitions) with which coders reported difficulty. Both prompts/cues and adding objects to the environment also fell below the 0.70 threshold. Further testing of the definitions of these BCTs is therefore warranted.

Trained coders commented on the definition of credible source. This BCT showed poor intercoder reliability among coders and poor validity. It was also the lowest intercoder reliability score among experienced BCT coders. This is clearly a BCT that requires further clarification and specification but should be retained because of its relevance to behaviour change. The importance of message–source credibility has been recognised for over half a century151 and coding for source characteristics in intervention descriptions has been shown to predict intervention effectiveness in meta-analytic studies. Perceived professional competence of intervention facilitators has also been found to predict effectiveness of human immunodeficiency virus-preventative interventions.152 Similarities between intervention facilitators and recipients (e.g. age, ethnicity and behaviour risk group membership) explained differential effectiveness of interventions. This highlights that coding intervention descriptions for message source or facilitator characteristics is important. Plus it may be important to distinguish between perceived professional competence of those delivering the interventions and their ‘credibility’, which could also be based on common group membership. Finally, it may be important to distinguish between characteristics of those delivering interventions and the content of what they deliver. Certainly, further work on the label and/or definition of this BCT is needed.

The coders trained for this study seem typical of those likely to want to use BCT labels and definitions to identify intervention components associated with greater intervention effectiveness, for example when conducting meta-analyses.20,23,48,56 Although we used data from 40 trained coders, we recommend that future studies continue to use at least two coders to check (and report) intercoder reliability.

We found that reliability was high and similar over the 40 intervention protocols coded. However, as reported above, we used a sample of descriptions included in well-cited journals that publish BCI protocols related to health improvement and this reliability might not be maintained in intervention descriptions of poorer quality.

We compared intercoder reliability assessments generated by the PABAK and AC1 statistics. Reassuringly, our findings indicate that these indices generate very similar patterns of intercoder scores across identified BCTs. We found, however, that for the most frequently indentified BCTs, the two indices differ significantly in their representation of reliability with PABAK generating lower reliability scores than AC1. These frequently identified BCTs were less reliably identified than other BCTs. Thus, our findings suggest that, at least in some contexts, the two statistics may not be directly interchangeable when data similar to the current data are investigated. Guessing may be more likely for frequently occurring BCTs than for rare BCTs meaning that random effects, such as those that AC1 controls for, could emerge. AC1 may therefore be a more appropriate index. It may also be helpful to interpret these statistics alongside other results (e.g. frequency of item observations). The consistently higher AC1 scores suggest that our data may have been influenced by random judgement effects that are controlled in AC1 but not PABAK, with resulting lower scores in PABAK. Nevertheless, even PABAK takes account of the high prevalence of negative codes and of the extent to which coders use similar marginal totals and, therefore, is more appropriate for this type of data than a simple kappa. The use of more than one statistic to assess agreement is a relatively new approach and the consistent difference between the scores indicates that the AC1 takes account of more sources of error in these data and, therefore, is more appropriate for establishing the level of agreement in BCT coding. The high correlation between the measures indicates that the pattern of relative agreement across BCTs is adequately represented by either score, although the PABAK may result in fewer frequently observed BCTs achieving reliability above thresholds such as 0.70, as used here.

Comparison of the pattern of results for frequent and rare BCTs suggests that PABAK is as good as AC1 for the rare but not for the frequent BCTs, while the poor reliabilities (using PABAK) were more common for frequently occurring BCTs. As was suggested by our comparison of confidence ratings across BCT categories, guessing may be more likely for common than for rare BCTs. This would result in occurrence of the random effects that AC1 controls and confirms the need to use AC1 rather than PABAK in studies assessing reliability. Further developments of reliability indices111 controlling for additional sources of error and more recently proposed indices may give an even better indication of the true reliability of coding.

In conclusion, our results show that BCTTv1 can be used by trained coders to reliably identify BCTs included in intervention descriptions. Feedback from our coders highlights several BCT definitions that require further classification and refinement before they can be identified reliably. BCTTv1 provides a uniquely comprehensive list of BCTs. The taxonomy can be used both by developers and implementers of BCIs and by those wishing to identify effective BCTs within complex interventions (e.g. systematic reviews).

Copyright © Queen’s Printer and Controller of HMSO 2015. This work was produced by Michie et al. under the terms of a commissioning contract issued by the Secretary of State for Health. This issue may be freely reproduced for the purposes of private research and study and extracts (or indeed, the full report) may be included in professional journals provided that suitable acknowledgement is made and the reproduction is not associated with any form of advertising. Applications for commercial reproduction should be addressed to: NIHR Journals Library, National Institute for Health Research, Evaluation, Trials and Studies Coordinating Centre, Alpha House, University of Southampton Science Park, Southampton SO16 7NS, UK.

Included under terms of UK Non-commercial Government License.

Bookshelf ID: NBK327611

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (1.7M)

Other titles in this collection

Recent Activity

  • Reliability of identification of behaviour change techniques in interventions us...
    Reliability of identification of behaviour change techniques in interventions using Behaviour Change Technique Taxonomy version 1 (study 4) - Behaviour change techniques: the development and evaluation of a taxonomic method for reporting and describing behaviour change interventions (a suite of five studies involving consensus methods, randomised controlled trials and analysis of qualitative data)

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...