Equating scores of the University of Pennsylvania Smell Identi ﬁ cation Test and Snif ﬁ n' Sticks test in patients with Parkinson's disease

Background: Impaired olfaction is an important feature in Parkinson's disease (PD) and other neurological diseases. A variety of smell identi ﬁ cation tests exist such as “ Snif ﬁ n ’ Sticks ” and the University of Pennsylvania Smell Identi ﬁ cation Test (UPSIT). An important part of research is being able to replicate ﬁ ndings or combining studies in a meta-analysis. This is dif ﬁ cult if olfaction has been measured using different metrics. We present conversion methods between the: UPSIT, Snif ﬁ n ’ 16, and Brief-SIT (B-SIT); and Snif ﬁ n ’ 12 and Snif ﬁ n ’ 16 odour identi ﬁ cation tests. Methods: We used two incident cohorts of patients with PD who were tested with either the Snif ﬁ n ’ 16 (n ¼ 1131) or UPSIT (n ¼ 980) and a validation dataset of 128 individuals who took both tests. We used the equipercentile and Item Response Theory (IRT) methods to equate the olfaction scales. Results: The equipercentile conversion suggested some bias between UPSIT and Snif ﬁ n ’ 16 tests across the two groups. The IRT method shows very good characteristics between the true and converted Snif ﬁ n ’ 16 (delta mean ¼ 0.14, median ¼ 0) based on UPSIT. The equipercentile conversion between the Snif ﬁ n ’ 12 and 16 item worked well (delta mean ¼ 0.01, median ¼ 0). The UPSIT to B-SIT conversion showed evidence of bias but amongst PD cases worked well (mean delta ¼ (cid:2) 0.08, median ¼ 0). Conclusion: We have demonstrated that one can convert UPSIT to B-SITor Snif ﬁ n ’ 16, and Snif ﬁ n ’ 12 to 16 scores in a valid way. This can facilitate direct comparison between tests aiding future collaborative analyses and evidence synthesis. © 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Impaired olfaction is an important non-motor feature of Parkinson's disease (PD). It is thought to be an early pre-clinical sign of PD [1] and can be used to help in the diagnosis of PD before the development of definite motor features [2,3]. Olfactory impairment may also be an early marker of other neurological diseases such as Alzheimer's disease [4], multiple sclerosis [5], idiopathic rapid eye movement sleep behaviour disorder [6], Huntington's disease [7], multiple system atrophy [8], progressive supranuclear palsy [9] and parkinsonism dementia complex seen in Guam [10]. Differences in olfactory dysfunction between neurological diseases may be helpful in the differential diagnosis [11] of parkinsonian disorders [12]. Detailed reviews of olfactory dysfunction in neurological disorders have been previously published [11,13].
Many research studies collect data on olfaction and an important aspect of high quality research is the ability to replicate findings from studies or undertaking systematic reviews with or without a meta-analysis to synthesise evidence and examine for heterogeneity. This is more difficult if olfaction has been measured using a different metric within the different studies leading to potentially artefactual differences. The ability to estimate scores on one test from scores on another test helps reduce this problem. Olfaction is often measured using smell identification tests such as Sniffin' Sticks [14] or the University of Pennsylvania Smell Identification Test (UPSIT) [15].
Both the Sniffin' [16] and UPSIT [17] tests have published normative data centiles stratified by age and gender allowing us to determine the olfactory changes that are likely to be caused by disease in addition to that due to the natural aging process. This is particularly important in PD which predominantly affect the older population. Whilst the published normative data for Sniffin' stratified age as 5e15; 16e35; 36e55; and >55, the UPSIT stratified using five year age bands up to 85 and above. The stratification method employed by UPSIT is arguably more sensible given that olfactory impairment rises dramatically between 65 and 80 years [18].
We aimed to create conversion tables from an UPSIT score to a standard Sniffin' 16 item odour identification score, between the Sniffin' 12 and 16 item odour identification versions and between the UPSIT and Brief Smell Identification test (B-SIT) using two large cohorts of individuals with PD to help researchers pool data in future collaborative studies. An additional useful by-product of our conversion is that we can convert the published age/gender stratified centiles for the UPSIT to equivalent Sniffin' scores.

Study populations
Data were available from two incidence cohorts of patients with PD. The Oxford Parkinson's Disease Centre Discovery cohort consists of individuals from 11 hospitals across the Thames Valley. Patients were recruited between study onset in September 2010 up to May 2015. Full details of this study are described in detail elsewhere [19]. Patients were eligible for study inclusion if they met the UK PD Brain Bank Criteria according to a neurologist with a special interest in PD. We included any individuals diagnosed within the last three and a half years and who were given a probability of PD ! 90% as rated by a clinician based on their clinical opinion. This was to try to eliminate the inclusion of similar conditions that have been incorrectly diagnosed as PD. All individuals in this study had their olfaction measured using the standard Sniffin' test.
Tracking Parkinson's is a large incidence cohort of patients with PD recruited from around the UK. Patients were recruited between February 2012 and May 2014 if they were diagnosed within the last 3.5 years and met Queen Square Brain Bank criteria. Full details of this study are described elsewhere [20]. Again we only included individuals who were given a probability of PD ! 90% as rated by a clinician. In this cohort, olfaction was initially measured using the UPSIT. However during the course of the study a difficulty arose in obtaining the UPSIT kits and the study was forced to switch to using the Sniffin' test instead. This means we have two groups of individuals within the same cohort completing different tests.
We also have a third dataset of subjects "Testing of olfaction in Parkinson's and controls" (TOPC) who undertook both tests (Sniffin' and UPSIT) concurrently so we could validate our conversion algorithms. This comprised of 128 subjects (61 PD and 67 controls) who were recruited as a convenience sample from the regional, West of Scotland, Movement Disorder Clinic. The order on which individuals took the two tests was randomised thus minimising any order effects, such as patients scoring worse on the second test due to fatigue. All three studies had ethical approval and were undertaken with the understanding and written consent of each subject and in compliance with the declaration of Helsinki.

Olfaction tests
The UPSIT test has 40 items, where each item has one correct answer and three incorrect answers or "distractors". The test is a forced choice paradigm, that is, if an individual is unsure of an answer they are forced to guess a response hence a score of 25% on average would reflect random guessing. An UPSIT result is scored out of 40 where a higher score indicates better olfaction. There is also a reduced 12 item version [21] of the UPSIT called the Brief-Smell Identification Test (B-SIT), previously called the Crosscultural Smell Identification Test (CC-SIT).
The standard Sniffin' test has 16 odour identification items, where each item has one correct answer and three incorrect answers or "distractors". Again the test is a forced choice paradigm. A Sniffin' result is scored out of 16 where a higher score indicates better olfaction. There is also a Sniffin' 12 item version [22] which is a subset of the 16 item version.

Statistical analysis
The first and simplest method of equating one scale to another is equipercentile equating with log-linear smoothing which matches scores on the two tests using their percentile ranks after first smoothing the distribution. This method requires that the two groups are equivalent in olfaction usually through design creating randomly equivalent groups or by carrying out both tests on the same population. In our case it would mean assuming the groups taking the Sniffin' and UPSIT tests are equivalent with regards to olfaction.
Our second method used Item Response Theory (IRT) which models individual's responses on the item level by fitting a series of latent variable models for each item. The power of the IRT approach is that we calibrated our model between groups with potentially different olfaction by using items that are common to both tests. We assumed that the two groups are linearly related by their olfaction and calculated a calibration slope and intercept between the two groups. After calibration we built the distribution of scores and then equated using equipercentile methods.
Both the equipercentile and IRT methods are described in detail by Kolen and Brennan [23] whilst the details of how we used the IRT method and the computing programs we used are discussed further in the Web appendix.
We used both methods to convert between the UPSIT and Sniffin' 16 item test. Since the Sniffin' 12 items is a subset of the Sniffin' 16 item and the B-SIT is a subset of the UPSIT they were carried out on the same population. Hence we only used the equipercentile method for the UPSIT to B-SIT and Sniffin' 12 to 16 item conversions. We used our validation dataset to test how well the conversions performed by comparing the concordance correlation coefficient [24] (a measure of agreement between two continuous variables) between true and equivalent results as well as the characteristics of the difference (or delta) between the true and equivalent.
We also converted the centile position stratified by age and gender from the UPSIT normative data charts to an equivalent Sniffin' score to provide more detailed normative comparative data. We used at or below the 15th centile as a cut-point for determining whether an individual has impaired olfaction corrected for age and gender as we have done in previous research [25]. There are some inconsistent and random fluctuations in the centiles (probably due to sample size issues) hence we used LOWESS techniques to smooth the cut-points before applying our conversion. Table 1 compares the data we have from the Tracking Parkinson's with 980 individuals who took the UPSIT test and 294 who took the Sniffin' test at the baseline visit. These two sub-groups of the Tracking Parkinson's cohort have a similar proportion of females, age when the testing took place, motor severity (measured by the Movement Disorder Society Unified PD Rating Scale or MDS-UPDRS part 3), disease severity (measured by Hoehn and Yahr stage) and cognitive impairment (measured by the education adjusted Montreal Cognitive Assessment or MoCA). However the UPSIT sub-group had slightly longer disease duration. This is not surprising given that the UPSIT sub-group would have been recruited first in the study, which would include both incident and some prevalent cases (up to 3.5 years), however the cases that are recruited later on in the centres would consist of mainly incident cases since the prevalent pool of cases would have already been recruited.

Demographic and clinical data for Tracking Parkinson's and Oxford Discovery cohorts
In the Oxford Discovery cohort we have 837 individuals who took the Sniffin' 16-item odour identification test at the baseline visit. When compared to the group who took the UPSIT test from the Tracking Parkinson's cohort they had slightly shorter disease duration, a similar proportion of females and similar age at testing. They also had worse motor severity, disease severity and more cognitive impairment. Comparing the Tracking Parkinson's Sniffin' subset and Oxford Discovery groups they show similar gender, age and cognitive impairment but Oxford Discovery has worse motor and disease severity and longer disease duration from diagnosis. Of paramount importance is that there is no evidence (p ¼ 0.12) of a difference in Sniffin' scores between the Tracking Parkinson's subset and Oxford Discovery groups. We therefore pooled the Sniffin' data from the two cohorts for our UPSIT to Sniffin' 16 conversion. Web table 1 shows the demographic data from the TOPC validation study and Web Fig. 2 shows the distribution of UPSIT and Sniffin' 16 scores stratified by patient type. The correlation between the UPSIT and Sniffin' 16 scores was 0.81 in this sample. Table 2 shows the conversions from the UPSIT to a Sniffin' 16 equivalent using the two methods. In general, most UPSIT scores were grouped into 2 point values equivalent to 1 Sniffin' point but this could be as wide as 5 points for the (0e4) group using the IRT method. Table 3 presents the characteristics of these different conversions when tested on the TOPC validation data in which we compared an UPSIT predicted Sniffin' 16 to a true Sniffin' 16 score. The concordance correlation coefficient between the true and equivalent Sniffin' is very good and similar using both the equipercentile (0.79) and IRT methods (0.80). The difference between equipercentile predicted and true Sniffin' was acceptable although there was some evidence of under-prediction bias (positive mean delta). The individual IRT parameter estimates (a, b, c) for the UPSIT data and the combined Sniffin' data can be found in Web Tables 2  and 3 When using the IRT method we found that the calibration slope was 1.093 and the calibration intercept was 0.180. This is equivalent to saying that the individuals taking the UPSIT test had marginally better olfaction and also a slightly larger spread of olfaction when compared to the Sniffin' group. However mean olfaction that is 0.180 higher is small considering the groups are scaled to a mean of 0 and sd of 1. The validation of the IRT method on the TOPC data resulted in a delta that has a mean very close to zero and a median of zero showing that this conversion appears to have little evidence of bias. Web Fig. 3 shows graphically the degree of agreement between the true Sniffin' and the UPSIT equivalent Sniffin' using the two methods.

UPSIT to Sniffin' 16 conversion
Comparison of these calibration estimates to the conversions carried out using the equipercentile method showed some agreement. Assuming these calibration estimates are correct implies that the olfaction was slightly different in the two populations and hence the assumptions for the equipercentile method do not hold. Considering these calibration estimates, individuals taking the UPSIT test seem to have slightly better olfaction when compared to the Sniffin'. In agreement with this the equipercentile method showed evidence of the difference in olfaction in the observed bias. Table 4 shows the cut-points corresponding to the 15th centile of olfaction score stratified by age and gender from the UPSIT normative data. The table also shows the smoothed cut-points using LOWESS techniques and the equivalent Sniffin' score when applying our conversion chart from the IRT method in Table 2. This allows researchers to define a binary hyposmic group (Yes/No) based on poor olfaction ( 15th centile) for each gender and different age groups which can be used in analyses testing predictors of hyposmia.

Sniffin' 12 to Sniffin' 16 conversion
In the conversion from Sniffin' 12 to 16 we are no longer bound by assuming the groups to be equal because they are identical. This means that we can use data from each visit in the Discovery cohort rather than only using the baseline data. The number of individuals eligible for analysis were 837, 564, and 275 from visits 1, 2, and 3 respectively from the Discovery cohort along with the 294 from the Tracking Parkinson's cohort. The 1970 observations of combined Sniffin' 16 data has a mean of 7.0 and s.d. of 2.8 whilst the combined Sniffin' 12 data has a mean of 5.6 and s.d. of 2.4. Web table 4 shows the conversion scores from Sniffin' 12 to a Sniffin' 16 equivalent and Table 3 shows the validation of this conversion using the TOPC data. With these two tests being so similar it is not surprising that the concordance between true and equivalent Sniffin' 16 was very high, 0.97, that the average delta between the two was so close to zero and the standard deviation of the delta was also low at 0.96. Web Fig. 4 shows graphically the degree of agreement using the true Sniffin' 16 and the Sniffin' 12 equivalent Sniffin' 16. It could be argued that the percentiles used in the equipercentile method should not include an individual more than once, re-running this method using only the baseline data from the Discovery cohort gave an identical conversion.

UPSIT to B-SIT conversion
Web table 5 shows the conversion scores from UPSIT to B-SIT and Table 3 shows the validation of this conversion. The concordance coefficient is relatively high, 0.82, however when looking at the delta there is some evidence of over-prediction bias (negative average delta) in our conversion, mean ¼ À0.63 and median ¼ À1. However if we stratify the delta by PD cases (mean delta ¼ À0.08 and median ¼ 0) and controls (mean delta ¼ À1.13 and median ¼ À1) there is only evidence of bias for the controls. Web Fig. 5 shows graphically the degree of agreement using the true B-SIT and the UPSIT equivalent B-SIT.

Discussion
We used two methods to equate scores on the UPSIT test to scores on the Sniffin' 16 smell identification test, scores on the Sniffin' 12 item to Sniffin' 16 item smell identification tests and also scores on the UPSIT and B-SIT tests.
It has been shown that the differences in olfaction between PD patients and controls is not related to any particular odour type [26]. This suggests that although our conversions have been created using only PD patients they could potentially be used for controls and/or other diseases where olfactory dysfunction is not related to particular odour types.
A previous paper reported that the correlation between the Sniffin' and UPSIT scores was 0.85 [14] which is similar to 0.81, the value we found in our TOPC data. Another reported that the testretest correlation of the UPSIT was 0.9 [27] and was 0.86 in the Sniffin' [28]. These results are of a similar magnitude with our correlation between true and UPSIT equivalent Sniffin' 16 of 0.8. Both variability in test-retest performance and inadequate  conversion may have contributed to the differences between the true and converted scores, though our results are consistent with the test-retest correlations. There were a number of limitations to our work. The validation dataset we used was small and does not cover the entire range of scores for the two olfaction tests. Also if we had designed our two incidence cohorts with these conversions in mind it would have been better to randomise patients to receive either the UPSIT or the Sniffin' test. There are also clear differences between the Tracking and Discovery groups, especially in cognition which is related to olfaction, which could be the reason why the equipercentile method on the UPSIT to Sniffin conversion showed some evidence of bias and made it necessary to use the IRT method. Another consideration is that the UPSIT normative data was derived using a US version. The cohorts that we studied used a newer UK version adapted due to cultural differences as some smells in the US version were unfamiliar in the UK population. Despite this, the UK and US versions are still very similar, sharing 33 items with some changes to distractors.
Our UPSIT to B-SIT conversion had high concordance but some evidence of bias. However this disappeared when only considering the PD cases from the TOPC data. None of our other conversions showed evidence of difference in the delta when stratified by PD or Control. This could be because (a) this conversion is not valid; (b) the conversion is valid and the differential observation between PD cases and controls was a chance finding; or (c) our conversion is only valid for PD patients contradicting our belief that differences in olfaction between PD patients and controls is not related to any particular odour type.
The choice of what olfaction test to use in a study will be determined by several factors (i) time available and burden on participants (ii) cost of administering tests (iii) sample size. Another issue to consider is that shorter tests may be less sensitive (e.g. 40item UPSIT versus 16 item Sniffin') thereby reducing the ability to differentiate between groups. However statistical power is also related to sample size and measuring the UPSIT on a large sample would take considerably more time than a quicker test like the B-SIT. In some circumstances one may be happy to trade-off sensitivity against increased sample size. Longer tests are also less likely to be affected by random measurement error and will therefore have greater reliability. The association between reliability and test length is most famously highlighted by the Spearman-Brown prediction formula [29] and has been modelled before in olfaction [27]. In olfactory tests this is emphasised by the fact that the test-retest correlation was 0.9 in the UPSIT and 0.71 in the B-SIT [27].
We created a valid and reliable conversion of UPSIT scores to Sniffin' scores and from Sniffin' 12 item to 16 item. Also we have arguably created a valid and reliable conversion from UPSIT to B-SIT scores for PD patients. These conversions will be used to merge olfaction data from the Oxford Discovery and Tracking Parkinson's cohorts to investigate the influence of baseline olfaction and hyposmia in predicting future cognitive and motor decline in these longitudinal cohorts of early PD. We believe that these conversion charts will facilitate more replication of research findings and greater data sharing across many neurological diseases and studies that measure olfaction using these tests.   15e19  33  33  14  35  35  14  20e24  33  33  14  35  34  14  25e29  34  33  14  34  34  14  30e34  33  32  13  34  34  14  35e39  33  32  13  34  33  14  40e44  32  31  13  34  33  14  45e49  33  30  12  34  32  13  50e54  29  29  12  32  31  13  55e59  26  27  11  32  30  12  60e64  28  24  10  31  27  11  65e69  22  H Morris reports grants from Parkinson's UK, grants from Medical Research Council UK, during the conduct of the study; grants from Welsh Assembly Government, personal fees from Teva, personal fees from Abbvie, personal fees from Teva, personal fees from UCB, personal fees from Boerhinger-Ingelheim, personal fees from GSK, non-financial support from Teva, grants from Ipsen Fund, non-financial support from Medtronic, grants from MNDA, grants from PSP Association, grants from CBD Solutions, grants from Drake Foundation, personal fees from Acorda, outside the submitted work; In addition, H Morris has a patent H. R. M is a co-applicant on a patent application related to C9ORF72 -Method for diagnosing a neurodegenerative disease (PCT/GB2012/052140) pending. DG Grosset has received payment for advisory board attendance from AbbVie, and honoraria from UCB Pharma, GE Healthcare, and Acorda.