A survey on different IRT model evaluation softwares for test equating and MCQ assessment using IRT (1PL–4 PL)
Keywords:1PL-4PL, IRT, jMetrik, test equating error, XCalibre
Item Response Theory- IRT is a measurement framework utilised in psychological and educational design and evaluation assessments based on rating scales, instruments, achievement tests, and others that measure mental traits. It is increasingly popular among the academic field and review to evaluate cognitive and non-cognitive measurement. In higher education assessment tasks, multiple-choice questions are generally used since it is simple and easily scored with better coverage of the instructional information. However, statistical evaluation is necessary to ensure the high-quality items is utilised as an inference basis. Hence, this paper reviews the IRT models, equating 1PL-4PL, assumptions, and various other models. Also, the powerful software utilised in IRT model evaluation, such as XCalibre and jMetrik software, is reviewed and presented. Further, the Multiple Choice Questions- MCQs assessment using IRT is analysed focusing on the item parameters and performed comparatives study. This study concluded that IRT is the better framework exploited by different researchers in evaluating the educational and psychological data for an assessment concerning high-quality items identification for the anchor items selection utilised for the test equating approach.
Almond, R. G. (2014). Using automated essay scores as an anchor when equating constructed response writing tests. International Journal of Testing, 14(1), 73-91. DOI: https://doi.org/10.1080/15305058.2013.816309
Azevedo, J. M., Oliveira, E. P., & Beites, P. D. (2019). Using Learning Analytics to evaluate the quality of multiple-choice questions: A perspective with Classical Test Theory and Item Response Theory. The International Journal of Information and Learning Technology. DOI: https://doi.org/10.1108/IJILT-02-2019-0023
Battauz, M. (2020). Regularized Estimation of the Four-Parameter Logistic Model. Psych, 2(4), 269-278. DOI: https://doi.org/10.3390/psych2040020
Benedetto, L., Cappelli, A., Turrin, R., & Cremonesi, P. (2020, March). R2DE: a NLP approach to estimating IRT parameters of newly generated questions. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge. DOI: https://doi.org/10.1145/3375462.3375517
Bichi, A. A., Embong, R., Talib, R., Salleh, S., & bin Ibrahim, A. (2019). Comparative analysis of classical test theory and item response theory using chemistry test data. International Journal of Engineering and Advanced Technology, 8(5), 1260-1266. DOI: 10.35940/ijeat.E1179.0585C19
Brown, G. T., & Abdulnabi, H. H. (2017, June). Evaluating the quality of higher education instructor-constructed multiple-choice tests: Impact on student grades. In Frontiers in Education. Frontiers Media SA. DOI: https://doi.org/10.3389/feduc.2017.00024
Brzezińska, J. (2020). Item response theory models in the measurement theory. Communications in Statistics-Simulation and Computation, 49(12), 3299-3313. DOI: https://doi.org/10.1080/03610918.2018.1546399
Cruz, J., Freitas, A., Macedo, P., & Seabra, D. (n.d.). Quality of Multiple Choice Questions. Retrieved from https://core.ac.uk/download/pdf/231952247.pdf
Domingue, B., & Dimitrov, D. (2021). A comparison of IRT theta estimates and delta scores from the perspective of additive conjoint measurement. DOI: 10.35542/osf.io/amh56
Elvira, M., & Sainuddin, S. (2021). Equating Test Instruments Using Anchor to Map Student Abilities Through the R Program Analysis. In Proceedings of the International Conference on Engineering, Technology and Social Science (ICONETOS 2020). DOI: https://doi.org/10.2991/assehr.k.210421.095
Garza, P., & Fiore, M. (n.d.). Historical Data Analysis and Item Response Theory for Calibration of Question Items. Retrieved from https://webthesis.biblio.polito.it/7456/1/tesi.pdf
Gökhan, A., Güzeller, C. O., & Eser, M. T. (2019). JMETRIK: Classical Test Theory and Item Response Theory Data Analysis Software. Journal of Measurement and Evaluation in Education and Psychology, 10(2), 165-178.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2): London, US: Sage publications. DOI: https://doi.org/10.2307/2075521
Himelfarb, I. (2019). A primer on standardized testing: History, measurement, classical test theory, item response theory, and equating. Journal of Chiropractic Education, 33(2), 151-163. DOI: https://doi.org/10.7899/jce-18-22
Jia, B., He, D., & Zhu, Z. (2020). Quality And Feature Of Multiple-Choice Questions In Education. Problems of Education in the 21st Century, 78(4), 576-594. DOI: https://doi.org/10.33225/pec/20.78.576
Kusumawati, M., & Hadi, S. (2018). An analysis of multiple choice questions (MCQs): Item and test statistics from mathematics assessments in senior high school. REiD (Research and Evaluation in Education), 4(1), 70-78. DOI: https://doi.org/10.21831/reid.v4i1.20202
Liao, W.-W., Ho, R.-G., Yen, Y.-C., & Cheng, H.-C. (2012). The four-parameter logistic item response theory model as a robust method of estimating ability despite aberrant responses. Social Behavior and Personality: an international journal, 40(10), 1679-1694. DOI: https://doi.org/10.2224/sbp.2012.40.10.1679
Matlock Cole, K., & Paek, I. (2017). PROC IRT: A SAS procedure for item response theory. Applied psychological measurement, 41(4), 311-320. DOI: https://doi.org/10.1177/0146621616685062
Mehta, G., & Mokhasi, V. (2014). Item analysis of multiple choice questions-an assessment of the assessment tool. Int J Health Sci Res, 4(7), 197-202.
Meyer, J. P., & Hailey, E. (2012). A Study of Rasch, partial credit, and rating scale model parameter recovery in WINSTEPS and jMetrik. Journal of Applied Measurement, 13(3), 248–258.
Nathaniel, E., Edougha, D., & Odjegba, G. (2019). Analysis of Past West Africa Examination Council/Senior Secondary Certificate Examination Physics Essay Test Items Parameters: Application of IRT Polytomous Model. Journal of Education and Practice, 10(17), 29-34. DOI: 10.7176/JEP/10-17-04
Olgar, S. (2015). The integration of automated essay scoring systems into the equating process for mixed-format tests (Doctoral dissertation). The Florida State University, USA.
Panidvadtana, P., Sujiva, S., & Srisuttiyakorn, S. (2021). A Comparison of the Accuracy of Multidimensional IRT equating methods for Mixed-Format Tests. Kasetsart Journal of Social Sciences, 42(1), 215–220-215–220. DOI: https://doi.org/10.34044/j.kjss.2021.42.1.34
Quaigrain, K., & Arhin, A. K. (2017). Using reliability and item analysis to evaluate a teacher-developed test in educational measurement and evaluation. Cogent Education, 4(1), 1301013. DOI: https://doi.org/10.1080/2331186x.2017.1301013
REWIND project. (n.d.). Video: Copy-move forgeries dataset. Retrieved from https://sites.google.com/site/rewindpolimi/downloads/datasets/video-copy-move-forgeries-dataset
Sahin, A., & Anil, D. (2017). The effects of test length and sample size on item parameters in item response theory. Educational Sciences: Theory & Practice, 17(1), 321-335. DOI: https://doi.org/10.12738/estp.2017.1.0270
Sansivieri, V., Wiberg, M., & Matteucci, M. (2017). A review of test equating methods with a special focus on IRT-based approaches. Statistica, 77(4), 329-352. DOI: https://doi.org/10.6092/issn.1973-2201/7066
Uduafemhe, M. E., Uwelo, D., John, S. O., & Karfe, R. Y. (2021). Item Analysis of the Science and Technology Components of the 2019 Basic Education Certificate Examination Conducted by National Examinations Council. Universal Journal of Educational Research, 9(4), 862-869. DOI: https://doi.org/10.13189/ujer.2021.090420
Uysal, I., & Doğan, N. (2021). Automated Essay Scoring Effect on Test Equating Errors in Mixed-format Test. International Journal of Assessment Tools in Education, 8(2), 222-238. DOI: https://doi.org/10.21449/ijate.815961
Wang, S., Zhang, M., & You, S. (2020). A Comparison of IRT Observed Score Kernel Equating and Several Equating Methods. Frontiers in Psychology, 11, 308. DOI: https://doi.org/10.3389/fpsyg.2020.00308
Wiebe, E., London, J., Aksit, O., Mott, B. W., Boyer, K. E., & Lester, J. C. (2019, February). Development of a lean computational thinking abilities assessment for middle grades students. In Proceedings of the 50th ACM technical symposium on computer science education. DOI: https://doi.org/10.1145/3287324.3287390
Zhang, Z. (2020a). Asymptotic standard errors of equating coefficients using the characteristic curve methods for the graded response model. Applied Measurement in Education, 33(4), 309-330. DOI: https://doi.org/10.1080/08957347.2020.1789142
Zhang, Z. (2022). Estimating standard errors of IRT true score equating coefficients using imputed item parameters. The Journal of Experimental Education, 90(3), 760-782. DOI: https://doi.org/10.1080/00220973.2020.1751579
How to Cite
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.