A survey on different IRT model evaluation softwares for test equating and MCQ assessment using IRT (1PL–4 PL)


  • Rishi Kumar Loganathan Department of Human Development, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, 35900, Malaysia
  • Zahari Suppian Department of Human Development, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, 35900, Malaysia
  • Siti Eshah Binti Mokshein Department of Human Development, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, 35900, Malaysia


1PL-4PL, IRT, jMetrik, test equating error, XCalibre


Item Response Theory- IRT is a measurement framework utilised in psychological and educational design and evaluation assessments based on rating scales, instruments, achievement tests, and others that measure mental traits. It is increasingly popular among the academic field and review to evaluate cognitive and non-cognitive measurement. In higher education assessment tasks, multiple-choice questions are generally used since it is simple and easily scored with better coverage of the instructional information. However, statistical evaluation is necessary to ensure the high-quality items is utilised as an inference basis. Hence, this paper reviews the IRT models, equating 1PL-4PL, assumptions, and various other models. Also, the powerful software utilised in IRT model evaluation, such as XCalibre and jMetrik software, is reviewed and presented. Further, the Multiple Choice Questions- MCQs assessment using IRT is analysed focusing on the item parameters and performed comparatives study. This study concluded that IRT is the better framework exploited by different researchers in evaluating the educational and psychological data for an assessment concerning high-quality items identification for the anchor items selection utilised for the test equating approach.


Almond, R. G. (2014). Using automated essay scores as an anchor when equating constructed response writing tests. International Journal of Testing, 14(1), 73-91. DOI: https://doi.org/10.1080/15305058.2013.816309

Azevedo, J. M., Oliveira, E. P., & Beites, P. D. (2019). Using Learning Analytics to evaluate the quality of multiple-choice questions: A perspective with Classical Test Theory and Item Response Theory. The International Journal of Information and Learning Technology. DOI: https://doi.org/10.1108/IJILT-02-2019-0023

Battauz, M. (2020). Regularized Estimation of the Four-Parameter Logistic Model. Psych, 2(4), 269-278. DOI: https://doi.org/10.3390/psych2040020

Benedetto, L., Cappelli, A., Turrin, R., & Cremonesi, P. (2020, March). R2DE: a NLP approach to estimating IRT parameters of newly generated questions. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge. DOI: https://doi.org/10.1145/3375462.3375517

Bichi, A. A., Embong, R., Talib, R., Salleh, S., & bin Ibrahim, A. (2019). Comparative analysis of classical test theory and item response theory using chemistry test data. International Journal of Engineering and Advanced Technology, 8(5), 1260-1266. DOI: 10.35940/ijeat.E1179.0585C19

Brown, G. T., & Abdulnabi, H. H. (2017, June). Evaluating the quality of higher education instructor-constructed multiple-choice tests: Impact on student grades. In Frontiers in Education. Frontiers Media SA. DOI: https://doi.org/10.3389/feduc.2017.00024

Brzezińska, J. (2020). Item response theory models in the measurement theory. Communications in Statistics-Simulation and Computation, 49(12), 3299-3313. DOI: https://doi.org/10.1080/03610918.2018.1546399

Cruz, J., Freitas, A., Macedo, P., & Seabra, D. (n.d.). Quality of Multiple Choice Questions. Retrieved from https://core.ac.uk/download/pdf/231952247.pdf

Domingue, B., & Dimitrov, D. (2021). A comparison of IRT theta estimates and delta scores from the perspective of additive conjoint measurement. DOI: 10.35542/osf.io/amh56

Elvira, M., & Sainuddin, S. (2021). Equating Test Instruments Using Anchor to Map Student Abilities Through the R Program Analysis. In Proceedings of the International Conference on Engineering, Technology and Social Science (ICONETOS 2020). DOI: https://doi.org/10.2991/assehr.k.210421.095

Garza, P., & Fiore, M. (n.d.). Historical Data Analysis and Item Response Theory for Calibration of Question Items. Retrieved from https://webthesis.biblio.polito.it/7456/1/tesi.pdf

Gökhan, A., Güzeller, C. O., & Eser, M. T. (2019). JMETRIK: Classical Test Theory and Item Response Theory Data Analysis Software. Journal of Measurement and Evaluation in Education and Psychology, 10(2), 165-178.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory (Vol. 2): London, US: Sage publications. DOI: https://doi.org/10.2307/2075521

Himelfarb, I. (2019). A primer on standardized testing: History, measurement, classical test theory, item response theory, and equating. Journal of Chiropractic Education, 33(2), 151-163. DOI: https://doi.org/10.7899/jce-18-22

Jia, B., He, D., & Zhu, Z. (2020). Quality And Feature Of Multiple-Choice Questions In Education. Problems of Education in the 21st Century, 78(4), 576-594. DOI: https://doi.org/10.33225/pec/20.78.576

Kusumawati, M., & Hadi, S. (2018). An analysis of multiple choice questions (MCQs): Item and test statistics from mathematics assessments in senior high school. REiD (Research and Evaluation in Education), 4(1), 70-78. DOI: https://doi.org/10.21831/reid.v4i1.20202

Liao, W.-W., Ho, R.-G., Yen, Y.-C., & Cheng, H.-C. (2012). The four-parameter logistic item response theory model as a robust method of estimating ability despite aberrant responses. Social Behavior and Personality: an international journal, 40(10), 1679-1694. DOI: https://doi.org/10.2224/sbp.2012.40.10.1679

Matlock Cole, K., & Paek, I. (2017). PROC IRT: A SAS procedure for item response theory. Applied psychological measurement, 41(4), 311-320. DOI: https://doi.org/10.1177/0146621616685062

Mehta, G., & Mokhasi, V. (2014). Item analysis of multiple choice questions-an assessment of the assessment tool. Int J Health Sci Res, 4(7), 197-202.

Meyer, J. P., & Hailey, E. (2012). A Study of Rasch, partial credit, and rating scale model parameter recovery in WINSTEPS and jMetrik. Journal of Applied Measurement, 13(3), 248–258.

Nathaniel, E., Edougha, D., & Odjegba, G. (2019). Analysis of Past West Africa Examination Council/Senior Secondary Certificate Examination Physics Essay Test Items Parameters: Application of IRT Polytomous Model. Journal of Education and Practice, 10(17), 29-34. DOI: 10.7176/JEP/10-17-04

Olgar, S. (2015). The integration of automated essay scoring systems into the equating process for mixed-format tests (Doctoral dissertation). The Florida State University, USA.

Panidvadtana, P., Sujiva, S., & Srisuttiyakorn, S. (2021). A Comparison of the Accuracy of Multidimensional IRT equating methods for Mixed-Format Tests. Kasetsart Journal of Social Sciences, 42(1), 215–220-215–220. DOI: https://doi.org/10.34044/j.kjss.2021.42.1.34

Quaigrain, K., & Arhin, A. K. (2017). Using reliability and item analysis to evaluate a teacher-developed test in educational measurement and evaluation. Cogent Education, 4(1), 1301013. DOI: https://doi.org/10.1080/2331186x.2017.1301013

REWIND project. (n.d.). Video: Copy-move forgeries dataset. Retrieved from https://sites.google.com/site/rewindpolimi/downloads/datasets/video-copy-move-forgeries-dataset

Sahin, A., & Anil, D. (2017). The effects of test length and sample size on item parameters in item response theory. Educational Sciences: Theory & Practice, 17(1), 321-335. DOI: https://doi.org/10.12738/estp.2017.1.0270

Sansivieri, V., Wiberg, M., & Matteucci, M. (2017). A review of test equating methods with a special focus on IRT-based approaches. Statistica, 77(4), 329-352. DOI: https://doi.org/10.6092/issn.1973-2201/7066

Uduafemhe, M. E., Uwelo, D., John, S. O., & Karfe, R. Y. (2021). Item Analysis of the Science and Technology Components of the 2019 Basic Education Certificate Examination Conducted by National Examinations Council. Universal Journal of Educational Research, 9(4), 862-869. DOI: https://doi.org/10.13189/ujer.2021.090420

Uysal, I., & Doğan, N. (2021). Automated Essay Scoring Effect on Test Equating Errors in Mixed-format Test. International Journal of Assessment Tools in Education, 8(2), 222-238. DOI: https://doi.org/10.21449/ijate.815961

Wang, S., Zhang, M., & You, S. (2020). A Comparison of IRT Observed Score Kernel Equating and Several Equating Methods. Frontiers in Psychology, 11, 308. DOI: https://doi.org/10.3389/fpsyg.2020.00308

Wiebe, E., London, J., Aksit, O., Mott, B. W., Boyer, K. E., & Lester, J. C. (2019, February). Development of a lean computational thinking abilities assessment for middle grades students. In Proceedings of the 50th ACM technical symposium on computer science education. DOI: https://doi.org/10.1145/3287324.3287390

Zhang, Z. (2020a). Asymptotic standard errors of equating coefficients using the characteristic curve methods for the graded response model. Applied Measurement in Education, 33(4), 309-330. DOI: https://doi.org/10.1080/08957347.2020.1789142

Zhang, Z. (2022). Estimating standard errors of IRT true score equating coefficients using imputed item parameters. The Journal of Experimental Education, 90(3), 760-782. DOI: https://doi.org/10.1080/00220973.2020.1751579




How to Cite

Rishi Kumar Loganathan, Zahari Suppian, & Siti Eshah Binti Mokshein. (2023). A survey on different IRT model evaluation softwares for test equating and MCQ assessment using IRT (1PL–4 PL). Journal of Current Science and Technology, 12(3), 582–891. Retrieved from https://ph04.tci-thaijo.org/index.php/JCST/article/view/304