• YANNIK FLEISCHER Universität Paderborn
  • ROLF BIEHLER Universität Paderborn
  • CARSTEN SCHULTE Universität Paderborn




Statistics education research, Machine Learning, Decision Trees, Jupyter Notebook


This study examines modelling with machine learning. In the context of a yearlong data science course, the study explores how upper secondary students apply machine learning with Jupyter Notebooks and document the modelling process as a computational essay incorporating the different steps of the CRISP-DM cycle. The students’ work is based on a teaching module about decision trees in machine learning and a worked example of such a modelling process. The study outlines the students’ performance in carrying out the machine learning technically and reasoning about bias in the data, different data preparation steps, the application context, and the resulting decision model. Furthermore, the context of the study and the theoretical backgrounds are presented.


Atkinson, R. K., Derry, S. J., Renkl, A., & Wortham, D. (2000). Learning from examples: Instructional principles from the worked examples research. Review of Educational Research, 70(2), 181–214. https://doi.org/10/csm67w

Bargagliotti, A., Franklin, C., Arnold, P., Gould, R., Johnson, S., Perez, L., & Spangler, D. A. (2020). Pre-K-12 guidelines for assessment and instruction in statistics education II (GAISE II) (Second edition). American Statistical Association.

Behrens, P., & Rathgeb, T. (2017). JIM-Studie 2017: Jugend, Information, (Multi-)Media, Basisstudie zum Medienumgang 12- bis 19-jähriger in Deutschland. Medienpädagogischer Forschungsverbund Südwest.

Biehler, R., & Fleischer, Y. (2021). Introducing students to machine learning with decision trees using CODAP and Jupyter Notebooks. Teaching Statistics, 43(S1), 133–142. https://doi.org/10.1111/test.12279

Biehler, R., & Schulte, C. (2018). Perspectives for an interdisciplinary data science curriculum at German secondary schools. In R. Biehler, L. Budde, D. Frischemeier, B. Heinemann, S. Podworny, C. Schulte, & T. Wassong (Eds.), Paderborn Symposium on Data Science Education at School Level 2017: The Collected Extended Abstracts (pp. 2–14). Universitätsbibliothek Paderborn.

Box, G. E. P., & Draper, N. R. (1987). Empirical model-building and response surface. Wiley.

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1998). Classification and regression trees. Chapman & Hall/CRC.

DiSessa, A. A. (2000). Changing minds: Computers, learning, and literacy. MIT press.

Engel, J. (2017). Statistical Literacy for Active Citizenship: A Call for Data Science Education. Statistics Education Research Journal, 16(1), 44–49. https://doi.org/10.52041/serj.v16i1.213

Engel, J., Biehler, R., Frischemeier, D., Podworny, S., Schiller, A., & Martignon, L. (2019). Zivilstatistik: Konzept einer neuen Perspektive auf data literacy und statistical literacy. AStA Wirtschafts- und Sozialstatistisches Archiv, 13(3–4), 213–244. https://doi.org/10/gjmwhb

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Springer New York. https://doi.org/10.1007/978-0-387-84858-7

IDSSP Curriculum Team. (2019). Curriculum frameworks for Introductory Data Science. http://idssp.org/files/IDSSP_Frameworks_1.0.pdf

Long, D., & Magerko, B. (2020). What is AI literacy? Competencies and design considerations. In R. Bernhaupt, F. “Floyd” Mueller, D. Verweij, J. Andres, J. McGrenere, A. Cockburn, I. Avellino, A. Goguey, P. Bjørn, S. (Shen) Zhao, B. P. Samson, & R. Kocielnik (Eds.), Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1–16). ACM. https://doi.org/10/ghbz2q

McNamara, A. (2019). Key attributes of a modern statistical computing tool. The American Statistician, 73(4), 375–384. https://doi.org/10/ghnfwp

Philip, T. M., Schuler-Brown, S., & Way, W. (2013). A framework for learning about big data with mobile technologies for democratic participation: Possibilities, limitations, and unanticipated obstacles. Technology, Knowledge and Learning, 18(3), 103–120. https://doi.org/10/gmvbkw

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.

Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann Publishers.

Register, Y., & Ko, A. J. (2020). Learning machine learning with personal data helps stakeholders ground advocacy arguments in model mechanics. Proceedings of the 2020 ACM Conference on International Computing Education Research, 67–78. https://doi.org/10/ghnpbc

Ridgway, J. (2016). Implications of the data revolution for statistics education: The data revolution and statistics education. International Statistical Review, 84(3), 528–549. https://doi.org/10/f3q6f6

Ridgway, J., Ridgway, R., & Nicholson, J. (2018). Data science for all: A stroll in the foothills. In M. A. Sorto, A. White, & L. Guyot (Eds.), Looking back, looking forward: Proceedings of the Tenth International Conference on Teaching Statistics (ICOTS10 in, July, 2018), Kyoto, Japan (pp. 1–6). International Statistical Institute.

Rule, A., Tabard, A., & Hollan, J. D. (2018). Exploration and explanation in computational notebooks. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 1–12. https://doi.org/10/gfw4vk

Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining. Journal of Data Warehousing, 5(4), 13–22.

Sulmont, E., Patitsas, E., & Cooperstock, J. R. (2019a). Can you teach me to machine learn? Proceedings of the 50th ACM Technical Symposium on Computer Science Education, 948–954. https://doi.org/10/ghnpcs

Sulmont, E., Patitsas, E., & Cooperstock, J. R. (2019b). What is hard about teaching machine learning to non-majors? Insights from classifying instructors’ learning goals. ACM Transactions on Computing Education, 19(4), 1–16. https://doi.org/10/ghnpbm

Toomey, D. (2017). Jupyter for data science: Exploratory analysis, statistical modeling, machine learning, and data visualization with Jupyter. Packt Publishing.

van Gog, T., Paas, F., & van Merriënboer, J. J. G. (2006). Effects of process-oriented worked examples on troubleshooting transfer performance. Learning and Instruction, 16(2), 154–164. https://doi.org/10/chqf28

van Gog, T., Paas, F., & van Merriënboer, J. J. G. (2008). Effects of studying sequences of process-oriented and product-oriented worked examples on troubleshooting transfer efficiency. Learning and Instruction, 18(3), 211–222. https://doi.org/10/brmdfx

Wylie, R., & Chi, M. T. H. (2014). The self-explanation principle in multimedia learning. In R. Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 413–432). Cambridge University Press. https://doi.org/10.1017/CBO9781139547369.021

Zieffler, A., Justice, N., delMas, R., & Huberty, M. D. (2021). The use of algorithmic models to develop secondary teachers’ understanding of the statistical modeling process. Journal of Statistics and Data Science Education, 29(1), 131–147. https://doi.org/10.1080/26939169.2021.1900759