Skip to main content

Data Mining in Education: It should be there

Data Mining Icon” by mcmurryjulie licensed by CC0 Creative Commons

Guest Blog Post by Crystal Burns

As society spends more time using the vast array of technological devices and tools available, more information is generated digitally about us. Data mining is the field of discovering novel and potentially useful information from large amounts of data. It has been applied in a significant number of areas, including retail sales, bioinformatics, and counter-terrorism. More recently, there has been increasing interest in the use of data mining to investigate scientific questions within educational research, known as educational data mining. In a data-driven world and with the high demand for results and improvements in education, where does educational data mining fit in?

What is Data Mining?

It is an interdisciplinary subfield of computer science defined as the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also known as Knowledge Discovery in Data (KDD). Some of the key elements of data mining are: automatic discovery of patterns, prediction of likely outcomes, creation of actionable information, and to focus on large datasets and databases. Specifically for education, Educational Data Mining (EDM), is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings and using those techniques to understand better students and the environments which they learn. (Slater, Joksimović, Kovanovic, Baker, and Gasevic, p. 85-86) “EDM researchers are addressing questions of cognition, metacognition, motivation, affect, language, social discourse, etc. using data from intelligent tutoring systems, massive open online courses, educational games and simulations, and discussion forums.” (Koedinger, D'Mello, McLaughlin, Pardos, and Rosé, p.333)


The data in repositories, data sets that have been created to use as a base, makes it increasingly easy to rapidly access and begin researching. “Researchers who use data from these repositories can dispense with traditionally time-consuming steps such as subject recruitment, scheduling of studies, and data entry.” (Koedinger, D'Mello, McLaughlin, Pardos, and Rosé, p. 337) Once a construct of educational interest (such as off-task behavior, or whether or not a skill is known) has been defined in data, it can be transferred to new data sets. Transfer learning and rapid labeling methods have been successful in speeding up the process of developing or validating a model for a new context. (Slater, Joksimović, Kovanovic, Baker, and Gasevic, p. 102) It has been difficult to study how many differences between teachers and classroom cohorts influence specific aspects of the learning experience; this sort of analysis becomes much easier with educational data mining. Similarly, the concrete impacts of fairly rare individual differences have been difficult to statistically study with traditional methods (leading case studies to be a dominant research method in this area) – educational data mining has the potential to extend a much wider toolset to the analysis of important questions in individual differences. (Koedinger, D'Mello, McLaughlin, Pardos, and Rosé, p. 339)

    Office” by FirmBee licensed by CC0 Creative Commons
    The concerns about the personal privacy have been increasing enormously recently especially when the internet is booming with social networks, e-commerce, forums, blogs, YouTube, and various other activities and sites available online. (Berendt, p. 698) Due to privacy issues, people are apprehensive that their personal information is collected and used unethically. Businesses gather information about their customers in many ways for understanding their purchasing behaviors trends. However, companies don’t last forever, and some may be acquired by others or closed. The personal information that particular business may own can be (and is most likely) sold to the new buyers or leaked. The same ideas are concerning when it is applied to an educational realm and involves young learners. Security is a big issue. In addition to concerns over what companies may do with the data they collect, many parents are fearful of what may happen if that information enters the wrong hands. The news is full of incidents of data breaches with individual financial and other personal data being illegally accessed. Existing legislation does put restrictions on the collection and storage of personally identifiable information (PII) of minors. (Berendt, p. 701) However, the rapid increase in the quantity of data collected and the sophistication of data mining procedures increases the likelihood that data could be combined to identify individuals. Information obtained through data mining intended for the ethical purposes can be misused. Information can be exploited by unethical people or businesses to take benefits, damage the reputation, discriminate against, or take advantage of vulnerable learners.

    Educational Usage
    There are a variety of applications of educational data mining, as more programs are created for this purpose and as the challenges of our learners change more arise. Educational Data Mining can be used to predict student responses to intelligent tutor tasks, discover cognitive models in data and used to improve instruction, create models of student affect and focus on discussion in a dialog-based tutoring system, and discussion data can be used to produce automated agents that support student learning as they collaborate in a chat room or a discussion board. (Koedinger, D'Mello, McLaughlin, Pardos, and Rosé, p.333) Educational data mining can be used to improve student models that provide detailed information about a student’s characteristics, such as knowledge, motivation, metacognition, and attitudes. Modeling the individual differences between students, in order to enable software to respond to those individual differences. In particular, educational data mining methods have enabled researchers to make higher-level inferences about students’ behavior, such as when a student is gaming the system, when a student has “slipped” (making an error despite knowing a skill), and when a student is engaging in self-explanation (Koedinger, D'Mello, McLaughlin, Pardos, and Rosé, p.345). These sophisticated student models have been used in two ways. First, these models have increased our ability to predict student knowledge and future performance – incorporating models of guessing and slipping into predictions of student future performance. Second, these models have enabled researchers to study what factors lead students to make specific choices in a learning setting, a type of scientific discovery with models discussed below. (Romero and Ventura, p. 14) It provides a variety of types of pedagogical support to students. Discovering which pedagogical support is most effective has been a key area of interest for educational data miners. Learning decomposition, a type of relationship mining, fits exponential learning curves to performance data, relating student success to the amount of each type of pedagogical support a student has received (with a weight for each type of support). The weights indicate how effective each type of pedagogical support is for improving learning.  (Koedinger, D'Mello, McLaughlin, Pardos, and Rosé, p.347)

    Data Mining Tools
      Data Mining Icon” by mcmurryjulie licensed by CC0 Creative Commons
      The following is a list of data mining programs and a brief description of how each operate retrieved from Slater, Joksimović, Kovanovic, Baker, and Gasevic (2017). It is not a comprehensive list of programs available as this field is a “rapidly changing area, and new tools are emerging constantly (p.102).”

      • EDM Workbench: Is available for free download at* alls/downloads-2, is a tool for automated feature distillation and data labeling. Much of the automated feature distillation functionality of EDM Workbench is addressed at shortcomings of Excel and Sheets for specific tasks of relevance to data scientists, such as the generation of complex sequential features, data sampling, labeling, and the aggregation of data into subsets of student-tutor transactions based on user-defined criteria (referred to as ‘‘clips’’). The EDM Workbench enables researchers to create features through XML-based authoring and also has built-in functionality to distill a set of 26 features used in existing literature and intelligent tutoring systems. (p. 89)
      • RapidMiner:  Is a package for conducting data mining analyses and creating models. It has limited functionality for engineering new features out of existing features (such as the creation of multiplicative interactions) and for feature selection (e.g., based on intercorrelation). However, RapidMiner has an extremely extensive set of classification and regression algorithms as well as algorithms for clustering, association rule mining, and other applications. Other algorithms can often be composed out of the operators contained in RapidMiner. (p. 90)
      • Orange: Is a data visualization and analysis package. While it has considerably fewer algorithms and tools than RapidMiner, WEKA, or KNIME, it has a cleaner and easier to understand interface, with color-coded widgets differentiating between data input and cleaning, visualization, regression, and clustering. (p.92)
      • Tableau: Presents a family of products for interactive data analysis and visualization. Although the primary focus of the Tableau toolset is support for business intelligence, it has been commonly applied in educational settings to analyze student data, provide actionable insights, enhance teaching practices, and streamline educational reporting. (p. 93)
      • Apache Spark: Is a framework for large-scale processing of data across multiple computer processors, in a distributed fashion. Spark can connect with several programming languages, including Java, Python, and SQL, through an API, allowing these languages to be used for distributed processing. Spark’s MLLib machine learning framework provides implementations of several standard machine learning and data mining algorithms. (p. 93)


      Educational data mining is concerned with developing methods for exploring the unique types of data that come from educational environments. “Its goal is to better understand how students learn and identify the settings in which they learn to improve educational outcomes and to gain insights into and explain educational phenomena.” (Romero and Ventura, p. 12) Data mining is a powerful tool that can help you find patterns and relationships within your data, though it does not eliminate the need to know your learners, to understand your data, or to understand analytical methods. It discovers hidden information in your data, but it cannot state the value of the information. Data mining can confirm or qualify such empirical observations in addition to finding new patterns that may not be discernible through simple observation. It is important to remember that the predictive relationships discovered through data mining are not necessarily causes of an action or behavior. The programs do not automatically discover solutions without guidance. The patterns you find will be very different depending on how you formulate the problem. To obtain meaningful results, you must learn how to ask the right questions. To ensure meaningful data results, you must understand your data that was used to build the model in order to properly interpret the results when the model is applied.

      What do you think about data mining for educational purposes?





      1. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

        Data Science Training in Bangalore


      Post a Comment