Abstract: This article presents an overview of the development and use of analytics in the context of education. Using Buckingham Shum's three levels of analytics, the authors present a critical analysis of current developments in the domain of learning analytics, and contrast the potential value of analytics research and development with real world educational implementation and practice. The article also focuses on the development of education content analytics and considers the legal and ethical implications of collecting and analysing educational data. Looking to the future, the authors also highlight new developments including exploration of data from massive open online courses (MOOCs).
Keywords: Analytics, learning analytics, educational data mining, big data, paradata, educational data, learning technology
Introduction and Overview
Over the past five years a number of factors, not least the changing economic climate and its impact on the educational sector, have led to increased awareness of the need for greater research, analysis and understanding of all aspects of data relating to education. Researchers in the emerging fields ofeducational data mining (EDM) and learning analytics have been active in exploring the potential of data created through the processes of teaching and learning. In parallel with these developments, interest has also increased in the potential of "big data" and the application of business intelligence approaches to the educational sector. For example, can the recommendation systems commonly used by companies such as Amazon and Netflix be employed in educational contexts? The emergence of MOOCs (massive open online courses) can be seen as a key area where big data approaches can be applied to learning data sets due to the massive scale of participation. In wider society the 'quantified self movement' (Quantified Self, 2012), which involves using wearable sensors to self-monitor and gather data on different aspects of an individual's life, physical state and performance, is giving rise to increased access to and re-use of personal data from multiple sources, including social network sites, geo-location services and data collected from mobile devices.
This article will provide an overview of analytics within the domain of education, highlighting potential areas of development and the challenges faced by the education sector. Given the plethora of data available and the growing number of technologies and terminologies being used, it can be difficult for practitioners to gain a comprehensive overview of the education sector data landscape. The article draws heavily on the Analytics Series (Cetis Analytics Series, 2012), produced by the Cetis, the UK Centre for Education Technology, Interoperability and Standards, which provides a broad perspective of the role and potential of analytics within higher education, together with an overview of current practice across the sector.
Analytics in the Context of Education
As MacNeill (2012) highlights, the use of analytics is far from new to education; the collection, use and sharing of data is well established in the sector. Despite generating increasing volumes of data, until recently, few institutions have been able to exploit the wealth of information they routinely collect through the core business of teaching and learning. This position is starting to change, partly as a result of growing interest in big data and business intelligence, and partly due to the development of increasingly accessible analytics tools and applications. However Cooper (2012) has cautioned that this increased interest in analytics comes at a cost:
"The problem with defining any buzz-word, including "analytics" is that over-use and band-wagon jumping reduces the specificity of the word. This problem is not new and any attempt to create a detailed definition seems to be doomed, no matter how careful one might be, because there will always be someone with a different perspective or a personal or commercial motivation to emphasize a particular aspect or nuance. The rather rambling Wikipedia entry for analytics illustrates this difficulty" (p. 3)
Cooper (2012) goes on to propose a working definition of "analytics", which we will adopt within the context of this article:
"Analytics is the process of developing actionable insights through problem definition and the application of statistical models and analysis against existing and/or simulated future data" (p. 3)
When discussing analytics within the domain of education it is also useful to consider how these techniques are being applied within the context of the institution. Buckingham Shum (2012) has introduced the concept of three levels of learning analytics:
- Macro level analytics enable data sharing across institutions for a range of purposes including benchmarking.
- Meso level analytics work at the level of individual institutions, and include analytics based on business intelligence approaches.
- Micro level analytics support the tracking and interpretation of process-level data for individual learners.
Long and Siemens (2011) make a similar distinction between academic analytics and learner analytics. Academic analytics equates to Buckingham Shum's macro and meso level analytics and learner analytics to micro level analytics. In Long and Siemens' view learning analytics focuses explicitly on the learning process.
Analytics can also be applied to educational content, and can be used to track and record how and in what context resources are used. Such information, which may be referred to as "paradata", has the potential to provide a useful supplement to educational metadata. It may help to address the problem of effectively describing educational context, a problem that formal educational metadata standards have struggled to address.
In terms of analytics relating specifically to teaching and learning, two main areas of research are emerging; educational data mining and learning analytics. Both are complimentary, but have a different emphasis. Bienkowski, Feng and Mean's (2012) report, Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics, for the US Department of Education defines data mining and learning analytics as follows:
"Educational data mining (EDM) develops methods and applies techniques from statistics, machine learning, and data mining to analyze data collected during teaching and learning. EDM tests learning theories and informs educational practice.
Learning analytics applies techniques from information science, sociology, psychology, statistics, machine learning and data mining to analyze data collected during education administration and services, teaching and learning. Learning analytics creates applications that directly influence educational practice" (p. 9)
Learning analytics may also incorporate data from formal and informal learning environments and Ferguson and Buckingham Shum (2012) have introduced the concept of social learning analytics to provide mechanisms for identifying patterns and behaviours at both individual and group level.
As these research fields are developing, commercial vendors are introducing new analytics tools to their educational systems and applications, e.g. Blackboard learn (Blackboard learn, n.d.). These tools promise to improve the performance and engagement of both staff and students, and to provide measurable insights into the educational process. However, these applications are still in their infancy, and further research is required in order to quantify the impact of the dashboard views of data that these systems provide. In addition, it is debatable whether these tools are currently capable of engaging students and enhancing their learning experience as they are often based on data that can be collected easily rather than on any substantiated pedagogical theory.
Learning analytics is an emerging field and, to date, there have been relatively few large scale implementations of the potential approaches that comprise much of the body of the research literature. There are a few notable exceptions including the Course Signals project (Course Signals, 2013), initially developed at Purdue University, which has amassed almost a decades' worth of evidence, and from which a commercial product has been created. Such examples are the exception rather than the rule and considerable work is still required to translate current research into the wide scale adoption and application of learning analytics approaches to teaching and learning.
Macro Level Analytics
Education authorities and funding bodies are increasingly aware of the need to be able to identify nuanced sector level patterns e.g. overall attrition levels, geographical and socio-economic trends to help determine priorities for spending and development.
At the institutional level, senior managers are increasingly aware of the need to integrate analytic approaches with current business intelligence methodologies. This would enable them to gain actionable insights that will allow them to make effective decisions in terms of operating within new economic models, while addressing strategic priorities such as student retention and achievement.
Meso Level Analytics
A number of issues need to be considered when attempting to align business intelligence solutions. For example, it is not uncommon for problems to arise when attempting to share data between centrally managed administrative systems, such as student record systems, and teaching and learning applications, such as virtual learning environments, due to lack of data interoperability. Consequently, institutional approaches to data management, sharing, re-use and data protection needs to be considered at the strategic level. Serious consideration needs to be given to apparently simple questions such as: Which systems hold the most useful and valuable data? Which formats are the data available in? Who has access to the data and how can it be used to develop actionable insights? In addition to considering technical issues such as data formats, it is equally important to develop the cultural and human capacity of the institution to enable it to make effective use if this data. Policies are required to govern the ethical use of data and opportunities need to be provided for staff and students to develop their knowledge and understanding of data and analytics.
Micro Level Analytics
Powell and MacNeill (2012) have identified a number of drivers for the application for learning analytics. These include:
- Individual learners using analytics to reflect on their achievements and patterns of behavior in relation to their peers.
- Identification of students who may require extra support and attention.
- Helping teachers and support staff to plan supporting interventions with individuals and groups.
- Enabling functional groups, such as course teams, to improve current courses or develop new curriculum offerings.
- Providing information to help institutional administrators to take decisions on matters such as marketing and recruitment or efficiency and effectiveness measures.
As previously highlighted, the use of analytics within the education sector is still in its infancy. Different institutional stakeholders may have very different motivations for employing analytics and, in some instances, the needs of one group of stakeholders, e.g. individual learners, may be in conflict with the needs of other stakeholders, e.g. managers and administrators. Educational institutions have a duty of care towards students and staff and must be aware that certain data can only be used sensitively, appropriately and with consent (MacNeill & Ellis, 2013). Koulocheri and Xenos (2013) have demonstrated that using social network analysis visualization techniques within learning environments could have a positive impact on learning. However, they have also highlighted ethical issues, including student consent that must be considered when sharing individual assessment data.
Despite these reservations, there are a growing number of examples of how analytics can be used to have a positive impact on teaching, learning and the wider student experience. The University of Huddersfield in the UK is developing innovative approaches to team assessment and learning design by sharing data from their e-assessment system with students to help them improve their grades and overall achievement (MacNeill & Ellis, 2013). E-submission and e-marking tools have made this possible by collecting and providing access to more detailed assessment data than has previously been available. In addition, staff are developing pre- and post-assessment workshops where assessment rubrics are shared with students alongside real (anonymized) data, to highlight the impact of common mistakes on overall grades. Although still in the early stages of development, course teams are already finding that this method of providing feedback to students is improving their results. This approach has also enabled course teams to develop new data-driven design methodologies; collaborative approaches are emerging for developing assessment criteria and feedback, and staff have greater understanding of what data is of greatest use in different contexts.
As part of their on-going aim of improving the student experience, the University of Derby in the UK is working on integrating a range of internal data sources and systems, to provide greater insight into how students are engaging with the university both pre and post-enrolment. In addition to helping to enhance the student experience, these "engagement analytics" are providing increased opportunities for cross-institutional development (MacNeill & Mutton, 2013). Focusing on the identification of key institutional systems that provide the touch points for developing students' digital footprints, e.g. library systems, learning environments, student record systems and through the holistic theme of student engagement, the team leading the work has been able to bring together key stakeholders (staff and students) and has focused discussions around the collection and use of data.
Currently analytics developments are focused primarily at the macro and meso level. Although the use of analytics may be increasing at the institutional level, there are still very few compelling examples of analytics being used at scale to benefit learners. There are a number of reasons why this may be case. Despite recent initiatives such as the advent of open courseware and massive open online courses, that claim to be breaking down educational barriers, formal education is still institutionally focused. Furthermore, despite claims to the contrary, many commercial education technology vendors develop tools that are primarily designed to meet the requirements of institutions in the first instance, with the needs of individual learners often appearing to be of secondary consideration. However it is also important to acknowledge that capturing learner data from the myriad systems that students engage with throughout their learning journeys is a non-trivial task. It is considerably more problematic than surfacing data from large scale institutional systems. In addition, students arguably lack the data literacy skills required to realise the potential benefits of the data that they generate through the process; and educational data is still widely regarded as belonging to institutions rather than to learners. Notwithstanding these issues, emerging activity suggests that both institutions and commercial vendors are beginning to lay foundations that may ultimately lead to the development of student centric analytics tools.
Educational Content Analytics
In addition to being applicable to administrative and teaching and learning activities and data, analytics can also be applied to educational resources. Data about how learning resources are used, by whom and in what context, is sometimes referred to as paradata (Campbell & Barker, 2013). The term paradata was first used in this context by the US National Science Digital Library (NSDL) in 2010 (McIlvain, 2013) to refer to data about user interactions with learning resources in their STEM Exchange. The concept was later adopted by the Learning Registry (Paradata in 20 minutes or less, 2011), an initiative funded initially by the U.S. Department of Education and the U.S. Department of Defense, which developed an open source decentralized content-distribution network for storing and sharing information about learning resources and their use.
The ability to gather and analyse data about how educational content is used in real world learning contexts is potentially of considerable value as it could help to address some of the problems that metadata standards have struggled to resolve. Systems, such as digital repositories, which are designed for managing and sharing information about learning resources, generally rely on formal metadata application profiles to describe characteristics of the resources that are likely to be of relevance to their users. However learning resources represent a diverse class of objects, used in a wide variety of contexts and, as a result, formal metadata standards and controlled vocabularies struggle to identify and describe all the characteristics that that may be of interest or relevance to users (Barker & Campbell, 2010). While metadata generally attempts to record objective or authoritative descriptions of a resource, paradata can record the opinion of the users together with how, where and with what outcome a resource has been used. By capturing the user activity related to a resource, paradata complements metadata by providing an additional layer of contextual information that can help to elucidate the potential educational utility of the resource (Campbell & Barker, 2013).
Paradata is generated as learning resources are used, reused, adapted, contextualized, favourited, tweeted and shared. Some paradata is deliberately created by users, e.g. likes and comments, while some is generated automatically as a result of the resource being used, e.g. hits and download statistics. On the simplest level, paradata can be used to record how users interact with learning resources by viewing, downloading, sharing, liking, commenting and tagging. Although paradata primarily refers to data about learning resources, it can also encompass information about users of a resource; e.g. age, educational level, geographical location. It can also record contextual information by linking resources with educational standards and curricula, course catalogues, pedagogic approaches and methodologies. In the context of decentralised content distribution networks, such as the Learning Registry, paradata can also be used to record complex aggregations of activities relating to a single resource, e.g. "between January 2011 and January 2012 lecturers in Engineering, Physics and Maths used this resource six times for undergraduate teaching activities" (Thomas, Campbell, Barker, & Hawksey, 2012).
While the development of tools and systems to capture and analyse paradata has been driven by U.S. initiatives, a small number of innovative projects have successfully experimented with the use of Learning Registry paradata within the context of the UK higher education sector. These include the JLeRN Experiment (About the JLeRN Experiment, n.d.) project, which successfully set up a Learning Registry test node, contributed data to it, including data from the national learning resource repository Jorum (Jorum, n.d.), developed an open source tool for exploring Learning Registry data and supported a special interest community of developers and practitioners from across the UK higher education sector (Campbell, Barker, Currier, &Syrotiuk, 2013). A second project, ENGrich, at the University of Liverpool, has developed Kritokos (Kritikos, n.d.), a custom search engine for visual media relevant to engineering education. Using Google Custom Search, with filters such as tags, file-types and domains, as a primary search engine, Kritikos pushes and pulls corresponding metadata and paradata to and from the Learning Registry. A user-interface enables academics and students to add comments and additional information about how resources and are being used. This additional paradata is then published to a Learning Registry node and used to order subsequent search results (Campbell & Barker, 2013).
Many institutional systems automatically generate paradata, including learning management systems, virtual learning environments, digital repositories, library systems and social media applications, however it is not always simple to surface this data or query it in meaningful ways. The application of analytics to learning resources and the use of paradata are still in its infancy and although the Learning Registry potentially offers an innovative solution to surfacing and sharing this data, it has not yet been widely adopted out with the U.S. and, as with learning analytics, applications have yet to emerge that are of real benefit to learners.
As institutional managers, administrators and researchers are well aware, any practice involving data collection and reuse has inherent legal and ethical implications. Most institutions have clear guidelines and policies in place governing the collection and use of research data; however it is less common for institutions to have legal and ethical guidelines on the use of data gathered from internal systems (Prinsloo & Slade, 2013). As is often the case, the development of legal frameworks has not kept pace with the development of new technologies.
The Cetis Analytics Series paper on Legal, Risk and Ethical Aspects of Analytics in Higher Education (Kay, Korn, & Oppenheim, 2012) outlines a set of common principles that have universal application:
- Clarity - open definition of purpose, scope and boundaries, even if that is broad and in some respects extent open-ended.
- Comfort and care - consideration for both the interests and the feelings of the data subject and vigilance regarding exceptional cases.
- Choice and consent - informed individual opportunity to opt-out or opt-in.
- Consequence and complaint - recognition that there may be unforeseen consequences and therefore provision of mechanisms for redress. (p. 6)
In short, it is fundamental that institutions are aware of the legal and ethical implications of any activity requiring data collection before undertaking any form of data analysis activity.
Given the diversity of research strands that feed into the area of analytics in education, together with the increased ease of data storage, the field is expanding rapidly in a wide range of new directions. Until recently, the focus of most analytics developments to support teaching and learning has been on integrating tools with existing institutional learning management systems (Ferguson, 2013). This is primarily because such integration provides relatively easy access to available student data. However the increased adoption of third party services such as social network tools and applications, and the emergence of massive open online courses (MOOCs) have created new opportunities for large-scale experimentation with analytics. Recent examples of projects that are seeking to explore the use of data from MOOCs and social networks include Stanford University's Lytics Lab (Lytics, n.d.) which, amongst other work, runs randomised control trials of MOOC courses offered by Coursera. In addition to using analytics to identifying potential "threshold concepts" that might be exposed by tens of thousands of students taking multiple choice question tests, there are opportunities to identify, analyse and define wider engagement patterns within subpopulations of learners (Kizilcec, Piech , & Schneider, 2013).
Large scale analytics initiatives are also taking place at a national level, with varying degrees of success. Launched in early 2013, inBloom is a US non-profit organisation backed by the Carnegie Corporation and the Bill and Melinda Gates Foundation, that aims to create infrastructure to integrate, analyse and provide solutions to personalise student learning for schools at state and district level. By creating a common interface inBloom set out to stimulate educational technology providers to develop new tools utilising the growing database and infrastructure around student data, without the cost of having to develop custom connections to existing local infrastructure such as student management systems (inBloom, 2013). Shortly after the project was launched however, parents and civil liberties organisations began raising concerns about centralising sensitive student data in this manner and asking questions about who will have access to the data (Campbell, 2013). By August 2013 these concerns resulted in a significant number of states pulling out of the project all together, leaving only four school districts participating (Nelson, 2013).
Within the UK, the Department of Education launched an Analytical Review looking at the role of research, analysis and data within the Department. The Review focused on two key areas: data systems for the collection, sharing and retrieval of data generated by English schools; and the role of randomized control trials for "building evidence into education" (Department for Education, 2013). Whilst it is still unclear what data exchange models will be adopted by the Department, the announcement, following the publication of two random control trials, is a clear indication that analytics is playing an increasingly significant role at all levels of education (Department for Education, 2013).
As analytics initiatives continue developing, it is highly likely that commercial practices will continue transferring into the educational sector. Recommendation systems and targeted advertising are the backbone of commercial giants such as Amazon and Google but they are increasingly finding their way into learning analytics systems. Emerging products in this area include Talis Aspire (talis aspire, 2013), which offers complete reading-list management solutions based on usage data to provide both staff and students with insights into catalogue use, thus creating opportunities for personalised learning.
Associated with these developments are "analytics as a service" products offered by companies that specialize in providing analytic services for a fee. Companies such as Narrative Science, who specialize in automatically producing text based summaries of numeric data, have already highlighted opportunities for creating personalised feedback with actionable insights by combining data from test results (Hammond, 2012).
There is undoubtedly potential for analytics-based approaches to provide new and increasingly-nuanced data about users' interactions with content and systems in teaching and learning contexts. Such approaches have the potential to lead to greater understanding of patterns of learner behaviour and learner networks and interactions based on data that has previously been difficult or impossible to access.
The emergence of MOOCs could provide the context to leverage big data techniques, e.g. large-scale data warehousing of education specific big data sets. How useful such data will be is unclear at present as MOOCs are still in their infancy and their long term viability is still to be established. Analysis of MOOC data sets may not progress far beyond current recommendation systems (similar to Amazon, Netflix etc.), or generic guidance and advice for learners. At this stage, it seems more likely that the most useful application of analytics approaches will be at the institutional or meso level, as this will provide greater contextual information for strategic planning and creativity in designing and teaching courses. It remains to be seen whether these developments will have a direct impact on students and lead to more successful learning experiences and outcomes.
It is important to recognize that access to data alone will not have a significant impact on the higher education sector; people are needed to contextualize, act upon and interpret the data. Consequently, developing staff and students' data literacies will be critical to enabling the cultural shift required to move towards data driven design and decision making approaches within education.
The authors would like to thank all their colleagues involved in producing the Cetis Analytics Series, in particular Adam Cooper and Stephen Powell, whose work we have drawn on extensively for this article.
Bienkowski, M., Feng, M., & Means, B. (2012). Enhancing teaching and learning through educational data mining and learning analytics: An issue brief. US Department of Education, Office of Educational Technology. Retrieved fromhttps://www.ed.gov/edblogs/technology/files/2012/03/edm-la-brief.pdf
Campbell, L. M., Barker, P., Currier, S., & Syrotiuk, N. (2013). The Learning Registry: Social networking for open educational resources?Open Educational Resources 13 Conference. Retrieved from http://www.medev.ac.uk/oer13/108/view/
Department for Education. (2013). New randomised controlled trials will drive forward evidence-based research. [Press release]. Retrieved from https://www.gov.uk/government/news/new-randomised-controlled-trials-will-drive-forward-evidence-based-research
Ferguson, R., & Buckingham Shum, S. (2012). Social learning analytics: Five approaches.2nd International Conference on Learning Analytics & Knowledge, Vancouver, Canada. Retrieved from http://oro.open.ac.uk/32910/
Koulocheri, E., & Xenos, M. (2013). Considering formal assessment in learning analytics within a PLE: the HOU2LEARN case.Proceedings of the Third International Conference on Learning Analytics and Knowledge, Leuven, Belgium. Retrieved from http://hcibib.org/LAK13#S2
Kizilcec, R. F., Piech, C., & Schneider, E. (2013). Deconstructing disengagements: Analyzing learner subpopulations in massive open online courses.Proceedings of the Third International Conference on Learning Analytics and Knowledge, Leuven, Belgium. Retrieved from http://hcibib.org/LAK13#S2
Long, P., & Siemens, G. (2011). Penetrating the fog: Analytics in learning and education. EDUCAUSE Review, 46(5). Retrieved from http://www.educause.edu/ero/article/penetrating-fog-analytics-learning-and-education
Prinsloo, P., & Slade, S. (2013). An evaluation of policy frameworks for addressing ethical considerations in learning analytics.Proceedings of the Third International Conference on Learning Analytics, Leuven, Belgium. Retrieved from http://hcibib.org/LAK13#S2
 Not to be confused with survey paradata, which refers to administrative data about the processes by which survey data is collected.