Open Access Paper
24 May 2022 Research on student portrait system based on canopy and k-means algorithm
Suzhi Zhang D.D.S., Minzhe Shen II, Yiran Yu III
Author Affiliations +
Proceedings Volume 12260, International Conference on Computer Application and Information Security (ICCAIS 2021); 122601W (2022) https://doi.org/10.1117/12.2637367
Event: International Conference on Computer Application and Information Security (ICCAIS 2021), 2021, Wuhan, China
Abstract
Aiming at the problems of incomplete information and low data mining efficiency in the current student management system, a college student portrait system is established based on Hadoop big data processing technology. The system collects student data from various business platforms in colleges and universities, and uses HDFS for data storage; uses canopy and k-means based clustering algorithms for multi-dimensional analysis of student data; uses Echart tool to visualize the analysis results and generate student portraits. Experiments show that the student portrait system based on canopy and k-means can describe students' images in multiple dimensions and help schools understand students more comprehensively.

1.

INTRODUCTION

With the construction of digital campuses and smart campuses, various intelligent application systems in colleges and universities have accumulated a large amount of student data1-3. How to use these data to construct multi-dimensional student data portraits and help school administrators to classify and manage students more scientifically is the goal of smart campus construction4.

At present, the research on campus big data has made significant progress5-7. Yin aggregates and analyzes the data generated during students’ learning, and builds student portrait tag libraries of different dimensions based on the results. Use longitudinal labels to perform variance analysis on statistical data to understand the relationship between each label and grades, and to help students learn professional courses more efficiently8. Perikos uses the decision tree mining method to extract rules based on the student’s usual evaluation score data to predict the student’s course performance, help teachers understand the student’s learning situation, and improve the quality of teaching9. Liu used the data mining K-means clustering method to analyze a total of 23.843 million online data of 3,245 students of a certain grade in a university in 4 years, and constructed a student network portrait10.Based on personalized multiple regression and matrix decomposition methods, Elbadrawy accurately predicts student performance in future courses and classroom assessments, helping students choose their own majors and courses11. Ge Suhui obtained the dynamic behavior trajectory data of students on the campus through various intelligent terminals in the smart campus, performed cluster analysis on the data, and generated student portraits based on the feature matrix, so that university management agencies and teachers could grasp the living conditions of students12.

It can be seen from the existing research that the research dimension of campus big data is not comprehensive enough. Aiming at the current problems, according to the student data in various information management platforms of colleges and universities, a Hadoop-based college student portrait system is established. By designing a student behavior description index system, using machine learning-related algorithms to mine and analyze student behavior data, and construct student data portraits from multiple dimensions, it helps school administrators manage students in a personalized manner.

2.

SYSTEM DESIGN

The student portrait system architecture is divided into data acquisition and processing layer, data storage layer, data analysis layer and visual display layer. First, build an experimental environment based on the Hadoop big data framework. Since the student data comes from various intelligent application systems in the university, it is necessary to use the sqoop data exchange tool to synchronously integrate the student data in the school’s Oracle database into the Hive data warehouse, and the data files are stored in the HDFS distributed system. Mining and analysis of student behavior data using improved k-means algorithm. Finally, use Java Web technology and ECharts to visualize the analysis results and provide friendly user interaction functions. The system workflow is shown in Figure 1.

Figure 1.

Student data portrait system workflow.

00235_psisdg12260_122601w_page_2_1.jpg

3.

ANALYSIS METHODS

3.1

Principle of k-means algorithm

K-means is an iteratively solved cluster analysis algorithm, where k refers to the number of clusters and means is understood as the mean of the data in each class. K-means clustering is one of the most popular and simple clustering algorithms, and is still widely used in many fields today due to its simplicity, efficiency, and ease of implementation.

The basic idea of the k-means algorithm is that for a given sample set X={x1,x2,…, xm} the sample set is divided into k clusters according to the distance d between samples. The calculation method of the least square error of the cluster C={C1, C2, …, Ck} obtained by the k-means algorithm is as follows.

00235_psisdg12260_122601w_page_2_2.jpg
00235_psisdg12260_122601w_page_2_3.jpg

The distance calculation between sample objects is generally measured by Euclidean distance. The calculation method of Euclidean distance is as follows.

00235_psisdg12260_122601w_page_2_4.jpg

The flow chart of the k-means algorithm is shown in Figure 2.

Figure 2.

Flow chart of k-means algorithm.

00235_psisdg12260_122601w_page_3_1.jpg

3.2

Improved k-means algorithm

The k value of the traditional k-means algorithm needs to be preset, and the selection of the initial cluster center is random, so the result of the algorithm changes with the selection of the center point, which may lead to a local optimum. The canopy algorithm is a fast clustering technique. Although the accuracy is low, it cannot give accurate cluster results, but it can give the optimal number of clusters. You can use canopy clustering to first perform coarse clustering on the data, and use the number of clusters and cluster centers of the canopy algorithm as the input parameters of the k-means algorithm to complete the fine clustering of the data set. The process of using the canopy algorithm to help the k-means algorithm to select the initial cluster center is shown in Figure 3.

Figure 3.

Canopy algorithm to determine cluster centers.

00235_psisdg12260_122601w_page_3_2.jpg

The specific algorithm is as follows:

Input: dataset X = {x1, x2, …, xn}

Step 1: Determine two distance thresholds T1 and T2 through cross-validation tuning or prior knowledge, where T1 > T2.

Step 2: Randomly select a sample point P from the dataset X as the first canopy cluster center, and add P to the set C. Remove P from L.

Step 3: Randomly select a sample Q from set X, and calculate the Euclidean distance from Q to each canopy cluster center in set C, and select the minimum value d among these distances.

Step 4: Compare T1 with distance d. If d < T1, add the sample point Q to the canopy with the smallest distance from it, and label a weak marker. If d > T1, set Q as the new canopy distance center, add sample point Q to set C, and remove Q from dataset X.

Step 5: Compare T2 with distance d. If d<T2, attach a strong label to it, update the center position of all strongly labeled samples to the cluster center of this canopy, and delete the sample point Q from the dataset X.

Step 6: Repeat steps 3, 4 and 5 until dataset X is empty.

Output: Cluster center set C

4.

EXPERIMENT AND ANALYSIS

4.1

Data collection

This paper desensitizes the collected student behavior data. The original data includes the one-card consumption data of some undergraduates in a university in Henan, library access control data and educational administration system data. For details, see Table 1-3.

Table 1.

Student card canteen consumption data.

Student IDNameAmountOverageSettlement departmentTime
*******0135Zhang**9.0161.5Chicken soup2019/3/12 11:24
*******0145Wu*5.5116.0Meat pie2019/3/12 12:06
*******0124Li*10.565.0Duck leg rice2019/3/12 11:45

Table 2.

Library loan data.

DateOperation typeBook titleIndex numberReader ID
2019/3/23 16:21LendPrinciples of EducationG40/3231540*******
2019/3/24 8:54LendCounseling PsychologyC932/421540*******
2019/3/24 9:15LendChrysanthemum and the SwordH3189/39271540*******

Table 3.

Academic affairs system grade data.

School yearSemesterStudent IDNameCourse titleScoreGPARetest mark
201920201********0136Peng*Machine learning863.6 
201920201********0137Wang**Machine learning692.9 
201920201********0138Li**Machine learning55080

4.2

Data processing

The raw data of students’ behavior recorded in various information systems of the school are too large and cumbersome to directly mine knowledge. In this paper, statistical methods are used to compress, generalize and normalize the student behavior data to make the data more valuable for analysis. For example, for the student card data, this paper first uses statistical methods to calculate the average monthly consumption amount, consumption frequency and consumption peak of students using a monthly cycle. After preprocessing, the evaluation indicators of students are shown in Table 4.

Table 4.

Student evaluation metrics.

DimensionIndicator nameDescribe
Consumption lawAmount of consumptionAverage monthly spending by students
Consumption frequencyAverage number of purchases by students per month
Single consumptionAverage single spending by students
Peak consumptionMaximum student monthly spending
Living habitsOnline timeAverage monthly time spent online
Work and rest rulesAverage number of early wakes per month
Learning situationNumber of library visitsAverage number of visits to the library per month
Number of books borrowedNumber of books borrowed from the library
Average scoresGrade point average per semester
Course pass rateNumber of courses passed/Number of all courses
Class attendanceAttendance times/attendance times

4.3

Experimental result

Due to space limitations, here we only use canopy and k-means algorithm to cluster student consumption data, analyze students’ consumption patterns, and construct the consumption portrait of students on campus. The result is shown in Table 5.

Table 5.

Clustering results of students’ consumption.

NumberAverage monthly consumptionMonthly peak consumptionMonthly consumption frequencyAverage monthly single consumptionPercentage of students
1237.5317.037.86.324.2%
2523.6562.2113.24.640.9%
3840.7910.5121.46.916.2%
4391.4423.996.34.118.7%

The monthly average consumption of students in group 1 is the lowest among all groups, but it cannot be simply judged that the overall consumption level of this group of students is low. The consumption frequency of such students is lower, and the average single consumption is higher among all groups. Therefore, it can be inferred that such students like off-campus consumption, and the overall consumption level is relatively high.

For the second group of students, the peak monthly consumption of these students is relatively close to the average monthly consumption, and the average monthly consumption and single consumption are in the middle level of each group. According to the consumption characteristics of such students, it can be seen that their consumption is relatively regular, basically solved in schools, and the overall consumption level is at a moderate level.

Group 3 students make up 16% of the total. Average monthly consumption, monthly consumption frequency and single consumption are relatively high. It can be inferred that their overall consumption level is relatively high and their living expenses are relatively sufficient.

Group 4 students, the average monthly consumption and single consumption are lower. It can be seen that the overall consumption level of such students is relatively low, and they belong to economically poor students. Schools should prioritize these students in financial aid evaluations.

4.4

Student portrait

According to the clustering results of the three indicators of consumption law, life law and effort level, the student categories with different behavior characteristics are summarized, categorized, and labeled, and the label of each indicator and the basic information of the student are integrated to describe the student Behavioral portraits. The portrait of the student is shown in figure 4.

Figure 4.

Student image.

00235_psisdg12260_122601w_page_6_1.jpg

5.

CONCLUSION

The college student portrait system uses the relevant components of the Hadoop framework to integrate, clean and store the student data, and uses k-means and canopy algorithm to perform cluster analysis on the processed student data. The system constructs student portraits from the perspectives of student consumption patterns, life patterns, and effort levels, and objectively displays the characteristics of student groups. The follow-up work can analyze teachers, build teacher portraits, understand the current behavior status of teachers from teachers’ personal information, scientific research results, and teaching evaluation, and provide quantitative decision-making basis for personnel management work.

REFERENCES

[1] 

Muhamad, W., Kurniawan, N. B., Suhardi and Yazid, S., “Smart campus features, technologies, and applications: A systematic literature review,” in 2017 International Conference on Information Technology Systems and Innovation, 384 –391 (2017). Google Scholar

[2] 

Valks, B., Arkesteijn, M. H., Koutamanis, A. and Heijer, A. C. D., “Towards a smart campus: Supporting campus decisions with Internet of Things applications,” Building Research & Information, 49 (01), 1 –20 (2021). https://doi.org/10.1080/09613218.2020.1784702 Google Scholar

[3] 

Fischer, C., Pardos, Z. A. and Baker, R. S., “Mining big data in education: Affordances and challenges,” Review of Research in Education, 44 (1), 130 –160 (2020). https://doi.org/10.3102/0091732X20903304 Google Scholar

[4] 

Zheng, Y. F., “Survey of big data visualization in education,” Journal of Frontiers of Computer Science and Technology, 15 (03), 403 –422 (2021). Google Scholar

[5] 

Dwivedi, S. and Roshni, V. S. K., “Recommender system for big data in education,” in 2017 5th National Conference on E-Learning & E-Learning Technologies, 1 –4 (2017). Google Scholar

[6] 

Viloria, A., Naveda, A. S. and Palma, H. H., “Using big data to determine potential dropouts in higher education,” in Journal of Physics: Conference Series, 012077 (2020). Google Scholar

[7] 

Khan, A. and Ghosh, S. K., “Student performance analysis and prediction in classroom learning: A review of educational data mining studies,” Educ. Inf. Technol, 26 205 –240 (2021). https://doi.org/10.1007/s10639-020-10230-3 Google Scholar

[8] 

Yin, Y., Yu, X., “Portrait of college students and personalized teaching mode under the blended learning mode based on big data technology,” in International Conference on Cognitive based Information Processing and Applications, 992 –997 (2022). Google Scholar

[9] 

Grivokostopoulou, F., Perikos, I. and Hatzilygeroudis, I., “Utilizing semantic web technologies and data mining techniques to analyze students learning and predict final performance,” in 2014 IEEE International Conference on Teaching, Assessment and Learning for Engineering, 488 –494 (2014). Google Scholar

[10] 

Liu, K. S., Ni, Y. K., Li, Z. and Duan, B., “Data mining and feature analysis of college students” campus network behavior,” in 2020 5th IEEE International Conference on Big Data Analytics, 231 –237 (2020). Google Scholar

[11] 

Elbadrawy, A., Polyzou, A., Ren, Z., Sweeney, M., Karypis, G. and Rangwala, H., “Predicting student performance using personalized analytics,” In Computer, 61 –69 (2016). https://doi.org/10.1109/MC.2016.119 Google Scholar

[12] 

Ge, S. H., Wan, Q. and Bai, C. J., “Hadoop-based college student behavior warning decision system,” Computer Applications and Software, 38 (01), 6 –12 (2021). Google Scholar
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Suzhi Zhang D.D.S., Minzhe Shen II, and Yiran Yu III "Research on student portrait system based on canopy and k-means algorithm", Proc. SPIE 12260, International Conference on Computer Application and Information Security (ICCAIS 2021), 122601W (24 May 2022); https://doi.org/10.1117/12.2637367
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Data centers

Analytical research

Intelligence systems

Machine learning

Statistical analysis

Back to Top