import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.inspection import permutation_importance
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_validate
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
C:\Users\h\anaconda3\lib\site-packages\pandas\core\computation\expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed). from pandas.core.computation.check import NUMEXPR_INSTALLED
The dataset was obtained from a higher education institution in Portugal by M.V.Martins, D. Tolledo, J. Machado, L. M.T. Baptista, V.Realinho, who also wrote the paper “Early prediction of student’s performance in higher education: a case study”(2021) using this dataset. The funding was provided by the program SATDAP - Capacitação da Administração Pública. The purpose of creating this dataset was to contribute to the reduction of academic dropout and failure in higher education by using machine learning techniques. The dataset includes information on academic path, demographics, and socioeconomic factors related to the students who enrolled in undergraduate degrees, such as education, nursing, and journalism at the time, as well as the students’ academic performance at the end of the first and second semesters. The ‘Target’ values were ‘dropout’, ‘enrolled’ and ‘graduate’. By using the provided features and employing classification techniques, I classified the data into these three categories. This is important as the institution can identify the students who are on the “dropout path” and intervene earlier to ensure they have all the necessary means to continue their education.
The dataset consists of 36 features, a “Target” column and 4424 student records as can be seen in Table 1 in Appendix. Features describe each student in terms of their demographic, socioeconomic status, and academic standing. Some of the demographic features are nationality, gender, age, and whether they are international student or not. Some of the socioeconomic features are marital status, father’s occupation, mother’s occupation and whether tuition fees are up to date or not. Some of the academic standing features are grade of previous degree, admission grade and grades of the first and second semesters and how many curricular credits the student acquired. All the values in the dataset are numeric except the “Target” column where values were labeled as ‘Dropout’, ‘Enrolled’, and ‘Graduate’. I used all the features in my analysis.
The dataset can be found at https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success
Before applying any classification model, I needed to preprocess the data. The authors who provided the data performed a rigorous preprocessing to handle data from anomalies, unexplainable outliers, and missing values. Hence, the only preprocessing steps that I performed involved encoding the "Target" values, as described earlier, scaling the data using a Standard Scaler and splitting the data into train, validation and test sets. Twenty percent of the data was set aside as the test set, twenty percent of the remaining data was set aside as the validation set, and the remaining part was used as the training set. Training set had 2831 instances, validation set had 708 instances and testing set had 885 instances. All hyperparameter tunings were carried out on the training data, and tested on the validation data. Finally, tuned models were applied to test set to measure the success of the model.
Four classification algorithms were applied; K Nearest Neighbors (KNN), Support Vector Machines (SVM), SGDClassifier (used as a Logistic Regression Classifier) and Random Forests. For each model, manual cross validation was performed and average and standard deviation of the scores were computed to observe whether overfitting occurred. Additionally, grid search cross validation was performed to obtain the best hyperparameters. Consequently, the models tuned with these best hyperparameters were applied on the test sets. Finally, as statistic measures, confusion matrices, classification reports and accuracy scores were printed for validation and test sets. Additionally, features that were used in decision making were plotted for each model.
df = pd.read_csv('data.csv', sep=';')
df.head()
Marital status | Application mode | Application order | Course | Daytime/evening attendance\t | Previous qualification | Previous qualification (grade) | Nacionality | Mother's qualification | Father's qualification | ... | Curricular units 2nd sem (credited) | Curricular units 2nd sem (enrolled) | Curricular units 2nd sem (evaluations) | Curricular units 2nd sem (approved) | Curricular units 2nd sem (grade) | Curricular units 2nd sem (without evaluations) | Unemployment rate | Inflation rate | GDP | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 17 | 5 | 171 | 1 | 1 | 122.0 | 1 | 19 | 12 | ... | 0 | 0 | 0 | 0 | 0.000000 | 0 | 10.8 | 1.4 | 1.74 | Dropout |
1 | 1 | 15 | 1 | 9254 | 1 | 1 | 160.0 | 1 | 1 | 3 | ... | 0 | 6 | 6 | 6 | 13.666667 | 0 | 13.9 | -0.3 | 0.79 | Graduate |
2 | 1 | 1 | 5 | 9070 | 1 | 1 | 122.0 | 1 | 37 | 37 | ... | 0 | 6 | 0 | 0 | 0.000000 | 0 | 10.8 | 1.4 | 1.74 | Dropout |
3 | 1 | 17 | 2 | 9773 | 1 | 1 | 122.0 | 1 | 38 | 37 | ... | 0 | 6 | 10 | 5 | 12.400000 | 0 | 9.4 | -0.8 | -3.12 | Graduate |
4 | 2 | 39 | 1 | 8014 | 0 | 1 | 100.0 | 1 | 37 | 38 | ... | 0 | 6 | 6 | 6 | 13.000000 | 0 | 13.9 | -0.3 | 0.79 | Graduate |
5 rows × 37 columns
df.shape
(4424, 37)
df.Target.value_counts()
Graduate 2209 Dropout 1421 Enrolled 794 Name: Target, dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4424 entries, 0 to 4423 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Marital status 4424 non-null int64 1 Application mode 4424 non-null int64 2 Application order 4424 non-null int64 3 Course 4424 non-null int64 4 Daytime/evening attendance 4424 non-null int64 5 Previous qualification 4424 non-null int64 6 Previous qualification (grade) 4424 non-null float64 7 Nacionality 4424 non-null int64 8 Mother's qualification 4424 non-null int64 9 Father's qualification 4424 non-null int64 10 Mother's occupation 4424 non-null int64 11 Father's occupation 4424 non-null int64 12 Admission grade 4424 non-null float64 13 Displaced 4424 non-null int64 14 Educational special needs 4424 non-null int64 15 Debtor 4424 non-null int64 16 Tuition fees up to date 4424 non-null int64 17 Gender 4424 non-null int64 18 Scholarship holder 4424 non-null int64 19 Age at enrollment 4424 non-null int64 20 International 4424 non-null int64 21 Curricular units 1st sem (credited) 4424 non-null int64 22 Curricular units 1st sem (enrolled) 4424 non-null int64 23 Curricular units 1st sem (evaluations) 4424 non-null int64 24 Curricular units 1st sem (approved) 4424 non-null int64 25 Curricular units 1st sem (grade) 4424 non-null float64 26 Curricular units 1st sem (without evaluations) 4424 non-null int64 27 Curricular units 2nd sem (credited) 4424 non-null int64 28 Curricular units 2nd sem (enrolled) 4424 non-null int64 29 Curricular units 2nd sem (evaluations) 4424 non-null int64 30 Curricular units 2nd sem (approved) 4424 non-null int64 31 Curricular units 2nd sem (grade) 4424 non-null float64 32 Curricular units 2nd sem (without evaluations) 4424 non-null int64 33 Unemployment rate 4424 non-null float64 34 Inflation rate 4424 non-null float64 35 GDP 4424 non-null float64 36 Target 4424 non-null object dtypes: float64(7), int64(29), object(1) memory usage: 1.2+ MB
df.isna().sum()
Marital status 0 Application mode 0 Application order 0 Course 0 Daytime/evening attendance\t 0 Previous qualification 0 Previous qualification (grade) 0 Nacionality 0 Mother's qualification 0 Father's qualification 0 Mother's occupation 0 Father's occupation 0 Admission grade 0 Displaced 0 Educational special needs 0 Debtor 0 Tuition fees up to date 0 Gender 0 Scholarship holder 0 Age at enrollment 0 International 0 Curricular units 1st sem (credited) 0 Curricular units 1st sem (enrolled) 0 Curricular units 1st sem (evaluations) 0 Curricular units 1st sem (approved) 0 Curricular units 1st sem (grade) 0 Curricular units 1st sem (without evaluations) 0 Curricular units 2nd sem (credited) 0 Curricular units 2nd sem (enrolled) 0 Curricular units 2nd sem (evaluations) 0 Curricular units 2nd sem (approved) 0 Curricular units 2nd sem (grade) 0 Curricular units 2nd sem (without evaluations) 0 Unemployment rate 0 Inflation rate 0 GDP 0 Target 0 dtype: int64
#Convert Target variables to numbers
def label(row):
if row == 'Graduate':
return 2
elif row =='Enrolled':
return 1
else:
return 0
df['Target'] = df['Target'].apply(label)
df['Target'].value_counts()
2 2209 0 1421 1 794 Name: Target, dtype: int64
The frequency distribution of all the features and the “Target” column can be seen in the following figure. We observe that most of the features are discrete, and continuous features, such as grades columns exhibit a Gaussian distribution pattern. “Target” values are encoded as “Graduate” => 2, “Enrolled”=> 1, and “Dropout”=> 0. The graduation rate among students is nearly three times higher than the enrollment rate and almost twice as high as the dropout rate.
import matplotlib.pyplot as plt
df.hist(bins=50, figsize=(20,15))
plt.tight_layout()
plt.show()
Columns that are highly correlated with Target can be seen below.The columns that exhibit high correlation with “Target” column include grades and credits students acquired in the second and first semesters, tuition and scholarship information among others. Second semester’s correlation is higher than others, as finishing the second the semester is a good indicator of how invested students are in the program.
corr_df = pd.DataFrame(df.corr()['Target'].abs().sort_values(ascending=False)[:16])[1:]
corr_df.index
corr_df
Target | |
---|---|
Curricular units 2nd sem (approved) | 0.624157 |
Curricular units 2nd sem (grade) | 0.566827 |
Curricular units 1st sem (approved) | 0.529123 |
Curricular units 1st sem (grade) | 0.485207 |
Tuition fees up to date | 0.409827 |
Scholarship holder | 0.297595 |
Age at enrollment | 0.243438 |
Debtor | 0.240999 |
Gender | 0.229270 |
Application mode | 0.221747 |
Curricular units 2nd sem (enrolled) | 0.175847 |
Curricular units 1st sem (enrolled) | 0.155974 |
Admission grade | 0.120889 |
Displaced | 0.113986 |
Previous qualification (grade) | 0.103764 |
SS = StandardScaler()
df_scaled = pd.DataFrame(SS.fit_transform(df.iloc[:,:-1]), columns=df.columns[:-1])
df_scaled['Target'] = df['Target']
df_scaled.head()
Marital status | Application mode | Application order | Course | Daytime/evening attendance\t | Previous qualification | Previous qualification (grade) | Nacionality | Mother's qualification | Father's qualification | ... | Curricular units 2nd sem (credited) | Curricular units 2nd sem (enrolled) | Curricular units 2nd sem (evaluations) | Curricular units 2nd sem (approved) | Curricular units 2nd sem (grade) | Curricular units 2nd sem (without evaluations) | Unemployment rate | Inflation rate | GDP | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.294829 | -0.095470 | 2.490896 | -4.209520 | 0.350082 | -0.35023 | -0.804841 | -0.126298 | -0.036018 | -0.669778 | ... | -0.282442 | -2.838337 | -2.042630 | -1.471527 | -1.963489 | -0.199441 | -0.287638 | 0.124386 | 0.765761 | 0 |
1 | -0.294829 | -0.209869 | -0.554068 | 0.192580 | 0.350082 | -0.35023 | 2.076819 | -0.126298 | -1.189759 | -1.256427 | ... | -0.282442 | -0.105726 | -0.522682 | 0.518904 | 0.659562 | -0.199441 | 0.876222 | -1.105222 | 0.347199 | 2 |
2 | -0.294829 | -1.010660 | 2.490896 | 0.103404 | 0.350082 | -0.35023 | -0.804841 | -0.126298 | 1.117723 | 0.959802 | ... | -0.282442 | -0.105726 | -2.042630 | -1.471527 | -1.963489 | -0.199441 | -0.287638 | 0.124386 | 0.765761 | 0 |
3 | -0.294829 | -0.095470 | 0.207173 | 0.444115 | 0.350082 | -0.35023 | -0.804841 | -0.126298 | 1.181819 | 0.959802 | ... | -0.282442 | -0.105726 | 0.490616 | 0.187165 | 0.416450 | -0.199441 | -0.813253 | -1.466871 | -1.375511 | 2 |
4 | 1.356212 | 1.162916 | -0.554068 | -0.408389 | -2.856470 | -0.35023 | -2.473171 | -0.126298 | 1.117723 | 1.024985 | ... | -0.282442 | -0.105726 | -0.522682 | 0.518904 | 0.531608 | -0.199441 | 0.876222 | -1.105222 | 0.347199 | 2 |
5 rows × 37 columns
X and y are what is left after removing X_test and y_test. X and y are divided into train and validation sets.
X_df = df_scaled.drop('Target', axis=1).values
y_df = df['Target'].values
X, X_test, y, y_test =train_test_split(X_df, y_df, test_size=0.2, random_state=21, stratify=y_df)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=21, stratify=y)
print('X_train shape: ', len(X_train))
print('y_train shape:', len(y_train))
print('X_val shape: ', len(X_val))
print('y_val shape:', len(y_val))
print('X_test shape: ', len(X_test))
print('y_test shape: ', len(y_test))
X_train shape: 2831 y_train shape: 2831 X_val shape: 708 y_val shape: 708 X_test shape: 885 y_test shape: 885
train_acc_knn = {}
test_acc_knn = {}
neighbors = np.arange(1, 26)
for neighbor in neighbors:
knn = KNeighborsClassifier(n_neighbors=neighbor)
knn.fit(X_train, y_train)
train_acc_knn[neighbor] = knn.score(X_train, y_train)
test_acc_knn[neighbor] = knn.score(X_val, y_val)
plt.title("KNN: Varying Number of Neighbors")
# Plot training accuracies
plt.plot(neighbors, train_acc_knn.values(), label="Training Accuracy")
# Plot test accuracies
plt.plot(neighbors, test_acc_knn.values(), label="Testing Accuracy")
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
# Display the plot
plt.show()
print("Number of neighbors that results in the highest accuracy: {}".format(max(test_acc_knn, key=test_acc_knn.get)))
Number of neighbors that results in the highest accuracy: 8
knn_manual = KNeighborsClassifier()
kf = KFold(n_splits=6, shuffle=True, random_state=42)
cv_results = cross_val_score(knn_manual, X, y, cv=kf)
cv_results
array([0.68135593, 0.68135593, 0.70677966, 0.70677966, 0.70677966, 0.68590832])
print('Mean of cross val. results is {}, standard variation is {}'.format(np.mean(cv_results).round(4), np.std(cv_results).round(4)))
Mean of cross val. results is 0.6948, standard variation is 0.012
print('95% confidence interval is {}'.format(np.quantile(cv_results, [0.025, 0.975]).round(4)))
95% confidence interval is [0.6814 0.7068]
knn_tuned = KNeighborsClassifier(n_neighbors=18)
knn_tuned.fit(X_train, y_train)
knn_tuned.score(X_val,y_val)
0.7288135593220338
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {'n_neighbors':[8,11, 15, 18], 'metric':['euclidean', 'manhattan']}
knn = KNeighborsClassifier()
knn_gs = GridSearchCV(knn, param_grid, cv=kf)
knn_gs.fit(X_train, y_train)
print(knn_gs.best_params_, knn_gs.best_score_)
{'metric': 'manhattan', 'n_neighbors': 18} 0.7276646661805672
knn_random = RandomizedSearchCV(knn, param_grid, cv=kf, n_iter=2)
knn_random.fit(X_train, y_train)
print(knn_random.best_params_, knn_random.best_score_)
{'n_neighbors': 11, 'metric': 'manhattan'} 0.7262474993923757
Manual and grid search both give the validation accuracy at 73%. We will use the parameters found by grid search.
knn_tuned_grid = KNeighborsClassifier(n_neighbors=18, metric='manhattan')
knn_tuned_grid.fit(X_train, y_train)
knn_tuned_grid.score(X_train, y_train).round(4)
0.7549
knn_tuned_grid = KNeighborsClassifier(n_neighbors=18, metric='manhattan')
knn_tuned_grid.fit(X_train, y_train)
knn_tuned_grid.score(X_val, y_val).round(4)
0.7161
result = permutation_importance(knn_tuned_grid, X_val, y_val, n_repeats=10, random_state=42)
importances = result.importances_mean.round(5)
important_feat_knn = pd.DataFrame(importances, df_scaled.columns[:-1], columns=['importance'])
important_feat_knn.sort_values(by='importance', ascending=False)
importance | |
---|---|
Curricular units 2nd sem (approved) | 0.02062 |
Tuition fees up to date | 0.01412 |
Debtor | 0.00847 |
Curricular units 2nd sem (evaluations) | 0.00593 |
Scholarship holder | 0.00551 |
Curricular units 1st sem (approved) | 0.00395 |
Curricular units 2nd sem (grade) | 0.00353 |
Curricular units 2nd sem (credited) | 0.00212 |
Curricular units 2nd sem (without evaluations) | 0.00155 |
Application mode | 0.00085 |
Mother's occupation | 0.00042 |
Marital status | 0.00028 |
Educational special needs | 0.00014 |
Curricular units 1st sem (credited) | 0.00014 |
Curricular units 1st sem (without evaluations) | -0.00056 |
Curricular units 1st sem (evaluations) | -0.00071 |
Father's occupation | -0.00085 |
Curricular units 1st sem (grade) | -0.00127 |
Inflation rate | -0.00184 |
Previous qualification (grade) | -0.00212 |
Displaced | -0.00212 |
Admission grade | -0.00226 |
Age at enrollment | -0.00254 |
Nacionality | -0.00282 |
International | -0.00297 |
Mother's qualification | -0.00381 |
Daytime/evening attendance\t | -0.00395 |
Application order | -0.00395 |
Father's qualification | -0.00438 |
GDP | -0.00452 |
Previous qualification | -0.00480 |
Curricular units 1st sem (enrolled) | -0.00537 |
Curricular units 2nd sem (enrolled) | -0.00579 |
Unemployment rate | -0.00692 |
Gender | -0.00706 |
Course | -0.00763 |
important_feat_knn.plot(kind='barh', figsize=(12,10))
<AxesSubplot: >
predicted = knn_tuned_grid.predict(X_test)
print(confusion_matrix(y_test, predicted))
print(classification_report(y_test, predicted))
print(accuracy_score(y_test, predicted).round(4))
sns.heatmap(confusion_matrix(y_test, predicted)/np.sum(confusion_matrix(y_test, predicted)), fmt='.2%',annot=True, cmap='Blues',cbar=False)
[[173 23 88] [ 24 34 101] [ 6 11 425]] precision recall f1-score support 0 0.85 0.61 0.71 284 1 0.50 0.21 0.30 159 2 0.69 0.96 0.80 442 accuracy 0.71 885 macro avg 0.68 0.59 0.60 885 weighted avg 0.71 0.71 0.68 885 0.7141
<AxesSubplot: >
svm = SVC()
cv_results = cross_val_score(svm, X,y, cv=kf)
print('Mean of cross validation results is {}, standard variation is {}'.format(np.mean(cv_results).round(4), np.std(cv_results).round(4)))
Mean of cross validation results is 0.7601, standard variation is 0.0049
Standard deviation is pretty low, all folds give similar results, therefore no overfitting.
print('95% confidence interval is {}'.format(np.quantile(cv_results, [0.025, 0.975]).round(4)))
95% confidence interval is [0.7531 0.7651]
svm =SVC()
params = {'C':[0.1, 1, 10], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}
searcher = GridSearchCV(svm, param_grid=params)
searcher.fit(X_train,y_train)
pred = searcher.predict(X_val)
accuracy_score(y_val, pred)
0.769774011299435
searcher.cv_results_
{'mean_fit_time': array([0.27595596, 0.55183644, 0.39222798, 0.2324347 , 0.30942407, 0.49004912, 0.29670954, 0.24211607, 0.16929913, 0.36431842, 0.25194054, 0.20276361, 0.16316075, 0.18086963, 0.42357111]), 'std_fit_time': array([0.03921296, 0.23337329, 0.09773754, 0.04982113, 0.03287002, 0.09668777, 0.06666036, 0.0211803 , 0.00628583, 0.13873721, 0.01022663, 0.0109675 , 0.00762966, 0.01226799, 0.02233543]), 'mean_score_time': array([0.15382695, 0.27605562, 0.19099441, 0.10318069, 0.12657542, 0.26565924, 0.13917098, 0.10622325, 0.08159723, 0.13483315, 0.11810298, 0.08284287, 0.07531404, 0.07977457, 0.12237706]), 'std_score_time': array([0.0348076 , 0.08090776, 0.07325376, 0.01809595, 0.02352529, 0.06000496, 0.02939992, 0.02223646, 0.00626639, 0.03520782, 0.01340029, 0.00863925, 0.00618585, 0.01213592, 0.01178965]), 'param_C': masked_array(data=[0.1, 0.1, 0.1, 0.1, 0.1, 1, 1, 1, 1, 1, 10, 10, 10, 10, 10], mask=[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False], fill_value='?', dtype=object), 'param_gamma': masked_array(data=[1e-05, 0.0001, 0.001, 0.01, 0.1, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1e-05, 0.0001, 0.001, 0.01, 0.1], mask=[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False], fill_value='?', dtype=object), 'params': [{'C': 0.1, 'gamma': 1e-05}, {'C': 0.1, 'gamma': 0.0001}, {'C': 0.1, 'gamma': 0.001}, {'C': 0.1, 'gamma': 0.01}, {'C': 0.1, 'gamma': 0.1}, {'C': 1, 'gamma': 1e-05}, {'C': 1, 'gamma': 0.0001}, {'C': 1, 'gamma': 0.001}, {'C': 1, 'gamma': 0.01}, {'C': 1, 'gamma': 0.1}, {'C': 10, 'gamma': 1e-05}, {'C': 10, 'gamma': 0.0001}, {'C': 10, 'gamma': 0.001}, {'C': 10, 'gamma': 0.01}, {'C': 10, 'gamma': 0.1}], 'split0_test_score': array([0.49911817, 0.49911817, 0.63844797, 0.69664903, 0.63844797, 0.49911817, 0.64373898, 0.70017637, 0.7319224 , 0.72310406, 0.64373898, 0.70194004, 0.7319224 , 0.73544974, 0.70194004]), 'split1_test_score': array([0.49823322, 0.49823322, 0.66077739, 0.70848057, 0.66784452, 0.49823322, 0.66254417, 0.71378092, 0.77738516, 0.7614841 , 0.66254417, 0.71378092, 0.76678445, 0.80388693, 0.74558304]), 'split2_test_score': array([0.49823322, 0.49823322, 0.65724382, 0.71378092, 0.65547703, 0.49823322, 0.6590106 , 0.72791519, 0.73674912, 0.7155477 , 0.66254417, 0.72614841, 0.74381625, 0.74028269, 0.69257951]), 'split3_test_score': array([0.5 , 0.5 , 0.66254417, 0.73144876, 0.65724382, 0.5 , 0.66254417, 0.74381625, 0.78091873, 0.74911661, 0.66254417, 0.74381625, 0.77031802, 0.78975265, 0.72791519]), 'split4_test_score': array([0.5 , 0.5 , 0.66607774, 0.71201413, 0.66431095, 0.5 , 0.66607774, 0.72261484, 0.75441696, 0.7155477 , 0.66431095, 0.72438163, 0.75265018, 0.75971731, 0.7155477 ]), 'mean_test_score': array([0.49911692, 0.49911692, 0.65701822, 0.71247468, 0.65666486, 0.49911692, 0.65878313, 0.72166072, 0.75627847, 0.73296003, 0.65913649, 0.72201345, 0.75309826, 0.76581786, 0.7167131 ]), 'std_test_score': array([0.00079013, 0.00079013, 0.00971234, 0.01121016, 0.01016906, 0.00079013, 0.00784704, 0.01452285, 0.02015298, 0.01885829, 0.00772911, 0.01391966, 0.01427309, 0.02695779, 0.01876695]), 'rank_test_score': array([13, 13, 11, 8, 12, 13, 10, 6, 2, 4, 9, 5, 3, 1, 7])}
gs_svm =pd.DataFrame(searcher.cv_results_['params'])
gs_svm['test_score'] = searcher.cv_results_['mean_test_score']
gs_svm
C | gamma | test_score | |
---|---|---|---|
0 | 0.1 | 0.00001 | 0.499117 |
1 | 0.1 | 0.00010 | 0.499117 |
2 | 0.1 | 0.00100 | 0.657018 |
3 | 0.1 | 0.01000 | 0.712475 |
4 | 0.1 | 0.10000 | 0.656665 |
5 | 1.0 | 0.00001 | 0.499117 |
6 | 1.0 | 0.00010 | 0.658783 |
7 | 1.0 | 0.00100 | 0.721661 |
8 | 1.0 | 0.01000 | 0.756278 |
9 | 1.0 | 0.10000 | 0.732960 |
10 | 10.0 | 0.00001 | 0.659136 |
11 | 10.0 | 0.00010 | 0.722013 |
12 | 10.0 | 0.00100 | 0.753098 |
13 | 10.0 | 0.01000 | 0.765818 |
14 | 10.0 | 0.10000 | 0.716713 |
#Non interactive 3D
feat1 = gs_svm['C']
feat2 = gs_svm['gamma']
feat3 = gs_svm['test_score']
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(feat1, feat2, feat3)
# Set labels for each axis
ax.set_xlabel('C')
ax.set_ylabel('gamma')
ax.set_zlabel('score')
# Show the plot
plt.title('Validation Results for Various C and gamma Values')
plt.show()
#Plotting them in interactive form
import plotly.graph_objects as go
feat1 = gs_svm['C']
feat2 = gs_svm['gamma']
feat3 = gs_svm['test_score']
fig = go.Figure(data=[go.Scatter3d(x=feat1, y=feat2, z=feat3, mode='markers')])
fig.update_layout(scene=dict(xaxis_title='C', yaxis_title='gamma', zaxis_title='Score'), title='Validation Results for Various C and gamma Values')
fig.show()
searcher.best_params_
{'C': 10, 'gamma': 0.01}
searcher.best_score_
0.7658178622842934
svm_tuned = SVC(C=10, gamma=0.01)
svm_tuned.fit(X_train, y_train)
print("Training accuracy: ", svm_tuned.score(X_train, y_train).round(4))
predicted = svm_tuned.predict(X_val)
print(confusion_matrix(y_val, predicted))
print(classification_report(y_val, predicted))
print('Validation accuracy: ', accuracy_score(y_val, predicted).round(4))
Training accuracy: 0.8516 [[169 29 29] [ 29 49 49] [ 8 19 327]] precision recall f1-score support 0 0.82 0.74 0.78 227 1 0.51 0.39 0.44 127 2 0.81 0.92 0.86 354 accuracy 0.77 708 macro avg 0.71 0.68 0.69 708 weighted avg 0.76 0.77 0.76 708 Validation accuracy: 0.7698
result = permutation_importance(svm_tuned, X_val, y_val, n_repeats=10, random_state=42)
importances = result.importances_mean.round(5)
important_feat_svm = pd.DataFrame(importances, df_scaled.columns[:-1], columns=['importance'])
important_feat_svm.sort_values(by='importance', ascending=False)
importance | |
---|---|
Curricular units 2nd sem (approved) | 0.20042 |
Curricular units 1st sem (approved) | 0.12429 |
Curricular units 2nd sem (enrolled) | 0.04958 |
Curricular units 2nd sem (grade) | 0.04294 |
Tuition fees up to date | 0.03729 |
Curricular units 1st sem (enrolled) | 0.02161 |
Course | 0.02020 |
Age at enrollment | 0.01271 |
Curricular units 1st sem (credited) | 0.01215 |
Unemployment rate | 0.00847 |
Curricular units 1st sem (evaluations) | 0.00833 |
Mother's occupation | 0.00749 |
Scholarship holder | 0.00593 |
Application order | 0.00466 |
Previous qualification | 0.00438 |
Debtor | 0.00424 |
Admission grade | 0.00424 |
Previous qualification (grade) | 0.00424 |
Application mode | 0.00353 |
Father's occupation | 0.00339 |
Father's qualification | 0.00339 |
Displaced | 0.00240 |
Curricular units 2nd sem (credited) | 0.00226 |
Curricular units 2nd sem (evaluations) | 0.00212 |
Educational special needs | 0.00198 |
Curricular units 1st sem (grade) | 0.00169 |
Inflation rate | 0.00099 |
GDP | 0.00099 |
International | 0.00028 |
Curricular units 2nd sem (without evaluations) | 0.00014 |
Nacionality | -0.00042 |
Curricular units 1st sem (without evaluations) | -0.00141 |
Mother's qualification | -0.00240 |
Gender | -0.00268 |
Daytime/evening attendance\t | -0.00339 |
Marital status | -0.00353 |
important_feat_svm.plot(kind='barh', figsize=(12, 10))
<AxesSubplot: >
predicted = svm_tuned.predict(X_test)
print(confusion_matrix(y_test, predicted))
print(classification_report(y_test, predicted))
print(accuracy_score(y_test, predicted).round(4))
sns.heatmap(confusion_matrix(y_test, predicted)/np.sum(confusion_matrix(y_test, predicted)), fmt='.2%',annot=True, cmap='Blues',cbar=False)
[[204 33 47] [ 43 50 66] [ 18 27 397]] precision recall f1-score support 0 0.77 0.72 0.74 284 1 0.45 0.31 0.37 159 2 0.78 0.90 0.83 442 accuracy 0.74 885 macro avg 0.67 0.64 0.65 885 weighted avg 0.72 0.74 0.72 885 0.7356
<AxesSubplot: >
SGDClassifer acts like a linear SVM when using the hinge loss function. When using the log loss function it behaves like Logistic Regression Classifier. Since I already performed SVM, I will use loss:log and run Logistic Regression Classifier.
sgd = SGDClassifier()
cv_results = cross_val_score(sgd, X,y, cv=kf)
print('Mean of cross validation results is {}, standard variation is {}'.format(np.mean(cv_results).round(4), np.std(cv_results).round(4)))
Mean of cross validation results is 0.7344, standard variation is 0.0125
print('95% confidence interval is {}'.format(np.quantile(cv_results, [0.025, 0.975]).round(4)))
95% confidence interval is [0.722 0.7549]
linear_classifier = SGDClassifier(random_state=0)
parameters = {'alpha':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1],
'loss':['log'], 'penalty':['l1','l2']}
searcher = GridSearchCV(linear_classifier, parameters, cv=10)
searcher.fit(X_train, y_train)
# Report the best parameters and the corresponding score
print("Best CV params", searcher.best_params_)
print("Best CV accuracy", searcher.best_score_)
print("Test accuracy of best grid search hypers:", searcher.score(X_val, y_val))
Best CV params {'alpha': 0.01, 'loss': 'log', 'penalty': 'l2'} Best CV accuracy 0.7576892450107003 Test accuracy of best grid search hypers: 0.7754237288135594
sgd_tuned = SGDClassifier(alpha=0.01, loss='log', penalty='l2')
gs_sgd =pd.DataFrame(searcher.cv_results_['params'])
gs_sgd['test_score'] = searcher.cv_results_['mean_test_score']
gs_sgd
alpha | loss | penalty | test_score | |
---|---|---|---|---|
0 | 0.00001 | log | l1 | 0.702930 |
1 | 0.00001 | log | l2 | 0.700817 |
2 | 0.00010 | log | l1 | 0.739312 |
3 | 0.00010 | log | l2 | 0.735780 |
4 | 0.00100 | log | l1 | 0.755565 |
5 | 0.00100 | log | l2 | 0.755915 |
6 | 0.01000 | log | l1 | 0.756617 |
7 | 0.01000 | log | l2 | 0.757689 |
8 | 0.10000 | log | l1 | 0.699755 |
9 | 0.10000 | log | l2 | 0.742500 |
10 | 1.00000 | log | l1 | 0.499117 |
11 | 1.00000 | log | l2 | 0.691640 |
#Non interactive 3D
feat1 = gs_sgd['alpha']
feat2 = gs_sgd['penalty'].apply(lambda x: 0 if 'l1' else 1)
feat3 = gs_sgd['test_score']
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(feat1, feat2, feat3)
# Set labels for each axis
ax.set_xlabel('alpha')
ax.set_ylabel('penalty (L1=0, L2=1)')
ax.set_zlabel('score')
# Show the plot
plt.title('Validation Results for Various alpha and penalty values')
plt.show()
import plotly.graph_objects as go
feat1 = gs_sgd['alpha']
feat2 = gs_sgd['penalty']
feat3 = gs_sgd['test_score']
fig = go.Figure(data=[go.Scatter3d(x=feat1, y=feat2, z=feat3, mode='markers')])
fig.update_layout(scene=dict(xaxis_title='alpha', yaxis_title='penalty', zaxis_title='Score'), title='Validation Results for Various alpha and Penalty Values')
fig.show()
sgd_tuned.fit(X_train, y_train)
print("Training accuracy: ", sgd_tuned.score(X_train, y_train).round(4))
predicted = sgd_tuned.predict(X_val)
print(confusion_matrix(y_val, predicted))
print(classification_report(y_val, predicted))
print("Validation accuracy: ", accuracy_score(y_val, predicted).round(4))
Training accuracy: 0.7637 [[186 11 30] [ 35 30 62] [ 13 6 335]] precision recall f1-score support 0 0.79 0.82 0.81 227 1 0.64 0.24 0.34 127 2 0.78 0.95 0.86 354 accuracy 0.78 708 macro avg 0.74 0.67 0.67 708 weighted avg 0.76 0.78 0.75 708 Validation accuracy: 0.7782
rfe = RFE(estimator=sgd_tuned, n_features_to_select=10)
rfe.fit(X_train, y_train)
sgd_ranking = pd.DataFrame(rfe.ranking_, df_scaled.columns[:-1], columns=['Ranking'])
sgd_ranking.sort_values(by='Ranking', ascending=True)
Ranking | |
---|---|
Curricular units 1st sem (approved) | 1 |
Tuition fees up to date | 1 |
Curricular units 1st sem (enrolled) | 1 |
Curricular units 2nd sem (credited) | 1 |
Curricular units 2nd sem (enrolled) | 1 |
Scholarship holder | 1 |
Course | 1 |
Curricular units 2nd sem (approved) | 1 |
Curricular units 2nd sem (grade) | 1 |
Curricular units 2nd sem (evaluations) | 1 |
Curricular units 1st sem (evaluations) | 2 |
Age at enrollment | 3 |
Mother's occupation | 4 |
Previous qualification (grade) | 5 |
Gender | 6 |
Debtor | 7 |
Curricular units 1st sem (credited) | 8 |
International | 9 |
Nacionality | 10 |
Marital status | 11 |
Admission grade | 12 |
Mother's qualification | 13 |
Unemployment rate | 14 |
Application order | 15 |
Curricular units 1st sem (without evaluations) | 16 |
Curricular units 1st sem (grade) | 17 |
Displaced | 18 |
GDP | 19 |
Father's occupation | 20 |
Father's qualification | 21 |
Inflation rate | 22 |
Application mode | 23 |
Previous qualification | 24 |
Daytime/evening attendance\t | 25 |
Curricular units 2nd sem (without evaluations) | 26 |
Educational special needs | 27 |
sgd_ranking.plot(kind='barh', figsize=(12,10))
<AxesSubplot: >
predicted = sgd_tuned.predict(X_test)
print(confusion_matrix(y_test, predicted))
print(classification_report(y_test, predicted))
print(accuracy_score(y_test, predicted).round(4))
sns.heatmap(confusion_matrix(y_test, predicted)/np.sum(confusion_matrix(y_test, predicted)), fmt='.2%',annot=True, cmap='Blues',cbar=False)
[[225 12 47] [ 50 29 80] [ 19 6 417]] precision recall f1-score support 0 0.77 0.79 0.78 284 1 0.62 0.18 0.28 159 2 0.77 0.94 0.85 442 accuracy 0.76 885 macro avg 0.72 0.64 0.64 885 weighted avg 0.74 0.76 0.72 885 0.7582
<AxesSubplot: >
rfc = RandomForestClassifier()
print(rfc.get_params())
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
cv_results = cross_val_score(rfc,X,y, cv=kf)
cv_results
array([0.76977401, 0.76553672, 0.77259887, 0.79096045, 0.79207921])
print('Mean of cross validation results is {}, standard variation is {}'.format(np.mean(cv_results).round(4), np.std(cv_results).round(4)))
Mean of cross validation results is 0.7782, standard variation is 0.0111
print('95% confidence interval is {}'.format(np.quantile(cv_results, [0.025, 0.975]).round(4)))
95% confidence interval is [0.766 0.792]
params = {'n_estimators':[120, 140, 160], 'max_depth':[8,10, 12], 'min_samples_leaf': [2, 3]}
rf_cv= GridSearchCV(estimator=rfc, param_grid=params, cv=3)
rf_cv.fit(X_train, y_train)
pred = rf_cv.predict(X_val)
accuracy_score(pred, y_val)
0.7909604519774012
rf_cv.best_params_
{'max_depth': 12, 'min_samples_leaf': 2, 'n_estimators': 140}
gs_rf =pd.DataFrame(rf_cv.cv_results_['params'])
gs_rf['test_score'] = rf_cv.cv_results_['mean_test_score']
gs_rf
max_depth | min_samples_leaf | n_estimators | test_score | |
---|---|---|---|---|
0 | 8 | 2 | 120 | 0.761918 |
1 | 8 | 2 | 140 | 0.757681 |
2 | 8 | 2 | 160 | 0.758388 |
3 | 8 | 3 | 120 | 0.757329 |
4 | 8 | 3 | 140 | 0.760154 |
5 | 8 | 3 | 160 | 0.761566 |
6 | 10 | 2 | 120 | 0.762628 |
7 | 10 | 2 | 140 | 0.762626 |
8 | 10 | 2 | 160 | 0.765451 |
9 | 10 | 3 | 120 | 0.762629 |
10 | 10 | 3 | 140 | 0.761565 |
11 | 10 | 3 | 160 | 0.761919 |
12 | 12 | 2 | 120 | 0.762626 |
13 | 12 | 2 | 140 | 0.767924 |
14 | 12 | 2 | 160 | 0.763330 |
15 | 12 | 3 | 120 | 0.764036 |
16 | 12 | 3 | 140 | 0.761213 |
17 | 12 | 3 | 160 | 0.762979 |
fig, ax = plt.subplots(nrows=3,ncols=1, figsize=(12,10))
ax = ax.ravel()
ax[0].plot(list(set(gs_rf['max_depth'])), [gs_rf.groupby('max_depth')['test_score'].agg(np.mean)[i] for i in set(gs_rf['max_depth'])])
ax[0].set_title('max_depth vs mean score')
ax[1].plot(list(set(gs_rf['min_samples_leaf'])), [gs_rf.groupby('min_samples_leaf')['test_score'].agg(np.mean)[i] for i in set(gs_rf['min_samples_leaf'])])
ax[1].set_title('min_samples_leaf vs mean score')
ax[2].plot(list(set(gs_rf['n_estimators'])), [gs_rf.groupby('n_estimators')['test_score'].agg(np.mean)[i] for i in set(gs_rf['n_estimators'])])
ax[2].set_title('n_estimators vs mean score')
plt.show()
rfc_tuned = RandomForestClassifier(max_depth=12, min_samples_leaf=2, n_estimators=160)
rfc_tuned.fit(X_train, y_train)
print("Training accuracy: ", rfc_tuned.score(X_train, y_train).round(4))
predicted = rfc_tuned.predict(X_val)
print(confusion_matrix(y_val, predicted))
print(classification_report(y_val, predicted))
print("Validation accuracy: ", accuracy_score(y_val, predicted).round(4))
Training accuracy: 0.941 [[177 17 33] [ 32 49 46] [ 8 14 332]] precision recall f1-score support 0 0.82 0.78 0.80 227 1 0.61 0.39 0.47 127 2 0.81 0.94 0.87 354 accuracy 0.79 708 macro avg 0.75 0.70 0.71 708 weighted avg 0.78 0.79 0.77 708 Validation accuracy: 0.7881
importances_rf = pd.DataFrame(rfc_tuned.feature_importances_, index=df_scaled.columns[:-1], columns=['importance'])
importances_rf.plot(kind='barh', figsize=(12,10))
<AxesSubplot: >
importances_rf.sort_values(by='importance', ascending=False)
importance | |
---|---|
Curricular units 2nd sem (approved) | 0.191417 |
Curricular units 1st sem (approved) | 0.120816 |
Curricular units 2nd sem (grade) | 0.113790 |
Curricular units 1st sem (grade) | 0.056863 |
Tuition fees up to date | 0.046207 |
Curricular units 2nd sem (evaluations) | 0.040774 |
Age at enrollment | 0.035964 |
Admission grade | 0.034893 |
Curricular units 1st sem (evaluations) | 0.032394 |
Course | 0.030323 |
Previous qualification (grade) | 0.027412 |
Father's occupation | 0.021902 |
Curricular units 2nd sem (enrolled) | 0.021896 |
Mother's occupation | 0.020304 |
GDP | 0.019955 |
Application mode | 0.019282 |
Curricular units 1st sem (enrolled) | 0.019224 |
Unemployment rate | 0.018000 |
Inflation rate | 0.017083 |
Mother's qualification | 0.016027 |
Father's qualification | 0.015746 |
Scholarship holder | 0.013344 |
Debtor | 0.010588 |
Gender | 0.010182 |
Application order | 0.009520 |
Displaced | 0.006856 |
Curricular units 1st sem (credited) | 0.005336 |
Previous qualification | 0.005263 |
Curricular units 2nd sem (credited) | 0.004465 |
Curricular units 2nd sem (without evaluations) | 0.004153 |
Curricular units 1st sem (without evaluations) | 0.003754 |
Marital status | 0.002501 |
Daytime/evening attendance\t | 0.001990 |
Nacionality | 0.001062 |
International | 0.000501 |
Educational special needs | 0.000214 |
predicted = rfc_tuned.predict(X_test)
print(confusion_matrix(y_test, predicted))
print(classification_report(y_test, predicted))
print(accuracy_score(y_test, predicted))
sns.heatmap(confusion_matrix(y_test, predicted)/np.sum(confusion_matrix(y_test, predicted)), fmt='.2%',annot=True, cmap='Blues',cbar=False)
[[210 22 52] [ 37 53 69] [ 12 22 408]] precision recall f1-score support 0 0.81 0.74 0.77 284 1 0.55 0.33 0.41 159 2 0.77 0.92 0.84 442 accuracy 0.76 885 macro avg 0.71 0.67 0.68 885 weighted avg 0.74 0.76 0.74 885 0.7581920903954802
<AxesSubplot: >
Combining Top 15 most important features of all Classifiers.
imp_features = pd.concat([important_feat_knn.sort_values(by='importance', ascending=False).iloc[:15,:], important_feat_svm.sort_values(by='importance', ascending=False).iloc[:15,:], sgd_ranking.sort_values(by='Ranking').iloc[:15,:], importances_rf.sort_values(by='importance', ascending=False).iloc[:15,:]], axis=1)
imp_features.columns=['KNN', "SVM", "SGD", "RF"]
imp_features.dropna(inplace=True)
imp_features
KNN | SVM | SGD | RF | |
---|---|---|---|---|
Curricular units 2nd sem (approved) | 0.02062 | 0.20042 | 1.0 | 0.191417 |
Tuition fees up to date | 0.01412 | 0.03729 | 1.0 | 0.046207 |
Curricular units 1st sem (approved) | 0.00395 | 0.12429 | 1.0 | 0.120816 |
Curricular units 2nd sem (grade) | 0.00353 | 0.04294 | 1.0 | 0.113790 |
Mother's occupation | 0.00042 | 0.00749 | 4.0 | 0.020304 |
In this project, I employed four classification models to predict student outcomes, including dropout, graduation, and enrollment. I conducted cross-validation on the combined training and validation datasets for each model. Subsequently, I utilized grid search to identify the optimal hyperparameters for each model and fine-tuned them accordingly. I also identified the most influential features used by the models to make their predictions. Finally, I applied the tuned models to unseen data. However, despite Random Forest yielding a high training accuracy of 94%, its performance declined when tested on validation and testing data. This indicates that Random Forest is prone to overfitting and is the most affected model among all. SVM also exhibits overfitting tendencies, although to a lesser extent compared to Random Forest. On the other hand, KNN provides the lowest accuracy among the models. The most generalizable model, with the highest testing accuracy, appears to be SGD Classifier, which behaves similarly to a Logistic Regression Classifier. While its testing accuracy of 76% is not exceptional, it still outperforms all other models. All models achieved relatively satisfactory results with the "Dropout" and "Graduate" classes. However, they encountered challenges when classifying instances within the "Enrolled" class. This discrepancy may be attributed to the fact that the "Enrolled" class is not at the same level of exclusivity as the "Dropout" and "Graduate" classes. Dropout and graduation are mutually exclusive events, meaning a student who graduated did not drop out, and vice versa. However, the same exclusivity does not hold true for the "Enrolled" class. A student in the "Enrolled" class can either drop out or graduate, while students who dropped out or graduated were once enrolled. If the dataset were divided into two classes, specifically "Graduate" and "Dropout," I believe the models would have achieved significantly higher accuracy.