This project's goal was to perform a supervised model exploration of the data, then apply a variety of cross-validation methods in order to evaluate which method was the most appropriate as well as to determine if we were properly tuning our model of choice.
a) What is cross-validation and when is it usually used?
Cross-validation uses a method of repeated splitting and training of multiple model sets. This statistical method allows for the evaluation of the performance of various data splits to yield a more stable result based on the mean-squared error. Its purpose is to evaluate how well a specified algorithm will function when trained on a specific dataset.
b) Explain the differences between cross-validation strategies in evaluating your chosen machine model’s performance.
First, I should point out that I ultimately chose the k-Nearest Neighbors algorithm for this dataset. The cross-validation (CV) methods used were standard k-fold with a default of 3 folds, then 5 and 10 folds, followed by stratified k-fold, leave-one-out, and shuffle split k-fold. Three folds were no better than not doing CV at all. It’s just not enough folding to show the variation or mix up the training and testing data. Five folds are somewhat more common and do begin to show that the data splits can produce quite a variety of results depending on your split “luck” of the random division of the data. Ten folds are the most commonly used number of folds because it produces enough splits to show if the model is reasonably consistent no matter the split or if you are getting huge variations in your model’s accuracy depending on the split. The leave-one-out method produces one of the best methods of cross-validation, however, it’s a huge resource hog, especially as your dataset grows. And if I were going to choose one of the methods for CV that we tested, I found the shuffle split to make the most sense it actually mixes up the order of the data BEFORE it does the split. The benefit of this mixing or shuffling is that datasets are often sorted in some fashion. Depending on what variable(s) are being used as the critical pivot points, you could end up with very overfitted or underfitted models simply because of where the split fell on the original dataset.
c) What’s the purpose of splitting into training, validation and testing sets for certain models?
Models often tend to memorize their training data. By keeping the testing set out of the mix while training the model and using the validation set to help tune the hyperparameters, we still have a “clean” set of data to use to test the final tuned version of the model.
d) What is a false positive and a false negative in binary classification?
A false positive is when there is an error in the data whereby the model improperly classifies a fit as in the class when it should not be. On the opposite side of the spectrum, a false negative is when the model fits a data point as not belonging to a class when instead it should have been included.
e) Why is accuracy alone not a good measure for machine learning algorithms?
Because we don’t really know if the training data is biased toward particular classes, relying solely on accuracy can lead to an extremely faulty model that then becomes improperly tuned. By using cross-validation strategies, we can get a glimpse into the imbalance of the data splits and decide if the values from each of the tests are near enough to the mean to indicate that the model is fitting well with a variety of data splits.
Sample of the Dataset
With n_splits=9 and a static random_state value, I was able to get consistent results from my K-Folds cross-validation testing.
K-Fold Cross-validation scores:
[0.60264901 0.60927152 0.62251656 0.63576159 0.62913907 0.61589404 0.56953642 0.55629139 0.61589404]
Mean/Average K-Fold cross-validation score: 0.61
Then running our test-set against the model:
PCA component shape: (6, 10)
Test set accuracy: 0.61
K-Fold Cross-validation scores:
[0.60927152 0.59602649 0.61589404 0.58278146 0.66887417 0.65562914 0.60264901 0.54304636 0.59602649]
Mean/Average K-Fold cross-validation score: 0.61
With the modifications of scaling and selecting the best components of the data set, we can now see that our model is fitting with a matching test set accuracy to the mean/average K-Fold cross-validation score.
a) What is cross-validation and when is it usually used?
Cross-validation uses a method of repeated splitting and training of multiple model sets. This statistical method allows for the evaluation of the performance of various data splits to yield a more stable result based on the mean-squared error. Its purpose is to evaluate how well a specified algorithm will function when trained on a specific dataset.
b) Explain the differences between cross-validation strategies in evaluating your chosen machine model’s performance.
First, I should point out that I ultimately chose the k-Nearest Neighbors algorithm for this dataset. The cross-validation (CV) methods used were standard k-fold with a default of 3 folds, then 5 and 10 folds, followed by stratified k-fold, leave-one-out, and shuffle split k-fold. Three folds were no better than not doing CV at all. It’s just not enough folding to show the variation or mix up the training and testing data. Five folds are somewhat more common and do begin to show that the data splits can produce quite a variety of results depending on your split “luck” of the random division of the data. Ten folds are the most commonly used number of folds because it produces enough splits to show if the model is reasonably consistent no matter the split or if you are getting huge variations in your model’s accuracy depending on the split. The leave-one-out method produces one of the best methods of cross-validation, however, it’s a huge resource hog, especially as your dataset grows. And if I were going to choose one of the methods for CV that we tested, I found the shuffle split to make the most sense it actually mixes up the order of the data BEFORE it does the split. The benefit of this mixing or shuffling is that datasets are often sorted in some fashion. Depending on what variable(s) are being used as the critical pivot points, you could end up with very overfitted or underfitted models simply because of where the split fell on the original dataset.
c) What’s the purpose of splitting into training, validation and testing sets for certain models?
Models often tend to memorize their training data. By keeping the testing set out of the mix while training the model and using the validation set to help tune the hyperparameters, we still have a “clean” set of data to use to test the final tuned version of the model.
d) What is a false positive and a false negative in binary classification?
A false positive is when there is an error in the data whereby the model improperly classifies a fit as in the class when it should not be. On the opposite side of the spectrum, a false negative is when the model fits a data point as not belonging to a class when instead it should have been included.
e) Why is accuracy alone not a good measure for machine learning algorithms?
Because we don’t really know if the training data is biased toward particular classes, relying solely on accuracy can lead to an extremely faulty model that then becomes improperly tuned. By using cross-validation strategies, we can get a glimpse into the imbalance of the data splits and decide if the values from each of the tests are near enough to the mean to indicate that the model is fitting well with a variety of data splits.
Sample of the Dataset
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
With n_splits=9 and a static random_state value, I was able to get consistent results from my K-Folds cross-validation testing.
K-Fold Cross-validation scores:
[0.60264901 0.60927152 0.62251656 0.63576159 0.62913907 0.61589404 0.56953642 0.55629139 0.61589404]
Mean/Average K-Fold cross-validation score: 0.61
Then running our test-set against the model:
PCA component shape: (6, 10)
Test set accuracy: 0.61
K-Fold Cross-validation scores:
[0.60927152 0.59602649 0.61589404 0.58278146 0.66887417 0.65562914 0.60264901 0.54304636 0.59602649]
Mean/Average K-Fold cross-validation score: 0.61
With the modifications of scaling and selecting the best components of the data set, we can now see that our model is fitting with a matching test set accuracy to the mean/average K-Fold cross-validation score.
Comments
Post a Comment