Cross-validation: Revisiting the Wine Project

This project's goal was to perform a supervised model exploration of the data, then apply a variety of cross-validation methods in order to evaluate which method was the most appropriate as well as to determine if we were properly tuning our model of choice.

a) What is cross-validation and when is it usually used?

Cross-validation uses a method of repeated splitting and training of multiple model sets. This statistical method allows for the evaluation of the performance of various data splits to yield a more stable result based on the mean-squared error. Its purpose is to evaluate how well a specified algorithm will function when trained on a specific dataset.

b) Explain the differences between cross-validation strategies in evaluating your chosen machine model’s performance.

First, I should point out that I ultimately chose the k-Nearest Neighbors algorithm for this dataset. The cross-validation (CV) methods used were standard k-fold with a default of 3 folds, then 5 and 10 folds, followed by stratified k-fold, leave-one-out, and shuffle split k-fold. Three folds were no better than not doing CV at all. It’s just not enough folding to show the variation or mix up the training and testing data. Five folds are somewhat more common and do begin to show that the data splits can produce quite a variety of results depending on your split “luck” of the random division of the data. Ten folds are the most commonly used number of folds because it produces enough splits to show if the model is reasonably consistent no matter the split or if you are getting huge variations in your model’s accuracy depending on the split. The leave-one-out method produces one of the best methods of cross-validation, however, it’s a huge resource hog, especially as your dataset grows. And if I were going to choose one of the methods for CV that we tested, I found the shuffle split to make the most sense it actually mixes up the order of the data BEFORE it does the split. The benefit of this mixing or shuffling is that datasets are often sorted in some fashion. Depending on what variable(s) are being used as the critical pivot points, you could end up with very overfitted or underfitted models simply because of where the split fell on the original dataset.

c) What’s the purpose of splitting into training, validation and testing sets for certain models?

Models often tend to memorize their training data. By keeping the testing set out of the mix while training the model and using the validation set to help tune the hyperparameters, we still have a “clean” set of data to use to test the final tuned version of the model.

d) What is a false positive and a false negative in binary classification?

A false positive is when there is an error in the data whereby the model improperly classifies a fit as in the class when it should not be. On the opposite side of the spectrum, a false negative is when the model fits a data point as not belonging to a class when instead it should have been included.

e) Why is accuracy alone not a good measure for machine learning algorithms?

Because we don’t really know if the training data is biased toward particular classes, relying solely on accuracy can lead to an extremely faulty model that then becomes improperly tuned. By using cross-validation strategies, we can get a glimpse into the imbalance of the data splits and decide if the values from each of the tests are near enough to the mean to indicate that the model is fitting well with a variety of data splits.

Sample of the Dataset

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

With n_splits=9 and a static random_state value, I was able to get consistent results from my K-Folds cross-validation testing.

K-Fold Cross-validation scores:
[0.60264901 0.60927152 0.62251656 0.63576159 0.62913907 0.61589404 0.56953642 0.55629139 0.61589404]

Mean/Average K-Fold cross-validation score: 0.61
Then running our test-set against the model:
PCA component shape: (6, 10)
Test set accuracy: 0.61
K-Fold Cross-validation scores:
[0.60927152 0.59602649 0.61589404 0.58278146 0.66887417 0.65562914 0.60264901 0.54304636 0.59602649]

Mean/Average K-Fold cross-validation score: 0.61
With the modifications of scaling and selecting the best components of the data set, we can now see that our model is fitting with a matching test set accuracy to the mean/average K-Fold cross-validation score.

Spring 2019 Courses

For my final semester in the Master's Informatics Program, I took the following courses: BUS 5743: Project Management Tools and techniques of project selection and management as defined by the Project Management Institute, including network diagrams, critical path analysis, critical chain scheduling, cost estimates, earned value management, and completion of team project management software required. CSCI 5803: Data Warehousing Design, implementation, and management of data warehouse systems and their applications; requirements for gathering data for data warehousing; data warehouse architecture; dimensional model design for data warehousing; physical database design for data warehousing; extracting, transforming, and loading strategies; design and development of intelligence applications for decision support; and expansion and support of a data warehouse. CSCI 5923: Capstone in Informatics Culminating organization and/or community-based interdisciplinary/interprofessio...

Graduate Portfolio - Hollie C. Hollis

Search This Blog

Cross-validation: Revisiting the Wine Project

Comments

Post a Comment

Popular posts from this blog

Spring 2019 Courses

The Informatics Program at Texas Woman's University

In general ...