Skip to main content

Cross-validation: Revisiting the Wine Project

This project's goal was to perform a supervised model exploration of the data, then apply a variety of cross-validation methods in order to evaluate which method was the most appropriate as well as to determine if we were properly tuning our model of choice.

a) What is cross-validation and when is it usually used?

Cross-validation uses a method of repeated splitting and training of multiple model sets. This statistical method allows for the evaluation of the performance of various data splits to yield a more stable result based on the mean-squared error. Its purpose is to evaluate how well a specified algorithm will function when trained on a specific dataset.

b) Explain the differences between cross-validation strategies in evaluating your chosen machine model’s performance.

First, I should point out that I ultimately chose the k-Nearest Neighbors algorithm for this dataset. The cross-validation (CV) methods used were standard k-fold with a default of 3 folds, then 5 and 10 folds, followed by stratified k-fold, leave-one-out, and shuffle split k-fold. Three folds were no better than not doing CV at all. It’s just not enough folding to show the variation or mix up the training and testing data. Five folds are somewhat more common and do begin to show that the data splits can produce quite a variety of results depending on your split “luck” of the random division of the data. Ten folds are the most commonly used number of folds because it produces enough splits to show if the model is reasonably consistent no matter the split or if you are getting huge variations in your model’s accuracy depending on the split. The leave-one-out method produces one of the best methods of cross-validation, however, it’s a huge resource hog, especially as your dataset grows. And if I were going to choose one of the methods for CV that we tested, I found the shuffle split to make the most sense it actually mixes up the order of the data BEFORE it does the split. The benefit of this mixing or shuffling is that datasets are often sorted in some fashion. Depending on what variable(s) are being used as the critical pivot points, you could end up with very overfitted or underfitted models simply because of where the split fell on the original dataset.

c) What’s the purpose of splitting into training, validation and testing sets for certain models?

Models often tend to memorize their training data. By keeping the testing set out of the mix while training the model and using the validation set to help tune the hyperparameters, we still have a “clean” set of data to use to test the final tuned version of the model.

d) What is a false positive and a false negative in binary classification?

A false positive is when there is an error in the data whereby the model improperly classifies a fit as in the class when it should not be. On the opposite side of the spectrum, a false negative is when the model fits a data point as not belonging to a class when instead it should have been included.

e) Why is accuracy alone not a good measure for machine learning algorithms?

Because we don’t really know if the training data is biased toward particular classes, relying solely on accuracy can lead to an extremely faulty model that then becomes improperly tuned. By using cross-validation strategies, we can get a glimpse into the imbalance of the data splits and decide if the values from each of the tests are near enough to the mean to indicate that the model is fitting well with a variety of data splits.

Sample of the Dataset

fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5


With n_splits=9 and a static random_state value, I was able to get consistent results from my K-Folds cross-validation testing.

K-Fold Cross-validation scores:
[0.60264901 0.60927152 0.62251656 0.63576159 0.62913907 0.61589404 0.56953642 0.55629139 0.61589404]

Mean/Average K-Fold cross-validation score: 0.61
Then running our test-set against the model:
PCA component shape: (6, 10)
Test set accuracy: 0.61
K-Fold Cross-validation scores:
[0.60927152 0.59602649 0.61589404 0.58278146 0.66887417 0.65562914 0.60264901 0.54304636 0.59602649]

Mean/Average K-Fold cross-validation score: 0.61
With the modifications of scaling and selecting the best components of the data set, we can now see that our model is fitting with a matching test set accuracy to the mean/average K-Fold cross-validation score.

Comments

Popular posts from this blog

Spring 2019 Courses

For my final semester in the Master's Informatics Program, I took the following courses: BUS 5743: Project Management Tools and techniques of project selection and management as defined by the Project Management Institute, including network diagrams, critical path analysis, critical chain scheduling, cost estimates, earned value management, and completion of team project management software required. CSCI 5803: Data Warehousing Design, implementation, and management of data warehouse systems and their applications; requirements for gathering data for data warehousing; data warehouse architecture; dimensional model design for data warehousing; physical database design for data warehousing; extracting, transforming, and loading strategies; design and development of intelligence applications for decision support; and expansion and support of a data warehouse. CSCI 5923: Capstone in Informatics Culminating organization and/or community-based interdisciplinary/interprofessio...

The Informatics Program at Texas Woman's University

Texas Woman's University Master of Science degree in Informatics program began in 2016. Hollie began pursuit of an advanced degree with an emphasis in data science and data analytics in the fall of 2017, in order to enhance her skills in her career and broaden her skills. She completed the program in the spring of 2019, graduating with a 4.0 GPA. Skills learned or reviewed Courses taken

Why Informatics?

The goal of obtaining my degree in Informatics was to broaden my knowledge and continue strengthening my current skills as a senior systems administrator. The (mostly) online program provided by Texas Woman's University offered just the right fit of diversity and course opportunities to help me achieve those goals. Additionally, the Denton campus was close enough to my home that I could, if the need arose, visit professors, tutorial/professional services or take advantage of other services provided by the university.