In machine learning there are many umbrellas that refer to different processes or set ups for the “learning” to take place. Today’s discussion will surround two commonly used umbrella terms, supervised and unsupervised learning. In this case the process or set up we are referring to lays within the raw data itself.
When setting up a machine learning project, there are a few considerations that shape the nature of the project. The first is the type of data available, either categorical or continuous. The second is whether a ground truth is available. It should be noted that not all numerical data is continuous.
The commonly used Likert scale testing, answering a question with a response of 1 to 5 with (1=worst; 5=best), is categorical data, because there is no float type, like 1.5034, answer possible. This 1.5034 response would be sorted into either the 1 or the 2 category, making the data type categorical.
Whereas, in the continuous datatype, any number along a number line or curve is a valid data point. An example of categorical data in the field of oil and gas, specifically in drilling, is lithology. An example of continuous data would be something like mechanical specific energy (mse) or weight on bit (wob). Whether the type of data is categorical or continuous is only one consideration.
Additionally, there needs to be a decision about whether or not there is some sort of “ground truth” associated with the data.
This “ground truth” functions like an answer key. You, as the human, already know the correct answer before asking the question of an AI or constructing the machine learning model. One version of this is having labeled data available for categorical data. Such as having a distinct and clear lithology for every depth when trying to classify lithology. Another version would be having something like a wire line log available to compare a predicted wire line log against. Having this ground truth available for use, means that a supervised machine learning model can be selected.
Otherwise when this type of “ground truth” data isn’t available, the options for a machine learning project are restricted to only unsupervised machine learning techniques. If you have this answer key, you can choose either a supervised or unsupervised approach. Sometimes you can choose both, to see which will yield better results. Also, you can use both to have them compete against each other in a reinforcement learning scenario.
Basically, if you start with some knowledge that is reliable in the real world, a “ground truth”, then you have more options in creating your artificial world using machine learning.
With this in mind, data quality is a huge consideration when embarking on a project. If there is partial labeled data available, is it enough to construct a supervised learning model? Or would an unsupervised approach be better with wider access to include the unlabeled data?
One way to assist in making these decisions is to consider the problem we must address. Is there any previous work to connect any of the variables or features at play? What is the nature of this relationship? Is it predictable? In the case of oil and gas, which uses geomechanical properties that have been studied for over a century, often times this answer lies within the physical relationships that are being studied. Meaning that sometimes, there’s a way to solve for a missing label algebraically.
This takes some time in writing the code to execute that formula. It also means working closely with a petroleum engineer to guarantee the correct use and referencing of all the geomechanical relationships. There’s also the additional step of associating that newly solved label with the dataset in use. But this would take an exclusively used in unsupervised learning dataset to have more options under a labeled dataset, now under the supervised learning umbrella.
Testing or validation of the machine learning model, plays an important role into the decision making process. In supervised learning the testing is really easy. You just compare the results from the model to that “ground truth” or answer key to determine accuracy.
There are some considerations to take into account with under- and over-fitting. However, the end of the day, calculating that accuracy is just like your teacher grading your test in school. For some business questions, an accuracy of 70% is good enough. For other questions, often those with high risk associated with failure, an accuracy of 95% or better might be required. It should be noted that an accuracy of 100% definitely means that the model is overfit. This means that when introduced to a new dataset, the model will likely fail due to the difference with the training dataset. This under- and over-fitting problem is a drawback to using a supervised approach. You have to hit that sweet spot of accuracy.
All of these considerations, while they still apply in unsupervised learning situations, are much more difficult to assess in an unsupervised environment. In this case, measuring some inherent statistical properties in model outputs and making sure that they align with other similar problems is a good route to take in determining model performance.
In the case of unsupervised learning, the conversation surrounding accuracy is null and void because there’s no answer key. So the conversation must switch frame to answering the question of:
Is this output realistic? Does is make sense? Is it reproducible and to what scale?
Often times in unsupervised scenarios a multitude of models are generated to assess performance in comparison to each other. This is much like taste testing after selecting a different quality of ingredients for each preparation of the same dish.
Some will behave similarly, others will be awful, and a few might be good. There’s not really a way to know beforehand though that the store-bought brand does better than the branded product, not until you try it.
Each type of unsupervised machine learning model has its own metric for a thorough evaluation, and this is how a comparison of or model performance can be discussed in an unsupervised scenario. A discussion before beginning any modeling is to determine if that metric will do enough to assess performance to an acceptable standard.
With all of this in mind, conversations surrounding what data is available and how it needs to be evaluated will help in the decision-making process of not only whether a machine learning option is a viable path forward, but also what type of machine learning algorithm to use, in making the decision about how to address a business question or need in industry. This conversation should also include the types of resources allocated to the problem. While all types of machine learning take tuning, unsupervised learning models do present more challenge, and often require manpower and more computational time to generate the variety of models for comparison.