Machine learning: what could possibly go wrong?

Thomas Hoppe Popular answer

Hi Chris,

well first of course everything which can go wrong in any kind of project. But, I think you are aware of this.

I found the following issues being great obstacles for a successful completion of a ML-Project:

Unclear goals: The customer has unclear goals or does not make them transparent. As with any other project, the goals of the ML project should be defined and fixed in advance. The conclusion I have drawn in the past either apply proper requirement engineering or reject the project.
No data/improper data: I experienced two cases where the customer had a clear goal, but the data where insufficient. In a classification problem, the user assigned classes were unreliable. In a prediction problem, the data came from the "long tail" distribution of products, so that for a larger number of products the data were too sparse. If you get aware of such defects, make them transparent to the customer and explain him what he can expect under these conditions. So he can decide whether to take the risk or not.
Data produced under false assumption: A customer of mine collected data by experimenting under false assumptions. I developed a method for predicting the parameter for the model he had in mind. But running the approach on real data, it was not possible to derive any useful parameters. A rational reconstruction of the real data revealed two erroneous assumptions the customer had made. Lesson learned for the customer: involve experts already in the experiment design. A different outcome for the customer than expected.
Wrong preprocessed data: During feature engineering of a categorical variable a one-hot-encoding was made without need. During feature selection from the one-hot-encoded feature a subtile selection bias was introduced into the encoded data. The customer neither made the original data available nor recognized the bias problem, but expected that a better fitting model should be derived from these buggy data. Only solution: communicate the problems and how to resolve them clearly to the customer. If he does not understand, don't continue.

Hope my experiences help.

Thomas Hoppe

Hi Chris,

well first of course everything which can go wrong in any kind of project. But, I think you are aware of this.

I found the following issues being great obstacles for a successful completion of a ML-Project:

Unclear goals: The customer has unclear goals or does not make them transparent. As with any other project, the goals of the ML project should be defined and fixed in advance. The conclusion I have drawn in the past either apply proper requirement engineering or reject the project.
No data/improper data: I experienced two cases where the customer had a clear goal, but the data where insufficient. In a classification problem, the user assigned classes were unreliable. In a prediction problem, the data came from the "long tail" distribution of products, so that for a larger number of products the data were too sparse. If you get aware of such defects, make them transparent to the customer and explain him what he can expect under these conditions. So he can decide whether to take the risk or not.
Data produced under false assumption: A customer of mine collected data by experimenting under false assumptions. I developed a method for predicting the parameter for the model he had in mind. But running the approach on real data, it was not possible to derive any useful parameters. A rational reconstruction of the real data revealed two erroneous assumptions the customer had made. Lesson learned for the customer: involve experts already in the experiment design. A different outcome for the customer than expected.
Wrong preprocessed data: During feature engineering of a categorical variable a one-hot-encoding was made without need. During feature selection from the one-hot-encoded feature a subtile selection bias was introduced into the encoded data. The customer neither made the original data available nor recognized the bias problem, but expected that a better fitting model should be derived from these buggy data. Only solution: communicate the problems and how to resolve them clearly to the customer. If he does not understand, don't continue.

Hope my experiences help.

Ahmed Mancy Mosa

Thomas Hoppe

Great answer

Adriana Santos-Caballero

Security vendors – Sophos included – continue touting the benefits of machine learning-based malware analysis. But, as we’ve written in recent weeks, it must be managed properly to be effective. The technology can be abused by bad actors and corrupted by poor data entry.

Sophos data scientists spoke about the challenges and remedies at length during Black Hat USA 2017 and BSidesLV, and have continued to do so. The latest example is an article by data scientist Hillary Sanders about the importance of proper labeling.

Also see: 5 questions to ask about machine learning

Sometimes, says Sanders, the labels companies inject into their models is wrong.

Dirty labels, bad results

As she put it, supervised machine learning works like this:

Researchers give a model (a function) some data (like some HTML files) and a bunch of associated desired output labels (like 0 and 1 to denote benign and malicious).

The model looks at the HTML files, looks at the available labels 0 and 1 and then tries to adjust itself to fit the data so that it can correctly guess output labels (0,1) by only looking at input data (HTML files).

Researchers define the ground truth for the model by telling it that “this is the perfectly accurate state of the world, now learn from it so you can accurately guess labels from new data”.

The problem, she says, is when researchers give their models labels that aren’t correct:

Perhaps it’s a new type of malware that our systems have never seen before and hasn’t been flagged properly in our training data. Perhaps it’s a file that the entire security community has cumulatively mislabeled through a snowball effect of copying each other’s classifications. The concern is that our model will fit to this slightly mislabeled data and we’ll end up with a model that predicts incorrect labels.

FREE award-winning antivirus

Keep your personal Macs and PCs clean

To top it off, she adds, researchers won’t be able to estimate their errors properly because they’ll be evaluating their model with incorrect labels. The validity of this concern is dependent on a couple of factors:

The amount of incorrect labels in a dataset

The complexity of the model

If incorrect labels are randomly distributed across the data or highly clustered

In the article, Sanders uses plot charts to show examples of when things can go wrong. Those charts are in the “problem with labels” section.

Getting it right

After guiding readers through the examples of what can go wrong, Sanders outlines what her team does to get it right. To minimize the amount and effects of bad labels in their data, the team…

Only uses malware samples that have been verified as inherently malicious through sandbox analysis and confirmed by multiple vendors.

Tries not to overtrain, and thus overfit, their models. “The goal is to be able to detect never-before-seen malware samples, by looking at similarities between new files and old files, rather than just mimic existing lists of known malware,” she says.

Attempts to improve their labels by analyzing false positives and false negatives found during model testing. In other words, she explains, “we take a look at the files that we think our model misclassified (like the red circled file in the plot below), and make sure it actually misclassified them”.

She adds:

What’s really cool is that very often – our labels were wrong, and the model was right. So our models can actually act as a data-cleaning tool.

The data scientist team will continue writing about the challenges of machine learning: we know what can possibly go wrong, and we have procedures to use machine learning effectively.

Follow @NakedSecurity

Follow @BillBrenner70

Kiran Grover

Machine learning is also widely used in scientific applications such as bioinformatics, medicine and astronomy. I am not sure Machine learning go wrong.

Z. A. Al-Hemyari

Like many other areas of Artificial Intelligence, the technical capabilities of machine learning approaches are regularly oversold and this hype overshadows the real advances. Machine learning algorithms have become an increasingly important part of our lives. They are integral to all sorts of applications from the speech recognition technology in Siri to Google’s search engine. Unfortunately machine learning systems are often more noticeable in our lives because of failures rather than successes. We come face-to-face with the limitations of auto-text recognition daily, while spam filtering algorithms quietly remove mass mail from our inboxes completely unnoticed. Improvements to machine learning algorithms are allowing us to do more sophisticated computational tasks. But it is often unclear exactly what these tools can do, their limitations and the implications of their use - especially in such a fast moving field

Regards,

Z. A. Al-Hemyari

Sorry, please refer to:

https://www.nesta.org.uk/sites/default/files/machines_that_learn_in_the_wild.pdf

Article Fundamental Factor Models Using Machine Learning https://www.gartner.com/binaries/content/assets/events/keywords/catalyst/catus8/preparing_and_architecting_for_machine_learning.pdf

Yohann Gabriel Missiak

As for a medical perspective,

There are several limitations in the use of Machine Learning.

Mostly the problem relies on the selection of variables, and the dependant variables.

Even though we can predict cancer treatment, and because of the interacting genes, machine learning based decision are not yet implemented in cancer therapy.

Most of the Machine Learning application are for risks classification (ex Myelome Classification using Random Forest).

The treatment can vary because of a specifice gene.. so there are lots of data to integer in a decision. Machine Learning is not yet capable to understand such things.

Strangely, Machine Learning enables us to acknowledge things we can't notice. These are things we don't know but we get a notion of how it works with Machine Learning.

There are also conceptions we don't know and we don't know how it works.

Machine Learning learns through specific models, if the function you are trying to model is not close to these specific models, it won't work out.

As a conclusion, to get the best efficiency, it is important to try every solutions, every models, every feature selection. The supposed "better working" models sometimes will not work and another one would work better.

To get efficiency, you should run projects in parallel processing, this would get you the maximum solutions.

To improve even more the results, work with people who understands the underlying subject. The best feature selection is background knowledge.

Feedback defines the constitution of an organism?

What is the reason for current dropping in OER , LSV curve?

What may be the reasons for failures of Tube toi Tube Sheet Joints in Boiler Drum ?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

Measuring the Intelligence of a Species?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

The Curse of Evolution and Complexity?

Need help with my research project on open source SIEM and machine learning?

Swimming/space travel depends on the proprioceptive muscle spindles?

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?