1. 程式人生 > >How To Handle Missing Values In Machine Learning Data With Weka

How To Handle Missing Values In Machine Learning Data With Weka

Data is rarely clean and often you can have corrupt or missing values.

It is important to identify, mark and handle missing data when developing machine learning models in order to get the very best performance.

In this post you will discover how to handle missing values in your machine learning data using Weka.

After reading this post you will know:

  • How to mark missing values in your dataset.
  • How to remove data with missing values from your dataset.
  • How to impute missing values.

Let’s get started.

How To Handle Missing Data For Machine Learning in Weka

How To Handle Missing Data For Machine Learning in Weka
Photo by

Peter Sitte, some rights reserved.

Predict the Onset of Diabetes

The problem used for this example is the Pima Indians onset of diabetes dataset.

It is a classification problem where each instance represents medical details for one patient and the task is to predict whether the patient will have an onset of diabetes within the next five years.

You can learn more about this dataset on the UCI Machine Learning Repository page for the Pima Indians dataset. You can download the dataset directly from this page. You can also access this dataset in your Weka installation, under the data/ directory in the file called diabetes.arff.

Need more help with Weka for Machine Learning?

Take my free 14-day email course and discover how to use the platform step-by-step.

Click to sign-up and also get a free PDF Ebook version of the course.

Mark Missing Values

The Pima Indians dataset is a good basis for exploring missing data.

Some attributes such as blood pressure (pres) and Body Mass Index (mass) have values of zero, which are impossible. These are examples of corrupt or missing data that must be marked manually.

You can mark missing values in Weka using the NumericalCleaner filter. The recipe below shows you how to use this filter to mark the 11 missing values on the Body Mass Index (mass) attribute.

1. Open the Weka Explorer.

2. Load the Pima Indians onset of diabetes dataset.

3. Click the “Choose” button for the Filter and select NumericalCleaner, it us under unsupervized.attribute.NumericalCleaner.

Weka Select NumericCleaner Data Filter

Weka Select NumericCleaner Data Filter

4. Click on the filter to configure it.

5. Set the attributeIndicies to 6, the index of the mass attribute.

6. Set minThreshold to 0.1E-8 (close to zero), which is the minimum value allowed for the attribute.

7. Set minDefault to NaN, which is unknown and will replace values below the threshold.

8. Click the “OK” button on the filter configuration.

9. Click the “Apply” button to apply the filter.

Click “mass” in the “attributes” pane and review the details of the “selected attribute”. Notice that the 11 attribute values that were formally set to 0 are not marked as Missing.

Weka Missing Data Marked

Weka Missing Data Marked

In this example we marked values below a threshold as missing.

You could just as easily mark them with a specific numerical value. You could also mark values missing between a upper and lower range of values.

Next, let’s look at how we can remove instances with missing values from our dataset.

Remove Missing Data

Now that you know how to mark missing values in your data, you need to learn how to handle them.

A simple way to handle missing data is to remove those instances that have one or more missing values.

You can do this in Weka using the RemoveWithValues filter.

Continuing on from the above recipe to mark missing values, you can remove missing values as follows:

1. Click the “Choose” button for the Filter and select RemoveWithValues, it us under unsupervized.instance.RemoveWithValues.

Weka Select RemoveWithValues Data Filter

Weka Select RemoveWithValues Data Filter

2. Click on the filter to configure it.

3. Set the attributeIndicies to 6, the index of the mass attribute.

4. Set matchMissingValues to “True”.

5. Click the “OK” button to use the configuration for the filter.

6. Click the “Apply” button to apply the filter.

Click “mass” in the “attributes” section and review the details of the “selected attribute”.

Notice that the 11 attribute values that were marked Missing have been removed from the dataset.

Weka Missing Values Removed

Weka Missing Values Removed

Note, you can undo this operation by clicking the “Undo” button.

Impute Missing Values

Instances with missing values do not have to be removed, you can replace the missing values with some other value.

This is called imputing missing values.

It is common to impute missing values with the mean of the numerical distribution. You can do this easily in Weka using the ReplaceMissingValues filter.

Continuing on from the first recipe above to mark missing values, you can impute the missing values as follows:

1. Click the “Choose” button for the Filter and select ReplaceMissingValues, it us under unsupervized.attribute.ReplaceMissingValues.

Weka ReplaceMissingValues Data Filter

Weka ReplaceMissingValues Data Filter

2. Click the “Apply” button to apply the filter to your dataset.

Click “mass” in the “attributes” section and review the details of the “selected attribute”.

Notice that the 11 attribute values that were marked Missing have been set to the mean value of the distribution.

Weka Imputed Values

Weka Imputed Values

Summary

In this post you discovered how you can handle missing data in your machine learning dataset using Weka.

Specifically, you learned:

  • How to mark corrupt values as missing in your dataset.
  • How to remove instances with missing values from your dataset.
  • How to impute mean values for missing values in your dataset.

Do you have any questions about missing data or about this tutorial? Ask your questions in the comments below and I will do my best to answer.


Want Machine Learning Without The Code?

Master Machine Learning With Weka

Develop Your Own Models in Minutes

…with just a few a few clicks

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring The Machine Learning To
Your Own Projects

Skip the Academics. Just Results.


相關推薦

How To Handle Missing Values In Machine Learning Data With Weka

Tweet Share Share Google Plus Data is rarely clean and often you can have corrupt or missing val

How to Handle Missing Timesteps in Sequence Prediction Problems with Python

Tweet Share Share Google Plus It is common to have missing observations from sequence data. Data

How to Normalize and Standardize Your Machine Learning Data in Weka

Tweet Share Share Google Plus Machine learning algorithms make assumptions about the dataset you

How Facebook Uses Bayesian Optimization to Conduct Better Experiments in Machine Learning Models

How Facebook Uses Bayesian Optimization to Conduct Better Experiments in Machine Learning ModelsHyperparameter optimization is a key aspect of the lifecycl

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Tweet Share Share Google Plus Ensembles can give you a boost in accuracy on your dataset. In thi

How to Work Through a Regression Machine Learning Project in Weka Step

Tweet Share Share Google Plus The fastest way to get good at applied machine learning is to prac

How To Handle Click Events In Android RecyclerViews

According to the documentation, a RecyclerView is a flexible view for providing a limited window into a large data set. If you have done any android dev

How Beginners Get It Wrong In Machine Learning

Tweet Share Share Google Plus The 5 Most Common Mistakes That Beginners Make And How To Avoid Th

How Do I Get Started In Machine Learning?

Tweet Share Share Google Plus I get daily emails asking the question: How do I get started in ma

How to Handle Missing Data with Python

Tweet Share Share Google Plus Real-world data often has missing values. Data can have missing va

How to Layout and Manage Your Machine Learning Project

Tweet Share Share Google Plus Project layout is critical for machine learning projects just as i

How to Build an Intuition for Machine Learning Algorithms

Tweet Share Share Google Plus Machine learning algorithms are complex. To get good at applying a

How To Load CSV Machine Learning Data in Weka (如何在Weka中載入CSV機器學習資料)

How To Load CSV Machine Learning Data in Weka 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/load-csv-machine-learning-data-weka/

How to Win at SEO in the Age of Machine Learning

In the recent past, we have been hearing a lot about machine learning, but do we really know what is machine learning? And how it can change the organic se

10 Examples of How to Use Statistical Methods in a Machine Learning Project

Tweet Share Share Google Plus Statistics and machine learning are two very closely related field

How To Talk About Data in Machine Learning (Terminology from Statistics and Computer Science)

Tweet Share Share Google Plus Data plays a big part in machine learning. It is important to unde

How to Better Understand Your Machine Learning Data in Weka

Tweet Share Share Google Plus It is important to take your time to learn about your data when st

How To Get Started In Machine Learning: A Self

Tweet Share Share Google Plus Specifically, the original poster of the question had completed t

How to Transform Your Machine Learning Data in Weka

Tweet Share Share Google Plus Often your raw data for machine learning is not in an ideal form f

AI In China: How Uber Rival Didi Chuxing Uses Machine Learning To Revolutionize Transportation

Chinese company, Didi Chuxing may be known by most as the world's largest ride-sharing company with a goal "to build a better journey," but its vision reve