How to build a good dataset for ML?

All datasets are defective, which is why preparing data is a major step in the ML process. You can use the guidelines on this page to verify that you have created a good dataset.

ML relies heavily on data. All datasets are defective, which is why preparing data is a major step in the ML process. Data preparation includes a more suitable dataset for ML and the correct establishment of the data collection mechanism (1). To prepare your dataset properly for ML, it is important that you ask yourself the following questions:

1. Is your data suitable for the stated problem?

Posing the problem you are trying to resolve will help you decide what data might be most valuable to collect and to prepare your data in a more appropriate way (1) (2). For instance, if you might want to recognize a personโ€™s emotional state based on their facial expressions, your dataset should contain images or videos of people's faces (2).

2. Is your dataset representative for the given problem?

When the sample data does not adequately represent the behavior of your population, data biases occur. This means that any sample-based analysis or model is not going to be generalized to your population or target group. One way to verify for bias is to examine how your data fields are distributed and verify that they make sense according to what you know about your target population or group (2).

Not all variables in your dataset will contain relevant information to solve the problem at hand. For example, if you want to estimate the price of a house according to its specifications or sales projections and probabilities, it is unlikely that the name of the homeowner will be one of the relevant attributes (4).

3. Is your data suitable for your task?

The data you need will also depend on the type of task you want to perform:

Task 1: Classification

If you want an algorithm that answers yes or no questions (e.g., does the patient have a fever? vomiting? diarrhea?), your task is likely to be one of classification. Therefore, you will need annotations or labels indicating to which class each piece of data in my database belongs so that the algorithm can learn from them (1).

In this task, the annotations must be discrete since we can only have a certain number of possible classes. For example, if we want to classify images of dogs and cats, the annotations can only have 0 or 1 as values.

Task 2: Clustering

If you want an algorithm that groups data according to similarity, your task is likely to be clustering. Then you will need similarity measures or criteria, which will depend on the type of data.

In this problem we do not have annotations or labels, so we must resort to looking for characteristics within the data that allow us to determine criteria for grouping them into different clusters.

Task 3: Regression

If you want an algorithm that produces some numerical value (1), your task is likely to be a regression. Therefore, you will need annotations or labels indicating the number that is related to each piece of data.

The annotations in this problem must be continuous because since your domain is numerical, you have an infinite number of possible values. For example, if we want to estimate the bone age of a person from an X-ray image, the annotations could range from 0 to 100.

4. Is your dataset of good quality?

Low-quality data may hinder and slow down the integration of business intelligence and predictive ML analysis. According to a Data Trust Pulse survey conducted by PricewaterhouseCoopers, "Much of a company's historical data, acquired haphazardly, may lack the detail and demonstrable accuracy needed for use with AI and other advanced automation" (3).

To check the quality of your data, it is important to ask yourself the following questions:

  • Do you trust your data?

  • Were your data collected or labeled by humans? If yes, verify a subset of the data and estimate how frequently errors occur.

  • Were there any technical problems when transferring your data? If yes, it is important to evaluate whether one of the following events has affected your data: a server error, a storage crash, or a cyber attack.

  • How many missing values does your data have? While there are ways to handle missing records, identify if their number is critical.

5. Is your dataset formatted consistently?

If the input format is the same across the entire dataset, this means that your dataset has a consistent format. Otherwise, ensure that all variables in a given attribute are written consistently (1). For example, if you set a numeric range in the age attribute from 0 to 100, make sure there is no 150 in it.

Additionally, be sure that the data in the categorical attributes are free of typographical errors. For example, if in the country attribute you have the data Brasil and Brazil, it is very likely that it is a typographical error and that these two similar data are the same.

Citations

(1) Preparing Your Dataset for Machine Learning: 10 Basic Techniques That Make Your Data Better. (2017). AltexSoft. Retrieved February 24, 2023, from: https://www.altexsoft.com/blog/datascience/preparing-your-dataset-for-machine-learning-8-basic-techniques-that-make-your-data-better/

(2) Jodie Burchell. (2022, November 7) How to Prepare Your Dataset for Machine Learning and Analysis | The JetBrains Datalore Blog. The JetBrains Blog; JetBrains Blog. Retrieved February 27, 2023, from: https://blog.jetbrains.com/datalore/2022/11/08/how-to-prepare-your-dataset-for-machine-learning-and-analysis/

(3) Data Quality Management: Roles, Processes, Tools. (2019). AltexSoft. Retrieved February 27, 2023, from: https://www.altexsoft.com/blog/data-quality-management-and-tools/

(4) ProjectPro. (2022, June 6). Data Preparation for Machine Learning Projects: Know It All Here. ProjectPro; ProjectPro. Retrieved February 27, 2023, from: https://www.projectpro.io/article/data-preparation-for-machine-learning/595

Last updated