Data Best Practices

Well-designed data increases the quality of the resulting AI model. You can use the guidelines on this page to increase the quality of your training data and model.

1. Avoid high cardinality for your target

When selecting a column for classification, make sure the number of classes is reduced from 2-10 for better performance and distribution of examples.

Avoid columns or rows with >30% missing values

We recommend working with columns and rows that have less than 30% missing rows. Depending on the scenario, these columns could be imputed

2. Avoid target leakage

Target leakage happens when your training data includes predictive information that is not available when you ask for a prediction. Target leakage can cause your model to show excellent evaluation metrics but perform poorly on real data.

For example, suppose you want to know how much ice cream your store will sell tomorrow. You cannot include the target day's temperature in your training data, because you will not know the temperature (it hasn't happened yet). However, you could use the predicted temperature forecasted from the previous day, which could be included in the prediction request.

3. Avoid training-serving skew

Training-serving skew happens when you generate your training data differently than you generate the data you use to request predictions.

For example, if you are planning on making predictions about user lifetime value (LTV) over the next 30 days, then make sure that the data is taken fed in the same format. In this case the context of today and the outcome taken 30 days after.

In general, any difference between how you generate your training data and your serving data (the data you use to generate predictions) should be reviewed to prevent training-serving skew.

4. Provide a time signal

For classification and regression models, if the underlying pattern in your data is likely to shift over time (it is not randomly distributed in time), make sure you provide that information. You can provide a time signal in several ways:

If each row of data has a timestamp, make sure that column is included, has a transformation type of Timestamp, and is set as the Time column when you train your model.

5. Make information explicit

You can improve model quality by being more explicit. For example, if your data includes longitude and latitude, these columns are treated as numeric, with no special calculations. If location or distance provides a signal for your problem, you must engineer a feature that provides that information explicitly for example using postal code or else. Additionally, this helps you anonymize the data even further.

Some data types that might improve with feature engineering:

Longitude/Latitude
URLs
IP addresses
Email addresses
Phone numbers
Addresses

6. Include calculated or aggregated data in a row

Arkangel AI uses only the input data in a single row to predict the target value for that row. If you have calculated or aggregated data from other rows or sources that would be valuable in determining the predicted value for a row, include that data and the source row.

For example, if you want to predict next week's demand for a healthcare product, you can improve the quality of the prediction by including columns with the following values:

The total number of items in stock from the same category as the product.
The average price of items in stock from the same category as the product.
The number of days before a known holiday when the prediction is requested.
And so on...

In another example, if you want to predict whether a specific user will buy a product, you can improve the quality of the prediction by including columns with the following values:

The average historic conversion rate or click-through rate for the specific user.
How many products are currently in the user's shopping cart.

7. Avoid bias

Make sure that your training data is representative of the entire universe of potential data that you will be making predictions for. For example, if you have customers that live all over the world, you should not use training data from only one country.

Classification problems

1. Represent null values appropriately

If you are importing from CSV, use empty strings to represent null values.

If your data uses special characters or numbers to represent null values, including zero, these values are misinterpreted, reducing model quality.

2. Avoid missing values where possible

Check your data for missing values, and correct them if possible. Otherwise, you can leave the value blank, and it is treated as a null value. Arkangel AI will treat each missing value with different techniques to improve your training dataset.

3. Use spaces to separate text

Arkangel AI tokenizes text strings and can derive training signals from individual words. It uses spaces to separate words; words separated by other characters are treated as a single entity.

For example, if you provide the text "red/green/blue", it is not tokenized into "red", "green", and "blue". If those individual words might be important for training the model, you should transform the text to "red green blue" before including it in your training data.

4. Make sure your categorical features are accurate and clean

Data inconsistencies can cause categories to be incorrectly split. For example, if your data includes "Brown", “bròwn” and "brown", Arkangel AI uses those values as separate categories, when you might have intended them to be the same. Misspellings can have a similar effect. Make sure you remove these kinds of inconsistencies from your categorical data before creating your training data.

5. Use extra care with imbalanced classes for classification models

If you have imbalanced classes (a classification problem with one or more outcomes that is seen rarely), review the following tips.

6. Provide sufficient training data for the minority class

Having too few rows of data for one class degrades model quality. If possible, you should provide at least 100 rows of data for every class.

7. Consider using a manual split

Arkangel AI selects the rows for the test dataset randomly (but deterministically). For imbalanced classes, you could end up with a small number of the minority class in your test dataset, or even none, which causes training to fail.

If you have imbalanced classes, you might want to assign a manual split to make sure enough rows with the minority outcomes are included in every split.

8. Provide enough training data

If you don't provide enough training data, the resulting model might perform poorly. The more columns you use to train your model, the more data you need to provide.

The following table provides some heuristics for how much training data to provide, depending on your objective.

Objective

Suggested minimum amount of training data

Classification

At least 10 times as many rows as you have columns.

Forecasting

At least 10 time series for every column used to train the model.

Regression

At least 50 times as many rows as the number of columns.

Leave all other preprocessing and transformations to Arkangel AI

Arkangel AI does the feature engineering for you when you train a model. But the platform does best when it has access to your underlying data with the best quality possible.

Forecasting data preparation best practices

Training data for forecasting models has some special considerations.

1. Time series identifier

One of your columns in your training data for a forecasting model must be specified as the time series identifier. Forecasting training data usually includes multiple time series, and the identifier tells Arkangel AI which time series a given observation in the training data is part of. All of the rows in a given time series have the same value in the time series identifier column.

Some common time series identifiers might be the service ID, a provider ID, or a region ID. When you have multiple time series in your training data, there should be a specific column that differentiates them.

You can train a forecasting model on a single time series (just using service ID). For best results, you should have at least 10 time series for every column used to train the model.

2. Considerations for choosing the data granularity

When you train a forecasting model, you specify the data granularity, or the time interval between the training data rows. It can be hourly, daily, weekly, monthly, or yearly. In addition, it can be every 1, 5, 10, 15, or 30 minutes.

The data granularity must be consistent throughout the training data, and all batch prediction data. If you specify a daily granularity, and there are 2 days between two training data rows, Arkangel AI AI treats the interim day as missing data. Multiple rows in the same time series with the same timestamp (as determined by the granularity) are considered a validation error at training time might also create an error.

Generally, your data collection practices determine your data granularity.

About data format

You can create your training data in either wide or narrow format. For regression and classification models, wide format is widely used and can be easier to assemble and review. For forecasting models using narrow format can help you avoid setting up unintentional connections between your data and your target (leaking data).

When you create data to train a forecasting model, each row should represent a single observation on a single time series. You must have a column that represents your time series identifier (how the time series are distinguished from each other), and a column that represents the value that you will be predicting (your target). Then every other value in the row that is used to train the model must be present at the time you request a prediction for your target.

Consider the following (simplified and abbreviated) sample training data:

Date

Widget_1_Demand

Widget_2_Demand

Widget_3_Demand

Promo

Region

01/01/2019

112

241

01/02/2019

141

219

01/03/2019

149

244

01/01/2019

01/02/2019

01/03/2019

This table, in wide format, shows business data by date, but it would not be usable for a forecasting model in its current form. There is no single target column, no time series ID column, and for any given date, you will not know the demand for the other widgets at the time of prediction.

The solution is to convert to a narrow format, so that each row represents a single observation. Any data that is independent of the time series is repeated for each row:

Date

Demand

Product

Promo

Region

01/01/2019

112

Widget_1

01/02/2019

141

Widget_1

01/03/2019

149

Widget_1

01/01/2019

Widget_1

01/02/2019

Widget_1

01/03/2019

Widget_1

01/01/2019

241

Widget_2

01/02/2019

219

Widget_2

01/03/2019

244

Widget_2

01/01/2019

Widget_2

01/02/2019

Widget_2

01/03/2019

Widget_2

01/01/2019

Widget_3

01/02/2019

Widget_3

01/03/2019

Widget_3

01/01/2019

Widget_3

01/02/2019

Widget_3

01/03/2019

Widget_3

Now we have a time series identifier (Product), a target column (Demand), and a Time column (Date). In addition, each row is based on a single observation, which can be used to predict the target value. The Region and Promo columns are used as features to train the model.

In reality, you will have many more rows and many more columns than these examples. But you must follow the guidelines here to structure your data to avoid data leakage.

PreviousPreparing Learning Data NextHow to anonymize your data?

Last updated 2 years ago

Was this helpful?