๐Ÿ•Preparing Learning Data

Data is the core resource for AI. This guide will show you how to prepare your dataset to use in Arkangel AI.

We structured this guide into two sections divided by the type of data:

  1. When your input is tabular data:

    1. Classification (multiclass/multilabel): Use this case if you want to predict a category of a record. For example, predicting which people will drop out from a program or which user is more probable to move to the next stage.

    2. Regression: Use this case when you want to predict a number. For example, predicting a numeric lab result.

  2. When your input is imaging data:

    1. Classification (multiclass/multilabel): Use this case if you want to predict the category of an image. For example, a chest x-ray radiological patterns.

    2. Object detection: Use this case if you want to predict objects inside an image. For example skin lesions, polyps, or tumors.

Tabular Data:

Basic requirements:

  • The data must be in a flat file, tabular, and saved as a CSV file.

  • Each record must be a line in the file and the supporting variables must be columns.

  • The first column must be the subject_id which must be a number for each record.

1. For Classification (multiclass/multilabel) projects

  • You must have a column that includes the target you are trying to predict, which must be categorical.

As a rule of thumb: We recommend having a maximum number of 10 categories. If you need more than 10 try to divide the problem into multiple prediction steps.

2. For Regression projects

  • You must have a column that includes the target you are trying to predict, which must be a number.

Example of a Tabular Classification Project:

To prepare this learning data we require a minimum 3 columns:

  • First column [Subject_id]: You must assign an identification number to each record in this column. As the information must be anonymized, this identification number cannot be the ID of a person but a number assigned in the creation of the database. This number is vital for creating the algorithm, as it will be used to properly manage the presence of numerous entries in the dataset for the same person.

  • Second column [Target]: In this column, you will assign the gold standard that the algorithm must recognize. In other words, you will give a category to each entry that the model will learn from. For example, if there are two classes, you will have โ€œBenningโ€ and โ€œMalignantโ€ for each record.

  • Third column [Supporting variables]: In this column, you will assign a specific characteristic to the entry that will be used for training the algorithm. This characteristic can be either a string, a boolean, or a number. We strongly recommend including columns that are complete, or that have a maximum of 20% empty cells. You can have as many โ€œthird columnsโ€ as you want. That is to say, you can include as many supporting variables to the algorithm as you wish. The only crucial aspect is that all of them should fulfill the requirements mentioned above.

In case your data is not in good shape, Arkangel AI will perform techniques to improve your data. Try your best to include the best data quality you can, this will improve your results in the end.

Below you can find an example of a CSV file created for a data algorithm:

In this particular case, as you may recognize, the user included the first column named "subject_id", a second column named "target" which is categorical, and 6 more columns as supporting variables.

Zip file

Once you have the CSV file, you must ZIP it and you are ready to upload it to Arkangel AI.

Imaging Data:

Basic requirements:

  • Create a ZIP file that includes a CSV file with the annotations of the images and the path to each image in the folder.

  • Each record must be a line in the CSV file.

  • The first column must be the subject_id which must be a number for each record.

1. For Classification (multiclass/multilabel) projects

  • You must include the labels of each image in the CSV file. We recommend using our Free Image Annotator to prepare your data for classification. Simply select "Image Recognition", then export it to Arkangel AI and we will generate the CSV file for you.

2. For Object Detection projects

  • You must include the coordinates of the bounding boxes to detect inside each image. We recommend using our Free Image Annotator to prepare your data for object detection. Simply select "Object Detection", then export it to Arkangel AI and we will generate the CSV file for you.

Example of an Image Classification Project:

Letโ€™s pretend we have a set of mammography images as shown bellow

These images are stored in separate directories named Benign_Masses and Malignant_Masses respectively

In that sense, when we create the .csv file, we have to consider the whole path to each of the images as we will show you below.

CSV column requirements

This file requires a minimum 3 columns:

  • First column [Subject_id]: You must assign an identification number to each record in this column. As the information must be anonymized, this identification number cannot be the ID of a person but a number assigned in the creation of the database. This number is vital for creating the algorithm, as it will be used to properly manage the presence of numerous entries in the dataset for the same person.

  • Second column [Path]: In this column, you will define the path or location of each image. Make sure that the paths have no spaces or special characters like parenthesis or square brackets. If your paths have such characters, please replace them using a โ€œ_โ€. For example, If the path to any image is something like โ€œDirectory/SubDirectory name/image (1).png,โ€ you must correct it using the underscore as follows like โ€œDirectory/SubDirectory_name/image_1.pngโ€.

  • Third column [Diagnosis]: In this column, you will assign the gold standard or target that the algorithm must recognize in the images. In other words, you will give a class to each image.

Below you can find an example of a CSV file created for the case study shown above.

In this particular case, you may recognize, there are several images for the same patient, and there is a fourth column that we used as a prediction target. Please note that in this case the images are stored in separate folders, meaning that the path must include the name of the folder: "Malignant_Masses/20586934.png." If your images are stored in the same folder as your CSV you don't need to include the full path: "20586934.png."

Zip file

Once you have the CSV file organized as described, you must ZIP it together with the images into a .zip file. Here is vital to remember that the directories structure that you use to define the image path in the CSV must be the same that you use when zipping the files.

Data Best Practices

Last updated