🦉
Arkangel AI Docs
  • 👋Welcome to Arkangel AI
  • 🍕Preparing Learning Data
    • Data Best Practices
    • How to anonymize your data?
    • How to build a good dataset for ML?
  • 🛠️Getting Started
  • Product Tutorials
    • 📩Upload Data
    • ⭐Improve Data
      • Handling dates
      • Correlation and Significance
      • Handling of Outliers
    • 🤖Create AI Models
    • 🔮Make Predictions
    • 📈Integrate & Monitor
  • API Docs
    • 🚩API Overview
    • 🔑Authentication
    • 👾Methods
      • 🚀Projects
      • 🧠Datasets
      • 🔮Predictions
    • 📖Glossary
Powered by GitBook
On this page
  • Tabular Data:
  • Basic requirements:
  • 1. For Classification (multiclass/multilabel) projects
  • 2. For Regression projects
  • Example of a Tabular Classification Project:
  • Imaging Data:
  • Basic requirements:
  • 1. For Classification (multiclass/multilabel) projects
  • 2. For Object Detection projects
  • Example of an Image Classification Project:

Was this helpful?

Preparing Learning Data

Data is the core resource for AI. This guide will show you how to prepare your dataset to use in Arkangel AI.

We structured this guide into two sections divided by the type of data:

  1. When your input is tabular data:

    1. Classification (multiclass/multilabel): Use this case if you want to predict a category of a record. For example, predicting which people will drop out from a program or which user is more probable to move to the next stage.

    2. Regression: Use this case when you want to predict a number. For example, predicting a numeric lab result.

  2. When your input is imaging data:

    1. Classification (multiclass/multilabel): Use this case if you want to predict the category of an image. For example, a chest x-ray radiological patterns.

    2. Object detection: Use this case if you want to predict objects inside an image. For example skin lesions, polyps, or tumors.

Tabular Data:

Basic requirements:

  • The data must be in a flat file, tabular, and saved as a CSV file.

  • Each record must be a line in the file and the supporting variables must be columns.

  • The first column must be the subject_id which must be a number for each record.

1. For Classification (multiclass/multilabel) projects

  • You must have a column that includes the target you are trying to predict, which must be categorical.

As a rule of thumb: We recommend having a maximum number of 10 categories. If you need more than 10 try to divide the problem into multiple prediction steps.

2. For Regression projects

  • You must have a column that includes the target you are trying to predict, which must be a number.

Example of a Tabular Classification Project:

To prepare this learning data we require a minimum 3 columns:

  • First column [Subject_id]: You must assign an identification number to each record in this column. As the information must be anonymized, this identification number cannot be the ID of a person but a number assigned in the creation of the database. This number is vital for creating the algorithm, as it will be used to properly manage the presence of numerous entries in the dataset for the same person.

  • Second column [Target]: In this column, you will assign the gold standard that the algorithm must recognize. In other words, you will give a category to each entry that the model will learn from. For example, if there are two classes, you will have “Benning” and “Malignant” for each record.

  • Third column [Supporting variables]: In this column, you will assign a specific characteristic to the entry that will be used for training the algorithm. This characteristic can be either a string, a boolean, or a number. We strongly recommend including columns that are complete, or that have a maximum of 20% empty cells. You can have as many “third columns” as you want. That is to say, you can include as many supporting variables to the algorithm as you wish. The only crucial aspect is that all of them should fulfill the requirements mentioned above.

Below you can find an example of a CSV file created for a data algorithm:

In this particular case, as you may recognize, the user included the first column named "subject_id", a second column named "target" which is categorical, and 6 more columns as supporting variables.

Zip file

Imaging Data:

Basic requirements:

  • Create a ZIP file that includes a CSV file with the annotations of the images and the path to each image in the folder.

  • Each record must be a line in the CSV file.

  • The first column must be the subject_id which must be a number for each record.

1. For Classification (multiclass/multilabel) projects

2. For Object Detection projects

Example of an Image Classification Project:

Let’s pretend we have a set of mammography images as shown bellow

These images are stored in separate directories named Benign_Masses and Malignant_Masses respectively

In that sense, when we create the .csv file, we have to consider the whole path to each of the images as we will show you below.

CSV column requirements

This file requires a minimum 3 columns:

  • First column [Subject_id]: You must assign an identification number to each record in this column. As the information must be anonymized, this identification number cannot be the ID of a person but a number assigned in the creation of the database. This number is vital for creating the algorithm, as it will be used to properly manage the presence of numerous entries in the dataset for the same person.

  • Second column [Path]: In this column, you will define the path or location of each image. Make sure that the paths have no spaces or special characters like parenthesis or square brackets. If your paths have such characters, please replace them using a “_”. For example, If the path to any image is something like “Directory/SubDirectory name/image (1).png,” you must correct it using the underscore as follows like “Directory/SubDirectory_name/image_1.png”.

  • Third column [Diagnosis]: In this column, you will assign the gold standard or target that the algorithm must recognize in the images. In other words, you will give a class to each image.

Below you can find an example of a CSV file created for the case study shown above.

In this particular case, you may recognize, there are several images for the same patient, and there is a fourth column that we used as a prediction target. Please note that in this case the images are stored in separate folders, meaning that the path must include the name of the folder: "Malignant_Masses/20586934.png." If your images are stored in the same folder as your CSV you don't need to include the full path: "20586934.png."

Zip file

Once you have the CSV file organized as described, you must ZIP it together with the images into a .zip file. Here is vital to remember that the directories structure that you use to define the image path in the CSV must be the same that you use when zipping the files.

PreviousWelcome to Arkangel AINextData Best Practices

Last updated 1 year ago

Was this helpful?

In case your data is not in good shape, Arkangel AI will perform techniques to . Try your best to include the best data quality you can, this will improve your results in the end.

Once you have the CSV file, you must ZIP it and you are ready to to Arkangel AI.

You must include the labels of each image in the CSV file. We recommend using our to prepare your data for classification. Simply select "Image Recognition", then export it to Arkangel AI and we will generate the CSV file for you.

You must include the coordinates of the bounding boxes to detect inside each image. We recommend using our to prepare your data for object detection. Simply select "Object Detection", then export it to Arkangel AI and we will generate the CSV file for you.

🍕
improve your data
upload it
Free Image Annotator
Free Image Annotator
Data Best Practices
CSV data example for high risk cirrosis patients