5 Pitfalls to Avoid When Developing AI Tools

By Chirag Mudsa - May 13, 2020

Illustrated data sets connecting in an AI system

Developing a tool that runs on artificial intelligence is mostly about training a machine with data. But you can’t just feed it information and expect AI to wave a magic wand and produce results. The type of data sets you use and how you use them to train the tool are important.

Here are five pitfalls to be wary of when building out your AI tools.

Overfitting

It may seem peculiar, but AI models being too accurate can be problematic for the output of the algorithm. Models that learn their data sources too well can overfit the training data set and not learn to recognize anything else. A little change in the data can render the model useless. Instead, your model needs to be adaptive.

Class imbalance

When you train your AI tools, it is always advisable to include a disproportionate amount of the target variable. This ensures a proper class balance. Say you are training the AI tool to detect iron bars in a warehouse, which make up only 1% of the entire inventory. You may need to make 60% of your samples iron bars. This amplification of the minority class number is required to ensure the right class balance is maintained.

Data leakage

Failure to control data leakage is a major fault. A common instance is when an element testing the model gets into the training data. Though this may not cause underperformance for the model, the predictions may not be accurate when deployed in the production process. To address this problem, it is important to keep track of all changes that have occurred to the AI tools. Besides carrying out code reviews and configuration change reviews, you should also have data set reviews to make sure AI engineering is following the right process without leaving any scope of such data leakage. When you experience any change to the parameters of AI tools, it must be well documented.

Concept drift

Most AI tools depend on the data organization, size, and shape that you are bringing into your system. For instance, an AI model that is trained for response times ranging between 1 and 100 seconds can suddenly deviate to a response time up to 1,000 seconds, and the model will not be able to capture the new data points. This is a problem referred to as concept drift. Instead of waiting for feedback from customers, set up alerts to detect concept drift on incoming data so you can update the AI model.

Using personal data

Meeting the established norms and industry guidelines for data is extremely important when creating data models for AI tools. The EU has the GDPR, and several US states have their own regulations. To be on the safe side, instead of using customer data to train AI models and tools, it is advisable to use anonymous data.