Stop Swinging That Hammer: Why Synthetic Data is Banging Up Agricultural AI

Y. Osroosh, Ph.D.
Feb 4
4 min read

Updated: Feb 6

Agricultural AI holds immense promise for transforming farming. We envision AI-powered tractors, drones, and robots that can revolutionize efficiency and ease on the farm. But the reality is that we aren't seeing these transformative changes as quickly or effectively as we'd hoped. One key reason is the way we're approaching data, specifically the increasing reliance on synthetic data. Synthetic data is artificially created information that mimics real-world data. In agricultural AI, this often means generating images of plants, fields, and other agricultural elements.

Synthetic data seems like a good solution at first. It's hard to get enough real data from farms. For example, it's difficult to collect many pictures of rare plant diseases or to capture all the different conditions in the fields. So, people try to create this data artificially. This seems easier and faster than collecting real data. Also, some people believe that having more data, even if it's synthetic, will always make their AI models better. They think that quantity is more important than quality. And finally, researchers are used to training models on datasets, so they naturally turn to creating synthetic datasets.

But this approach can cause problems. Synthetic data, while it looks good, often misses the small details and variations of real-world situations. Think of it like a photograph of a painting. The photograph might capture the main colors and shapes, but it won't have the texture and depth of the original painting. Similarly, synthetic data might capture the general look of a diseased plant, but it won't have the subtle signs that an expert farmer would notice.

Even worse, the AI models we train on synthetic data sometimes become too specialized to the artificial data. They become good at recognizing fake plants in fake fields, but they struggle with real plants in real fields. It's like learning to drive a car in a video game and then trying to drive a real car on a real road. The skills don't always transfer.

**Why are we training separate, and inherently less informative, models on the output of powerful AI models?**

And here's the biggest problem: We already have powerful AI models that can create this synthetic data. These models have already learned a lot about plants, diseases, and fields. So why are we training separate, and inherently less informative, models on the output of these models? It's like making a copy of a copy – the quality inevitably degrades. In reality, the problem with synthetic data is not just about reduced fidelity but also about the potential for introducing biases and unrealistic patterns that don't exist in the real world.

This brings me to my main point: We're using the wrong tool. We're trying to drive a screw with a hammer. Training a model on synthetic data is like using a hammer – it might seem like it works, but it's clumsy, inefficient, and can even damage the screw.

What we need is a screwdriver. The screwdriver is the direct use of these powerful generative AI models. Instead of the resource-intensive process of creating a synthetic dataset and training a separate model, we should be finding ways to use the generative models themselves for tasks like recognizing diseases or classifying crops. They already have so much of the knowledge we need; we just need to learn how to use it directly. This involves exploring innovative approaches like prompt engineering, where carefully crafted text prompts guide the generative model to perform specific tasks, such as identifying diseases in images or classifying different crop types. For a deeper dive into this topic, I encourage you to read my article, "Are Your Machine Learning Models Already Obsolete? The Generative AI Revolution in Agriculture," which explores the potential of generative AI and prompt engineering in greater detail. This approach is not only more effective, but often far more computationally efficient.

While generative AI holds immense promise, it's crucial to acknowledge its limitations. These models are trained on vast datasets, and if those datasets contain biases, the models may perpetuate or even amplify them. Furthermore, generative models can sometimes produce inaccurate or nonsensical outputs, especially when faced with inputs that deviate significantly from their training data. Therefore, it's essential to validate the outputs of generative models with real-world data and expert knowledge. Careful monitoring, testing, and refinement are necessary to ensure that these models are used responsibly and effectively in agricultural applications.

Of course, there will be times when training smaller, more specific models is necessary. Consider situations where we're dealing with limited hardware resources, like embedded systems. These systems often have constraints on memory, processing power, and energy consumption. In such cases, we might need to create a more compact and optimized model, perhaps using tools like TensorFlow Lite, to enable deployment on these devices.

**We might need to create a more compact and optimized model, perhaps using tools like TensorFlow Lite, to enable deployment on resource-limited embedded systems.**

For example, imagine a small, battery-powered sensor deployed in a field to monitor plant health. This sensor might have limited processing capabilities, making it impractical to run a large generative AI model directly. In this case, a smaller, specialized model, optimized using TensorFlow Lite, could be deployed on the sensor to analyze images captured by the sensor and detect early signs of disease. However, for many agricultural uses, particularly those running on standard computers or more powerful edge devices, using the generative AI models directly will be much more efficient and effective.

We also need to put more effort into gathering good quality real data. Even a small amount of real data, carefully labeled, can be very helpful for improving the generative AI models or training specialized models.

And finally, we need to change how we judge success in agricultural AI. We should focus more on practical results and less on just coming up with new model designs. Let's reward researchers who find clever ways to use the existing AI tools to solve real farming problems.

The future of agricultural AI depends on it. Let's put down the hammer and pick up the screwdriver. Let's stop building simulated farms in our computers and start building real solutions for real farmers.

Stop Swinging That Hammer: Why Synthetic Data is Banging Up Agricultural AI

Recent Posts

Comments