Abstract
There has been a recent surge in the number of studies that aim to model crop yield using data-driven approaches. This has largely come about due to the increasing amounts of remote sensing (e.g. satellite imagery) and precision agriculture data available (e.g. high-resolution crop yield monitor data), as well as the abundance of machine learning modelling approaches. However, there are several common issues in published studies in the field of precision agriculture (PA) that must be addressed. This includes the terminology used in relation to crop yield modelling, predicting, forecasting, and interpolating, as well as the way that models are calibrated and validated. As a typical example, many studies will take a crop yield map or several plots within a field from a single season, build a model with satellite or Unmanned Aerial Vehicle (UAV) imagery, validate using data-splitting or some kind of cross-validation (e.g. k-fold), and say that it is a ‘prediction’ or ‘forecast’ of crop yield. However, this poses a problem as the approach is not testing the forecasting ability of the model, as it is built on the same season that it is then validating with, thus giving a substantial overestimation of the value for decision-making, such as an application of fertiliser in-season. This is an all-too-common flaw in the logic construct of many published studies. Moving forward, it is essential that clear definitions and guidelines for data-driven yield modelling and validation are outlined so that there is a greater connection between the goal of the study, and the actual study outputs/outcomes. To demonstrate this, the current study uses a case study dataset from a collection of large neighbouring farms in New South Wales, Australia. The dataset includes 160 yield maps of winter wheat (Triticum aestivum) covering 26,400 hectares over a 10-year period (2014–2023). Machine learning crop yield models are built at 30 m spatial resolution with a suite of predictor data layers that relate to crop yield. This includes datasets that represent soil variation, terrain, weather, and satellite imagery of the crop. Predictions are made at both the within-field (30 m), and field resolution. Crop yield predictions are useful for an array of applications, so four different experiments were set up to reflect different scenarios. This included Experiment 1: forecasting yield mid-season (e.g. for mid-season fertilisation), Experiment 2: forecasting yield late-season (e.g. for late-season logistics/forward selling), Experiment 3: predicting yield in a previous season for a field with no yield data in a season, and Experiment 4: predicting yield in a previous season for a field with some yield data (e.g. two combine harvesters, but only one was fitted with a yield monitor). This study showcases how different model calibration and validation approaches clearly impact prediction quality, and therefore how they should be interpreted in data-driven crop yield modelling studies. This is key for ensuring that the wealth of data-driven crop yield modelling studies not only contribute to the science, but also deliver actual value to growers, industry, and governments.
Abstract
There has been a recent surge in the number of studies that aim to model crop yield using data-driven approaches. This has largely come about due to the increasing amounts of remote sensing (e.g. satellite imagery) and precision agriculture data available (e.g. high-resolution crop yield monitor data), as well as the abundance of machine learning modelling approaches. However, there are several common issues in published studies in the field of precision agriculture (PA) that must be addressed. This includes the terminology used in relation to crop yield modelling, predicting, forecasting, and interpolating, as well as the way that models are calibrated and validated. As a typical example, many studies will take a crop yield map or several plots within a field from a single season, build a model with satellite or Unmanned Aerial Vehicle (UAV) imagery, validate using data-splitting or some kind of cross-validation (e.g. k-fold), and say that it is a ‘prediction’ or ‘forecast’ of crop yield. However, this poses a problem as the approach is not testing the forecasting ability of the model, as it is built on the same season that it is then validating with, thus giving a substantial overestimation of the value for decision-making, such as an application of fertiliser in-season. This is an all-too-common flaw in the logic construct of many published studies. Moving forward, it is essential that clear definitions and guidelines for data-driven yield modelling and validation are outlined so that there is a greater connection between the goal of the study, and the actual study outputs/outcomes. To demonstrate this, the current study uses a case study dataset from a collection of large neighbouring farms in New South Wales, Australia. The dataset includes 160 yield maps of winter wheat (Triticum aestivum) covering 26,400 hectares over a 10-year period (2014–2023). Machine learning crop yield models are built at 30 m spatial resolution with a suite of predictor data layers that relate to crop yield. This includes datasets that represent soil variation, terrain, weather, and satellite imagery of the crop. Predictions are made at both the within-field (30 m), and field resolution. Crop yield predictions are useful for an array of applications, so four different experiments were set up to reflect different scenarios. This included Experiment 1: forecasting yield mid-season (e.g. for mid-season fertilisation), Experiment 2: forecasting yield late-season (e.g. for late-season logistics/forward selling), Experiment 3: predicting yield in a previous season for a field with no yield data in a season, and Experiment 4: predicting yield in a previous season for a field with some yield data (e.g. two combine harvesters, but only one was fitted with a yield monitor). This study showcases how different model calibration and validation approaches clearly impact prediction quality, and therefore how they should be interpreted in data-driven crop yield modelling studies. This is key for ensuring that the wealth of data-driven crop yield modelling studies not only contribute to the science, but also deliver actual value to growers, industry, and governments. Read More