The Centaurus implementation consists of a user-facing web service and distributed cloud-enabled backend

Precision farming integrates cyber infrastructure and computational data analysis to overcome the challenges associated with extracting useful information and actionable insights from the vast amount of information that surrounds the crop life cycle. Precision ag attempts to help growers answer key questions about irrigation and drainage, plant disease, insect and pest control, fertilization, crop rotation, and soil health, weather protection, and crop production. Existing precision ag solutions include sensor-software systems for irrigation, mapping, and image capture/processing , intelligent implements , and more recently, public cloud software-as-a-service solutions that provide visualization and analysis of farm data over time OnFarm , Climate Corporation , MyAgCentral , gThrive , WatrHub , PowWow . Current precision ag technologies fall short in three key ways that have severely limited their impact and widespread use: First, they fail to provide growers with control over the privacy of their data and second, they lock growers into proprietary, closed, inflexible, and potentially costly technologies and methodologies. In terms of data privacy, extant solutions require that farmers relinquish control over and ownership of their most valuable asset: their data. Farm data reveals private and personal information about grower practices, crop input , and farm implement use, purchasing and sales details, water use, disease problems, etc.,vertical growing systems that define a grower’s business and competitiveness. Revealing such information to vendors in exchange for the ability to visualize it puts farmers at significant risk Federation , Russo , Vogt . The second limitation of extant precision ag solutions is “lock-in”. Lock-in is a well-known business strategy in which vendors seek to create barriers to exit for their customers as a way of ensuring revenue from continued use, new or related products, or add-ons in the future.

In the precision ag sector, this manifests as proprietary, closed, and fragmented solutions that preclude advances in sustainable agriculture science and engineering by anyone other than the companies themselves. Lock-in also manifests as a lack of support for cross-vendor technologies, including observation and sensing devices, farm implements, and data management and analysis applications. Since farmers face many challenges switching vendors once they choose one, the one they choose can charge fees for training, customizations, add-ons, and use of their online resources without limit because of the lack of competition. The third limitation is that most precision ag solutions today employ the centralized approach described above. As solutions become increasingly on-line , the lock-in also requires that farmers upload all of their data to the cloud giving vendor full control and access, and leaving growers without recourse when vendors go out of business Rodrigues . In addition to these risks, such network communication of potentially terabytes of image and sensor data is expensive and time consuming for many because of poor network connectivity and costly data rates that are typical of rural areas. Finally, many of these technologies impose high premiums and yearly subscriptions ArcGIS . The goal of our work is to address these limitations and to provide such a scalable, data analytics platform that facilitates open and scalable precision agadvances. To enable this, we leverage recent advances in Internet of Things , cloud computing, and data analytics and extend them them to contribute new research that defines a software architecture that tailors each to agricultural settings, applications, and sustainability science. These constituent technologies cannot be used off-the-shelf however because they require significant expertise and staffing to setup, manage, and maintain – which are show stoppers for today’s growers. We attempt to overcome these challenges with a comprehensive, end-to-end system for scalable agriculture analytics that is open source and that can run anywhere , precluding lock-in. To enable this, we contribute new advances in scalable analytics, low-cost sensing, easy to use data visualization, data-driven decision support, and automatic edge-cloud scheduling, all within a single, unified distributed platform. In the next chapter, we begin by focusing on an important analytics building block and tailoring its use for farm management zone identification using soil electrical conductivity data.

Statistical clustering, also known as a separation of measurements into related groups, is a key requirement for solving many analytics problems. Lloyd’s algorithm Lloyd , commonly called k-means, is one of the most widely used approaches Duda et al. . K-means is an unsupervised learning algorithm, requiring no training or labeling, that partitions data into K clusters, based on their “distance” from K centers in a multi-dimensional space. Its basic form is simple to implement and has become an indispensable component of pattern recognition, data mining, image processing, information retrieval, and recommendation applications across fields ranging from marketing and advertising to astronomy and agriculture. While conceptually simple, there is a myriad of k-means algorithm variants based on how distances are calculated in the problem space. Some k-means implementations also require “hyper parameters” that control for the amount of statistical variation in clustering solutions. Identifying which algorithm variant and set of implementation parameters to use in a given analytics setting is often challenging and error-prone for novices and experts alike. In this chapter, we present Centaurus as an approach to simplifying the application of k-means through the use of cloud computing. Centaurus is a web accessible, cloud-hosted service that automatically deploys and executes multiple k-means variants concurrently, producing multiple models. It then scores the models to select the one that best fits the data – a process known as model selection. It also allows for experimentation with different hyper parameters and provides a set of data and diagnostic visualizations so that users can best interpret its results. From a systems perspective, Centaurus defines a pluggable framework into which clustering algorithms and k-means variants can be chosen. When users upload their data, Centaurus executes and automatically scales the execution of concurrently executing k-means variants using public or private cloud resources. To perform model selection, Centaurus employs a scoring component based on information criteria. Centaurus computes a score for each result and provides a recommendation of the best clustering to the user. Users can also employ Centaurus to visualize their data,its clusterings, and scores, and to experiment with different parameterizations of the system .

We implement Centaurus using production-quality, open-source software and validate it using synthetic datasets with known clusters. We also apply Centaurus in the context of a real-world, agricultural analytics application and compare its results to the industry-standard clustering approach. The application analyzes fine-grained soil electrical conductivity measurements, GPS coordinates, and elevation data from a field to produce a “map” of differing soil zones. These zones can then be used by farmers and farm consultants to customize the management of different zones on the farm Fridgen et al. , Moral et al. , Fortes et al. , Corwin & Lesch . We compare Centaurus to the state of the art clustering tool for farm management zone identification and show that Centaurus is more robust, obtains more accurate clusters, and requires significantly less input and effort from its users. In the sections that follow, we provide some background on the use of EC for agricultural zone management. We then describe the general form of the kmeans algorithm, variants for computing covariance matrices, and scoring method that Centaurus employs . Following this, we present our datasets,an empirical evaluation of Centaurus, related research specifically related to Centaurus,outdoor vertical plant stands and summarize our contributions. The soil health of a field can vary significantly and change over time due to human activity and forces of nature. To optimize yields, farmers increasingly rely on site-specific farming in which a field is divided into contiguous regions, called zones, with similar soil properties. Agronomic strategies are then tailored to specific zones to apply inputs precisely, to lower costs and input use, and to ultimately increase yields. Management zone boundaries can be determined with many different procedures: soil surveys with or without other measurements Bell et al. , Kitchen et al. ; spatial distribution estimates of soil properties by interpolating soil sample data Mausbach et al. , Wollenhaupt et al. fine-grain soil electrical conductivity measurements Mulla et al. , Jaynes et al. , Sudduth , Rhoades et al. , Sudduth et al. , Corwin & Lesch , Veris , and a combination of sensing technologies Adamchuk et al. . EC-based zone identification is widely used because it addresses many of the limitations of the other approaches: it is inexpensive, it can be repeated overtime to capture changes, and it produces useful and accurate estimates of many yield-limiting soil properties including compaction, water holding capacity, and chemical composition.

As a result, EC-based management tools are used extensively for a wide variety of field plants Peeters et al. , Aggelopooulou et al. , Gili et al. . To collect EC data, EC sensors are typically attached to a GPS-equipped tractor or all-terrain vehicle and pulled across a field to collect measurements at multiple depths and at a very fine grain spatially . EC maps generated from this data can either be used to directly define management zones or to inform the future, more extensive, soil sampling locations Veris , Lund et al. . Alternatively, EC values can be clustered into related regions using fast, automated, unsupervised statistical clustering techniques and its variants Bezdek , Murphy Fridgen et al. , Molin & Castro , Fraisse et al. , et al . Given the potential and wide-spread use of EC-based zone identification tools that rely on automated unsupervised algorithms, in this chapter we investigate the impact of using different k-means implementations and deployment strategies for EC-based management zone identification. We consider different algorithm variants, different numbers of randomized runs, and the frequency of degenerateruns – algorithm solutions which are statistically questionable because they include empty clusters, clusters with too few data points, or clusters that share the same cluster center Brimberg & Mladenovic . To compare k-means solutions , we define a model selection framework that uses the Bayesian Information Criterion Schwarz to score and select the best model. Past work has used BIC to score models for the univariate normal distribution Pelleg et al. . Our work extends this use to multivariate distributions and multiple k-means variants.The k-means algorithm attempts to find a set of cluster centers that describe the distribution of the points in the dataset by minimizing the sum of the squared distances between each point and its cluster center. For a given number of clusters K, it first assigns the cluster centers by randomly selecting K points from the dataset. It then alternates between assigning points to the cluster represented by the nearest center, and recomputing the centers Lloyd , Bishop , while decreasing the overall sum of squared distances Linde et al. . The sum-of-squared distances between data points and their assigned cluster centers provides a way to compare local optima – the lower the sum of thedistances, the closer to a global optimum a specific clustering is. Note, that it is possible to use distance metrics other than Euclidean distance to compute per-cluster differences in variance, or covariance between data features. Thus, for a given data set, the algorithm can generate a number of different k-means clusterings – one for each combination of starting centers, distance metrics, and a method used to compute the covariance matrix. Centaurus integrates both Euclidian and Mahalanobis distance. The computation of Mahalanobis distance requires computation of a covariance matrix for the dataset.In addition, each of these approaches for computing the covariance matrix can be Tied or Untied. Tied means that we compute a covariance matrix per cluster, take the average across all clusters, and then use the averaged covariance matrix to compute distance. Untied means that we compute a separate covariance matrix for each cluster, which we use to compute distance. Using a tied set of covariance matrices assumes that the covariance among dimensions is the same across all clusters, and that the variation in the observed covariance matrices is due to sampling variation. Using an untied set of covariance matrices assumes that each cluster is different in terms of its covariance between dimensions.Users upload their datasets to the web service frontend as files in a simple format: as comma-separated values .