Hypatia presents users with an easy-to-use interface that it makes available via any web browser

In the third column, we show results for training during sunny days followed by prediction during rainy periods. January 2nd, 3rd, and 4th were days without precipitation followed by three days with 1.29, 1.06, and 1.0 inches of precipitation respectively. The results show that the model trained only on three rainy days had errors slightly higher than when tested on sunny days, while the model trained on sunny days behaved similarly to the models we discussed before, even when tested on rainy days. Part of our future work is to expand test cases to more variable weather conditions . However,these results indicate that the prediction errors are robust to what are essentially “shocks” to the temperature time series in the explanatory weather data and the predicted variables . Because the CPUs were in sealed containers the effects of precipitation on the CPU series are less pronounced. Still, the errors are largely unaffected by precipitation.Figure 4.7 illustrates the errors when predicting DHT-1 temperature with different subsets of explanatory variables. We observe that if we only rely on the nearby weather station the error is much higher than for a subset that includes at least one of the CPU temperatures . Farmers, today, often use only a weather station temperature reading when implementing manual frost prevention practices. Often, though, the weather station they choose to use for the outdoor temperature is even farther away from the target growing block than the station we use in this study. Notice, also, that when the CPU that is directly connected to the DHT is not included , the errors are higher than when it is included . Thus, as one might expect, proximity plays a role in determining the error. However, using only the attached CPU generates a higher MAE than all CPUs and the weather station . Indeed,vertical planting tower the best performing model is this one that uses all four CPU temperatures and WU-T measurements as explanatory variables, yielding an MAE < 0.5 ◦F across all time frames.

Thus using the nearest CPU improves accuracy, but using only the nearest CPU does not yield the most accurate prediction. Finally, while the weather station data does not generate an accurate prediction by itself, including it does improve the accuracy over leaving it out. In summary, our methodology is capable of automatically synthesizing a “virtual” temperature sensor from a set of CPU measurements and externally available weather data. By including all of the available temperature time series, it automatically “tunes” itself to generate the most accurate predictions even when one of the explanatory variables is, by itself, a poor predictor. These predictions are durable , with errors often at the threshold of measurement error , on average, and relatively insensitive to seasonal and meteoro- logical effects, as well as typical CPU loads in the frost-prevention setting where we have deployed it as part of an IoT system.there are no studies of which we are aware that use the devices themselves as thermometers. To enable this, we estimate the outdoor temperature from CPU temperature linear regression Hastie et al. of temperature time series. Others have shown that doing so is useful for other applications and analyses Guestrin et al. , Xie et al. , Lane et al. , Yao et al. . Our work is complementary to these and is unique in that it combines SSA with regression to improve prediction accuracy. As in other work, we leverage edge computing to facilitate low latency response and actuation for IoT systems Alturki et al. , Feng et al. . With the prior chapters, we have contributed new methods for clustering correlated, multidimensional data and for synthesizing virtual sensors using the data produced from combinations of other sensors. We next unify these advances into a scalable, open-source, end-to-end system called Hypatia. We design Hypatia to permit multiple analytics algorithms to be “plugged in” and to simplify the implementation and deployment of a wide range of data science applications. Specifically, Hypatia is a distributed system that automatically deploys data analytics jobs across different cloud-like systems. Our goal with Hypatia is to provide low latency, reliable, and actionable analytics, machine learning model selection, error analysis, data visualization, and scheduling, in a unified scalable system. To enable this, Hypatia places this functionality “near” the sensing devices that generate data, at the edge of the network. It then automates the process of distributing the application execution across different computational tiers: “edge clouds” and public/private cloud systems.

Hypatia does so to reduce the response latency of applications so that data-driven decisions can be made by people and devices at the edge more quickly. Such edge decision making is important for a wide range of application domains including agriculture, smart cities, and home automation where decisions, actuation, and control are all local and make use of information from the surrounding environment. Hypatia automatically deploys and scales tasks on-demand both locally and remotely – if/when there are insufficient resources at the edge.Users can choose the algorithms they need for data analysis and prediction and select the dataset they are interested in. Hypatia iterates through the list of available parameters, and multiple training and scoring models for each parameter set. It then selects those with the best score. Such model selection can be used to provide data-driven decision support for users as well as to actuate and control digital and physical systems . In this chapter, we focus on Hypatia support for clustering and regression. The Hypatia scheduler automates distributed deployment across edge and cloud systems to minimize time to completion . It uses the computational and communication requirements of model training, testing, and inference, to make placement decisions for independent jobs that comprise a workload. For data-intensive workloads, Hypatia prioritizes the use of the edge cloud. For compute-intensive jobs , Hypatia prioritizes public/private cloud use.Hypatia is an online platform for distributed cloud services that implement common data analytics utilities. It takes advantage of cloud-based, large-scale distributed computation, provides automatic scaling , and implements data management and user interfaces in support of visualization and browser-based user interaction. Hypatia currently supports two key building blocks for popular statistical analysis and machine learning applications: clustering and linear regression. For clustering, Hypatia implements the different variants of k-means clustering. The variants include different distance computations , input data scaling , and the six combinations of covariance matrices. Hypatia runs the configuration for successive values of K ranging from 1 to a user-assigned large number, max_k. For each clustering, Hypatia computes a pair of scores based on both the Bayesian Information Criterion Schwarz and the Akaike Information Criterion Akaike . Hypatia allows the user to change the number of independent, randomly seeded runs to account for statistical variation. Finally, it provides ways for the user to graph and visualize both two-dimensional “slices” of all clusterings as well as the relative BIC and AIC scores.

It uses these scores to provide decision support for the user – e.g. presenting the user with the “best” clustering across all variants. For linear regression, Hypatia implements different approaches for analyzing correlated, multidimensional data , Golubovic et al.. Since we focus on synthesizing new sensors,vertical hydroponic farming we are looking for the most important inputs from other sensors that can be used to accurately estimate a synthesized measurement. Hypatia allows users to decide on the number of input variables and which ones to use. They also can specify the start time of the test, duration of the training and testing periods, the scoring metric to use . Users also choose whether or not to smooth the input data using different techniques . Finally,to predict outdoor temperature, users can select nearby single-board computers and/or weather stations . Once the user makes these choices or accepts/modifies the defaults, Hypatia create an experiment with as many tasks as there are parameter choices. Each task produces a linear regression model with coefficients for each input variable and a score that can be used for model selection. As is done for clustering, Hypatia scores the various parameterizations using the scoring metric to provide decision support to users. The user can then use the visualization tools to verify the similarity between input variables and estimated sensor measurements. Hypatia is unique in that it is extensible – different data analytics algorithms can be “plugged in” easily, and automatically deployed with and compared to others. User can also extend the platform with both scoring and visualization tools. Visualization is particularly important when some of the sensors are faulty and unreliable, or some of the smoothing or filtering techniques do not produce the desired outcome. Figure 5.1 shows such an example where visualization is used to show growers how soil moisture responds to precipitation and temperature on east, and west sides of a tree in an almond grove at different depths of 1 foot and 2 feet . Being able to understand how significant each parameter is to soil moisture provides decision support that can be used to guide irrigation and harvest. To implement Hypatia, we have developed a user-facing web service and distributed, cloud-enabled backend. Users upload their datasets to the web service front end as files in a common, simple format: as comma-separated values . The user interface also enables users to modify the various algorithms and their parameters, or accept the defaults.

Hypatia considers each parameterization that the user chooses as a “job”. Each job consists of multiple tasks that Hypatia deploys. Users can also use the service to check the status of a job or to view the report and results for a job . The status page provides an overview of all the tasks for a job showing a progress bar for the percentage of tasks completed and a table showing task parameters and outcomes. Hypatia uses a report page to provide its recommendation for both analysis building blocks, clustering and regression. For clustering, the recommendation consists of the number of clusters and the k-means variant that produces the best BIC score. This page also shows the cluster assignments, spatial plots using longitude and latitude , BIC and AIC score plots. Hypatia also provides cluster labels in CSV files that the user can download. For regression, the report page consists of a map of error analysis for each model grouped by their parameters. Users can quickly navigate to the model with the smallest error. The software architecture of Hypatia is shown in Figure 5.2. We implement Hypatia using Python v3.6 and integrate a number of open-source software packages and cloud services. At the edge, Hypatia uses a small private cloud that runs Eucalyptus software v4.4Nurmi et al. , Aristotle . The public cloud is Amazon Web Services Elastic Compute Cloud . Hypatia integrates virtual servers from these two cloud systems with different capabilities , which we describe in our empirical methodology. Hypatia is deployed on an edge cloud and a private/public cloud if available. We assume that the edge cloud has limited resources and is located near where data is produced by sensors. The public cloud provides vast resources and is located across a long-haul network with varying performance and perhaps intermittent connectivity. We use NEC to denote the number of machines available in the edge cloud and to denote the number of machines available in the public cloud where NEC << NP C. Users submit multiple jobs to the edge system . Each job describes the datasets to be used for training, testing, and inference or analysis. In some jobs we can assume that the entire dataset is needed while in others we can assume that data can be split and tasks within the job can operate on different parts of the dataset in parallel. Each job has ntasks . In the numerous jobs that we have evaluated over the course of this dissertation, we have observed that for the applications we have studied, n can range tens of tasks to millions of tasks. We consider tasks from the same job as having the same “type”. To estimate the time each task will take to complete the data transfer and computation, we compute an average per job i as ti , across past tasks of the same type. Each task fetches its dataset upon invocation.