A major limitation to this study is that the reference dataset was not entirely accurate

The plotted detections in Figure 10 and Figure 12 are counted in the overall AP and AR metrics as successful detections because the IoU of the detection and reference label is greater than 50%. Yet, many pixels that belong to individual center pivot fields are not included in the detection. Across similar scenes, Mask R-CNN does not appear to have the consistent boundary eccentricity bias that FCIS has . Because Mask R-CNN showed better boundary accuracy along with comparably high performance metrics relative to the FCIS model, further comparisons and visual examples comparing accuracy across different field size ranges and with different training dataset sizes used Mask R-CNN instead of FCIS. Many scenes are more complex than the arid landscape with fully cultivated center pivots shaped like full circles in Figure 10. Figure 11 is a representative example of a scene with more complex fields. These include non center pivot fields, center pivots containing two halves, quarters, or other fractional portions in different stages of development , and partial pivots, which is semicircular not because the rest of the circle is in a different stage of development or cultivation, but because of another landscape feature that restricts the field’s boundary . In this scene, at least 25% of the detections in this scene are below the 90% confidence threshold, and many atypical pivots are missed based on this threshold . Figure 12 is a simpler scene that like Figure 12, has a high density of center pivot fields. In this case the detections more closely match the reference labels and detection confidence scores are higher, either because of the FCIS model’s tendency to produce higher confidence scores or because the scene has less variation in center pivot field type. Figure 13 highlights another common issue when testing both models,vertical farming technology that reference labels are truncated by the tiling grid used to make 128 by 128 sized samples from Landsat 5 scenes.

These tend to be missed detections based on the 90% confidence threshold since they are not mapped with a high confidence score. There are cases where no high confidence detections above 90% are produced, such as in Figure 14. In this scene, No Data values in the Landsat 5 imagery, partial pivots near non-center pivot fields, and mixed pivots with indistinct boundaries all result in a scene that is not mapped with high confidence. However, in cases where there is high contrast between center pivot fields and their surrounding environment, they are mapped nearly perfectly by the Mask R-CNN model with high confidence scores . Figures 16 through 18 were selected to illustrate the impact that size range has on detection accuracy, since center pivots can come in various semicircular shapes and sizes. Figure 16 shows that in a scene with no reference labels, no high confidence detections were produced for any size category. The highest confidence score associated with an erroneous detection in this case was at most ~0.66, which is relatively low for both models. Figure 17 shows a case where a large center pivot is mapped accurately, whereas smaller and medium center pivots in the scene are not. This example shows the scale invariance of the Mask RCNN model in that it can accurately map the large center pivot because it looks similar to a medium sized center pivot, only larger. On the other hand, smaller center pivots, partial pivots, and mixed pivots are detected with lower confidences or not detected. Figure 18 highlights a case where a large center pivot is not mapped with a high confidence score above 90%. Unlike Figure 17 where the large center pivot is uniform in appearance, the large center pivot in Figure 18 has a mixture of three different land cover types. This indicates that the inaccuracies from large center pivot detection may come from large center pivots that were partially cultivated or divided into multiple portions with different crop types or cultivation stages. Many false negatives in this scene and others are the result of partial pivots, mixed pivots or pivots that had not been annotated yet in the 2005 dataset.

Small fields are more difficult to detect than large ones and so as expected, removing 50% of the training data available to train the Mask R-CNN model caused large drops in performance. Having more training data available to improve features that are attuned to detect small fields is particularly important with regard to overall model performance. The metric results for small fields is likely biased toward a worse result because many full pivots overlapped a sample image boundary, leading to small, partial areas of pivot irrigation at scene edges being over represented after Landsat scenes were tiled into 128×128 image chips. Since these fields have a less distinctive shape, some full pivots at scene edges were missed. However, since small fields make up a minority of the total population of fields and the medium and large category were more accurate by 20 or more percentage points for both AR and AP, this shows that both the FCIS and Mask R-CNN models can map a substantial majority of pivots with greater than 50% intersection over union. Zhang et al. tested their model on the same geographic locations at a different year and used samples produced from two Landsat scenes to train their model over a 21,000 km^2 area versus the Nebraska dataset which spans 200,520 km^2. These results extend upon the work by Zhang et al. , as the test set is geographically independent from both the training and validation set and 32 Landsat 5 scenes across a large geographic area were used to train and test the model. Furthermore, Zhang et al. ’s approach produces bounding boxes for each field, while Mask R-CNN produces instance segmentations that can be used to count fields and identify pixels belonging to individual fields.While comparing metrics is useful, they don’t indicate how performance varies across different landscapes or how well the quality of the boundary matches reference given that a detection is determined to be correct by having an IoU over 50%. While the FCIS model slightly outperformed the Mask R-CNN model in terms of the medium size category average precision, it also exhibited poorer boundary fidelity. Figure 10 demonstrates arbitrarily boxy boundaries that appear to be truncated by the position sensitive score map voting process that is the final stage of the FCIS model.

The Mask R-CNN model’s high confidence detections that had a confidence score of 0.9 or higher matched the reference boundaries much more closely, showing that this model can be usefully applied to delineate center pivot agriculture in a complex, humid landscape. While the FCIS model could also be employed with post processing to assign a perfect circle to the center of a FCIS detection ,vertical tower planter this would lead to further errors that would overestimate the size and wrongly estimate the shape of partial center pivots. The results from Mask R-CNN on medium sized fields are encouraging because it indicates that the model could potentially generalize well to semi-arid and arid regions outside of Nebraska . In addition, the results from Figures 10, 12, and 15 indicate that where many uniform center pivots are densely colocated and not many non center pivots are present, a higher quantity will be mapped correctly. This is encouraging, since in many parts of the world, center pivots are densely colocated, or are cultivated in semiarid or arid environments, where contrast is high. Therefore, the model can be expected to generalize well outside of Nebraska, though it remains future work to test the model in other regions. False negatives are present for many scenes that are heavily cultivated, and in many cases it is ambiguous whether the absence of a high confidence detection is due to the absence of a center pivot or because of Landsat’s inability to resolve fuzzier boundaries between a field and its surrounding environment . A time series based approach similar to Deines et al. could improve detections so that only pivots that exhibited a pattern of increased greenness would be detected in a given year. However, this requires multiple images within a growing season from a Landsat sensor, which are not always available due to clouds, and also precludes the use of traditional CNN methods and pretrained networks which ease the computational burden of training and detection. Furthermore, this approach is difficult to incorporate into a CNN based method for segmentation, as it precludes the use of pretraining.

Another alternative is to develop higher quality annotated datasets which make meaningful semantic distinctions between agriculture in different stages of cultivation. For example, in Figure 11, brown center pivots that are not detected could instead be labeled as “fallow” or “uncultivated”, and this information could be used to refine the training samples used to train a model to segment pivots within specific cultivation stages. With 4 hours of training on 8 GPUs, the original implementation of Mask R-CNN achieved 37.1 AP using a ResNet-101-FPN backbone on the COCO dataset, a large improvement over the best FCIS model tested, which achieved 33.6 AP . This amounts to a difference of 3.5 AP percentage points, with Mask R-CNN performing better on the COCO dataset. On the Nebraska dataset, for the medium size category, the difference in AP was 3.2 AP percentage points, with the FCIS model outperforming Mask R-CNN. However Mask R-CNN outperformed FCIS in terms of AR by 5.1%. These results indicate that COCO detection baselines are not necessarily reflective of overall metric performance, given that FCIS outperformed Mask R-CNN in the more numerous size category. The improvements on the COCO baseline do reflect the improved boundary accuracy of MAsk R-CNN relative to the FCIS model. The AR and AP results on the Nebraska center pivot dataset are higher relative to Reike , which is to be expected since center pivots are a simpler detection target than fields in the Denmark dataset, which come in more various shapes and sizes. What’s especially notable is that even though Rieke trained the FCIS model on approximately 11 times the training data compared to the training data used in this study, the AP results and AR results on the Nebraska dataset were about 10 to 20 points higher for each of the size categories. The difference for the AP for the small category was 0.42-0.28 = 0.14 , the difference for the medium category was 0.732 – 0.473 = 0.259, and the difference for the large category was 0.734 – 0.51 = 0.224 . Rieke used 159042 samples compared to 13625 samples used in this study. These samples were equivalently sized to this study, at 128×128 pixels. Even though the size categories used in this study and Rieke are not exactly the same, the fact that each of the categories saw substantially better performance for the FCIS model on the Nebraska dataset indicates that the relative simplicity of the center pivot detection target played a substantial role in the jump in performance. Given that Rieke used 11 more training samples, this indicates that the simplicity of the detection target played an even larger role in the performance difference. This is an important lesson for remote sensing researchers looking to use CNN techniques to map fields or other land cover objects, the feasibility of mapping the detection target can be even more important than using an order of magnitude more training data to improve a model’s ability to generalize. These results are comparable to other results achieved for other detection targets. Wen et al. applied a slightly modified version of Mask R-CNN which can produce rotated bounding boxes to segment building footprints in Fujian Province, China from Google Earth imagery. The model was trained on manually labeled annotations across a range of scenes containing buildings with different shapes, materials, and arrangements. Though an independent test set was not used that was separate from the validation set , the model was tested on half of the imagery collected, while the other half was used to train the model, providing a large amount of samples to test the model. The total dataset used to split between training and testing/validation amounted to 2000 500×500 pixel images containing 84,366 buildings .