Positional embeddings are learned during pre-training and sometimes during fine-tuning

The top three diseases that compose this data set are Bacterial Spot, Haunglongbing, and Yellow Leaf Curl. Bacterial Spot, which composes 10% of the data, is a bacterial disease that affects many crops by causing their leaves to develop yellow spots that turn brown in the middle. It also causes crops to develop black or brown spots of rot on their fruits. Haunglongbing composes 10% of the data set. This bacterial disease affects citrus trees causing their fruits to stay green and fall to the ground early before becoming ripe. This disease is common for citrus, but keep in mind that this data set only has images of this disease affecting Oranges. Yellow Leaf Curl composes 9.9% of the data set and is a Viral infection that only affects Tomatoes. “Yellow leaf curl virus is undoubtedly one of the most damaging pathogens of tomatoes, and it limits the production of tomatoes in many tropical and subtropical areas of the world. It is also a problem in many countries that have a Mediterranean climate, such as California. Thus, the spread of the virus throughout California must be considered a serious potential threat to the tomato industry” . Note that the total percentage of diseased images contributing to this data set is 72.2% because the other 27.8% of this data set is healthy crop images. See Table 2.4 for a comparison of the amount of diseased and healthy crop images in this data set. The crops that contributed to the diseased images are Apple, Bell Pepper, Cherry, Grape, Maize, Orange, Peach, Potato, Strawberry, Squash, and Tomato. The crops that contributed to the healthy images are Apple, Bell Pepper, Blueberry, Cherry, Grape, Maize, Peach, Potato, Raspberry, Strawberry, and Tomato. For a visual representation of what the diseased and healthy crop images look like, see Figure 2.1 for nine different crop images.

The crop images have their classification labels above them to identify the crop name and disease or healthy. The images in Figure 2.1 are crop images of classification labels Apple-Apple Scab, Apple-Healthy, Peach-Healthy, GrapeHealthy, Raspberry-Healthy, Soybean-Healthy, Grape-Black Rot, and Peach-Bacterial Spot. While the Transformer architecture has become the de-facto standard for natural language processing tasks,square plastic pot its applications to computer vision remain limited. This was the motivation for Dosovitskiy et al. to look into the implementations of the transformer model for image classification tasks, and the vision transformer was created. The transformer architecture for natural language processing tasks works similarly to a vision transformer. In natural language processing, sentences are broken down into words. Then each word is treated as a sub-token of the original sentence. Similarly, the vision transformer breaks down an image into smaller patches, each patch representing a small sub-section of the original image. To visually see how sentences are broken down into word tokens and images broken down into patches, see Figure 2.2 from. Keep in mind that the position of the image patch is very important. If the image patches are out of order, then the original image will also be out of order. The vision transformer is designed to start with an extra learnable class embedding that is equivilent to 0. Which represents the start of the image and sequence of patches to come. The extra learnable class embedding allows the model to learn embeddings specific to each classification label. “The pre-training function of vision transformer is based solely on the classification label given; therefore, the learnable class embedding is even more important to successfully pre-training the vision transformed model”. Without the learnable class embedding, the transformer will not understand the classification labels that are attached to each image.

To keep the order of the sequence of patches that make up the image, the patches are instilled with positional embeddings. “For the vision transformer, these positional embeddings are learned vectors with the same dimensionality as our patch embeddings. After creating the patch embeddings and pre-pending the classification label embedding, it is then summed with the positional embeddings”. Finally, the summed embeddings are shown to the transformer encoder. After the entirety of the image is shown to the transformer encoder, the model has then learned that image under the given classification label. For a more understandable visual example, see Figure 2.4 from . It visually demonstrates the architecture of how the image patches are linearly projected and linearly embedded, how the patches receive a learnable class embedding, and then finally showing the image to the transformer encoder and the model learning the given image for the given classification label. Dosovitskiy and et al. also found that “ViT attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer data points. When pre-trained on the public ImageNet-21k data set or the in-house JFT-300M data set, ViT approaches or beats state of the art on multiple image recognition benchmarks”. A vision transformer model that has been pre-trained on large data sets such as ImageNet-21k are able to help in making a model more effective [DBK21].This project will implement the vision transformer developed by Dosovitskiy et al. and mentioned in their paper. The framework of their ViT model will be used and accessed through the Hugging Face platform and their package transforms using Python. The vision transformer model comes pre-train on the ImageNet-21k, a benchmark data set consisting of 14 million images and 21k classes. The vision transformer model has been pre-trained on images with pixel size 224 × 224. Therefore, any data that is to be further trained on this model must also be of pixel size 224 × 224.“Data augmentation is the process of transforming images to create new ones for training machine learning models” . “Data augmentation increases the number of examples in the training set while also introducing more variety in what the model sees and learns from. Both these aspects make it more difficult for the model to memorize mappings while also encouraging the model to learn general patterns.

Data augmentation can be a good substitute when resources are constrained” because it artificially creates more of your data when it is not possible to get more data.In the case of this project, the function being used to perform the data augmentations isset transfrom from the Hugging Face Datasets package in Python. This function performs data transformations only when the model training begins. Therefore, transformations can be done on the fly and save on computational resources. Then at each epoch, the transformations are applied to every image given to the model, so the amount of training data stays constant, but variation is added to the original data through transformations. This does not increase the number of training images as other data augmentation packages would, this artificially augments with transformations and variation .Data augmentation is an important step when training machine learning models because they can perform very powerfully if the given data sets for training are too small to train with. These models can start to over-fit, which is a problem because then the model will memorize mappings between the inputs and expected outputs. There are 54,306images in this data set, which may seem like a lot of images,square pot but for a machine learning model, it is not that much. That is why data augmentation is being implemented as a step to reduce possible model over fitting.See Table 3.1, for training results of the pre-trained ViT Image classification model trained on the Plant Village data set. The table displays the model’s Training Loss, Validation Loss, Precision, Recall, model F1 score, and model Accuracy over the 10 epochs. Model training loss indicates how well the model is fitting the training data. Model validation loss indicates how well the model fits new data. The best model that was chosen was epoch 10, which is bold in Table 3.1. The model from epoch 10 has a training loss of 0.088 with a validation loss of 0.073, the lowest pair out of all the epochs. For this model, Precision, Recall, the model F1 Score, and Accuracy all have the value of 1.00. These results indicate that the model has a high positive identification rate. The model seems to have reached a convergence in values between epochs 8 to 10. The training results for epoch 10 showed that the Evaluation of Samples per Second for the model is 50.29. This indicates that number of samples the machine learning model can process and make predictions of in one second is 50.29. Also, the Evaluation Steps per Second is 1.58, which indicates the number of iterations that the machine learning model can complete in one second. Model testing was done with the 15% of the data that was reserved for testing and never shown to the model during training. See Table 3.2, in order to see the overall testing results of the model in a table. The model has an Accuracy of 99%. Meaning the model has the ability to guess the classification label 99% of the time correctly. Total Time in Seconds is the amount of time it took for the image classification model to be tested with the test data set, which is 1460.88 seconds also 24.34 minutes. Samples per Second is 1.15, which is the amount of data samples or instances that the image classification model can process and make predictions on in one second. Latency in Seconds is 0.86, which refers to the amount of time it takes for a single prediction to be made by the image classification model in oneIn order to see how the image classification model can classify an image, see Figure 3.2 and Table 3.3.

Figure 3.2 shows an example image of which the image classification model was tested on. The image has its true classification label above it, which is Apple-Apple Scab. Table 3.3 has the scores and labels of five different classification label predictions for what it thinks the image in Figure 3.2 could be. The score is on a scale from 0 to 1, with 1 meaning 100% confidence in the classification label that the model is predicting. The value of the score is split across the five predictions, meaning when we add up all of the score values for all five predictions, the value will add up to 1. The model’s top prediction with a score of 0.91% is the classification label Apple-Apple Scab, which is the true classification label of the image in Figure 3.2.The overall performance of the pre-trained ViT image classification model with data augmentation shows good promise. The best epoch provided that F1, Accuracy, Precision, and Recall, were all equal to 1.0. This is not the best situation but shows there is room for improvement. Given the previous epochs from 1 to 8, has values for their F1 Score and Accuracy. It shows that there is possible room to improve the model with more fine-tuning and possibly an early stop in training to not have the model over-train to get the result of 1.0 for F1 Score and Accuracy. Another possible reason for the result could be the large imbalance of images between classification labels and that the Plant Village data set is considered to be a small data set. These types of results are good but not realistic for model prediction power. Even though these values are incredibly high, the training and validation loss is still fairly low and slowly converging to 0. This means that the model is learning over time, which is a good sign and shows room for improvement. A possible future improvement for this project would be to find another data set with more specific plant disease information to train the model on. Efforts will be made to look for data that includes more plant pests. That variation would be beneficial because the data set for this project mainly contains crop disease images. The disease Spider Mite, is labeled as a disease in this data set but in reality, is not a disease. It is the only classification label that has pest-inflected crop images in this data set. Having more data on crop pests would be beneficial because crop pests also cause a significant amount of crop loss and damage as well.