The CMSFF and CA is an effective combination to enhance the representation of category under few-shot condition

The principle is that the features of samples belonging to the same category are close to each other, while the features of samples belonging to different categories are far from each other. The earliest representative work is Siamese Network, which is trained with positive or negative sample pairs . Vinyals et al. proposed the Matching Networks, and they borrowed the concept “seq2seq+attention” to train an end-to-end nearest neighbor classifier. Snell et al. proposed Prototypical Network, which learns to match the proto center of class in semantic space through few samples. Sung et al. proposed Relation Network, which concatenates the feature vectors of the support samples and the query samples to discover the relationship of classes. Li et al. proposed CoveMNet based on the covariance presentation and covariance metric of the consistency of distribution. The network extracts the second order statistic information of each category by an embedding local covariance to measure the consistency of the query samples with the novel classes. Chen et al. proposed Meta-Baseline method, which achieves good performance on some FSL benchmarks. The accuracy achieves at 83.74% with 5-way, 5-shot task of Tiered-ImageNet, and 90.95% with 1-way, 5-shot task of Mini-ImageNet. Recently, FSL has started to be used in research on plant disease identification. Argüeso et al. used Siamese Network on the dataset Plant Village . Jadon proposed SSM-Net that uses the Siamese framework and combines two features from a Conv and a VGG16. Zhong et al. proposed a novel generative model for zero-shot and few-shot recognition of citrus aurantium L. diseases by using conditional adversarial auto-encoders. Afifi et al. compared Triplet network, Baseline, Baseline++,strawberry gutter system and DAML on PV and coffee leaf datasets. The results show that the Baseline has the best performance. Li and Chao proposed a semi-supervised FSL method and tested it with PV.

Nuthalapati and Tunga introduced transformer into plant disease recognition. Chen et al. used meta-learning on Miniplant-disease dataset and PV. Li and Yang used Matching Network and tested cross-domain performance by mixing pest data. These methods have been tried from various perspective and have made important progresses. Nevertheless, FSL still has two common challenging issues: limited features extracted from few samples are less representative for a class ; the generalization requirements are very high and various. In this work, we tackle the two issues by using multi-scale feature fusion and improving training strategies. CNN is widely used in image-based deep learning methods. In a CNN architecture, the local features with more details and small perceptive fields are extracted from low-level layers, while the global features with rich semantic information and large perceptive fields are extracted from high-level layers . MSFF is the technology using multi-scale features which are extracted from different layers of CNN . In object detection and semantic segmentation, many excellent networks are proposed by using MSFF, such as Feature Pyramid Network , Unet , Fully Convolutional Network etc. MSFF is also used in image restoration, image dehazing and image super resolution etc. . These methods fuse features by using dense connection, feature concatenation or weighted element-wise summation . In common, the mentioned methods have encoder-decoder framework. The multi-scale features extracted from encoder are reused in decoder to enhance feature representation. However, in conventional classification task, MSFF is seldom used because the network does not have decoder. Generally, only the top semantic features are fed into classifier, but other scale features are abandoned. But in fact, the high-level features and the low-level features are not subordination relationship. The local features including rich fine-grained features can be an effective compensation to formulate a richer feature representation of sample . In the data-limitation condition, it requires to extract as many features as possible from a limited amount of data.

Therefore, in this work, we propose to leverage the MSFF to enhance feature representation. Multi-scale features can be fused in different ways. In our work, we use cascaded multi-scale feature fusion . The channels of feature maps increase after feature fusion. But it does not mean that all channels are the same significance. The contribution of each channel is different. Some channels should be emphasized and some should be suppressed. Attention can help to focus on the meaningful channels. Attention mechanism plays important role in human perception to selectively focus on salient parts in order to capture visual structure better . It has been leaded into some areas of machine learning such as computer vision, natural language processing etc. and has significance to improve performance . It not only tells where to focus, but also improves the representation of interests. Recently, some light-weight attention modules have been proposed. Wang et al. proposed Residual Attention Network that uses encoder-decoder style attention module. Hu et al. introduced a compact module to exploit the inter-channel relationship, which was named as Squeeze-and excitation module. Woo et al. proposed Convolutional Block Attention Module that includes channel attention and spatial attention. These light-weight attention modules can be easily embedded into deep learning networks as plug-ins. In this work, we use the CA to weight the accumulated channels obtained from CMSFF. As the definition of FSL, it is asked to generalize to novel categories or novel domains. Generalizing to new categories within the same domain of training is defined as intra-domain classification, while generalizing to novel domain is defined as cross-domain classification. Long-tail distribution of data is common in plant disease datasets. To identify the part of categories with few samples, the model can be trained with the part of diseases that have more samples.

This generalization happens in the same domain. Cross-domain happens when a set of categories with few shots is required to be identified but does not belong to any dataset. Cross-domain adaption happens between different datasets, which is more difficult than intra-domain adaption. However, researchers found that it is frequently encountered situation and inescapable for boosting FSL to practical application. Guo et al. established a new broader study of cross-domain few-shot learning benchmark and pointed out that all meta-learning methods underperform in relation to simple fine-tuning methods, which indicates that the difficulty of the cross-domain issue. Adler et al. proposed a method of representation fusion by an ensemble of Hebbian learners acting on different layers of a deep neural network, which is from feature representation perspective. Li W.-H. et al. proposed a task-specific adapters for cross-domain problem from the perspective of network architecture. Qi et al. proposed a meta-based adversarial training framework for this problem, which is also from the perspective of network architecture. As we know, there is no research that has been done from a training strategy perspective. These efforts are the kind of general explorations of using general benchmarks and rarely discuss specific domains. In fact, different domain has its own characteristics and resources to utilize when crossing domains. Hence, in this work, we propose a set of training strategies to match various cases of generalization using the available data resources. The contributions of this work are summarized as: we propose a Meta-Baseline based FSL approach merging with CMSFF and CA for plant disease recognition; we propose a group of training strategies to meet different generalization requirements; through extensive comparative experiments and ablation experiments,hydroponic fodder system we validate the superiority of our method and analyze various factors of FSL. Comparing with the existing related works under the same data conditions, our method has achieved at the best accuracy.In this research, three public datasets are used in our experiments. Mini-ImageNet is a subset of the ImageNet, which includes 100 classes and 600 images per class. We select 64 classes in our experiments. The second is PV released in 2015 by Pennsylvania State University. It is the most frequently used and comprehensive dataset in academic research up to now in plant disease recognition. Totally, it includes 50,403 images which crosses over 14 crop species and covers 38 classes, as shown in Table 1. Because the number of samples in PV is unbalanced, we use the data after augmentation and select 1,000 images per class to keep balance. The third is the dataset of apple foliar disease , which was published in FGVC8 Plant Pathology 2021 Competition. All images of AFD were taken in wild with complicated backgrounds, as shown in Figure 1A. Like classical classification structure, our framework contains two components: an encoder and a classifier, which is illustrated in Figure 2A.

The encoder noted as fθ is a CNN-based network merging with CMSFF and CA. It is trained in two stages: base-training and meta-learning. In base-training, the network contains fθ and base-training classifier, which is trained with image-wise data. The goal in this stage is to learn the general features as prior knowledge. Some large-scale general datasets with more classes and diverse data, such as ImageNet, Mini-ImageNet etc. are good choices for learning prior knowledge. The classifier can be linear classifier, fully connected layer, SVM, or other classifiers. The cross entropy loss is calculated to update the parameters of fθ during back propagation. After base-training is completed, the classifier is removed and the trained model is delivered to the meta learning stage. In meta-learning, fθ is initialized by the trained model from base-training. Meta-learning is a concept of learning to learn. So, the purpose is not to learn the knowledge of the training classes, but to learn how to differentiate between classes. Aiming at the objective, the classifier in meta-learning is replaced by a distance measurement module. The classification result is decided by the distances from the support samples to the query sample. Meta-learning is a task-driven paradigm where training data is formulated as N-way, K-shot tasks. Based on a simple machine learning principle: test and training conditions must match , the data of Cnovel is also formatted into tasks in test. After embedding, the 2D color image has been a high dimensional vector in semantic space. The distance of query sample to the class centroid is calculated by a distance metric. Distance metric uses distance function which provides a relationship metric between each element in the dataset. In many machine learning algorithms, distance metric is used to know the input data pattern in order to make any data-based decision. The most common used measures to calculate the distance between two vectors are cosine similarity, dot product and Euclidean distance. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is the same as the inner product after normalization . In Euclidean geometry, the dot product of the Cartesian coordinates of two vectors is widely used. It is often called as inner product or projection product of Euclidean space. The length of projection represents the distance of two vectors. In mathematics, the Euclidean distance between two high-dimensional vectors is the square root of the sum of the squares of the distances in each dimension. Basically, the structure of MSFF includes two categories: parallel multi-scale feature fusion and cascaded multi-scale feature fusion . The two fusion methods are illustrated in Figure 2B. The PMSFF concatenates the features from different layers of CNN simultaneously. The different resolutions of feature maps are uniformed before concatenation. Comparatively, the CMSFF fuses the different resolution feature maps step by step. Taking Resnet12 as backbone network, four convolutional blocks are linked. A group of feature maps of double times of channels and half resolution is generated after each block forwarding. In the backward fusion, small size feature maps are two times up-sampled and concatenated with the feature maps of previous block. After a series of up-sampling and concatenation, all channels are fused together to be the fused full-scale feature, noted as F. The CMSFF is used in this work. The domain of training is noted as source domain , and the domain of test is noted as target domain . Data from different domains can be used in the three stages: base-training, meta-learning, and test. It is special that there are two training stages of our method, and the datasets used in the two stages could be different. We just consider the domain of meta-learning stage as SD. When SD is the same as TD, it is intra-domain adaption, otherwise, it is cross-domain adaption. In order to mimic different adaption situations, we design different data configurations.