Local-weighted Citation-kNN algorithm for breast ultrasound image classification
Introduction
Multiple-instance learning (MIL) is proposed to solve learning problems with incomplete information about labels of training data. For traditional supervised learning problem, each training example is represented by a fixed-length vector of features with known label. However, in MIL each example is called a bag and is represented by multiple instances. In other words, each example is represented by variable length feature vector. Labels are only provided for training bags, the labels of instances are unknown. And the task is to learn a model to predict the labels of new bags [1], [2], [3].
The first work on MIL was done by Dietterich et al. when they were investigating the problem of drug activity prediction [4]. The problem was to determine if a given drug molecule will strongly bind to a target protein. The axis-parallel rectangle (APR) algorithm was introduced to solve the problem. After that, many MIL methods have been studied for wide applications, such as diverse density (DD) for stock market prediction [5], natural scene classification [6] and content based image retrieval [7], MIL support vector machine for image classification [8], Citation-kNN for web mining [9], etc.
Citation-kNN is an improved kNN algorithm which is suitable for MIL approach. It is a kind of lazy learning method which defers processing training data until a query needs to be answered [11]. It borrows the concepts of citation and reference from scientific literatures. Not only the neighborhood bags of bag b are taken into account, but also the bags that count b as a neighbor are considered [9]. The Hausdorff distance [10] is used to measure distances among bags. Then the MIL problem is shifted from discriminating instances to discriminating bags [3].
However, in Citation-kNN algorithm, the contribution of each training bag to the classification is either 0 or 1. The distribution of the bags in feature space is not considered. But in most case, the distribution property can affect the final decision or the decision confidence.
To solve the problem, an improved Citation-kNN algorithm called locally weighted Citation-kNN (LWCKNN) is proposed in this paper. The distribution feature, such as the relative distance and sparseness among bags, is taken into account. The algorithm is tested with Musk data and breast ultrasound images. And it shows better results than that of the traditional Citation-kNN.
The rest of the paper is organized as follows. The Citation-kNN algorithm is reviewed in Section 2. The LWCKNN algorithm is presented in Section 3. The experimental results are shown in Section 4. Finally, the discussions and the conclusions are drawn in Section 5.
Section snippets
Citation-kNN
The standard k-nearest neighbor algorithm (k-NN) is a method for classifying test samples based on k closest training examples in feature space. The test sample is assigned to the class mostly occurring amongst its k nearest neighbors. Usually, the Euclidean distance is used to measure the closeness of the samples. For two different samples in feature space, i.e., a and b, the distance between them can be written as following:
But for MIL, the distance between bags cannot be
Locally weighted Citation-kNN (LWCKNN)
In Citation-kNN algorithm, when a test bag X is to be classified, its reference set and citer set are calculated using Hausdorff distance, and this forms the voter set of X. Majority voting among the training bags in the voter set is usually used to decide the label of X. The process does not take the distribution of samples into consideration. Each element in the voter set makes equal contribution to the prediction of test bag X, no matter where it is related to X and to the other elements in
Experimental results
The data sets in our experiments are Musk1 and Musk2 which are the benchmark data sets for MIL, and a set of breast ultrasound images acquired by the Department of Ultrasound of the Second Affiliated Hospital of Harbin Medical University. Different weighted methods are selected and combined to conduct the experiments. The results are compared with that by using traditional Citation-kNN algorithm.
In experiments, k-fold cross validation is used. All the data are randomly divided into 10 groups.
Discussions and conclusions
The distribution of samples is an important factor for the classification. To improve Citation-kNN decision rule, the local distribution feature of samples is considered in this paper. The different voters should have different contributions to the classification. The Distance-Weighted Decision considers the contribution according to the distance of voters from test bag. The voter which is closer to the test bag should have higher weighted value. The Sparseness-Weighted Decision method
References (19)
- et al.
Solving the multiple-instance problem with axis-parallel rectangles
Artif. Intell.
(1997) - et al.
Improving the distinction between benign and malignant breast lesions: the value of sonographic texture analysis
Ultrasound Imag.
(1993) - et al.
Fully automatic and segmentation-robust classification of breast tumors based on local texture analysis of ultrasound images
Pattern Recogn.
(2010) “Multi-Instance Learning: A Survey”. Technical Report, AI Lab, Computer Science a
(2004)- et al.
A review of multi-instance learning assumptions
Knowl. Eng. Rev.
(2010) Multi-instance learning from supervised view
J. Comput. Sci. Technol.
(2006)- et al.
A framework for multiple-instance learning
(1998) - et al.
Multiple-instance learning for natural scene classification
- et al.
Image database retrieval with multiple-instance learning techniques