Home » Articles » Հոդվածներ |
Nearest Neighbor AnalysisNearest Neighbor Analysis is a method for classifying cases based on their similarity to other cases. In machine learning, it was developed as a way to recognize patterns of data without requiring an exact match to any stored patterns, or cases. Similar cases are near each other and dissimilar cases are distant from each other. Thus, the distance between two cases is a measure of their dissimilarity. Cases that are near each other are said to be “neighbors.” When a new case (holdout) is presented, its distance from each of the cases in the model is computed. The classifications of the most similar cases – the nearest neighbors – are tallied and the new case is placed into the category that contains the greatest number of nearest neighbors. You can specify the number of nearest neighbors to examine; this value is called k. Nearest neighbor analysis can also be used to compute values for a continuous target. In this situation, the average or median target value of the nearest neighbors is used to obtain the predicted value for the new case. Nearest Neighbor Analysis Data Considerations Target and features. The target and features can be:
An icon next to each variable in the variable list identifies the measurement level and data type:
Categorical variable coding. The procedure temporarily recodes categorical predictors and dependent variables using one-of-c coding for the duration of the procedure. If there are c categories of a variable, then the variable is stored as c vectors, with the first category denoted (1,0,...,0), the next category (0,1,0,...,0), ..., and the final category (0,0,...,0,1). This coding scheme increases the dimensionality of the feature space. In particular, the total number of dimensions is the number of scale predictors plus the number of categories across all categorical predictors. As a result, this coding scheme can lead to slower training. If your nearest neighbors training is proceeding very slowly, you might try reducing the number of categories in your categorical predictors by combining similar categories or dropping cases that have extremely rare categories before running the procedure. All one-of-c coding is based on the training data, even if a holdout sample is defined (see Partitions (Nearest Neighbor Analysis)). Thus, if the holdout sample contains cases with predictor categories that are not present in the training data, then those cases are not scored. If the holdout sample contains cases with dependent variable categories that are not present in the training data, then those cases are scored. Rescaling. Scale features are normalized by default. All rescaling is performed based on the training data, even if a holdout sample is defined (see Partitions (Nearest Neighbor Analysis)). If you specify a variable to define partitions, it is important that the features have similar distributions across the training and holdout samples. Use, for example, the Explore procedure to examine the distributions across partitions. Frequency weights. Frequency weights are ignored by this procedure. Replicating results. The procedure uses random number generation during random assignment of partitions and cross-validation folds. If you want to replicate your results exactly, in addition to using the same procedure settings, set a seed for the Mersenne Twister (see Partitions (Nearest Neighbor Analysis)), or use variables to define partitions and cross-validation folds. To obtain a nearest neighbor analysis This feature requires the Statistics Base option. From the menus choose: Analyze > Classify > Nearest Neighbor...
| |||||||||||||||||||||
Views: 266 | |
Total comments: 0 | |