Predict sepal width ################### We can also use the SOM to predict additional features of the test dataset. To do so, we first need to assign new features to the neurons of the SOM. Let us predict the sepal width of the test dataset using our SOM. To do so we will train a new SOM with more neurons to have finer predictions .. code:: import pandas from SOMptimised import SOM, LinearLearningStrategy, ConstantRadiusStrategy, euclidianMetric import numpy as np # Extract data table = pandas.read_csv('examples/iris_dataset/iris_dataset.csv').sample(frac=1) swidth = table['sepal width (cm)'].to_numpy() data = table[['petal length (cm)', 'petal width (cm)', 'sepal length (cm)']].to_numpy() data_train = data[:-10] data_test = data[-10:] swidth_train = swidth[:-10] swidth_test = swidth[-10:] # Fit SOM m = 5 n = 5 lr = LinearLearningStrategy(lr=1) # Learning rate strategy sigma = ConstantRadiusStrategy(sigma=0.8) # Neighbourhood radius strategy metric = euclidianMetric # Metric used to compute BMUs nf = data_train.shape[1] # Number of features som = SOM(m=m, n=n, dim=nf, lr=lr, sigma=sigma, metric=metric, max_iter=1e4, random_state=None) som.fit(data_train, epochs=1, shuffle=True, n_jobs=1) pred_train = som.train_bmus_ pred_test = som.predict(data_test) Here we used a :math:`5 \times 5` SOM to fit the data. A larger SOM might give more precise results but some neurons might never map to any data point though. We also extract the sepal width column of the train and test datasets. The sepal width for the test dataset will be used to compare with the predicition from the SOM. That of the train dataset will be used to assign a sepal width for each neuron in the SOM. To compute a sepal width estimate for each neuron, we can loop through them and find all data points whose best-matching unit is that neuron. The sepal width of the neuron can then be computed as the median value of the sepal width of all these data points .. code:: # Compute median sepal width and uncertainty for all neurons swidth_med = [] swidth_std = [] for i in range(m*n): tmp = swidth_train[pred_train == i] swidth_med.append(np.nanmedian(tmp)) swidth_std.append(np.nanstd(tmp)) In the code above, we also computed an estimate of the uncertainty on the sepal width as the standard deviation. Lets us predict the sepal width of the test set using the SOM .. code:: # Predict sepal width for test set swidth_test_pred = np.array(swidth_med)[pred_test] swidth_test_pred_std = np.array(swidth_std)[pred_test] print('Predicted Real') for pred, err, true in zip(swidth_test_pred, swidth_test_pred_std, swidth_test): print(f'{pred:.1f} +- {err:.1f} {true:.1f}') .. execute_code:: :hide_code: import warnings import pandas from SOMptimised import SOM, LinearLearningStrategy, ConstantRadiusStrategy, euclidianMetric import numpy as np # Extract data table = pandas.read_csv('examples/iris_dataset/iris_dataset.csv').sample(frac=1) swidth = table['sepal width (cm)'].to_numpy() data = table[['petal length (cm)', 'petal width (cm)', 'sepal length (cm)']].to_numpy() data_train = data[:-10] data_test = data[-10:] swidth_train = swidth[:-10] swidth_test = swidth[-10:] # Fit SOM m = 5 n = 5 lr = LinearLearningStrategy(lr=1) # Learning rate strategy sigma = ConstantRadiusStrategy(sigma=0.8) # Neighbourhood radius strategy metric = euclidianMetric # Metric used to compute BMUs nf = data_train.shape[1] # Number of features som = SOM(m=m, n=n, dim=nf, lr=lr, sigma=sigma, metric=metric, max_iter=1e4, random_state=0) som.fit(data_train, epochs=1, shuffle=False, n_jobs=1) pred_train = som.train_bmus_ pred_test = som.predict(data_test) # Compute median sepal width and uncertainty swidth_med = [] swidth_std = [] for i in range(m*n): tmp = swidth_train[pred_train == i] with warnings.catch_warnings(): warnings.simplefilter("ignore") swidth_med.append(np.nanmedian(tmp)) swidth_std.append(np.nanstd(tmp)) # Predict sepal width for test set swidth_test_pred = np.array(swidth_med)[pred_test] swidth_test_pred_std = np.array(swidth_std)[pred_test] print('Predicted Real') for pred, err, true in zip(swidth_test_pred, swidth_test_pred_std, swidth_test): print(f'{pred:.1f} +- {err:.1f} {true:.1f}') .. note:: Depending on the parameters of the SOM and the initialisation of the weights, it is not possible to predict a sepal width for all the data points in the test dataset.