Predict sepal width#

We can also use the SOM to predict additional features of the test dataset. To do so, we first need to assign new features to the neurons of the SOM.

Let us predict the sepal width of the test dataset using our SOM. To do so we will train a new SOM with more neurons to have finer predictions

import pandas
from   SOMptimised import SOM, LinearLearningStrategy, ConstantRadiusStrategy, euclidianMetric
import numpy       as np

# Extract data
table        = pandas.read_csv('examples/iris_dataset/iris_dataset.csv').sample(frac=1)
swidth       = table['sepal width (cm)'].to_numpy()
data         = table[['petal length (cm)', 'petal width (cm)', 'sepal length (cm)']].to_numpy()

data_train   = data[:-10]
data_test    = data[-10:]
swidth_train = swidth[:-10]
swidth_test  = swidth[-10:]

# Fit SOM
m      = 5
n      = 5
lr     = LinearLearningStrategy(lr=1)      # Learning rate strategy
sigma  = ConstantRadiusStrategy(sigma=0.8) # Neighbourhood radius strategy
metric = euclidianMetric                   # Metric used to compute BMUs
nf     = data_train.shape[1]               # Number of features

som    = SOM(m=m, n=n, dim=nf, lr=lr, sigma=sigma, metric=metric, max_iter=1e4, random_state=None)
som.fit(data_train, epochs=1, shuffle=True, n_jobs=1)

pred_train = som.train_bmus_
pred_test  = som.predict(data_test)

Here we used a \(5 \times 5\) SOM to fit the data. A larger SOM might give more precise results but some neurons might never map to any data point though.

We also extract the sepal width column of the train and test datasets. The sepal width for the test dataset will be used to compare with the predicition from the SOM. That of the train dataset will be used to assign a sepal width for each neuron in the SOM.

To compute a sepal width estimate for each neuron, we can loop through them and find all data points whose best-matching unit is that neuron. The sepal width of the neuron can then be computed as the median value of the sepal width of all these data points

# Compute median sepal width and uncertainty for all neurons
swidth_med = []
swidth_std = []

for i in range(m*n):
    tmp    = swidth_train[pred_train == i]

    swidth_med.append(np.nanmedian(tmp))
    swidth_std.append(np.nanstd(tmp))

In the code above, we also computed an estimate of the uncertainty on the sepal width as the standard deviation. Lets us predict the sepal width of the test set using the SOM

# Predict sepal width for test set
swidth_test_pred     = np.array(swidth_med)[pred_test]
swidth_test_pred_std = np.array(swidth_std)[pred_test]

print('Predicted     Real')
for pred, err, true in zip(swidth_test_pred, swidth_test_pred_std, swidth_test):
    print(f'{pred:.1f} +- {err:.1f}    {true:.1f}')

Results

Predicted     Real
5 +- 0.3    4.1
0 +- 0.3    3.8
5 +- 0.2    2.4
8 +- 0.3    3.3
2 +- 0.3    3.1
0 +- 0.3    3.2
5 +- 0.2    2.4
8 +- 0.2    2.8
9 +- 0.2    2.9
5 +- 0.3    3.4

Note

Depending on the parameters of the SOM and the initialisation of the weights, it is not possible to predict a sepal width for all the data points in the test dataset.

SOMptimised 1.1

Predict sepal width

Predict sepal width#