Tutorial/Example
================

.. _iris dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set

.. toctree::
   :hidden:

   clustering_iris.rst
   predict_sepal_width.rst
   normalisation.rst
   set_and_get.rst
   save_and_load.rst

.. important::

    This example requires pandas_ to be installed. This can be done the following way with pip
    
    .. code:: bash
        
        pip install pandas
        
    or with conda
    
    .. code:: bash
    
        conda install pandas

Preparing the data
##################

This SOM implementation requires the data to be given as a 2-dimensional numpy array where the first dimension corresponds to the observations or data points that you have and the second dimension corresponds to the features for each observation.

Let us run the SOM on the `iris dataset`_. To do so we will use `pandas`_ to load the dataset contained in the csv file

.. execute_code::
    
    import pandas
    
    table = pandas.read_csv('examples/iris_dataset/iris_dataset.csv')
    print(table.head(), end='\n\n')
    print(table.info())
    
Each line represents an observation and in this case we have 4 features: sepal length, sepal width, petal length and petal width. Let us train the SOM on three features

.. code::

    table  = table[['petal length (cm)', 'petal width (cm)', 'sepal length (cm)']]
    target = table['target']
    swidth = table['sepal width (cm)']
    print(table.info())

.. execute_code::
    :hide_code:

    import pandas
    
    table  = pandas.read_csv('examples/iris_dataset/iris_dataset.csv')
    target = table['target']
    swidth = table['sepal width (cm)']
    table  = table[['petal length (cm)', 'petal width (cm)', 'sepal length (cm)']]
    print(table.info())
    
where we have also extracted the target class for each observation to compare at the end and the sepal width to predict for the next section. We convert the data into a numpy array since this is the required format for the SOM to run and we check it has the correct shape (150, 3)

.. code::
    
    data  = table.to_numpy()
    print(data.shape)
    print(data[:5])
    
.. execute_code::
    :hide_code:

    import pandas
    
    table  = pandas.read_csv('examples/iris_dataset/iris_dataset.csv')
    target = table['target']
    swidth = table['sepal width (cm)']
    table  = table[['petal length (cm)', 'petal width (cm)', 'sepal length (cm)']]
    data   = table.to_numpy()
    print(data.shape)
    print(data[:5])
    
Finally, let us keep the last five observations apart to test the SOM

.. code::

    data_train = data[:-5]
    data_test  = data[-5:]
    print(len(data_train), len(data_test))

.. execute_code::
    :hide_code:
    
    import pandas
    
    table  = pandas.read_csv('examples/iris_dataset/iris_dataset.csv')
    target = table['target']
    swidth = table['sepal width (cm)']
    table  = table[['petal length (cm)', 'petal width (cm)', 'sepal length (cm)']]
    data   = table.to_numpy()
    
    data_train = data[:-5]
    data_test  = data[-5:]
    print(len(data_train), len(data_test))

.. include:: clustering_iris.rst
.. include:: predict_sepal_width.rst
.. include:: normalisation.rst
.. include:: set_and_get.rst
.. include:: save_and_load.rst