Neural Network Modeling with Keras: Data Processing Techniques

COMP4332/RMBI4310 Neural Network (Keras)(More Examples) Prepared by Raymond Wong Presented by Raymond Wong raywong@cse

Data Collection Data Processing Collected Data Processed Data Raw Data Do you know that in practice, what is the total proportion of the overall time for these 2 steps (i.e., “Data Collection” and “Data Processing”)? Result Presenting Data Mining 70%-90% of the total time! Processed Data Presentable Forms of Data Mining Results Data Mining Results

In this set of lecture notes, although we focus on the neural network model,we also cover some data processing steps.

In the last set of lecture notes, we know how to implement a neural network model in “Keras” for classification. • The data used for classification is a table. • It contains 8 input attributes and 1 target attribute.

Outline • In this set of lecture notes, we will describe the following. • The total number of target attributes used in the table for classification could be changed from 1 to a number greater than 1 (e.g., 2). • The data form could be changed from the table form to the time series form (prediction). • The data could be normalized. • The task could be changed from “classification/prediction” to “regression”

The total number of target attributes used in the table for classification could be changed from 1 to a number greater than 1 (e.g., 2). • E.g., Let us change it to 2. • Thus, the “given” table should also contains two target attributes

We have the following changes • The data reading function should be changed so that we could read 2 target attributes • The model specification function should be changed so that we could have 2 target attributes in the “output” layer. • The function handling the output of the prediction of the model should be updated so that we could handle 2 target attributes

Original Code Python # Step 2: to define the model print(" Step 2: to define the model...") model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='sigmoid')) Updated Code Python # Step 2: to define the model print(" Step 2: to define the model...") model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(2, activation='sigmoid'))

Although the neural network allows us to specify more than one target attribute, we suggest that you should specify only one target attribute for the model. • This is because the “optimization” tool being used is NOT powerful enough to find a solution involving many target attributes.

Outline • In this set of lecture notes, we will describe the following. • The total number of target attributes used in the table for classification could be changed from 1 to a number greater than 1 (e.g., 2). • The data form could be changed from the table form to the time series form (prediction). • The data could be normalized. • The task could be changed from “classification/prediction” to “regression”

When the data is a time series, we have to transform it to the “correct” format. Time Series 112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118 input Neuron Network Format x1 output y x2 (x1, x2, y) Output attribute Input attributes

The “correct” form is just like a table format. • In particular, we should have a number of input attributes and a number of target attributes for the neural network model. • Let us fix that the number of target attributes to 1

Data Collection Data Processing Collected Data Processed Data Raw Data Result Presenting Data Mining Processed Data Presentable Forms of Data Mining Results Data Mining Results

Data Processing Collected Data Processed Data We have to transform and extract the collected data in the “correct” form so that that form could be used for the data mining models to be used in the next process

Let us re-visit the lecture notes about “Data Processing” about time series data as follows.

112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118 • Currently, we have the following data.112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118 • Suppose that we want to perform a prediction. • We need some input attributes (X) and the target attribute (Y)

112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118 look_back = 1 • Suppose that we use the 1 previous data point for prediction. 118 112 132 118 129 132 121 129 135 121 … … 118 104 How many number of records?

112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118 look_back = 2 • Suppose that we use the 2 previous data point for prediction. 132 112 118 132 129 118 132 129 121 129 121 135 121 135 148 … … … 119 104 118 How many number of records?

112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118 • We could also do the same based on the previous 3 data points look_back = 3 How many number of records? Let n be the total number of values in the time series data. What is the total number of records in terms of n?

112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118 • Next, we define a function called “create_dataset” for this purpose • Input: • dataset (in the NumPy array format) • look_back (integer) • Output: • dataX (in the NumPy array format) • dataY (in the NumPy array format)

112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118 Python convert the dataset (in the NumPy array format) to the “correct” format def create_dataset(dataset, look_back=1): dataX, dataY = [], [] for i in range(len(dataset)-look_back): a = dataset[i:(i+look_back), 0] dataX.append(a) dataY.append(dataset[i + look_back, 0]) return numpy.array(dataX), numpy.array(dataY)

We have re-visited the data transformation/processing function/part. • The time series data could be transformed to the table format. • Next, we show the transformation on the time series data to be used. • The original dataset D1 used for the training set and the test set (Training-TimeSeriesData.csv) • The original dataset D2 used for the new set (New-TimeSeriesData-NoOutput.csv)

Training-TimeSeriesData.csv Month,No. of Passengers 1949-01,112 1949-02,118 1949-03,132 1949-04,129 1949-05,121 1949-06,135 1949-07,148 1949-08,148 1949-09,136 1949-10,119 1949-11,104 1949-12,118 … We could obtain the following values over time:112, 118, 132, 129, 121, 135, 148, 148, 136, 119, 104, 118, …

The content of this “new” data set is exactly equal to the content of the “training” dataset in our example for illustration. In real-life appplications, it could be different. New-TimeSeriesData-NoOutput.csv Month,No. of Passengers 1949-01,112 1949-02,118 1949-03,132 1949-04,129 1949-05,121 1949-06,135 1949-07,148 1949-08,148 1949-09,136 1949-10,119 1949-11,104 1949-12,118 …

Let us use this transformation function to create the training set, the test set and the new set.

In the following, we describe two functions • generateTrainDataAndTestData • generateNewData to preprocess the time series data D1 to generate the correct format of the input data for the model (for the training data and the test data) to preprocess the time series data D2 to generate the correct format of the input data for the model (for the new data)

output generateTrainDataAndTestData input data_float_TwoDim timeSeriesDataFilename trainDataX trainDataY testDataX look_back testDataY

to preprocess the time series data D1 to generate the correct format of the input data for the model (for the training data and the test data) Python def generateTrainDataAndTestData(timeSeriesDataFilename, look_back): # to read the data from the file dataframe = pandas.read_csv(timeSeriesDataFilename, usecols=[1]) data_int_TwoDim = dataframe.values data_float_TwoDim = data_int_TwoDim.astype(float) # to split the data into the training set and the test set data_size = len(data_float_TwoDim) train_size = int(data_size*0.80) test_size = data_size - train_size # to generate the training set and the test set trainData = data_float_TwoDim[0:train_size, :] testData = data_float_TwoDim[train_size:data_size, :] trainDataX, trainDataY = create_dataset(trainData, look_back) testDataX, testDataY = create_dataset(testData, look_back) return data_float_TwoDim, trainDataX, trainDataY, testDataX, testDataY

In the following, we describe two functions • generateTrainDataAndTestData • generateNewData to preprocess the time series data to generate the correct format of the input data for the model (for the training data and the test data) to preprocess the time series data to generate the correct format of the input data for the model (for the new data)

generateNewData output input newTimeSeriesNoOutputDataFilename newDataX look_back

to preprocess the time series data D2 to generate the correct format of the input data for the model (for the new data) Python def generateNewData(newTimeSeriesNoOutputDataFilename, look_back): # to read the data from the file dataframe = pandas.read_csv(newTimeSeriesNoOutputDataFilename, usecols=[1]) data_int_TwoDim = dataframe.values data_float_TwoDim = data_int_TwoDim.astype(float) # to generate the new set newDataX, newDataY = create_dataset(data_float_TwoDim, look_back) # we ignore newDataY here (since we want to create a new dataset # for the model to do the prediction) return newDataX

We have described the “data processing” part. • We are ready to describe the Keras program on this time series data. • This task is “prediction” (not “classification”) • This program is called “program-NeuralNetwork-TimeSeries-Prediction.py”

We have to define some “data mining” models to perform some “data mining” tasks We could call many existing libraries to complete these “data mining” tasks Data Mining Processed Data Data Mining Results

Phase 1:ModelTraining Phase 2:Model Storing Training/Validation/Test Data Model (In Memory) Model (In Disk) Processed Data Data Mining Results Phase 4:New DataPrediction Phase 3:ModelReading Model (In Memory) Model (In Disk) PredictedResult New Data Data Mining Results Processed Data

In “Data Mining”, we know that we have Phase 1, Phase 2, Phase 3 and Phase 4. • Next, we show the step of “Result Presenting”.

We have to present the data mining results in a “readable” form and a “presentable” form Some data mining results could be presented directly. Some other data mining results could be presented better by using some existing visualization libraries. This is Phase 5 in our program. Result Presenting Presentable Forms of Data Mining Results Data Mining Results

Python timeSeriesDataFilename = "Training-TimeSeriesData.csv" newTimeSeriesNoOutputDataFilename = "New-TimeSeriesData-NoOutput.csv" newTimeSeriesPredictedOutputDataFilename = "New-TimeSeriesData-NeuralNetwork-PredictedOutput.csv" modelFilenamePrefix = "neuralNetworkModel-timeSeries-prediction" look_back = 1 # Phase 0 (Before Phases 1-4): to preprocess the time series data to generate the # correct format of the input data for the model data_float_TwoDim, trainDataX, trainDataY, testDataX, testDataY = generateTrainDataAndTestData(timeSeriesDataFilename, look_back) newDataX = generateNewData(newTimeSeriesNoOutputDataFilename, look_back) Done! # Phase 1: to train the model print("Phase 1: to train the model...") model = trainModel(trainDataX, trainDataY, testDataX, testDataY, look_back) To describe later # Phase 2: to save the model to a file print("Phase 2: to save the model to a file...") saveModel(model, modelFilenamePrefix) Skipped (Similar)

Python # Phase 3: to read the model from a file print("Phase 3: to read the model from a file...") model = readModel(modelFilenamePrefix) Skipped (Similar) # Phase 4: to predict the target attribute of a new dataset based on a model print("Phase 4: to predict the target attribute of a new dataset based on a model...") newDataY_TwoDim = predictNewDatasetFromModel(newDataX, newTimeSeriesPredictedOutputDataFilename, model) To describe later # Phase 5: to visualize the result print("Phase 5: to visualize the result...") plotResult(data_float_TwoDim, trainDataY, testDataY, newDataY_TwoDim, look_back) To describe later

Phase 1:ModelTraining Phase 2:Model Storing Training/Validation/Test Data Model (In Memory) Model (In Disk) Phase 4:New DataPrediction Phase 3:ModelReading Model (In Memory) Model (In Disk) PredictedResult New Data

Phase 1:ModelTraining Training/Validation/Test Data Model (In Memory) To read the dataTo split the data into the input attributes and the target attribute There are the following 5 steps. Step 1: to load the data Step 2: to define the model Step 3: to compile the model Step 4: to fit the model Step 5: to evaluate the model To define the “structure” of the model To define how to update the parameter used in the “structure” of the model To train the model with the given data To evaluate the data

Python to set the "fixed" seed of a random number generator used in the "optimization" tool in the neural network model The reason why we fix this is to reproduce the same output each time we execute this program In practice, you could set it to any number (or, the current time) (e.g., “numpy.random.seed(time.time())”) # to train a model def trainModel(trainDataX, trainDataY, testDataX, testDataY, look_back) numpy.random.seed(11) # Step 1: to load the data print(" Step 1: to load the data...") # We obtain the data already just before Phase 1 # The data could be found in the 4 variables from the input argument # (i.e., trainDataX, trainDataY, testDataX, testDataY) # Step 2: to define the model print(" Step 2: to define the model...") model = Sequential() model.add(Dense(12, input_dim=look_back, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='relu')) Rectifier function Rectifier function

Rectifier function Rectifier function Fully connected input x1 N1,1 ... Rectifier function N2, 1 x2 N1, 2 output N2, 2 x3 N1, 3 N3, 1 y1 ... Fully connected ... ... Fully connected N2, 8 xlook_back N1, 12 Hidden layer Hidden layer Output layer Input layer

Python # Step 3: to compile the model print(" Step 3: to compile the model...") model.compile(loss="mean_squared_error", optimizer="adam") # Step 4: To fit the model print(" Step 4: to fit the model...") model.fit(trainDataX, trainDataY, epochs=200, batch_size=2) # Step 5: To evaluate the model print(" Step 5: to evaluate the model...") trainScores = model.evaluate(trainDataX, trainDataY) testScores = model.evaluate(testDataX, testDataY) print("") print("Training Scores --- {}: {}".format(model.metrics_names[0], trainScores)) print("Test Scores --- {}: {}".format(model.metrics_names[0], testScores)) return model

Output Using TensorFlow backend. Phase 1: to train the model... Step 1: to load the data... Step 2: to define the model... Step 3: to compile the model... Step 4: to fit the model... Epoch 1/200 114/114 [==============================] - 0s 2ms/step - loss: 2707.0334 Epoch 2/200 114/114 [==============================] - 0s 291us/step - loss: 731.5459 Epoch 3/200 114/114 [==============================] - 0s 274us/step - loss: 736.4551 Epoch 4/200 114/114 [==============================] - 0s 468us/step - loss: 752.3058 Epoch 5/200 114/114 [==============================] - 0s 274us/step - loss: 742.1441 … Epoch 199/200 114/114 [==============================] - 0s 163us/step - loss: 737.9143 Epoch 200/200 114/114 [==============================] - 0s 411us/step - loss: 748.3047 Step 5: to evaluate the model... 114/114 [==============================] - 0s 0us/step 28/28 [==============================] - 0s 558us/step Training Scores --- loss: 742.4421172560307 Test Scores --- loss: 3123.26806640625

Phase 2:Model Storing Model (In Memory) Model (In Disk) In Keras, we have to store the neural network model into two components. • The model structure (stored in JSON format) • The model weight information (stored in HDF5 format) Skipped (Similar)!

In Keras, we have to read the neural network model from the two components • The model structure (stored in JSON format) • The model weight information (stored in HDF5 format) Skipped (Similar)! Phase 3:ModelReading Model (In Memory) Model (In Disk)

Neural Network Modeling with Keras: Data Processing Techniques

Neural Network Modeling with Keras: Data Processing Techniques

Presentation Transcript