回归:预测燃油效率
回归:预测燃油效率
在一个回归问题中,我们的目标是预测一个连续值的输出,比如价格或概率。这与一个分类问题形成对比,我们的目标是从一系列类中选择一个类(例如,一张图片包含一个苹果或一个橘子,识别图片中的水果)。
本笔记本使用经典的[auto-mpg](https://archive.ics.uci.edu/ml/datasets/auto+mpg)数据集,建立了预测70年代末和80年代初汽车燃油效率的模型。为了做到这一点,我们将为该模型提供从那个时期开始的许多汽车的描述。此描述包括以下属性:气缸、排量、马力和重量。
此示例使用“tf.keras”API,有关详细信息,请参阅[本指南](https://www.tensorflow.org/guide/keras)。
import pathlib import matplotlib.pyplot as plt import pandas as pd import seaborn as sns import keras from keras import layers %matplotlib inlineThe Auto MPG dataset
The dataset is available from the UCI Machine Learning Repository.
Get the data
First download the dataset.
dataset_path = keras.utils.get_file("auto-mpg.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data") dataset_path 'C:\\Users\\YIUYE\\.keras\\datasets\\auto-mpg.data'Import it using pandas
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight','Acceleration', 'Model Year', 'Origin'] raw_dataset = pd.read_csv(dataset_path, names=column_names,na_values = "?", comment='\t',sep=" ", skipinitialspace=True)dataset = raw_dataset.copy() dataset.tail()| 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | 1 |
| 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | 2 |
| 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | 1 |
| 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | 1 |
| 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | 1 |
Clean the data
The dataset contains a few unknown values.
dataset.isnull().sum() MPG 0 Cylinders 0 Displacement 0 Horsepower 6 Weight 0 Acceleration 0 Model Year 0 Origin 0 dtype: int64To keep this initial tutorial simple drop those rows.
dataset = dataset.dropna()The "Origin" column is really categorical, not numeric. So convert that to a one-hot:
origin = dataset.pop('Origin') dataset['USA'] = (origin == 1)*1.0 dataset['Europe'] = (origin == 2)*1.0 dataset['Japan'] = (origin == 3)*1.0 dataset.tail()| 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | 1.0 | 0.0 | 0.0 |
| 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | 0.0 | 1.0 | 0.0 |
| 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | 1.0 | 0.0 | 0.0 |
| 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | 1.0 | 0.0 | 0.0 |
| 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | 1.0 | 0.0 | 0.0 |
现在将数据集拆分为一个训练集和一个测试集。
我们将在模型的最终评估中使用测试集。
train_dataset = dataset.sample(frac=0.8,random_state=0) test_dataset = dataset.drop(train_dataset.index) sns.pairplot(train_dataset[[ "Cylinders", "Displacement", "Weight"]], diag_kind="kde") sns.set()Also look at the overall statistics:
train_stats = train_dataset.describe() train_stats.pop("MPG") train_stats = train_stats.transpose() train_stats| 314.0 | 5.477707 | 1.699788 | 3.0 | 4.00 | 4.0 | 8.00 | 8.0 |
| 314.0 | 195.318471 | 104.331589 | 68.0 | 105.50 | 151.0 | 265.75 | 455.0 |
| 314.0 | 104.869427 | 38.096214 | 46.0 | 76.25 | 94.5 | 128.00 | 225.0 |
| 314.0 | 2990.251592 | 843.898596 | 1649.0 | 2256.50 | 2822.5 | 3608.00 | 5140.0 |
| 314.0 | 15.559236 | 2.789230 | 8.0 | 13.80 | 15.5 | 17.20 | 24.8 |
| 314.0 | 75.898089 | 3.675642 | 70.0 | 73.00 | 76.0 | 79.00 | 82.0 |
| 314.0 | 0.624204 | 0.485101 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| 314.0 | 0.178344 | 0.383413 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| 314.0 | 0.197452 | 0.398712 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
Split features from labels
Separate the target value, or “label”, from the features. This label is the value that you will train the model to predict.
train_labels = train_dataset.pop('MPG') test_labels = test_dataset.pop('MPG')Normalize the data
Look again at the train_stats block above and note how different the ranges of each feature are.
规范化使用不同尺度和范围的特征是一个很好的实践。虽然模型可能在没有特征规范化的情况下收敛,但它使训练变得更加困难,并且使生成的模型依赖于输入中使用的单元的选择。
注意:尽管我们有意只从训练数据集生成这些统计信息,但这些统计信息也将用于规范化测试数据集。我们需要这样做,以将测试数据集投影到模型所训练的相同分发中。
def norm(x):return (x - train_stats['mean']) / train_stats['std'] normed_train_data = norm(train_dataset) normed_test_data = norm(test_dataset) def build_model():model = keras.Sequential([layers.Dense(64, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),layers.Dense(64, activation=tf.nn.relu),layers.Dense(1)])optimizer = keras.optimizers.RMSprop(0.001)model.compile(loss='mean_squared_error',optimizer=optimizer,metrics=['mean_absolute_error', 'mean_squared_error'])return model model = build_model() model.summary() _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_10 (Dense) (None, 64) 640 _________________________________________________________________ dense_11 (Dense) (None, 64) 4160 _________________________________________________________________ dense_12 (Dense) (None, 1) 65 ================================================================= Total params: 4,865 Trainable params: 4,865 Non-trainable params: 0 _________________________________________________________________Now try out the model. Take a batch of 10 examples from the training data and call model.predict on it.
example_batch = normed_train_data[:10] example_result = model.predict(example_batch) example_result array([[-0.03468257],[-0.01342154],[-0.15384783],[-0.18010283],[ 0.03922582],[-0.12172151],[ 0.10603201],[ 0.2442987 ],[ 0.00099315],[ 0.18530795]], dtype=float32)It seems to be working, and it produces a result of the expected shape and type.
Train the model
Train the model for 1000 epochs, and record the training and validation accuracy in the history object.
# Display training progress by printing a single dot for each completed epoch class PrintDot(keras.callbacks.Callback):def on_epoch_end(self, epoch, logs):if epoch % 100 == 0: print('')print('.', end='')EPOCHS = 1000history = model.fit(normed_train_data, train_labels,epochs=EPOCHS, validation_split = 0.2, verbose=0,callbacks=[PrintDot()]) .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... .................................................................................................... hist = pd.DataFrame(history.history) hist['epoch'] = history.epoch hist.tail()| 2.075518 | 0.940943 | 2.075518 | 8.913726 | 2.351839 | 8.913726 | 995 |
| 2.130111 | 0.953561 | 2.130111 | 9.769884 | 2.438282 | 9.769884 | 996 |
| 2.221040 | 0.951258 | 2.221040 | 9.664708 | 2.382888 | 9.664708 | 997 |
| 2.301870 | 0.980407 | 2.301870 | 9.934311 | 2.425505 | 9.934311 | 998 |
| 2.002580 | 0.887644 | 2.002580 | 9.484982 | 2.414742 | 9.484982 | 999 |
此图显示在大约100个周期后,验证错误几乎没有改善,甚至恶化。让我们更新“model.fit”调用,以便在验证分数没有提高时自动停止培训。我们将使用一个早期的回调来测试每个时代的训练条件。如果一个设定的时间段没有显示出改善,那么自动停止训练。
model = build_model()# The patience parameter is the amount of epochs to check for improvement early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)history = model.fit(normed_train_data, train_labels, epochs=EPOCHS,validation_split = 0.2, verbose=0, callbacks=[early_stop, PrintDot()])plot_history(history) ................................................. loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=0)print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae)) Testing set Mean Abs Error: 1.79 MPGMake predictions
Finally, predict MPG values using data in the testing set:
test_predictions = model.predict(normed_test_data).flatten()plt.scatter(test_labels, test_predictions) plt.xlabel('True Values [MPG]') plt.ylabel('Predictions [MPG]') plt.axis('equal') plt.axis('square') plt.xlim([0,plt.xlim()[1]]) plt.ylim([0,plt.ylim()[1]]) _ = plt.plot([-100, 100], [-100, 100]) error = test_predictions - test_labels plt.hist(error, bins = 25) plt.xlabel("Prediction Error [MPG]") _ = plt.ylabel("Count")它不是很高斯的,但是我们可以预期,因为样本的数量非常小。
总结
- 上一篇: 有谁可以说下阿克蒂思卫浴AQ390智能马
- 下一篇: kaggle房价预测问题