博主前期相关的博客见下：
cs109-energy+哈佛大学能源探索项目 Part-1（项目背景）
cs109-energy+哈佛大学能源探索项目 Part-2.1（Data Wrangling）
这次是将数据整理的部分讲完。

Data Wrangling

数据整理

Weather data

天气数据

有两个天气数据来源。

2014年：来自GSD（设计研究生院）建筑物Gund屋顶的当地天气数据。
2012年和2013年：从位于马萨诸塞州剑桥市的气象站购买天气数据。请注意，2012年和2013年的数据来自不同的气象站。

这是经过简单清理和单位转换后的原始天气数据。正如您所看到的，时间间隔为5分钟。

weather2014 = pd.read_excel('Data/weather-2014.xlsx')
weather2014.head()

在这里插入图片描述

Convert to hourly by resampling method.
Here is the clean hourly weather data of year 2014.

weather2014 = weather2014.set_index('Datetime')
weather2014 = weather2014.resample('H')
weather2014.head()

在这里插入图片描述

Here is the orginal weather data of the year 2013 and 2014 after a little cleaning including unit conversion. As you can see, the Datetime format is not correct.

Correct the timestep.
Here is the hourly data after cleaning.

weather2012and2013['Datetime'] = pd.to_datetime(weather2012and2013['Datetime'], format='%Y-%m-%d-%H')
weather2012and2013 = weather2012and2013.set_index('Datetime')
weather2012and2013.head()

在这里插入图片描述

Hourly weather data

Combine two files and add more features including cooling degrees, heating degrees, humidity ratio and dehumidification.
Here is all the hourly weather data.

# Combine two weather files
hourlyWeather = weather2014.append(weather2012and2013)
hourlyWeather.index.name = None
hourlyWeather.sort_index(inplace = True)

# Add more features

# Convert relative humidity to specific humidity

Mw=18.0160 # molecular weight of water
Md=28.9660 # molecular weight of dry air
R =  8.31432E3 # gas constant
Rd = R/Md # specific gas constant for dry air
Rv = R/Mw # specific gas constant for vapour
Lv = 2.5e6 # heat release for condensation of water vapour [J kg-1]
eps = Mw/Md

#saturation pressure
def esat(T):
    ''' get sateration pressure (units [Pa]) for a given air temperature (units [K])'''
    from numpy import log10
    TK = 273.15
    e1 = 101325.0
    logTTK = log10(T/TK)
    esat =  e1*10**(10.79586*(1-TK/T)-5.02808*logTTK+ 1.50474*1e-4*(1.-10**(-8.29692*(T/TK-1)))+ 0.42873*1e-3*(10**(4.76955*(1-TK/T))-1)-2.2195983) 
    return esat

def rh2sh(RH,p,T):
    '''purpose: conversion relative humidity (unitless) to specific humidity (humidity ratio) [kg/kg]'''
    es = esat(T)
    W = Mw/Md*RH*es/(p-RH*es)

    return W/(1.+W)


p = hourlyWeather['pressure-mbar'] * 100
RH = hourlyWeather['RH-%'] / 100
T = hourlyWeather['T-C'] + 273.15
w = rh2sh(RH,p,T)

hourlyWeather['humidityRatio-kg/kg'] = w
hourlyWeather['coolingDegrees'] = hourlyWeather['T-C'] - 12
hourlyWeather.loc[hourlyWeather['coolingDegrees'] < 0, 'coolingDegrees'] = 0

hourlyWeather['heatingDegrees'] = 15 - hourlyWeather['T-C']
hourlyWeather.loc[hourlyWeather['heatingDegrees'] < 0, 'heatingDegrees'] = 0

hourlyWeather['dehumidification'] = hourlyWeather['humidityRatio-kg/kg'] - 0.00886
hourlyWeather.loc[hourlyWeather['dehumidification'] < 0, 'dehumidification'] = 0

#hourlyWeather.to_excel('Data/hourlyWeather.xlsx')
hourlyWeather.head()

这段代码的作用是将两个天气数据文件合并，并添加更多功能，例如冷却度数、加热度数、湿度比和除湿。

首先，使用pandas中的append函数将两个数据文件weather2014和weather2012and2013合并到一个新的数据框架hourlyWeather中，并设置索引名称为None。然后使用sort_index函数按时间对数据进行排序，并在原位将其应用于hourlyWeather。

接下来，定义了一些变量和函数，包括水分子量Mw、干空气分子量Md、气体常数R、干空气的特定气体常数Rd、水蒸气的特定气体常数Rv、水的凝结热Lv和水分子量Mw与干空气分子量Md之比eps。同时定义了函数esat，它用来计算给定空气温度下的饱和水蒸气压力。然后定义了一个函数rh2sh，将相对湿度（RH）转换为特定湿度（湿度比）（W）。

然后，将读取的数据中的压力、相对湿度和温度转换为所需的单位，并使用定义的rh2sh函数计算湿度比，将其添加到数据框架中。接下来，计算冷却度数、加热度数和除湿量，将它们添加到数据框架中。具体地，冷却度数是温度减去12，如果小于零则设为零；加热度数是15减去温度，如果小于零则设为零；除湿量是湿度比减去0.00886，如果小于零则设为零。

最后，将hourlyWeather数据框架的前几行打印出来供查看。如果需要，将数据框架写入Excel文件中。

在这里插入图片描述

plt.figure()
fig = hourlyWeather.plot(y = 'T-C', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All hourly temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)')
plt.show()

plt.figure()
fig = hourlyWeather.plot(y = 'solarRadiation-W/m2', figsize = (15, 6))
fig.set_axis_bgcolor('w')
plt.title('All hourly solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 13)
plt.show()

plt.figure()
fig = hourlyWeather['2014-10'].plot(y = 'T-C', figsize = (15, 6), marker = 'o')
fig.set_axis_bgcolor('w')
plt.title('Selected hourly temperture',fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)',fontsize = 13)
plt.show()

plt.figure()
fig = hourlyWeather['2014-10'].plot(y = 'solarRadiation-W/m2', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected hourly solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 13)
plt.show()

在这里插入图片描述

所有的 T-C（温度）；所有的solar radiation（太阳辐射）；选择的2014-10的温度与太阳辐射

Daily weather data

dailyWeather = hourlyWeather.resample('D')
#dailyWeather.to_excel('Data/dailyWeather.xlsx')
dailyWeather.head()

在这里插入图片描述

plt.figure()
fig = dailyWeather.plot(y = 'T-C', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('All daily temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)', fontsize = 13)
plt.show()

plt.figure()
fig = dailyWeather['2014'].plot(y = 'T-C', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected daily temperture', fontsize = 16)
plt.ylabel(r'Temperature ($\circ$C)', fontsize = 13)
plt.show()

plt.figure()
fig = dailyWeather['2014'].plot(y = 'solarRadiation-W/m2', figsize = (15, 6), marker ='o')
fig.set_axis_bgcolor('w')
plt.title('Selected daily solar radiation', fontsize = 16)
plt.ylabel(r'$W/m^2$', fontsize = 14)
plt.show()

在这里插入图片描述

所有数据daily温度；2014年的daily温度；2014年的daily solar radiation

Features related to occupancy

与占用相关的特征。
这是一个介于0和1之间的数字。0表示没有人员占用，1表示正常占用。这是根据假期、周末和学校学术日历进行估算的。

holidays = pd.read_excel('Data/holidays.xlsx')
holidays.head()

在这里插入图片描述

hourlyTimestamp = pd.date_range(start = '2011/7/1', end = '2014/10/31', freq = 'H')
occupancy = np.ones(len(hourlyTimestamp)) 

hourlyOccupancy = pd.DataFrame(data = occupancy, index = hourlyTimestamp, columns = ['occupancy'])


Saturdays = hourlyOccupancy.index.weekday == 5
Sundays = hourlyOccupancy.index.weekday == 6
hourlyOccupancy.loc[Saturdays, 'occupancy'] = 0.5 
hourlyOccupancy.loc[Sundays, 'occupancy'] = 0.5 


for i in range(len(holidays)):
    timestamp = pd.date_range(start = holidays.loc[i, 'startDate'], end = holidays.loc[i, 'endDate'], freq = 'H')
    hourlyOccupancy.loc[timestamp, 'occupancy'] = holidays.loc[i, 'value']

#hourlyHolidays['Datetime'] = pd.to_datetime(hourlyHolidays['Datetime'], format="%Y-%m-%d %H:%M:%S")
hourlyOccupancy['cosHour'] = np.cos((hourlyOccupancy.index.hour - 3) * 2 * np.pi / 24)

dailyOccupancy = hourlyOccupancy.resample('D')
dailyOccupancy.drop('cosHour', axis = 1, inplace = True)

这段代码的作用是生成一个包含时间戳和占用度的数据框架，并对占用度进行估算，包括周末、假期和时间的周期性影响。

首先，使用pandas中的date_range函数生成一个时间戳hourlyTimestamp，从2011年7月1日到2014年10月31日，时间间隔为1小时。然后，创建一个长度为时间戳长度的数组occupancy，数组中的每个元素均为1。使用这个数组创建一个名为hourlyOccupancy的数据框架，将时间戳设置为索引，将数组作为一个名为occupancy的列添加到数据框架中。

接下来，分别获取周六和周日的时间戳，并将这些时间戳所对应的占用度设置为0.5，以反映周末对占用度的影响。

然后，使用循环遍历存储在名为holidays的数据框架中的每个假期，并获取每个假期的开始时间和结束时间。使用这些时间戳将相应的小时时间戳所对应的占用度设置为假期的值，以反映假期对占用度的影响。

接下来，将每个小时时间戳的cos值作为一个新的列添加到hourlyOccupancy数据框架中，以考虑时间的周期性影响。具体地，将每个小时时间戳的小时数减去3，然后将结果乘以2π并除以24，最后求余弦值。

最后，将hourlyOccupancy数据框架按天重采样，并删除cosHour列，将结果存储在dailyOccupancy中。

Merge energy consumption data with weather and occupancy features

将能源消耗数据与天气和占用特征合并。

hourlyElectricityWithFeatures = hourlyElectricity.join(hourlyWeather, how = 'inner')
hourlyElectricityWithFeatures = hourlyElectricityWithFeatures.join(hourlyOccupancy, how = 'inner')
hourlyElectricityWithFeatures.dropna(axis=0, how='any', inplace = True)
hourlyElectricityWithFeatures.to_excel('Data/hourlyElectricityWithFeatures.xlsx')
hourlyElectricityWithFeatures.head()

这段代码的作用是将能源消耗数据与天气和占用特征合并，并将结果存储到Excel文件中。

首先，使用pandas的join函数将hourlyElectricity数据框架与hourlyWeather数据框架按内部连接方式合并，即只保留两者都有的时间戳。然后，再次使用join函数将hourlyOccupancy数据框架与合并后的数据框架按内部连接方式合并。最后，使用dropna函数删除包含缺失值的行。将合并后的数据框架存储到Excel文件’hourlyElectricityWithFeatures.xlsx’中，并打印出前几行以供查看。

在这里插入图片描述

这段代码的作用是将冷却水和蒸汽的能耗数据与天气和占用特征合并，并将结果存储到Excel文件中。其中，前两行为冷却水数据的处理，接下来三行为蒸汽数据的处理，最后三行为按日处理的能耗数据的处理。

对于冷却水和蒸汽数据，处理方式与能源消耗数据的处理相同。使用join函数将能耗数据与天气数据和占用特征数据按内部连接方式合并，删除包含缺失值的行，并将合并后的数据框架存储到Excel文件中。具体来说，将小时级别的数据与天气和占用特征按内部连接方式合并后，将包含缺失值的行删除，并将结果存储为Excel文件。同样的方式也适用于按日处理的数据。

由于数据的处理方式相同，这些代码块可以通过复制和粘贴进行重复使用。

A note for features

Nomenclature (Alphabetically)
特征说明（符号（按字母顺序））

coolingDegrees:

制冷度数：如果T-C-12>0，则为T-C-12，否则为0。假设当室外温度低于12°C时，不需要制冷，这对许多建筑物来说是正确的。这将对每日预测有用，因为小时制冷度数的平均值比小时温度的平均值更好。

cosHour:

$\text{cos}(\text{hourOfDay} \cdot \frac{2\pi}{24})$

dehumidification

如果 humidityRatio-0.00886> 0，then = humidityRatio - 0.00886，否则= 0。这对冷水预测特别是每日冷水预测很有用。

heatingDegrees

if 15 - T-C > 0, then = 15 - T-C, else = 0. 假设当室外温度高于15°C时，不需要供暖。这对每日预测有用，因为小时供暖度数的平均值比小时温度的平均值更好。

occupancy

一个介于0和1之间的数字。0表示没有人员占用，1表示正常占用。这是根据假期、周末和学校学术日历进行估算的。

pressure-mbar

atmospheric pressure

RH-%

Relative humidity

Tdew-C

Dew-point temperature

Humidity

Humidity ratio 是预测冷水的重要因素，因为冷水也用于干燥排放到房间中的空气。使用湿度比比使用相对湿度和露点温度更有效和有效。

Reference

cs109-energy+哈佛大学能源探索项目 Part-1（项目背景）
cs109-energy+哈佛大学能源探索项目 Part-2.1（Data Wrangling）
一个完整的机器学习项目实战代码+数据分析过程：哈佛大学能耗预测项目
Part 1-3 Project Overview, Data Wrangling and Exploratory Analysis-DEC10
Prediction of Buildings Energy Consumption