Unit 1. Preparing Data

Before training a machine learning model, you need to prepare sample data for model training.

Prerequisites

You should upload the wind power algorithm files and weather data files accumulated in the wind field to EnOS HDFS in advance to create the data tables required for model training. In this tutorial, the HDFS path for storing files is hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/.

Prepare Wind Power Algorithm Data

You can use map the files uploaded to EnOS HDFS to the Hive partition table, and use the wind farm ID (masterid) as the partition key to simulate the actual scenarios by following these steps.

  1. Use powers and speeds files to create a Hive partition table (enos_power_speeds) for partition by site (masterid) and type. The value corresponding to the powers type is power (power), while the value corresponding to the speeds type is speed (wind speed).

    create external table enos_power_speeds(utcTimeStamp timestamp, value double)
    partitioned by (masterid string,type string)
    row format delimited fields terminated by ','
    lines terminated by 'n'
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new'
    tblproperties ("skip.header.line.count"="1");
    
  2. Add a partition

    alter table enos_power_speeds add partition(masterid='CGNWF0046',type='powers')
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/CGNWF0046/powers';
    
    alter table enos_power_speeds add partition(masterid='CGNWF0046',type='speeds')
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/CGNWF0046/speeds';
    
  3. Specify the batch data processing queue name in Hive (requested through resource management):

    set mapreduce.job.queuename=root.test_enos_01;
    
  4. Check and verification:

    select count(1) from enos_power_speeds where masterid='CGNWF0046' and type='powers'
    
  5. Use the file speed_to_power.csv to create the table speed_to_power:

    create external table speed_to_power(speed double,power double)
    
    row format delimited fields terminated by ','
    lines terminated by 'n'
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/speed_to_power'
    tblproperties ("skip.header.line.count"="1");
    

Prepare External Weather Data

Create an external weather data table by following these steps:

  1. Create the table external_weather and perform partitioned by weather data source type (EC and GFS):

    create external table external_weather(utcTimeStamp timestamp, pres double,tmp double,wd double,ws double, weatherpublishstarttime timestamp)
    partitioned by (type string)
    row format delimited fields terminated by ','
    lines terminated by 'n'
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/112.647/21.938'
    tblproperties ("skip.header.line.count"="1");
    
  2. Add a partition

    alter table external_weather add partition(type='EC')
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/112.647/21.938/EC';
    
    alter table external_weather add partition(type='GFS')
    location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kongmingmldemo/new/112.647/21.938/GFS';
    
  3. Create a group:

    create table external_weather1  as
    
    select
    
    utctimestamp,
    
    avg(pres) as pres,
    
    avg(tmp) as tmp,
    
    avg(wd) as wd,
    
    avg(ws) as ws,
    
    max(weatherpublishstarttime) as weatherpublishstarttime,
    
    type
    
    from
    
    external_weather
    
    group by type, utctimestamp;
    
  4. Check if the group processing is effective:

    select * from external_weather1 where utcTimeStamp='2018-07-03';
    

Integrate Final Data

Create the following Hive table to integrate data:

create table kmmlds2 as

select

t1.utctimestamp as X_basic_time,

hour(t1.utctimestamp) as X_basic_hour,

current_timestamp as X_basic_forecast_time,

 (row_number() over(partition by 1)     - 1)%49  as X_basic_horizon,

datediff(t1.utctimestamp,'2018-07-01') as i_set,

t3.weatherpublishstarttime as EC_nwp_time,

t3.ws as  EC_ws,

t3.wd as  EC_wd,

t3.tmp as  EC_tmp,

t3.pres as  EC_press,

t3.rho as  EC_rho,

t3.dist as  EC_dist,

t4.weatherpublishstarttime as GFS_nwp_time,

t4.ws as  GFS_ws,

t4.wd as  GFS_wd,

t4.tmp as  GFS_tmp,

t4.pres as  GFS_press,

t4.rho as  GFS_rho,

t4.dist as  GFS_dist,

t1.value as speed,

t2.value as power

from

(select utctimestamp,value from enos_power_speeds where  masterid='CGNWF0046' and type='speeds' and value is not null) t1

left join

(select utctimestamp, value from enos_power_speeds where masterid='CGNWF0046' and type='powers' and value is not null ) t2

on t1.utctimestamp =t2.utctimestamp

left join

(select utctimestamp,pres, tmp, wd, ws, weatherpublishstarttime,type,pres*sqrt(2)/(287.05 * tmp+273.15) as rho,row_number() over(partition by 1) +12 as dist from external_weather1 where type='EC') t3

on t1.utctimestamp =t3.utctimestamp

left join

(select utctimestamp,pres, tmp, wd, ws, weatherpublishstarttime,type,pres*sqrt(2)/(287.05 * tmp+273.15) as rho,row_number() over(partition by 1) +12 as dist from external_weather1 where type='GFS') t4

on t1.utctimestamp =t4.utctimestamp


hdfs dfs -cp /user/hive/warehouse/data_o15632609593521.db/kmmlds2/000000_0 /user/hive/warehouse/data_o15632609593521.db/kmmlds1/CGNWF0046/


CREATE external TABLE `kmmlds1`(
`x_basic_time` timestamp,
`x_basic_hour` int,
`x_basic_forecast_time` timestamp,
`x_basic_horizon` int,
`i_set` int,
`ec_nwp_time` timestamp,
`ec_ws` double,
`ec_wd` double,
`ec_tmp` double,
`ec_press` double,
`ec_rho` double,
`ec_dist` int,
`gfs_nwp_time` timestamp,
`gfs_ws` double,
`gfs_wd` double,
`gfs_tmp` double,
`gfs_press` double,
`gfs_rho` double,
`gfs_dist` int,
`speed` double,
`power` double)
partitioned by(masterid string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/hive/warehouse/data_o15632609593521.db/kmmlds1';

alter table kmmlds1 add if not exists partition(masterid='ABCDE0001')
location '/user/hive/warehouse/data_o15632609593521.db/kmmlds1/ABCDE0001';

alter table kmmlds1 add if not exists partition(masterid='CGNWF0046')
location '/user/hive/warehouse/data_o15632609593521.db/kmmlds1/CGNWF0046';

Use External Weather Data

CREATE external TABLE `kmmlds91x`(

`sequence` int,

`i_set` int,

`x_basic_forecast_time` timestamp,

`x_basic_horizon` int,
`x_basic_time` timestamp,
`x_basic_hour` int,
`ec_nwp_time` timestamp,

`ec_dist` int,
`ec_ws` double,
`ec_wd` double,

`ec_rho` double,

`ec_press` double,
`ec_tmp` double,

`gfs_nwp_time` timestamp,

`gfs_dist` int,
`gfs_ws` double,
`gfs_wd` double,

`gfs_rho` double,

`gfs_press` double,
`gfs_tmp` double
)
row format delimited fields terminated by ','
lines terminated by 'n'
location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kmmlds91x'
tblproperties ("skip.header.line.count"="1");


CREATE external TABLE `kmmlds91y`(

`sequence` int,

`speed` double,
`power` double

)
row format delimited fields terminated by ','
lines terminated by 'n'
location 'hdfs://azbeta/user/hive/warehouse/data_o15632609593521.db/kmmlds91y'
tblproperties ("skip.header.line.count"="1");

Create Status Table

  1. Use the following command to create a status table to record daily data updates:

    create table status_tbl(masterid string, updatetime timestamp, flag int);
    
  2. Insert status flag:

    insert into status_tbl values('ABCDE0001','2020-06-13',1);
    

Description of Integrated Data Structure

The data type and description of the integrated data are shown in the following table:

Field name Type Description
x_basic_time timestamp Time field, where one piece of data with universal time label is generated per hour
x_basic_hour int Hours that are gotten from x_basic_time
x_basic_forecast_time timestamp The value at the zero point on the second day is taken generally
x_basic_horizon int Number number: 0~48 (repeated)
i_set int Numbers corresponding to horizon (0, 1, 2…)
ec_nwp_time timestamp Numerical weather prediction, which refers to the EC weather forecast time, generally at 12:00 every day to forecast temporary weather information on the next day
ec-ws double wind speed
ec-wd double Wind direction
ec-tmp double temperature
ec-pres double Air pressure
ec-rho double Air density
ec-dist double Count that has no practical meaning
gfs_nwp_time timestamp Numerical weather prediction, which refers to the GFS weather forecast time, generally at 12:00 every day to forecast temporary weather information on the next day
gfs-ws double Wind speed
gfs-wd double Wind direction
gfs-tmp double Temperature
gfs-pres double Air pressure
gfs-rho double Air density
gfs-dist double Count that has no practical meaning
speed double Actual wind speed
power double Actual power value