Unit 2. Designing a Pipeline¶

The EAP MI Pipelines provides the logic, file, model and other types of operators to meet the business needs of a variety of model training processes and scenarios. This unit describes how to use operators to organize pipelines and develop wind farm power generation prediction models.

Prerequisites¶

Before orchestrating a pipeline, you can create a new experiment in the MI Pipelines by following these steps:

Log in to the EnOS Management Console, and select Enterprise Analytics Platform > Machine Intelligence Studio > MI Pipelines from the left navigation bar to open the Experiment List homepage.
Click New Experiment and enter the name (kmmkdsdemo) and description of the experiment.
Click OK to create the experiment, open the Pipeline Designer page of the experiment for designing and developing the pipeline.

Designing Pipelines¶

The following operators are needed to orchestrate the pipeline in this tutorial:

Hive operator: query the Hive to get the list of sites to be trained (partition or field query), and get the keytab and kerberos profiles required by the Hive operator.
Git Directory operator: get the file transform1.py from the Git directory and use it as input of Python operator
Python operator: format the input file and take it as the input of ParallelFor operator
ParallelFor operator: implement the loop processing for each site

Drag the operators to the editing canvas, and the pipeline after orchestration is shown in the figure below:

The configuration instructions for each operator orchestrated in the pipeline are given as follows:

Hive Operator¶

Name: Hive

Description: query the Hive to get the list of sites to be trained, and get the keytab and krb5 profiles

Input parameters

Parameter Name	Data type	Operation Type	Value
data_source_name	String	Declaration	Name of the registered Hive data source
sqls	List	Declaration	[“set tez.am.resource.memory.mb=1024”,”select distinct lower(masterid) as masterid from kmmlds1”]
queue	String	Declaration	root.eaptest01 (Name of the big data queue applied for through resource management)

Output parameters

Parameter Name	Value
resultset	File

An sample of operator configuration is given as follows:

Git Directory Operator¶

Name: Git Directory for Transform1

Description: pull the Python code file transform1.py from the Git directory

Input parameters

Parameter Name	Data type	Operation Type	Value
data_source_name	String	Declaration	Name of the registered Git data source
branch	String	Declaration	master
project	String	Declaration	workspace1
paths	List	Declaration	[“workspace1/kmmlds/transform1.py”]

Output parameters

Parameter Name	Value
workspace	directory
paths	list

An sample of operator configuration is given as follows:

Python Operator¶

Name: Transform1

Description: format the input file and take it as the input of ParallelFor operator. The query output format of the Hive operator is [[], [["abcde0001"], ["cgnwf0046"]]]. It cannot be used directly by the ParallelFor operator. Before that, it needs to be converted through the Python operator, and the converted format is ["abcde0001", "cgnwf0046"].

Input parameters

Parameter Name	Data type	Operation Type	Value
workspace	Directory	Reference	Git Directory for Transform1.workspace
entrypoint	String	Declaration	workspace1/kmmlds/transform1.py
requirements_file_path	String	Declaration
list_data	File	Reference	Hive.resultset