-
Notifications
You must be signed in to change notification settings - Fork 129
metadata platform
This table will act as a registry for all processes (jobs) that will run on the cluster. The processes could be a file to table process, semantic process or an export process. Each process will be registered using designated procedure API and given a unique process identifier.
process_id(PK) : This is an auto generated process id
description : Self descriptive
add_ts : Self descriptive
process_name : Meaningful process name without space
bus_domain_id : Application in which a process belongs
process_type_id : If it’s a file to table / semantic / export job
parent_process_id : Each step in a process is a sub process and all sub processes are also registered here in. This will be null for main process and should have the main process’ id for subprocess entries.
can_recover : Applicable for sub processes only if the process can be restarted from this sub-process if the previous run failed at following process.
enqueuing_process_id(PK) : Applicable for sub-processes only and will have the upstream process id for these sub-processes. The input batches for this process are added by the enqueuing_process. The enqueuing process must be a main process (parent process). If a process has no enqueuing process, then the entry for this column would be 0, instead of null (since it’s a primary key, it is not nullable).
batch_cut_pattern : Applicable for sub-processes only. It defines the limit on the batch intake for each sub process while beginning a parent process. Pattern could be like EOD-IN that means if a batch is marked with EOD-IN then it’ll cut to that batch. A sample marking of an EOD batch could be EOD-IN@2014-04-25.
next_process_id : Gives a list of sub-processes that are executed immediately after the current sub-process is executed as defined in the workflow. For a parent process, this is defined as the first sub process that starts and for the last sub-process(es) in the workflow it is defined as the parent process itself.
delete_flag : This is to mark the process as deleted. '0' for not deleted and '1' for deleted.
workflow_id : This represents the workflow type to be deployed. The following ids are used:
- 0 : Only for steps
- 1 : Stand alone
- 2 : Oozie
Every time a process runs it’s assigned a new auto-generated process id, which gets inserted in a row with other information in INSTANCE_EXEC table. This run id is used as an ETL partition in Hive tables where the data is being loaded.
instance_exec_id(PK) : This is an auto generated run id
process_id : The main process that ran
start_ts : Self explanatory
end_ts : Self explanatory
run_state : State indicator (RUNNING , FINISHED , ERRORED OUT etc)
Every time a process runs it may consume batches through its sub-processes. Those sub-processes consume batches enqueued by its enqueuing process(es). A sub-process can have one, two or more enqueuing processes but majority of them will generally have only one enqueuing process. This means that a sub processes can consume delta batches from different tables, which are independently populated with different batches (usually by different file to table interfaces). A row gets added in this table for an eligible sub-process by its enqueuing process(a parent process always) when the enqueuing process completes its execution successfully. If multiple processes (sub) have same enqueuing process then for each process a row gets added with same batch id but with different process ids. The target batch id initially remains empty and gets populated with an auto generated batch id when the process given by the process_id accesses the corresponding batch. When all batches for the parent process have been processed successfully then the enqueued batches are moved to the ARCHIVE_CONSUMP_QUEUE table.
process_id : The sub-process for which its enqueuing process has enqueued the batches.
source_batch_id : The batch added by the enqueuing process. It’s the output batch for the enqueing process and input batch of the current process.
target_batch_id : The output batch id for the parent aggregate process.
queue_id : An auto generated primary key.
insert_ts : The time stamp when the row was added.
source_process_id : The enqueuing process
start_ts : When the batch was first picked up. Gets updated with new pick up time for re-runs of the process (if previous attempt had been failed).
end_ts : When the batch was successfully processed or failed.
batch_state : If the batch is a fresh batch, or previously accessed batch etc.
batch_marking : Metadata associated with each batch. It’s actually the comma separated date values in YYYY-MM-DD format if there is a date associated in the batch. Example 2014-06-15,2014-06-17.
A replica of BATCH_CONSUMP_QUEUE. When batches are successfully consumed and parent process exits successfully, the records get moved to this table.
This table is responsible for generating a batch id when a new batch has to be created.Each process when executes creates a batch after its completion.This batch id is used to link between the files for upstream and downstream processes by registering the batch Id in batch consump queue table.
batch_id : is an auto generated id generated for each batch when it gets registered to the batch table
source_instance_exec_id : instance_exec_id of the parent process which generated this batch.
batch_type : a field associated to a batch that tells about what type of batch it is.
If a process is a file to table interface then the file table would be queried for file location.The file path and server Id is registered in this table with other file details once the upstream process is successfully completed so the downstream process can access the file by querying the file table using the batch Id.
batch_id : Id of the batch that contains the file under consideration.
server_id : Id of the server that has the file.
path : file path.
file_size : self explanatory.
file_hash : self explanatory.
creation_ts : time stamp at which the file was created.
This table registers the DDLs required to create the tables and views in the Hive for the Data Load operations. An auto increment table Id is generated which can be referred to link these DDLs to a Data Load process.
table_id : An auto-generated primary key.
comments : Self- explanatory.
location_type : Location type of Hive table, internal or external.
dbname : Name of database in which tables are created.
batch_id_partition_col : Batch id based on which partition is made.
table_name : Name of the table.
type : Type of Hive Table, i.e. 'Table' or 'View'.
ddl : data Definition Language for the Hive table.
This table registers the properties as key-value pair with a configuration group against a sub-process required for the execution.The processes like data-generation,semantic utilizes this table to define the parameters required for their execution.
process_id : The process id of the process whose properties has to be defined.
config_group : The configuration group may depend on the nature of the process like import,dq,etc.
key : The properties of process will be defined in this.
value : The value of the property defined in the key column.
description : Self- explanatory.
This table is used by Data Load Process to link the DDLs with the process by registering the table Ids generated in the HIVE_TABLES.
etl_process_id : Parent process id; Also the primary key for this table.
raw_table_id : Id of the raw table created.
base_table_id : Id of the base table created.
insert_type : Self- explanatory.
drop_raw : Could be either 0 or 1; 0 by default.
raw_view_id : Id of the view created.
The servers table registers the server details like name,login user,password,ssh private key, IP address,etc, against an auto incremented server Id which can be referred by the process to make an entry in file table.
server_id : Auto generated primary key for the table.
server_type : Self-explanatory.
server_name : Self-explanatory.
server_metainfo : Additional server information that doesn't fit into the other table fields.
login_user : username for connecting to the server.
login_password : password for connecting to the server.
ssh_private_key : The key used to connect to the server.
server_ip : IP address of the server.
Multiple process in a given subject area are grouped into a given business domain. Example – Security application can contain different security related processes.
Specific logs are registered in the process log table with respective process id and instance exec Id as instance_ref after its execution. For example, logs of Data Quality process and Import process. This log can be referenced later for further analysis.
log_id : Unique id for each log to be registered.
add_ts : Self descriptive.
process_id : The process_id of the log related to.
log_category :The log can be categorized according to their nature like DQ,Import etc.
message_id : The specific keys to be recorded related to the process.
message : The value of the key specified in message_id.
instance_ref : The instance_exec_id of the process whose log is registered.
A table which contains a standard template of the form process, from which more such processes can be created or existing such processes can be edited.
process_template_id : Unique id for each process template. Also a foreign key in process table.
description : Self descriptive.
add_ts : Self descriptive.
process_name : Name of the template.
bus_domain_id : Business Domain to which the template belongs to.
process_type_id : Type of the process template.
parent_process_id : Parent process id of the template.
can_recover : Whether a process to be created from this template can recover after failure or not in the next run.
batch_cut_pattern : Batch cut pattern of resultant process. Refer to process table's definition for more explanation
next_process_template_id : Next process to run from template in the workflow.
delete_flag : Whether a process template is deleted or not.
workflow_id : Workflow type, whether oozie or stand alone.
A table which contains properties for any associated process template.
process_template_id : The template id of the process template whose properties have to be defined.
config_group : The configuration group may depend on the nature of the process like import,dq,etc.
key : The properties of process will be defined in this.
value : The value of the property defined in the key column.
description : Self- explanatory.
This table contains information about the hive/pig query for which column level lineage is to be shown.
query_id : Primary key in the table for each query execution.
query_string : Query string executed.
query_type_id : Hive or pig.
create_ts : Self- explanatory.
process_id : Process id of the process that runs the query(always a sub process).
instance_exec_id : Execution id of the process that runs the query. Different for multiple runs of same query.
This table contains information about different nodes involved in a query(table, column, function etc).
node_id_id : Primary key in the table for each node.
node_type_id : Type of each node.
container_node_id : Node containing current node (for example table is container of its column).
node_order : Order of nodes.
insert_ts : self explanatory.
update_ts : self explanatory.
dot_string : Dot string(for visual representation) of the node.
dot_label : Dot label.
display_name : Display name.
This table contains information about relations between different columns involved in a query.
relation_id : Primary key in the table for each relation.
src_node_id : Origin node of a relation.
target_node_id : Target node of a relation.
query_id : Corresponding query Id of the relation.
dot_string : Dot string(for visual representation) of the relation.
Other master tables
This table is a master table which defines the status of the batches generated by the processes. Depending on the state of the batch the batch can be moved from batch_consump_queue table to archive_consump_queue table.
batch_state_id : Primary key identifying the batch state ** : 1, 2 or 3.
description : Batch state description : new, accessed or failed.
This is another master table which defines the execution status of the processes.The execution state : 1, 2, 3, 4, 5 or 6 represents not running, running, success,paused, suspended or failed. Depending on these status the Init job,Init step,Halt job,Halt step,Term job and Term step can be initiated.
exec_state_id : Primary key associated with a execution state : 1, 2, 3, 4, 5 or 6.
description : Execution state description : not running, running, success,paused, suspended or failed.
This is also a master table which defines the process type. It can be semantic,hive data generation,etc.Each type is assigned a process type Id. These ids are referred in workflow generation module to generate the workflow xml file for oozie and the dot file for workflow visualization.
process_type_id : Id of process type, i.e. 1,2,3 or 4.
process_type_name : Process type name , like File load, Semantic, Export, Import.
parent_process_type : The process_type id of the parent process and for parent process it should be null.
Another master table which defines the workflow type and assigns a workflow type Id to each type. This has 3 entries : steps,stand alone and oozie.
workflow_id : Id of workflow type, i.e. 0,1,2.
workflow_type_name : Workflow type name , 0 for steps only,1 for stand alone and 2 for oozie.
This table registers the lineage query type which would referenced by different tables of the Lineage module.The query type can be pig,hive etc.
query_type_id : Id of query type, i.e. 1,2.
query_type_name : Query type name , 1 for hive,2 for pig.
This table is used to define the lineage node type which can be a table,column,function etc. This would be used by lineage module to generate the lineage graphic.
node_type_id : Id of node type, i.e. 1,2,3,4,5.
node_type_name : Node type name , 1 for table,2 for column,3 for function,4 for temp_table,5 for idle_column.
This table registers the users with their passwords for authentication and performing actions.
![users] (https://gitlab.com/bdre/documentation/uploads/4f5a4b9e7294d2c96d508e7107fb661e/users.PNG)
username : Username for bdre UI, drools UI.
password : Password for bdre UI, drools UI.
enabled : True or False.
This table defines the user role for the different users registered in the USERS table.
![user_roles] (https://gitlab.com/bdre/documentation/uploads/fbc6fd30d66d05acae345ede95356519/user_roles.PNG)
user_role_id : Primary key.
username : Username defined in users table.
ROLE : admin, developer or user.
Function : Enables adding a new application to the bus_domain table.
Inputs :
description : This field is a brief description of the application which is being added, for ex. Crawled Twitter, Facebook, LinkedIn data for an application 'Social Media Content'.
bus_domain_name : Name of the application
bus_domain_owner : Name of the application owner
Execution :
On calling the procedure, the inputs get added as a new row entry in the table along with an auto incremented Application id
Validation :
None
Outputs :
None
Inputs :
description : Description about the process.
process_name : Name of the process to be added to the table.
application_id : Id of the application to which the process belongs.
process_type_id : Id of the process type , example, ‘1’ for ‘File to table’ process type.
parent_process_id : Holds for sub-processes, null otherwise.
can_recover : Holds only for sub-processes. Would be ‘True’ if a process can be restarted from this sub process if the previous execution failed at the following process:
enqueuing_process_id : Applicable only for sub-processes.
Validation : None
Execution :
On calling the procedure, the inputs get added as a new row entry in the table along with an auto incremented process_id.
Outputs : None
Inputs :
source_batch_id : Id of the batch added by the enqueuing process. It is the output batch of the enqueuing process and input batch for the current process.
source_process_id : Id of the process from which the batch comes as output.
**batch_marking
process_id : Id of the current process to which the batch goes as input.
Validation :
source_process_id has to be a valid process_id in the Process table
Execution :
A row gets added in the table for every batch coming as an output from an upstream process and is ready to get executed by a downstream process.
insert_ts gets default values as provided by CURRENT TIMESTAMP.
Outputs : None
Starts a parent process
Input
IN_PID : Process id.
IN_MAX_BATCH : Max number of batches to return (in our case the application say Oozie or shell script workflow.xml will supply this number; would not be stored in a table)
Validation :
Check if the process id is a valid parent process. Throw error otherwise.
Check if the process isn’t already running. Meaning there should not be any run id for this process that is in running state.
Check none of the sub-processes are in running state.
If any of the sub processes with non-null enqueuing_process_id has batch_cut_pattern as null, then max batch cannot be null.
If none of the sub processes with non-null enqueuing_process_id has batch_cut_pattern as null, then at least one batch in the PROCESS_BATCH_QUEUE table must match with the batch-cut-pattern for each of the sub processes.
Execution
Check if there is any instance of failure of this process in PROCESS_BATCH_QUEUE table.
If there is no such instance, create a target batch id and assign it to its sub processes (with non-null enqueuing processes) as follows:
Add a row in BATCH table and auto generate a batch_id.
Update the PROCESS_BATCH_QUEUE table with above batch_id as target_batch_id where parent_process_id = process id. Cut at the batch where there is a match between batch_marking and batch_cut_pattern for each sub process. For sub processes with no batch_cut_pattern, update the target_batch_id for ‘max_batch’ no.of batches.
Set batch status to a value corresponding to ACCESSED state.
Else, if one of the sub processes has been detected as FAILED (meaning target batch id is non null), change the BATCH_STATUS to a value corresponding to accessed state. This will make sure the failed batches are run again by Init Job.
Output
The InitJob procedure returns a selection of rows, which have the data from the PROCESS_BATCH_QUEUE table and other data like last_recoverable_sp_id, instance_exec_id. This selection of rows is parsed in the corresponding Java API to obtain the following outputs.
Minimum and maximum batch ids for each sub process.
Minimum and maximum batch dates for each sub process.
Target batch marking, which will be used as an input parameter for HaltJob procedure to set as the batch marking of all the downstream processes having this parent process as an enqueuing process. A target batch marking is a collection of dates of all the batches involved in the present process.
Process exec id corresponding to the present execution of the process.
Target batch id. This is the batch resulting from the successful execution of the present process.
Last recoverable sub process id. In case of a previously failed instance of this parent process (due to failure of one of the sub processes), the InitJob starts its execution from the last recoverable sub process id, instead of running the successfully completed sub processes again.
Marks completion of a parent process.
Input
IN_PID
IN_BATCH_MARKING
Validation
Check if the process id is a valid semantic parent process. Throw error otherwise.
Throw error if the parent process is not in RUNNING state.
Throw error if the any of the sub-processes are in RUNNING/FAILED state.
Execution
Mark batch_state to PROCESSED in PROCESS_BATCH_QUEUE.
Mark run_state as complete in INSTANCE_EXEC and also populate end_ts.
Enqueue one row for all processes that have this process as enqueueing process. Also mark them per IN_BATCH_MARKING. Use the target_batch_id from PROCESS_BATCH_QUEUE as the source_batch_id and target batch marking from the InitJob procedure as the batch marking for these new enqueued batches.
Move all processed records from PROCESS_BATCH_QUEUE to HIST_PROCESS_BATCH queue.
Output None identified
Records failure of a process
Input
IN_PID
Validation
Throw error if the process is not a valid semantic parent process.
Throw error if the process is not in RUNNING state.
Throw error if any of sub-process is in RUNNING state as per INSTANCE_EXEC.
Execution
Mark the row in the INSTANCE_EXEC for this sub-process as FAILED
Update the PROCESS_BATCH_QUEUE table for this process id and set batch status to a value corresponding to FAILURE. These batches are NOT moved to the ARCHIVE_CONSUMP_QUEUE.
Output None identified
Marks start of a sub- process
Input
IN_SUB_PID
Validation
Throw error if the process is not a valid sub-process.
Throw error if the parent process is not in RUNNING state.
Throw error if this sub-process is in RUNNING state as per INSTANCE_EXEC.
Execution
Add a row in the INSTANCE_EXEC for this sub-process
Update the PROCESS_BATCH_QUEUE table for this process id and set start_ts with current system TS.
Output
OUT_SUB_INSTANCE_EXEC_ID
Marks end of a sub-process
Input
IN_SUB_PID
Validation
Throw error if the process is not a valid sub-process.
Throw error if the parent process is not in RUNNING state.
Throw error if this sub-process is not in RUNNING state as per INSTANCE_EXEC.
Execution
Mark the row in the INSTANCE_EXEC for this sub-process as COMPLETED.
Mark with end_ts (as system TS) in PROCESS_BATCH_QUEUE.
Output None identified
Records failure of a sub- process
Input
IN_SUB_PID
Validation
Throw error if the process is not a valid sub-process.
Throw error if the parent process is not in RUNNING state.
Throw error if this sub-process is not in RUNNING state as per INSTANCE_EXEC.
Execution
Mark the row in the INSTANCE_EXEC for this sub-process as FAILED.
Output
None identified
Records registration of a new file and the associated file details
Input :
IN_BATCH_ID
IN_SERVER_ID
IN_PATH
IN_FILE_SIZE
IN_FILE_HASH
IN_CREATION_TS
Execution:
Add a row in the File table with the file details as provided through the parameters.
Output : None identified
Function : Enables adding a new batch status to the batch status table.
On calling the procedure, the inputs get added as a new row entry in the table along with batch_state_id.
Fetches the records of files from join of file table and servers table between the min batch and max batch id passed through parameters.
Fetches the records from properties table related to the process Id and config group passed as parameters.
Fetches the records of files from file table between the min batch and max batch id passed through parameters.
Returns a string of file details separated by semicolon.
Function : Enables adding a new process log to the process log table.
On calling the procedure, the inputs get added as a new row entry in the table along with an auto incremented log_id.
Function : Enables adding batch in batch table and then a file entry in file table and finally, an entry in batch consump queue.
Add a row in the Batch table,File table and BatchConsumpQueue table, with the details as provided through the parameters.
Function : Checks the presence of the required batches for initiation of the Process which is passed as parameter.This proc is used in the Init Job.
Our metadata framework consists of applications under which come processes. And under each process come one or several sub-processes. These processes are registered in the process table. We are testing different cases for:
InitJob: Running a valid parent process registered in the process table.
InitStep: Running a valid sub-process registered in the process table.
HaltStep: Ending a valid sub-process registered in the process table.
HaltJob: Ending a valid parent process registered in the process table.
Case 1: This is a case where Parent process id is a valid entry in the process table The entered parent process is not already running It does not have running sub-processes All sub-processes of the entered parent processes have their necessary batches enqueued.
Preparation
Insert into bus_domain table one row corresponding to a new application. The associated query is :
insert into bus_domain (description, bus_domain_name, bus_domain_owner) values (‘CrawledSocialMediaContent’,’Social_Media’,’Raja’);
Insert into Process table one row corresponding to the parent process with process id = 123. This is the parent process we are beginning now. The associated query is :
insert into `process` VALUES (123,'This is a hive query workflow','2014-11-25 08:06:27','Hive_Workflow',1,1,null,0,0,null,'124',0,2);
Insert into Process table one row corresponding to the parent process with process id = 85. This is the parent process which enqueues batches to the sub processes of the present parent process (process_id=85).The associated query is :
insert into `process` VALUES (85,'This is a parent process','2014-11-25 08:06:27','transfer1',1,1,null,0,0,null,'85',0,2);
Insert 2 sub-processes for the parent process by :
insert into `process` VALUES (124,'This is a hive query 1','2014-11-25 09:06:27','hive query 1',1,1,123,1,0,null,'125',0,0);
insert into `process` VALUES (125,'This is a hive query 2','2014-11-25 10:06:27','hive query 2',1,1,123,1,85,'EOD','123',0,0);
We need a few batches to get started with the InitJob. Hence we add a few rows in the batch table and queue them in the process_batch_queue table. The associated queries are:
Steps
Call the procedure
InitJob(123, 20);
Output
Returns the following outputs
processRunId: would be an auto generated id
lastRecoverableSpId: not applicable in this case
targetBatchId: auto-generated batch id.
min-batch-id-map.125=11
max-batch-id-map.125=33
min-batch-marking-map.125=2014-12-16
max-batch-marking-map.125=2014-12-16
target-batch-marking=2014-12-16,EOD@2014-12-16
Case 2:
This is a case where The entered parent process id is not a valid entry in the process table
PREPARATION
The preparation described in Case 1 suffices
STEPS
Call the procedure
call InitJob(200,1);
Output: It’ll throw an error “Not a valid parent process!”
Case 1: When parent process is running, sub-process is a valid process entry in the process table and sub-process is not already running. This is a normal case, and is expected to pass all the validation conditions. Here, we are running InitStep for the sub process whose process id is 124 and parent process id = 123.
PREPARATION:
If InitJob not already called,
Follow the same steps of preparation as in InitJob. The parent process, i.e. 123 should be running. This can be done by call InitJob(123,20) as done in the previous test.
STEPS :
Call the procedure:
call InitStep(124,@a);
EXPECTED OUTPUT: Returns an auto-generated sub process run id for the sub process id = 124
Case 2 : When parent process is not in running state. This case fails to pass a validation condition.
Preparation :
The preparation described in Case 1 suffices. Leave out the part where we call the InitJob, to make sure the parent process is not running.
STEPS :
Call the procedure:
call InitStep(124,@a);
EXPECTED OUTPUT:
Gives error message ‘Parent process is not running!’
This procedure ends a sub-process using the sub-process-id that is entered.
Case 1:This is a normal case passing all validations, where the sub-process to be ended is already running and its parent process is running.
PREPARATION
If InitJob not already called,
Follow the steps as mentioned in Case 1 for InitJob stored procedure. Move to the next step if InitJob has already been called and HaltJob has not been called for the parent process.
If InitStep not already called,
Follow the steps as mentioned in Case 1 for InitStep stored procedure. Move to the next step if InitStep has already been called HaltStep has not been called for this sub process.
Steps
call HaltStep(124);
Outputs None identified
Case 2: This is a case where parent process of sub process to be ended is running. But the sub process itself isn’t running.
Preparation
If InitJob not already called,
Follow the steps as mentioned in Case 1 for InitJob stored procedure.
Steps
call HaltStep(124);
Outputs: Gives error message ‘Sub process is not running’.
This stored procedure accepts a parent process id as input and attempts to end the same.
Case 1: A normal case where the process itself is running, its sub processes have ran and ended.
Preparation
If InitJob not already called,
Follow the steps as mentioned in Case 1 for InitJob stored procedure. Move to the next step if InitJob has already been called and HaltJob has not been called for the parent process.
If InitStep not already called,
Follow the steps as mentioned in Case 1 for InitStep stored procedure. Move to the next step if InitStep has already been called for its sub processes (124,125) and HaltStep has not been called for the same.
If HaltStep not already called,
Follow the steps as mentioned in Case 1 for HaltStep stored procedure. Move to the next step if HaltStep has already been called for all sub processes under the current parent process.
Steps
call HaltJob(123,’sample_batch_marking’);
Output: Process ends successfully with no identified output. All the processed batches of sub process 125 are moved to the archive_consump_queue table.
Error code | Error message |
---|---|
101 | Max batch cannot be null |
102 | Invalid parent process |
103 | Invalid sub-process |
104 | Parent Process is already running |
105 | The process is not running |
106 | This sub process is already running |
107 | This sub process is not running |
108 | The parent process is not running |
109 | Not all necessary batches are present for the sub process |
110 | Sub process in running state |
111 | Sub process in failed state |
112 | No batches present for one of the sub processes |
113 | Min batch and max batch do not belong to the same sub process |
114 | Batch not present in the process_batch_queue table |