bdre_handbook

Bigdata Ready Enterprise

Goal of BDRE is to give Bigdata implementation a significant acceleration by supplying the essential frameworks which are most likely to be written anyway. It'll drastically eliminate hundreds of man hours of effort in operational framework development.

BDRE currently implemented:

Operational Metadata Management
Registry of all workflow processes/templates
Parameters/configuration(key/val) for processes
Dependency information (upstream/downstream)
Batch management/tracking. Batch concept in BDRE is for tracking the data flow between workflow processes.
Run control (for delta processing/dependency check)
Execution status for jobs(dynamic metadata - with step level granularity)
File registry - can be used to register e.g. ingested files or a raw file as an output of an upstream.
Execution statistics logging (key/value)
Executed hive queries and data lineage information.
Java APIs that ingrates with Bigdata with non-bigdata applications alike.
Job monitoring and proactive/reactive alerting
Data ingestion framework
Tabular data from RDBMS
Streaming data from 16 types of sources (including logs, message queues and Twitter)
Arbitrary file ingestion by directory monitoring
Web Crawler
Distributed Data Manufacturing framework
Generate billons of records based on patterns and ranges
Semantic Layer Building Framework
Build the sematic layer using visual workflow creator using the data you ingested.
Supports Hive, Pig, Mapreduce, Spark , R etc
Generates Oozie workflows
Data Quality Framework
Validates your data using your rules in a distributed way
Integrated with Drools rule engine
HTML5 User Interface
Create ingestion, data generation, Crawler jobs or create Oozie workflows graphically without writing any code
One click deploy and execute jobs without SSH into the edge node.

Architecture

Above diagram explains how BDRE can be implemented in a production environment.

A Hadoop Cluster
A supported RDBMS database(Oracle, MySQL, PostgreSQL) to store BDRE backend information and operational metadata.
BDRE UI and RestAPI run in any J2EE App server(e.g. Tomcat) and provides access the metadata stored in RDBMS database. It typically installs in the existing edge node of the cluster so additional server is not usually needed to run BDRE.
For authentication the UI uses standard JAAS which AS IS provides DB authenticates table but easily changed to use AD or LDAP instead.

BDRE Features

BDRE Core Features Suggested for Implementation

Operational Metadata Management

BDRE provides complete job/operational metadata management solution for Hadoop. At its core acts as a registry and tracker for different types of jobs running in different Hadoop clusters or as a standalone. It provides APIs to integrate with virtually any jobs.

BDRE uses RDBMS database to store all job related metadata. A set of stored procedures are there to interface will the tables which are exposed via Java APIs to manage/create/update the static and run time metadata information. Below is the data model for BDRE metadata operational database.

BDRE Metadata Tables

PROCESS

This table will act as a registry for all processes (jobs) that will run on the cluster. The processes could be a file to table process, semantic process or an export process. Each process will be registered using designated procedure API and given a unique process identifier.

INSTANCE_EXEC

Every time a process runs it’s assigned a new auto-generated process id, which gets inserted in a row with other information in INSTANCE_EXEC table. This run id is used as an ETL partition in Hive tables where the data is being loaded.

BATCH_CONSUMP_QUEUE

Every time a process runs it consumes batches through its sub-processes. Those sub-processes consume batches enqueued by its enqueuing process(es). A sub-process can have one, two or more enqueuing processes but majority of them will generally have only one enqueuing process. This means that a sub process can consume delta batches from different tables, which are independently populated with different batches (usually by different file to table interfaces). A row gets added in this table for an eligible sub-process by its enqueuing process(a parent process always) when the enqueuing process completes its execution successfully. If multiple processes (sub) have same enqueuing process then for each process a row gets added with same batch id but with different process ids. The target batch id initially remains empty and gets populated with an auto generated batch id when the process given by the process_id accesses the corresponding batch. When all batches for the parent process have been processed successfully then the enqueued batches are moved to the ARCHIVE_CONSUMP_QUEUE table.

ARCHIVE_CONSUMP_QUEUE

A replica of BATCH_CONSUMP_QUEUE. When batches are successfully consumed and parent process exits successfully, the records get moved to this table.

BATCH

This table is responsible for generating a batch id when a new batch has to be created.Each process when executes creates a batch after its completion.This batch id is used to link between the files for upstream and downstream processes by registering the batch Id in batch consump queue table.

FILE

If a process is a file to table interface then the file table would be queried for file location.The file path and server Id is registered in this table with other file details once the upstream process is successfully completed so the downstream process can access the file by querying the file table using the batch Id.

HIVE TABLES

This table registers the DDLs required to create the tables and views in the Hive for the Data Load operations. An auto increment table Id is generated which can be referred to link these DDLs to a Data Load process.

PROPERTIES

This table registers the properties as key-value pair with a configuration group against a sub-process required for the execution.The processes like data-generation,semantic utilizes this table to define the parameters required for their execution.

ETLDRIVER

This table is used by Data Load Process to link the DDLs with the process by registering the table Ids generated in the HIVE_TABLES.

SERVERS

The servers table registers the server details like name,login user,password,ssh private key, IP address,etc, against an auto incremented server Id which can be referred by the process to make an entry in file table.

BUS_DOMAIN

Multiple process in a given subject area are grouped into a given business domain. Example – Security application can contain different security related processes.

PROCESS_LOG

Specific logs are registered in the process log table with respective process id and instance exec Id as instance_ref after its execution. For example, logs of Data Quality process and Import process. This log can be referenced later for further analysis.

PROCESS_TEMPLATE

A table which contains a standard template of the form process, from which more such processes can be created or existing such processes can be edited.

PROPERTIES_TEMPLATE

A table which contains properties for any associated process template.

LINEAGE_QUERY

This table contains information about the hive/pig query for which column level lineage is to be shown.

LINEAGE_NODE

This table contains information about different nodes involved in a query(table, column, function etc).

LINEAGE_RELATION

This table contains information about relations between different columns involved in a query.

BATCH_STATUS

This table is a master table which defines the status of the batches generated by the processes. Depending on the state of the batch the batch can be moved from batch_consump_queue table to archive_consump_queue table.

EXEC_STATUS

This is another master table which defines the execution status of the processes.The execution state : 1, 2, 3, 4, 5 or 6 represents not running, running, success,paused, suspended or failed. Depending on these status the Init job,Init step,Halt job,Halt step,Term job and Term step can be initiated.

PROCESS_TYPE

This is also a master table which defines the process type. It can be semantic,hive data generation,etc.Each type is assigned a process type Id. These ids are referred in workflow generation module to generate the workflow xml file for oozie and the dot file for workflow visualization.

WORKFLOW_TYPE

Another master table which defines the workflow type and assigns a workflow type Id to each type. This has 3 entries : steps,stand alone and oozie.

LINEAGE_QUERY_TYPE

This table registers the lineage query type which would referenced by different tables of the Lineage module.The query type can be pig,hive etc.

LINEAGE_NODE_TYPE

This table is used to define the lineage node type which can be a table,column,function etc. This would be used by lineage module to generate the lineage graphic.

USERS

This table registers the users with their passwords for authentication and performing actions.

USER_ROLES

This table defines the user role for the different users registered in the USERS table.

Technologies used:

Oracle/MySQL(Database)

Metadata DB APIs

There are quite a lot of APIs that are already developed and working in BDRE. For above use case following important metadata APIs are being explained in more detail as they would be important, as they will be called by the workflows to update job status. The Oozie workflows generated by BDRE(explained below with more details)

InitJob API

Mark start of a workflow/process

The InitJob API returns a selection of rows, which have the data from the PROCESS_BATCH_QUEUE table and other data like last_recoverable_sp_id, instance_exec_id. - This selection of rows is parsed in the corresponding Java API to obtain the following outputs.
Minimum and maximum batch ids enqueued by upstreams for each sub process.
Minimum and maximum batch dates for each sub process.
Optional: Target batch marking, which will be used as an input parameter for HaltJob procedure to set as the batch marking of all the downstream processes having this parent process as an enqueuing process. A target batch marking is a collection of dates of all the batches involved in the present process.
Process execution id corresponding to the present execution of the process.
Target batch id. This is the batch resulting from the successful execution of the present process to be enqueued to its downstreams.
Last recoverable sub process id. In case of a previously failed instance of this parent process (due to failure of one of the sub processes), the InitJob starts its execution from the last recoverable sub process id, instead of running the successfully completed sub processes again.

HaltJob API

Marks completion of a workflow/process .

Mark batch_state to PROCESSED in PROCESS_BATCH_QUEUE.
Mark run_state as complete in INSTANCE_EXEC and also populate end timestamp.
Enqueue one row for all processes that have this process as enqueueing process. Also mark them per IN_BATCH_MARKING. Use the target_batch_id from PROCESS_BATCH_QUEUE as the source_batch_id and target batch marking from the InitJob procedure as the batch marking for these new enqueued batches.
Move all processed records from BATCH_CONSUMP_QUEUE to ARCHIVE_CONSUMP_QUEUE queue.

TermJob API

Records failure of a parent process

Mark the row in the INSTANCE_EXEC for this sub-process as FAILED
Update the BATCH_CONSUMP_QUEUE table for this process id and set batch status to a value corresponding to FAILURE. These batches are NOT moved to the ARCHIVE_CONSUMP_QUEUE.

InitStep API

Marks start of a sub-process/steps/routine

Add a row in the INSTANCE_EXEC for this sub-process
Update the BATCH_CONSUMP_QUEUE table for this process id and set start_ts with current system TS.

HaltStep API

Marks end of a sub-process/steps/routine

Mark the row in the INSTANCE_EXEC for this sub-process as COMPLETED.
Mark with end_ts (as system TS) in BATCH_CONSUMP_QUEUE.

TermStep API

Records failure of a sub-process/steps/routine

Mark the row in the INSTANCE_EXEC for this sub-process as FAILED.

GetFiles API

Fetches the records of files from join of file table and servers table between the min batch and max batch id passed through parameters.

GetProperties API

Fetches the records from properties table related to the process Id and config group passed as parameters.

GetETLDriver API

Fetches the records of files from file table between the min batch and max batch id passed through parameters.

Returns a string of file details separated by semicolon.

AddBusDomain

Function : Enables adding a new application to the bus_domain table.

On calling the procedure, the inputs get added as a new row entry in the table along with an auto incremented Application id.

AddProcess

Function : Enables adding a new process to the process table.

On calling the procedure, the inputs get added as a new row entry in the table along with an auto incremented process_id.

AddBatchStatus

Function : Enables adding a new batch status to the batch status table.

On calling the procedure, the inputs get added as a new row entry in the table along with batch_state_id.

AddBatchConsumpQueue

Function : Enables adding a new entry to the batch consump queue table.

A row gets added in the table for every batch coming as an output from an upstream process and is ready to get executed by a downstream process. insert_ts gets default values as provided by CURRENT TIMESTAMP.

RegisterFile:

Function : Records registration of a new file and the associated file details.

Add a row in the File table with the file details as provided through the parameters.

AddProcessLog

Function : Enables adding a new process log to the process log table.

On calling the procedure, the inputs get added as a new row entry in the table along with an auto incremented log_id.

BatchEnqueuer:

Function : Enables adding batch in batch table and then a file entry in file table and finally, an entry in batch consump queue.

Add a row in the Batch table,File table and BatchConsumpQueue table, with the details as provided through the parameters.

BatchCheck

Function : Checks the presence of the required batches for initiation of the Process which is passed as parameter.This proc is used in the Init Job.

Metadata DB REST APIs

The Rest APIs have been created to bridge the gap between the UI and the back end DB. These APIs have mainly five functionalities:
1.INSERT
2.GET
3.LIST
3.DELETE
5.UPDATE

Technologies used:

Java
Oracle/MySQL PL/SQL
MyBatis (for JDBC)

Metadata Driven Dependency Management

BDRE offers run control, incremental processing and batch management for Hadoop. User can define the upstream-downstream relationship in BDRE which enables following

Checks if all the upstreams had run and dependencies are met. So even if a job is mistakenly triggered BDRE will automatically fail it in the initialization before it could alter the data in a wrongful way.
Provides pointers to the incremental output produced by the upstreams so the routines can use only processed unprocessed data.
Allows to visualize process pipeline.

Typical Usage with partition pruning in Hive

Assume that there are two processes X and Y registered in BDRE.
Y is a file ingestion process that's loading new file batches into a base table T1. T1 is partitioned on batchid which is supplied by BDRE API every time it populates.
X is a semantic workflow(some workflow consisting of hive queries) that runs a query which selects from T1 and inserts to T2. T2 has two business partitions and in addition to that it has a third partition called 'runid'. Runid is used to populate incremental data coming from T1.
Y is registered as an upstream for one of the sub-processes of X.
Y ran three times successfully, enqueuing 3 batches to X.
X runs once after that and BDRE InitJob passes 3 enqueued batched to Y.
When Y runs BDRE InitJob passes following information from metadata. The numbers are from current values coming from metadata.

target_batchid=201 instanceexecid=3 min_src_batch=3 Max_src_batch=5

This way the query would be able to use Hive partition pruning avoiding full table scan of T1.
New records are appened into the runid partition which is essentially the instanceexecid of the current execution of X.

Technologies used:

JAVA
Oracle/MySQL PL/SQL
MyBatis (for JDBC)

BDRE UI

BDRE has an existing browser Based UI. The UI is supported by a web application and fed by RESTful APIs. The RestAPIs connect internally to the backend database using JDBC and bring the metadata data back and forth between the UI and RDBMS database. Note that BDRE does not have access hence cannot expose Partner's data from HDFS or from anywhere. BDRE has two distinct web applications that are to be deployed in Tomcat application server.

BDRE application is built using Spring3 libraries.

Spring3 MVC
Jquery, Angular (For Ajax/interactiveness front end)
Bootstrap (For look and feel)
Tomcat 6 (or any J2EE compliment app server for containing the web applications)
JAAS (for authentication to the UI)
MyBatis (for JDBC)

Navigation

BDRE provides a two level navigation, one in the left and one in the top for ease of use. All navigations are collapsible for gaining more real estate while viewing a wider page.

Operational Registry/Metadata View/update

Most metadata tables can be accessed via the user interface which uses REST APIs to interact with the backend. BDRE provided a jQuery based spreadsheet like interface to insert and manipulate the data. The REST API uses standard HTTP GET,POST,PUT and DELETE operations to read/write data. Each operation is protected with different user roles (e.g. READONLY role blocks POST,PUT and DELETE).

Workflow Creation with BDRE using templates

Core Workflow Creation
For countries that use core 'set of routines' (the 'set' is here after referred as core workflows) use a template to create the workflows.
Developers will design the workflow template once and create as many workflows for each core countries as required.
If the workflow template is modified then developers can propagate the changes to all workflows for the core counties.
For other countries that do not follow the standard template separate workflows with appropriate routines(Hive queries) can be created.
The location of the hive queries and the parameters passed to the hive queries (like db name for a country) would maintained by BDRE and while generating the workflow, it'll add them in the workflow.xml appropriately. This allows the Hive queries to be written once and reused multiple times.
If the workflow template is updated all workflows created under the template can be updated together in one click.
Developers can edit or view it in read only mode based on users role.
In production only view only mode is suggested for most users. BDRE will log the user names for each actions.

Countries without standard core
Separate workflow processes from the UI needs to be created without a template.

Visual Representation of Deployable Oozie Workflows

After creating the workflows from the UI, users can check the logical representation diagram of the workflows.

Visual Representation of process and interdependencies(process pipeline)

Job dependency information can be visually seen from BDRE

Oozie Workflow Generation

After defining a workflow with its steps from the UI the data is saved in BDRE metadata Oracle/MySQL DB. User can externalize different parameters in properties page for each process. These properties would be passed as ** Hive variables ** automatically when the Oozie workflow is generated. User can also externalize the path of the Hive routines so Oozie can pick it from a standard "bookshelf" folder thereby increasing the reusability.

Process Definition

Process Properties Definition

Generated Oozie xml code block

<action name="hive-10432-Bank_Core_Routine1">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <job-xml>hive-site.xml</job-xml>
            <configuration>
            <property>
                <name>hive.exec.post.hooks</name>
                <value>com.wipro.ats.bdre.hiveplugin.hook.LineageHook</value>
                </property>                <property>
                <name>bdre.lineage.processId</name>
                <value>10432</value>
                </property>
                <property>
                <name>bdre.lineage.instanceExecId</name>
                <value>${wf:actionData("init-job")["instance-exec-id"]}</value>
                </property>
                </configuration>            <script>/bank/rules/bank/core/core_routine1.hql</script>
            <param>exec-id=${wf:actionData("init-job")["instance-exec-id"]}</param>
            <param>target-batch-id=${wf:actionData("init-job")["target-batch-id"]}</param>
            <param>min-batch-id=${wf:actionData("init-job")["min-batch-id-map.10432"]}</param>
            <param>max-batch-id=${wf:actionData("init-job")["max-batch-id-map.10432"]}</param>
            <param>min-pri=${wf:actionData("init-job")["min-source-instance-exec-id-map.10432"]}</param>
            <param>max-pri=${wf:actionData("init-job")["max-source-instance-exec-id-map.10432"]}</param>
            <param>min-batch-marking=${wf:actionData("init-job")["min-batch-marking-map.10432"]}</param>
            <param>max-batch-marking=${wf:actionData("init-job")["max-batch-marking-map.10432"]}</param>
            <param>target-batch-marking=${wf:actionData("init-job")["target-batch-marking"]}</param>
            <param>last-recoverable-sp-id=${wf:actionData("init-job")["last-recoverable-sp-id"]}</param>
            <param>accountTableName=ACCOUNT</param>
            <param>dbname=france</param>
            <param>customerTableName=CUSTOMER</param>
        </hive>
        <ok to="join-10476"/>
        <error to="term-step-10432"/>
    </action>

BDRE workflow generator when launched fetches the workflow information from DB, generates and writes workflow.xml into a file.

Following things are done by BDRE automatically without any explicit instruction or development by the user
Each routine in this use case is defined a Hive action and appropriate xml code block is generated.
Depending on how the steps are defined actions are organized in parallel or in sequence.
BDRE automatically adds InitJob API java action xml in the beginning of the workflow(immediately after the start node) and HaltJob Java action before the end node.

<start to='init-job'></start>
<action name="init-job">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <main-class>com.wipro.ats.bdre.md.api.oozie.OozieInitJob</main-class>
            <arg>--environment-id</arg>
            <arg>${env}</arg>
            <arg>--max-batch</arg>
            <arg>1</arg>
            <arg>--process-id</arg>
            <arg>10431</arg>
            <capture-output />
        </java>
        <ok to="recovery-test"/>
        <error to="kill"/>
</action>

TermJob API is added before kill node.
Before and after a routine, workflow generator inserts InitStep and HaltStep API Automatically.
If two or more routines to execute in parallel, BDRE automatically inserts an Oozie fork node before the InitStep of parallel routines. After the routine execution completion BDRE adds a join node capture the success of all routines in the parallel stage and then sequentially adds the HaltStep API for each of the routines.
"error to" for all actions is directed to corresponding TermStep actions and ultimately to TermJob followed by a kill.

Following two images show how the job configuration translate to what Oozie workflow with metadata APIs embedded.

Workflow restart-ability

Because of the job metadata APIs added to the workflow, BDRE can log the status of each workflow and steps with in. If a workflow has failed at a certain step in previous execution, BDRE would know it. So if the restart-ability is turned on for a step it'll restart from the last failed step automatically. The InitJob API returns the last successful step id. Workflow generator automatically adds a block to read that parameter and redirects the flow to start from the last failed step.

<decision name="recovery-test">
    <switch>
      <case to="fork-10431">${wf:actionData('init-job')['last-recoverable-sp-id'] eq '10431'}</case>
      <case to="init-step-10476">${wf:actionData('init-job')['last-recoverable-sp-id'] eq '10476'}</case>
      <case to="init-step-10477">${wf:actionData('init-job')['last-recoverable-sp-id'] eq '10477'}</case>
      <default to="fork-10431"/>
    </switch>
</decision>

workflow Deployment

BDRE provides with set of shell scripts that deploy generated Oozie workflows to HDFS. It is recommended that we have a separation between BDRE and the cluster. That is instead of calling the deployment shell script directly from the BDRE web application hosted in Tomcat, install the script in the corporate CI tool. If the CI tool provides a REST API to trigger the jobs, BDRE is already capable of calling it. However depending on the need/restriction/processes BDRE shell scripts can be modified to perform the deployment differently. E.g. check-in the generated workflows into SVN and then follow the 'standard code deployment' procedure. Significant automation makes the development more efficient. However, in production environment the deployment should be per change control and access can be configured to limit the deployments.

Deployment Automation using Continuous integration and version control

Technologies used:

Java
Oracle/MySQL PL/SQL
MyBatis (for JDBC)
Oozie

Data Lineage

BDRE the Oozie workflow generated by BDRE includes a post execution hook that captures the Hive queries executed successfully as part of the workflow. BDRE logs the Hive queries executed via Oozie for the workflows generated by BDRE. The query then gets analyzed and lineage information is extracted and stored in BDRE metadata database (Oracle/MySQL). When user requests for lineage from the UI, it fetches the pre-analyzed information from the BDRE backend and returns a visual representation of the lineage.

insert overwrite table bank.tmp_table select account_id, 
city from (select city, (row_number() over (partition by account_id order by cts desc)) as ACCOUNT_ID 
from (  
select account_id, city, min(creation_timestamp) as cts from  bank.final 
where instanceexecid in (1,2) group by account_id, city) T1) T2  where T2.rn <= 2;

Technologies used:

Java
Oracle/MySQL PL/SQL
MyBatis (for JDBC)
Hive

Random Test Data Generation

During testing it's often required to have production like bulk data for proper testing. Copying data from production can be seen as a security risk and requires the data to be masked. Sometimes copying data from production to development cluster is a slow process and affects testing efficiency. BDRE has a capability to manufacture the data automatically based on schema/pattern/range and prevent any exposure to production data.

Map only MR job written in Java. Like everything else in BDRE, code is given and tweaking can be done if necessary.
Users can specify the range(for numbers and date) or regular expression for more complex ones and number of rows to be generated.
Complete configuration is done from BDRE UI and BDRE creates test data generation Oozie workflow.
The generated workflow will automatically include the invocation of the
BDRE manufactures the data randomly based on the range or the patterns.
Parallelism can be control and users can specify number of parallel generator tasks to be used.
Produces a flat CVS file with millions of records which can be loaded in Hive tables (or anywhere).
BDRE also comes with a Hive data loader program which when configured can load data automatically after generation into a Hive table and also converts the data in any Hive I/O format. Default is ORC.

Technologies used:

Java
Oracle/MySQL PL/SQL
MyBatis (for JDBC)
Hadoop Map/Reduce

Data Ingestion

Analysis on data can be done once the data is ingested to the Hadoop environment. The data can be ingested from different sources e.g. RDBMS,MQs to the Hadoop environment using the templates available in BDRE UI.

RDBMS Data Import

The above image shows the flow of data ingestion process for RDBMS as data source.

The database URL,user name,password and driver is supplied through UI to establish the connection with data source.
Next, the tables and columns can be selected which has to be imported. There is a provision of importing data as delimited file to HDFS along with loading the same file into the Hive by selecting ingest only option as False.Selecting True will imply ingestion of data as delimited file into HDFS only.
Also, the incremental type can be chosen as follows: - None : non-incremental import - Append Rows : ingestion of newly added records - Last Modified : updated records in the data source based on time stamp.

The mapping of columns from RDBMS to Hive is done automatically. The Hive database name is also provided.

The data load into Hive is a three step process.
1.File To Raw : Loads data from a file to a Raw table in Hive. The data is stored as TEXTFILE format in Hive from which a view is created in subsequent step.
2.Raw To Stage : Creates a temporary table Stage having the same schema as the Base table and inserts data from Raw table to the Stage table 3.Stage To Base : Finally moves the data from Stage table to Base table, thereby completing the file -> table process.
Next, the business domain Id is provided and the jobs are created.
The metadata entries are made automatically in the dependent tables for the execution.This includes the jobs and their required parameter entries.
The workflow is generated and deployed.Finally, the jobs are run.

Technologies used:

Sqoop
Hive
Java
Oracle/MySQL PL/SQL
MyBatis (for JDBC)

MQ Data Import

The above image shows the flow of data ingestion process for MQ as data source.

All the real time data can be stored in the messaging server.The Real Time data populating the messaging server can be ingested to HDFS as files.
As soon as the messaging server receives messages from heterogeneous data sources, the MQ import module creates a file in HDFS and starts writing into it. A new file is created once the file size limit is reached and the former one is registered in the BDRE metadata.
The following parameters are set through UI :
- broker URL, the URL of messaging server from where the messages can be fetched.
- number of spouts and bolts which gives the liberty to process the data in parallel. Spouts are the consumers which fetches the messages from the server while bolts process the data fetched i.e file rotation (creating new file once file size limit is reached) and register the file in metadata.
- queue name, the queue from where the messages has to be fetched.
The job is created and the workflow is generated automatically.
The job can be deployed through Apps & Processes page and then finally can be run through CI.

TOPOLOGY:

Technologies used:

Java
Storm
Oracle/MySQL PL/SQL
MyBatis (for JDBC)

Application success/failure alerting system

The Java APIs viz HaltJob and TermJob can currently push machine understandable success and failure events to a JMS server (e.g. WebSphear MQ or Active MQ etc).

This feature is embedded with all generated Oozie workflows by BDRE.
BDRE has this feature but it's optional to use. It helps with proactive alerting when an exception happens to the workflow(e.g Hive routine failure for some reason).
BDRE also has a JMS consumer code to read the messages from MQ and convert that to an email message.
Alternatively a consumer can be written easily to create a JIRA tickets upon workflow failure.
BDRE logs and persists success/failure information in it's database anyway, so all the success/failure information can be tracked on demand for all workflows created using BDRE.

Data Quality

BDRE comes with a DQ module. Like everything else in BDRE, DQ can be configured from BDRE UI.

Performs a row by row column level validation.
Currently works on delimited text file
The input file is picked from HDFS and never downloads to edge node or anywhere.
DQ validation rules are defined in Drools
DQ module spawns a map only job where each mapper pulls the rules from the rule engine and performs row by row validation in the cluster(nothing is performed in the run engine side)
It creates two files in HDFS one with good and other with all rejected records.

Input Data

Output Data

BDRE Installation

Software required to develop, build and run

JDK
A Version Control System (e.g. SVN or Git)
Maven 3 and a maven repository.
A Java Development IDE (e.g. Intellij Idea, RAD , Eclipse etc)
Oracle/MySQL Environment
A SSH Client(putty)
Chrome browser
A Sandbox running HDP 2.x.
A CI Tool (for automated building deployment of BDRE and other applications and automated workflow deployment in dev)
Tomcat6

There are three project workspace to build. Before building edit the bdre config files and update the config like DB credentials.

<environment id="development">
            <transactionManager type="JDBC"/>
            <dataSource type="POOLED">
                <property name="driver" value="ORACLE_JDBCDRIVER"/>
                <property name="url" value="ORACLE_JDBC_URL"/>
                <property name="username" value="DBUSER"/>
                <property name="password" value="DBPASS"/>
            </dataSource>
</environment>

They can be built using standard mvn clean install.

Installation

Assuming we are installing BDRE in edge node. Create one user in the edgenode for running Tomcat as. The default Tomcat user is fine too.
Copy md-ui.war and md-rest-api artifacts to tomcat webapp directory.
Create an Oracle/MySQL database and using following scripts create BDRE backend tables and procedures.

sh oracle/scripts/create-tables.sh $oracleuser $oraclepwd 
sh oracle/scripts/create-procs.sh $oracleuser $oraclepwd

Start Tomcat and access BDRE web console through a ssh tunnel like http://localhost:8080/md-ui/pages/content.page
Copy the BDRE metadata API jar artifacts in Oozie shared lib directory in HDFS.
Alternatively those jars can be deployed as uber(fat) jars to each workflow's lib directory.
After creating some workflow from the UI. Generate the workflow from the command line as

java -cp "workflow-generator.jar" com.wipro.ats.bdre.wgen.WorkflowGenerator 
--parent-process-id $processId 
--file-name workflow-$processId.xml 
--environment-id $environment

-This will generate workflow-$process-id.xml file containing the Oozie workflow

You have transfer this file in hdfs to run the workflow.
To run an oozie workflow you have to create a job.properties file. We have a set of shell scripts that generate the job.properties as well. this file, we provide path of workflow.xml in hdfs and lib path.
You can use oozie run command to run the workflow. Oozie client automatically exits with success if the job submission is successful. It does not wait for the job to be actually completed in the cluster.
BDRE provides a python script that take the process id as parameter and triggers the workflow for that process. This python scripts also polls Oozie till the workflow completes in the cluster. It exits with 0 success code when the workflow completes gracefully or a non-zero code when the workflow fails.
This script can use used to trigger the workflows from Control M. That way control-M will would be in running mode till the job is complete and show right status when the job fails.

Requirement for BDRE

Maven Repository (Need core BDRE to work)

If a private MVN repository is used all required Hadoop and standard open source jars(e.g. log4j and apache commons) should be available.

Network Requirement (Need core BDRE to work)

The edge node where BDRE web app is installed(in Tomcat) would need to connect to Oracle/MySQL for metadata retrieval.
Since the Oozie jobs run from GBSM (Babar) cluster the apps needs to push app metadata to Oracle/MySQL so a connectivity is required between the GBSM cluster and Oracle/MySQL.
Since MQ alert may not be used for this release. MQ server connectivity from Cluster is not needed.

BDRE SVN and CI Access (Specific feature requirement)

Needed for automated deployment of Oozie workflows.
BDRE would need a SVN id to checkin the Oozie workflows in version control.
BDRE would require some access mechanism to call CI tool that will deploy the workflows in HDFS and job.properties in the edge node.
Note the CI tool actually does the deployment of the file in HDFS. BDRE does not need access to the Hadoop file system.

Database Access (Need core BDRE to work)

BDRE needs a dedicated Oracle/MySQL database.
Table DDLs will be provided to DBA to create the tables.
BDRE also has around 170 stored procs, all scripts will be provided to DBA.
A default schema called 'bdre' can be created and all objects should be created under this schema.
The users need ids to access the DB.
Select, insert , delete on all tables
Should be able to execute all the stored procedure.
Credential is needed for BDRE application to access the database. BDRE application id should have following access
Select, insert , delete on all tables
Should be able to execute all the stored procedure.
Default schema for this id is BDRE.
During development and unit testing, stored procedured need to be modified and tested. So create and drop access to the developer ids are required for the stored produres in the test BDRE DB.

Tomcat server (Need core BDRE to work)

Two war files will be deployed in Tomcat
Some access through F5 is needed for users to access the BDRE URL from the browser.