Skip to content

agile-lab-dev/witboost-snowflake-specific-provisioner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

witboost

Designed by Agile Lab, witboost is a versatile platform that addresses a wide range of sophisticated data engineering challenges. It enables businesses to discover, enhance, and productize their data, fostering the creation of automated data platforms that adhere to the highest standards of data governance. Want to know more about witboost? Check it out here or contact us!

This repository is part of our Starter Kit meant to showcase witboost's integration capabilities and provide a "batteries-included" product.

Snowflake Specific Provisioner

Overview

This project implements a simple Specific Provisioner for Snowflake. After deploying this microservice and configuring witboost to use it, the platform can deploy components of a Data Product that use Snowflake. Supported components are: Output Port, Storage Area.

Refer to the witboost Starter Kit repository for information on the templates that can be used with this Specific Provisioner.

What's a Specific Provisioner?

A Specific Provisioner is a microservice which is in charge of deploying components that use a specific technology. When the deployment of a Data Product is triggered, the platform generates it descriptor and orchestrates the deployment of every component contained in the Data Product. For every such component the platform knows which Specific Provisioner is responsible for its deployment, and can thus send a provisioning request with the descriptor to it so that the Specific Provisioner can perform whatever operation is required to fulfill this request and report back the outcome to the platform.

You can learn more about how the Specific Provisioners fit in the broader picture here.

Snowflake

Snowflake is a cloud-based data warehousing platform that provides scalable storage, easy data sharing, and on-demand, near-infinite computing power for processing complex, analytical queries on structured and semi-structured data. It separates compute and storage resources, allowing users to scale and pay for each independently, maximizing efficiency and cost-effectiveness. Learn more about it on the official website.

Software stack

This microservice is written in Scala 2.13, using Akka HTTP (pre-license change) for the HTTP layer. Communication with Snowflake is handled via the official JDBC driver. Project is built with SBT and supports packaging as JAR, fat-JAR and Docker image, ideal for Kubernetes deployments (which is the preferred option).

Building

Requirements:

  • Java 17 (11 works as well)
  • SBT
  • Node
  • Docker (for building images only)

Generating sources: this project uses OpenAPI as standard API specification and the OpenAPI Generator CLI to generate server code from the specification.

In a terminal, install the OpenAPI Generator CLI and run the generateCode SBT task:

npm install @openapitools/openapi-generator-cli -g
sbt generateCode

Note: the generateCode SBT task needs to be run again if clean or similar tasks are executed.

Compiling: is handled by the standard task:

sbt compile

Tests: are handled by the standard task as well:

sbt test

Artifacts & Docker image: the project uses SBT Native Packager for packaging. Build artifacts with:

sbt package

The Docker image can be built with:

sbt docker:publishLocal

Note: the version for the project is automatically computed using information gathered from Git, using branch name and tags. Unless you are on a release branch 1.2.x or a tag v1.2.3 it will end up being 0.0.0. You can follow this branch/tag convention or update the version computation to match your preferred strategy.

CI/CD: the pipeline is based on GitLab CI as that's what we use internally. It's configured by the .gitlab-ci.yaml file in the root of the repository. You can use that as a starting point for your customizations.

Running

To run the server locally, use:

sbt generateCode compile run

By default, the server binds to port 8093 on localhost. After it's up and running you can make provisioning requests to this address. You can also check the API documentation served here.

Configuring

Most application configurations are handled with the Typesafe Config library. You can find the default settings in the reference.conf and some application.conf examples in the Helm chart (see below). Customize them and use the config.file system property or the other options provided by Typesafe Config according to your needs.

Snowflake connection information can be passed in via environment variables or configuration file; the mapping done by default resides in the reference.conf and is the following:

Setting Environment Variable
snowflake.user SNOWFLAKE_USER
snowflake.password SNOWFLAKE_PASSWORD
snowflake.role SNOWFLAKE_ROLE
snowflake.account SNOWFLAKE_ACCOUNT
snowflake.warehouse SNOWFLAKE_WAREHOUSE
snowflake.jdbc-url JDBC_URL
snowflake.account-locator-url SNOWFLAKE_ACCOUNT_LOCATOR_URL

To clarify, below a detailed explain of how configure your variables. By accessing your Snowflake environment, in the lower left part of the page, you can retrieve these information:

parameters

Based on this, you are able to configure your variables simply following the placeholders in the previous image. The only exception is for these two:

  • JDBC_URL: is a string composed in this way jdbc:snowflake://[ORGANIZATION]-[ACCOUNT NAME].snowflakecomputing.com. Following the placeholders you can easily create your jdbc string
  • SNOWFLAKE_ACCOUNT_LOCATOR_URL: you can retrieve this from the specific button, highlighted in green in the image

Logging is handled with Logback, you can find an example logback.xml in the Helm chart. Customize it and pass it using the logback.configurationFile system property.

Account Setup

It's possible to run this microservice with a different role than the administrator user.

NOTE: For the tag support feature, if you want to use a custom role pay attention on the next dedicated paragraph, because others settings are needed.

In other words, to use a custom role it's important to have a Snowflake user with these types of permissions:

  • create role on the account
  • manage grants on the account
  • create a database on the account
  • usage on the warehouse

You can easily set up your own dedicated Snowflake user by following a few specific instructions.

For example, you could create a custom role 'WITBOOST' through which add needed privileges to an existing account or to a new one. Below the commands that must be executed:

CREATE ROLE WITBOOST; 
GRANT CREATE ROLE ON ACCOUNT TO ROLE WITBOOST;
GRANT MANAGE GRANTS ON ACCOUNT TO ROLE WITBOOST;
GRANT CREATE DATABASE ON ACCOUNT TO ROLE WITBOOST;
GRANT USAGE ON WAREHOUSE <YOUR_WAREHOUSE> TO ROLE WITBOOST;
GRANT ROLE WITBOOST TO USER WITBOOST;

In this way, you will create a Snowflake role 'WITBOOST' and you will associate it to an existing user 'WITBOOST'. After this setup, you will be able to use this user and role to perform actions by the provisioner.

Tag feature setup

The Snowflake provisioner is also able to create and remove tags from columns and components.

Assuming this feature access on the SNOWFLAKE.ACCOUNT_USAGE schema, it's possible to use a custom role, but some previous actions are needed.

These actions are mandatory if you use a custom role, because it cannot directly access on that schema.

These the steps that must be executed:

CREATE DATABASE IF NOT EXISTS WITBOOST;
CREATE SCHEMA IF NOT EXISTS WITBOOST.CONFIGURATIONS; 
CREATE VIEW IF NOT EXISTS WITBOOST.CONFIGURATIONS.TAG_REFERENCES AS SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES; 
GRANT USAGE ON DATABASE WITBOOST TO ROLE WITBOOST; 
GRANT USAGE ON SCHEMA WITBOOST.CONFIGURATIONS TO ROLE WITBOOST; 
GRANT SELECT ON VIEW WITBOOST.CONFIGURATIONS.TAG_REFERENCES TO ROLE WITBOOST;

In case you want to use a different database, schema or view, you are free to modify the previous commands with others value. Remember only to configure also these settings:

Setting Environment Variable Default value
snowflake.tag-references.database SNOWFLAKE_TAG_REFERENCES_DATABASE WITBOOST
snowflake.tag-references.schema SNOWFLAKE_TAG_REFERENCES_SCHEMA CONFIGURATIONS
snowflake.tag-references.view SNOWFLAKE_TAG_REFERENCES_VIEW TAG_REFERENCES

Principals Mapper

When the updateAcl is queried, it requires a list of identities as parameter. This is the list of users/groups/roles that need access to the provisioned resource. These identities have a meaning in the Witboost context, but might not have sense in the context of your resources (see Snowflake).

For this reason, we had introduced the concept of PrincipalsMapper strategy, which converts the identities from the Witboost context to the one that Snowflake uses.

Identity Mapper Strategy


In this strategy, the input is returned as output without any transformation, trimming only the Witboost prefixes user: and group:.

Setting Environment Variable Default value Allowed values
snowflake.principals-mapper.strategy SNOWFLAKE_PRINCIPALS_MAPPER_STRATEGY identity identity,table-based

The default configuration uses an identity mapping strategy.

Table Based Mapper Strategy


It is possible to use a table-based mapping if, for example, Snowflake is not configured to use SSO. In this case the mapping table is used and is expected to be already present and filled with all the required mapping entries.

The following options allow you to customize the database, schema and name of the mapping table:

Setting Environment Variable Default value
snowflake.principals-mapper.table-based.database SNOWFLAKE_PRINCIPALS_MAPPER_TABLE_BASED_DATABASE WITBOOST
snowflake.principals-mapper.table-based.schema SNOWFLAKE_PRINCIPALS_MAPPER_TABLE_BASED_SCHEMA CONFIGURATIONS
snowflake.principals-mapper.table-based.table SNOWFLAKE_PRINCIPALS_MAPPER_TABLE_BASED_TABLE PRINCIPALS_MAPPING

You can find a sample SQL script to create required objects here. Update it before executing, if you decide to use different default values for database, schema or table name.

The role used by the provisioner must have SELECT privileges on this table.

Reverse Provisioning

This Specific Provisioner support the Reverse Provisioning operation, used on Witboost to import or update existing resources onto the platform. The Snowflake Specific Provisioner supports reverse provisioning for both output ports and storage area components. Specifically, it supports:

  • Reverse provisioning of a list of tables (modifying the spec.mesh.specific.tables catalog-info field, containing table name and column metadata) for storage area components given the input:
    {
      "database": "databaseName",
      "schema": "schemaName",
      "tables": ["listOf", "tableNames"]
    }
  • Reverse provisioning for output port components, including data contract schema (modifying spec.mesh.dataContract.schema) and table/view tags (spec.mesh.tags) given the input:
    {
      "database": "databaseName",
      "schema": "schemaName",
      "viewName": "viewName"
    }

    Currently, reverse provisioning doesn't support importing column tags on the data contract schema

The Specific Provisioner identifies the type of component to be imported based on the useCaseTemplateId sent as part of the request. For this purpose, two configuration fields are added to the Specific Provisioner configuration file to specify the expected useCaseTemplateIds:

Setting Default value
specific-provisioner.storage-useCaseTemplateId urn:dmb:utm:snowflake-storage-template:0.0.0
specific-provisioner.outputport-useCaseTemplateId urn:dmb:utm:snowflake-outputport-template:0.0.0

Set these values based on the useCaseTemplateId used by your components.

Deploying

This microservice is meant to be deployed to a Kubernetes cluster with the included Helm chart and the scripts that can be found in the k8s subdirectory.

License

This project is available under the Apache License, Version 2.0; see LICENSE for full details.

About Witboost

Witboost is a cutting-edge Data Experience platform, that streamlines complex data projects across various platforms, enabling seamless data production and consumption. This unified approach empowers you to fully utilize your data without platform-specific hurdles, fostering smoother collaboration across teams.

It seamlessly blends business-relevant information, data governance processes, and IT delivery, ensuring technically sound data projects aligned with strategic objectives. Witboost facilitates data-driven decision-making while maintaining data security, ethics, and regulatory compliance.

Moreover, Witboost maximizes data potential through automation, freeing resources for strategic initiatives. Apply your data for growth, innovation and competitive advantage.

Contact us or follow us on: