Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Implement operator #23

Merged
merged 31 commits into from
Oct 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
4cf56c4
feat!: Implement operator
jsbroks Aug 29, 2023
ed0a70d
terraform-docs: automated action
github-actions[bot] Aug 29, 2023
ef028e4
add operator values to chart
jsbroks Aug 30, 2023
de8e916
terraform-docs: automated action
github-actions[bot] Aug 30, 2023
60c38f5
fix host
jsbroks Aug 30, 2023
921422c
update controller version
jsbroks Aug 30, 2023
abb2a2d
use proper provider naming
jsbroks Aug 31, 2023
9712def
bump controller version
jsbroks Sep 1, 2023
e4864ba
forward license to operator
jsbroks Sep 1, 2023
e5cc115
bump controller
jsbroks Sep 23, 2023
922a40c
Update main.tf
jsbroks Sep 30, 2023
8d75f0c
Merge branch 'main' into operator
jsbroks Sep 30, 2023
3e0ec02
terraform-docs: automated action
github-actions[bot] Sep 30, 2023
f9e1f8a
Merge branch 'main' into operator
jsbroks Oct 4, 2023
78cd222
Merge branch 'main' into operator
jsbroks Oct 6, 2023
ef601a0
Merge branch 'main' into operator
jsbroks Oct 11, 2023
fd822f9
terraform-docs: automated action
github-actions[bot] Oct 11, 2023
7e0f0a5
Merge branch 'main' into operator
jsbroks Oct 13, 2023
0e08d0c
terraform-docs: automated action
github-actions[bot] Oct 13, 2023
de15e63
Support for external bucket
jsbroks Oct 16, 2023
f5a264a
terraform-docs: automated action
github-actions[bot] Oct 16, 2023
6062db2
cleanup external bucket
jsbroks Oct 16, 2023
af4541a
Add variables for datadog
Oct 17, 2023
9ea7aed
terraform-docs: automated action
github-actions[bot] Oct 17, 2023
ef35437
Amendments.
Oct 17, 2023
fbfc1e9
update to external bucket support
venky-wandb Oct 18, 2023
e1a91d1
terraform-docs: automated action
github-actions[bot] Oct 18, 2023
c8f0b86
bump operator version
jsbroks Oct 18, 2023
2052ca8
rollback dd changes
jsbroks Oct 18, 2023
f5c45e3
terraform-docs: automated action
github-actions[bot] Oct 18, 2023
65c82b7
remove more datadog
jsbroks Oct 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 2 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,7 @@ resources that lack official modules.

| Name | Source | Version |
|------|--------|---------|
| <a name="module_aks_app"></a> [aks\_app](#module\_aks\_app) | wandb/wandb/kubernetes | 1.14.0 |
| <a name="module_app_aks"></a> [app\_aks](#module\_app\_aks) | ./modules/app_aks | n/a |
| <a name="module_app_ingress"></a> [app\_ingress](#module\_app\_ingress) | ./modules/app_ingress | n/a |
| <a name="module_app_lb"></a> [app\_lb](#module\_app\_lb) | ./modules/app_lb | n/a |
| <a name="module_cert_manager"></a> [cert\_manager](#module\_cert\_manager) | ./modules/cert_manager | n/a |
| <a name="module_database"></a> [database](#module\_database) | ./modules/database | n/a |
Expand All @@ -58,12 +56,12 @@ resources that lack official modules.
| <a name="module_redis"></a> [redis](#module\_redis) | ./modules/redis | n/a |
| <a name="module_storage"></a> [storage](#module\_storage) | ./modules/storage | n/a |
| <a name="module_vault"></a> [vault](#module\_vault) | ./modules/vault | n/a |
| <a name="module_wandb"></a> [wandb](#module\_wandb) | wandb/wandb/helm | 1.2.0 |

## Resources

| Name | Type |
|------|------|
| [azurerm_client_config.current](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/data-sources/client_config) | data source |

## Inputs

Expand All @@ -75,8 +73,7 @@ resources that lack official modules.
| <a name="input_database_version"></a> [database\_version](#input\_database\_version) | Version for MySQL | `string` | `"5.7"` | no |
| <a name="input_deletion_protection"></a> [deletion\_protection](#input\_deletion\_protection) | If the instance should have deletion protection enabled. The database / Bucket can't be deleted when this value is set to `true`. | `bool` | `true` | no |
| <a name="input_domain_name"></a> [domain\_name](#input\_domain\_name) | Domain for accessing the Weights & Biases UI. | `string` | `null` | no |
| <a name="input_external_bucket"></a> [external\_bucket](#input\_external\_bucket) | String to configure a non Azure bucket (s3://user:pass@bucket) | `string` | `""` | no |
| <a name="input_external_bucket_region"></a> [external\_bucket\_region](#input\_external\_bucket\_region) | When using external bucket the Region is mandatory | `string` | `""` | no |
| <a name="input_external_bucket"></a> [external\_bucket](#input\_external\_bucket) | config an external bucket | `any` | `null` | no |
| <a name="input_kubernetes_instance_type"></a> [kubernetes\_instance\_type](#input\_kubernetes\_instance\_type) | Use for the Kubernetes cluster. | `string` | `"Standard_D4a_v4"` | no |
| <a name="input_kubernetes_node_count"></a> [kubernetes\_node\_count](#input\_kubernetes\_node\_count) | n/a | `number` | `2` | no |
| <a name="input_license"></a> [license](#input\_license) | Your wandb/local license | `string` | n/a | yes |
Expand All @@ -87,8 +84,6 @@ resources that lack official modules.
| <a name="input_oidc_issuer"></a> [oidc\_issuer](#input\_oidc\_issuer) | A url to your Open ID Connect identity provider, i.e. https://cognito-idp.us-east-1.amazonaws.com/us-east-1_uiIFNdacd | `string` | `""` | no |
| <a name="input_oidc_secret"></a> [oidc\_secret](#input\_oidc\_secret) | The Client secret of application in your identity provider | `string` | `""` | no |
| <a name="input_other_wandb_env"></a> [other\_wandb\_env](#input\_other\_wandb\_env) | Extra environment variables for W&B | `map(any)` | `{}` | no |
| <a name="input_resource_limits"></a> [resource\_limits](#input\_resource\_limits) | Specifies the resource limits for the wandb deployment | `map(string)` | <pre>{<br> "cpu": null,<br> "memory": null<br>}</pre> | no |
| <a name="input_resource_requests"></a> [resource\_requests](#input\_resource\_requests) | Specifies the resource requests for the wandb deployment | `map(string)` | <pre>{<br> "cpu": "2000m",<br> "memory": "2G"<br>}</pre> | no |
| <a name="input_ssl"></a> [ssl](#input\_ssl) | Enable SSL certificate | `bool` | `true` | no |
| <a name="input_storage_account"></a> [storage\_account](#input\_storage\_account) | Azure storage account name | `string` | `""` | no |
| <a name="input_storage_key"></a> [storage\_key](#input\_storage\_key) | Azure primary storage access key | `string` | `""` | no |
Expand All @@ -107,7 +102,6 @@ resources that lack official modules.
| <a name="output_cluster_client_certificate"></a> [cluster\_client\_certificate](#output\_cluster\_client\_certificate) | n/a |
| <a name="output_cluster_client_key"></a> [cluster\_client\_key](#output\_cluster\_client\_key) | n/a |
| <a name="output_cluster_host"></a> [cluster\_host](#output\_cluster\_host) | n/a |
| <a name="output_external_bucket"></a> [external\_bucket](#output\_external\_bucket) | n/a |
| <a name="output_fqdn"></a> [fqdn](#output\_fqdn) | The FQDN to the W&B application |
| <a name="output_storage_account"></a> [storage\_account](#output\_storage\_account) | n/a |
| <a name="output_storage_container"></a> [storage\_container](#output\_storage\_container) | n/a |
Expand Down
34 changes: 34 additions & 0 deletions downtime.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import requests
import time

def ping_server(server_url):

while True:
try:
response = requests.get(server_url)

if response.status_code == 200:
print(f'Server {server_url} is up!')
else:
print(f'Server {server_url} is down! Status code: {response.status_code}')
downtime_start = time.time() # record when the server started going down

while True:
# Check periodically if the server is back up
response = requests.get(server_url)
if response.status_code == 200:
downtime_end = time.time() # record when the server went back up
downtime_duration = downtime_end - downtime_start
print(f'Server {server_url} was down for {downtime_duration} seconds')
break
else:
time.sleep(2) # you can change the time to wait before trying again

except requests.exceptions.RequestException:
print(f'Could not connect to {server_url}')
time.sleep(2) # wait before trying to connect again

time.sleep(2) # wait before pinging the server again

# Call the function
ping_server('https://qa-azure.wandb.io')
154 changes: 85 additions & 69 deletions main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ module "vault" {
}

module "storage" {
count = (var.blob_container == "" && var.external_bucket == "") ? 1 : 0
count = (var.blob_container == "" && var.external_bucket == null) ? 1 : 0
source = "./modules/storage"
namespace = var.namespace
resource_group_name = azurerm_resource_group.default.name
Expand Down Expand Up @@ -111,17 +111,17 @@ locals {
account_name = try(module.storage[0].account.name, "")
access_key = try(module.storage[0].account.primary_access_key, "")
queue_name = try(module.storage[0].queue.name, "")
blob_container = coalesce(var.external_bucket, var.blob_container, local.container_name)
storage_account = var.external_bucket != "" ? "" : coalesce(var.storage_account, local.account_name, "")
storage_key = var.external_bucket != "" ? "" : coalesce(var.storage_key, local.access_key, "")
bucket = var.external_bucket != "" ? var.external_bucket : "az://${local.storage_account}/${local.blob_container}"
queue = (var.use_internal_queue || var.blob_container == "" || var.external_bucket == "") ? "internal://" : "az://${local.account_name}/${local.queue_name}"
blob_container = var.external_bucket == null ? coalesce(var.blob_container, local.container_name) : ""
storage_account = var.external_bucket == null ? coalesce(var.storage_account, local.account_name) : ""
storage_key = var.external_bucket == null ? coalesce(var.storage_key, local.access_key) : ""
bucket = "az://${local.storage_account}/${local.blob_container}"
queue = (var.use_internal_queue || var.blob_container == "" || var.external_bucket == null) ? "internal://" : "az://${local.account_name}/${local.queue_name}"

redis_connection_string = "redis://:${module.redis.instance.primary_access_key}@${module.redis.instance.hostname}:${module.redis.instance.port}"
}

locals {
service_account_name = "wandb-serviceaccount"
service_account_name = "wandb-app"
}

resource "azurerm_federated_identity_credential" "app" {
Expand All @@ -133,62 +133,6 @@ resource "azurerm_federated_identity_credential" "app" {
subject = "system:serviceaccount:default:${local.service_account_name}"
}

data "azurerm_client_config" "current" {}

module "aks_app" {
source = "wandb/wandb/kubernetes"
version = "1.14.0"

service_account_name = local.service_account_name
service_account_annotations = {
"azure.workload.identity/client-id" = module.identity.identity.client_id
}

deployment_pod_labels = {
"azure.workload.identity/use" = "true"
}

service_account_labels = {
"azure.workload.identity/use" = "true"
}

license = var.license

host = local.url
bucket = local.bucket
bucket_queue = local.queue
bucket_aws_region = var.external_bucket_region
database_connection_string = "mysql://${module.database.connection_string}"
redis_connection_string = local.redis_connection_string

oidc_client_id = var.oidc_client_id
oidc_issuer = var.oidc_issuer
oidc_auth_method = var.oidc_auth_method
oidc_secret = var.oidc_secret

wandb_image = var.wandb_image
wandb_version = var.wandb_version

other_wandb_env = merge(var.other_wandb_env, {
"AZURE_STORAGE_KEY" = local.storage_key
"AZURE_STORAGE_ACCOUNT" = local.redis_connection_string,
"GORILLA_CUSTOMER_SECRET_STORE_AZ_CONFIG_VAULT_URI" = module.vault.vault.vault_uri,
"GORILLA_CUSTOMER_SECRET_STORE_SOURCE" = "az-secretmanager://wandb",
})

resource_limits = var.resource_limits
resource_requests = var.resource_requests

# If we dont wait, tf will start trying to deploy while the work group is
# still spinning up
depends_on = [
module.database,
# module.redis,
module.storage,
module.app_aks,
]
}

module "cert_manager" {
source = "./modules/cert_manager"
namespace = var.namespace
Expand All @@ -201,14 +145,86 @@ module "cert_manager" {
depends_on = [module.app_aks]
}

module "app_ingress" {
source = "./modules/app_ingress"
fqdn = local.fqdn
namespace = var.namespace
module "wandb" {
source = "wandb/wandb/helm"
version = "1.2.0"

depends_on = [
module.aks_app,
module.cert_manager,
module.app_aks,
module.cert_manager,
module.database,
module.storage,
]
operator_chart_version = "1.1.0"
controller_image_tag = "1.10.1"

spec = {
values = {
global = {
host = local.url
license = var.license

bucket = var.external_bucket == null ? {
provider = "az"
name = local.storage_account
path = local.blob_container
accessKey = local.storage_key
} : var.external_bucket

mysql = {
host = module.database.address
database = module.database.database_name
user = module.database.username
password = module.database.password
port = 3306
}

redis = {
host = module.redis.instance.hostname
password = module.redis.instance.primary_access_key
port = module.redis.instance.port
}
}

app = {
extraEnv = {
"GORILLA_CUSTOMER_SECRET_STORE_AZ_CONFIG_VAULT_URI" = module.vault.vault.vault_uri,
"GORILLA_CUSTOMER_SECRET_STORE_SOURCE" = "az-secretmanager://wandb",
}
pod = {
labels = { "azure.workload.identity/use" = "true" }
}
serviceAccount = {
name = local.service_account_name
annotations = { "azure.workload.identity/client-id" = module.identity.identity.client_id }
labels = { "azure.workload.identity/use" = "true" }
}
}

ingress = {
// TODO: For now we will use the existing issuer. We can move this into
// the operator after testing. Trying to reduce the diff.
issuer = { create = false }

annotations = {
"kubernetes.io/ingress.class" = "azure/application-gateway"
"cert-manager.io/cluster-issuer" = "cert-issuer"
"cert-manager.io/acme-challenge-type" = "http01"
}

tls = [
{ hosts = [trimprefix(trimprefix(local.url, "https://"), "http://")], secretName = "wandb-ssl-cert" }
]
}

weave = {
persistence = {
provider = "azurefile"
}
}

mysql = { install = false }
redis = { install = false }
}
}
}
8 changes: 2 additions & 6 deletions outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,9 @@ output "cluster_ca_certificate" {
}

output "storage_account" {
value = var.external_bucket != "" ? "" : coalesce(var.storage_account, local.account_name, "")
value = var.external_bucket != null ? "" : coalesce(var.storage_account, local.account_name, "")
}

output "storage_container" {
value = coalesce(var.external_bucket, var.blob_container, local.container_name)
}

output "external_bucket" {
value = var.external_bucket != "" ? var.external_bucket : ""
value = var.external_bucket != null ? "" : coalesce(var.blob_container, local.container_name)
}
32 changes: 4 additions & 28 deletions variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -156,15 +156,9 @@ variable "storage_key" {
}

variable "external_bucket" {
description = "String to configure a non Azure bucket (s3://user:pass@bucket)"
type = string
default = ""
}

variable "external_bucket_region" {
description = "When using external bucket the Region is mandatory"
type = string
default = ""
description = "config an external bucket"
type = any
default = null
}

##########################################
Expand All @@ -179,22 +173,4 @@ variable "kubernetes_instance_type" {
variable "kubernetes_node_count" {
default = 2
type = number
}

variable "resource_limits" {
description = "Specifies the resource limits for the wandb deployment"
type = map(string)
default = {
cpu = null
memory = null
}
}

variable "resource_requests" {
description = "Specifies the resource requests for the wandb deployment"
type = map(string)
default = {
cpu = "2000m"
memory = "2G"
}
}
}
Loading