Skip to content

PrivacyEngineering/hawk

Repository files navigation

Hawk

workflow

Hawk has been published at the 16th IEEE International Conference on Cloud Computing 2023, IEEE Cloud 2023. Please find its publication here: https://arxiv.org/abs/2306.02496

BibTex citation:

@misc{grünewald2023hawk,
      title={Hawk: DevOps-driven Transparency and Accountability in Cloud Native Systems}, 
      author={Elias Grünewald and Jannis Kiesel and Siar-Remzi Akbayin and Frank Pallas},
      year={2023},
      eprint={2306.02496},
      archivePrefix={arXiv},
      primaryClass={cs.DC}
}

Overview

The Hawk Framework provides a way of tracking the dataflow between applications and allows for GDPR related tags to be added to the data references. It also features an analytical dashboard about the GDPR related information and integration for using the ratio of GDPR-tagged data in e.g. Flagger Canary releases.

Concept and Goal

Concept

The concept is to archive this goal is to intercept the traffic between the individual applications / services. This idea is called Hawk Core. It can be either done by (A) Framework Integration inside the application or outside the application using (B) Service Mesh Integration, if available. While the Framework integration allows to interact with the Hawk API directly inside the Service and gives the possibility to intercept encrypted and also external traffic, the application itself must be modified. The Service Mesh solution can be installed without modifying any application. Both solutions can be active in parallel. Currently the only integrations are EnvoyProxy / Istio Service Mesh Integration and Java Framework Integration for HTTP and JSON bodies only.

When a Packet is intercepted it will be parsed, according to the protocol used. The parsing searches for possible custom data / personal data or more concretely for atomic data values of type string or number. So the User Email might be one example of this (and not the whole User object). The idea is to build a selector for each individual atomic data field and saving it. This selector includes the destination host and some kind of endpoint abstraction. In case of HTTP the method and the path. And also the a phase which might be request or response, the namespace of the data which is header or body in case of HTTP, the format which describes if this data was found in a key-value based format or in some more complex format like JSON and finally the path which is protocol and format dependend to describe where this data lies inside the packet. When implemented correctly, these values should provide a protocol independent and context aware selector. Using the selector, it is also possible to find / track data in other packets with same endpoint. To reduce size many on these selectors might be aggregated to reduce the size. One example right here might be a list of users. We dont need to have a selector for each individual User Email, instead we only need to provide a reference to the array and which path for each entry inside the array. E.g. $.users.[0].email , $.users.[1].email ... -> $.users.[*].email. This aggregated selector is called UsageField. For each such packet parsed we might get a list of UsageFields. This list is tagged with some metadata and represents one Usage object.

GDPR relevant data is added using Fields and Mappings. A field again represents one atomic data unit like a User Email. We can also add a description, some legal bases, whether it is personal data / special categories personal data and many more describing information. The next component is the Mapping, which can be created at max once per endpoint. This mapping then specifies a list of MappingFields, where each individual MappingField represents a mapping between a Field and a UsageField. When every endpoint is mapped accordingly, it is possible for example to see from where and when a User Email is sent to which other application / service and with which other data.

The Hawk Service is the central component for all of these entities, as all integrations submit their Usages to here. Also Mappings and Fields can be created here via REST API. The Hawk Service is stateless and allows for Horizontal scaling. The Database PostgreSQL can be used, but also e.g. YugabyteDB or CockroachDB are possible, which makes the whole Hawk Framework scalable. But the Hawk Service also serves as a base for Hawk Release, which accesses the metrics from here. These metric include e.g. how many Usages where collected and how many of those endpoints have a Mapping. To visualize the Data collected, we can use Hawk Core Monitor. It contains a UI for creating fields and mappings really quickly and listing them in a nice way. And also a Grafana Dashboard which is used to visualize the data collected and giving a summary of it. Both of these components use Hawk Service as a Backend.

The last component is Hawk Build, which is a GitHub Action that allows to be notified when the API of a service is changed. These changes can be then update in the Hawk Core Monitoring interface. The Hawk Release can constantly validate the coverage of mapped endpoints to prevent deploying unmapped endpoints.

Goal

The Hawk Framework helps businesses to be compliant with the GDPR, to avoid fines. The data protection officer can use this software to stay updated about privacy related information and change the privacy policy accordingly.

Quickstart

Deployment through Helm

  1. Add the helm chart repository:
    helm repo add hawk https://privacyengineering.github.io/hawk-helm-charts/
    
  2. Modify values in values.yaml to your needs.
  3. Install hawk core and all it's services:
    helm dependency update
    helm install hawk hawk/hawk --namespace hawk --create-namespace
    
  4. Access the hawk-core-monitor and hawk-service via ingress:
    kubectl get ingress -n hawk
    
  5. Add an integration to the hawk framework (see Integrations for more information)

Deloyment alternatives

It's also possible to install the application in a Non-Kubernetes environment or configuring them more individual using the Docker Images of them. The Istio / Envoy integration is only available in specific Kubernetes environments. The Java integration is available in every environment. It needs a connection to the Hawk Service. When possible, the Envoy Integration is preferred as it's less effort to install. You must choose at least one integration.

Integrations

The Hawk Framework can be extended through integrations. Currently there are two integrations (for HTTP and JSON bodies only):

Both integrations communicate with the hawk-service via the exposed REST API.

Hawk Core

Hawk Core

The Helm Chart (WIP), installs the Hawk-Service, a default PostgreSQL database, Hawk Core Monitor ( nginx + monitor + grafana) and the Istio / Envoy integration if selected.

helm repo add hawk https://github.com/PrivacyEngineering/hawk/releases/download/1.0.1
helm install PrviacyEngineering/hawk

Replace VERSION with the newest version of the chart. Alternatively, you can also download the hawk-VERSION.tgz of the release you wish and execute:

helm install ./hawk-VERSION.tgz

See the values.yaml for configuration options.

You can now see the generated Notes of Helm Chart, to know how to access the [Hawk Core Monitor UI].

Docker

The following Docker Images are available, when not choosing Helm:

Name Image Description
Hawk Service p4skal/hawk-service Required: Backend for Hawk Core & Hawk Release
Hawk Core Monitor p4skal/hawk-core-monitor Optional: UI for managing Mappings, Fields (can be imported via. JSON directly in the Hawk-Service) and visualizing Data flow.

The Hawk Service is pretty simple, it uses a PostgreSQL Database, just pass the required Environment variables described in Hawk Service.

For Hawk Core Monitor things get a little bit more complicated as it consists of two components. First the Configuration UI. This component needs to have access to the Hawk Service. By default it expects the Hawk Service API to be available reverse-proxied on the path. To change that you can provide an Environment variable. The second component is a Grafana instance with specific Plugins, Datasource and Dashboards. See Grafana Deployment and Grafana Config for information on which environment variables and which files to provide. It is recommended to use a reverse proxy, to seamlessly connect the two (or three) components. See Nginx Deployment and Nginx Config for information on which environment variables and which files to provide.

Hawk Release

Hawk Release

To enable Hawk Release, you have to install Flux and Flagger. Then you can configure to use the Metrics using Prometheus, see Hawk Service for more information on which mappings to use. You also need to configure Prometheus to scrape the Metrics.

Hawk Build

Hawk Build

To enable Hawk Build you have to install and configure the OpenAPI Privacy Changes Service . Then it is possible to use OpenAPI Privacy Alert GitHub Action .

Example Deployment

An example using the WeaveWorks SockShop , integrated with some of Hawk components can be found here.

Hawk Grafana Dashboard Evaluation

We provide 4 grafana dashboards:

  • Dashboard
  • Service Graph
  • Field Details
  • Endpoint Details

A detailed explanation of the dashboards can be found here.

Hawk Monitor Overview Dashboard

Dashboard overview with four panels