From 696e1faa20fe5ac58bdbb269dacdfa3cfde6d762 Mon Sep 17 00:00:00 2001
From: mhorst00 <36167515+mhorst00@users.noreply.github.com>
Date: Fri, 3 Nov 2023 16:26:15 +0100
Subject: [PATCH 1/2] Add AMD ROCm support (#557)

* Add rocm GPU topology

* First rocmon implementation

Basic implementation for montioring AMD GPUs with rocprofiler

* Move rocm call from addEventSet to setupCounters

* Include short_name in topology

* Implement more rocmon functions

This version does still not produce consistent results?

* Implement functions for rocmon marker api

* Add macros for rocmon marker api

* Add test for rocmon marker api

* Add ROCm SMI Backend to Rocmon

ROCm SMI provides additional counters and information to rocprofiler.

* Fix cut off event names

* Add temporary build instructions

* Fix markerfile format documentation

* Fix device id device index mixup

* Fix same variable name for rocmon and nvmon topology

Change rocmon topology variable name to avoid conflicts with nvmon and allow builds with both nvmon and rocmon.

* Integrate nvml library into nvmon

The NVIDIA Management Library (NVML) allows measuring of more statistics
like power usage. In addition to the existing events, events from the
NVML library can now be measured with LIKWID.

* Fix return types

nvml_getResult and nvml_getLastResult incorrectly returned int instead
of double.

* Fix gpu markers for nvml

Markerfile now contains average value for nvml events.

* Refactor result update

Put updating of result struct after measurement in dedicated function.

* Fix wrong function call

Called nvml_getResult instead of nvml_getLastResult in
nvmon_getLastResult.

* Simplify SMI event wrappers

* Fix filter for rocmon in Makefile

* Add timeline mode for GPUs using AppDaemon

* Fix appDaemon linker errors

* Add last value to output file

* Fix marker API for SMI events

Return accumulated values for ROCm SMI events, not accumulated
difference.

* Fix disparities between rocmon marker

Let user calculate average

* Adjust tests for benchmarking

* Fix dllink issues in rocmon_init

* Add macros for ROCM Debugging

* Add function to resolve GPUstr for ROCM

* Update ROCMon and ROCMon marker

* Changes to the likwid header

* Add ROCmon to likwid-perfctr

* Add example groups

* Rename symbol HSA_VEN_AMD_AQLPROFILE_LEGACY_PM4_PACKET_SIZE to avoid collision.

* adjusted for ROCm 5.4

* fixed AMD multi gpu issues

* Change rocm metrics.xml path to new directory spec

* Include likwid libs in LD_LIBRARY_PATH at runtime

* Add more groups for AMD GPUs

* Fix AMD rocm performance group metrics

* Adjust ROCm library path to new path structure

* Enable appDaemon to print timeline measurements to stderr

* Add GPU timeline support to likwid-perfctr

* Leave event_string_list empty if cpu perf group is not defined

* Fix typo in appDaemon environment variable

* Handle permission error for Rocm Marker API file

* Fix library environments for Rocm

* Modify conditions to allow for Rocm timeline support

* Fix make config to allow App Daemon Build for Rocm

* appDaemon: search for libraries in build directory

* Delete previous Smi events in rocmon_setupCounters

* Add power group to AMD GPU

* Set previous numSmiEvents to 0 in rocmon_setupCounters

* Fix amd_gpu POWER group

* Add likwid library to library path for nvidia GPUs

* Fix perfworks API for cuda versions >=11.2

* likwid-perfctr: Fix list events and counters for Nvidia GPUs

* Add backwards compatibility for ROCm metrics path

* access-daemon Makefile: Only include liblikwid in appDaemon target

* likwid-perfctr: remove version number from rocprofiler64 library

* rocmon: implement workaround for rocprofiler_iterate_info bug in ROCm 5.4.0

* likwid-perfctr: Use INSTALLED_LIBPREFIX for library path

* make more than 1 metric usable in timeline rocm

* fix wrong time readings in timeline mode

* update doxygen for AMD GPU support

* fix import order for PciDeviceId errors while compiling

---------

Co-authored-by: Marcel Marquardt <marcel.marquardt@hpe.com>
Co-authored-by: Karlo Kraljic <karlo.kraljic@tum.de>
Co-authored-by: Thomas Roehl <thomas.roehl@fau.de>
Co-authored-by: Sebastian Schnorbus <sebastian.schnorbus@hpe.com>
Co-authored-by: Thomas Gruber <Thomas.Roehl@googlemail.com>
---
 Makefile                             |   13 +
 README_ROCM.md                       |   28 +
 config.mk                            |   14 +
 doc/applications/likwid-perfctr.md   |   16 +-
 doc/likwid-doxygen.md                |    7 +-
 doc/likwid-perfctr.1                 |   23 +-
 doc/likwid-topology.1                |    8 +-
 groups/amd_gpu/GDS.txt               |   13 +
 groups/amd_gpu/MEM.txt               |   16 +
 groups/amd_gpu/PCI.txt               |   18 +
 groups/amd_gpu/POWER.txt             |   17 +
 groups/amd_gpu/SALU.txt              |   13 +
 groups/amd_gpu/SFETCH.txt            |   13 +
 groups/amd_gpu/STALLED.txt           |   17 +
 groups/amd_gpu/UTIL.txt              |   16 +
 groups/amd_gpu/VALU.txt              |   13 +
 groups/amd_gpu/WAVE.txt              |   13 +
 make/config_checks.mk                |    5 +
 make/config_defines.mk               |    6 +-
 src/access-daemon/Makefile           |   10 +-
 src/access-daemon/appDaemon.c        |  573 ++-
 src/applications/likwid-perfctr.lua  |  726 ++-
 src/applications/likwid-topology.lua |   59 +
 src/applications/likwid.lua          |  234 +
 src/cpustring.c                      |   62 +-
 src/includes/error.h                 |   11 +
 src/includes/likwid-marker.h         |   67 +
 src/includes/likwid.h                | 1733 ++++---
 src/includes/nvmon_nvml.h            |   33 +-
 src/includes/nvmon_perfworks.h       | 3485 +++++++-------
 src/includes/nvmon_types.h           |   12 +
 src/includes/rocmon_types.h          |  143 +
 src/libnvctr.c                       |  921 ++--
 src/luawid.c                         | 6326 +++++++++++++-------------
 src/nvmon.c                          |  444 +-
 src/nvmon_nvml.c                     | 1382 ++++++
 src/rocmon.c                         | 2275 +++++++++
 src/rocmon_marker.c                  | 1076 +++++
 src/topology_gpu_rocm.c              |  273 ++
 test/Makefile                        |   12 +-
 test/test-rocmon-triad-marker.cpp    |  161 +
 test/test-rocmon-triad.cpp           |  182 +
 test/test-topology-gpu-rocm.c        |   62 +
 test/triad.cu                        |  120 +-
 44 files changed, 14527 insertions(+), 6124 deletions(-)
 create mode 100644 README_ROCM.md
 create mode 100644 groups/amd_gpu/GDS.txt
 create mode 100644 groups/amd_gpu/MEM.txt
 create mode 100644 groups/amd_gpu/PCI.txt
 create mode 100644 groups/amd_gpu/POWER.txt
 create mode 100644 groups/amd_gpu/SALU.txt
 create mode 100644 groups/amd_gpu/SFETCH.txt
 create mode 100644 groups/amd_gpu/STALLED.txt
 create mode 100644 groups/amd_gpu/UTIL.txt
 create mode 100644 groups/amd_gpu/VALU.txt
 create mode 100644 groups/amd_gpu/WAVE.txt
 create mode 100644 src/includes/rocmon_types.h
 create mode 100644 src/nvmon_nvml.c
 create mode 100644 src/rocmon.c
 create mode 100644 src/rocmon_marker.c
 create mode 100644 src/topology_gpu_rocm.c
 create mode 100644 test/test-rocmon-triad-marker.cpp
 create mode 100644 test/test-rocmon-triad.cpp
 create mode 100644 test/test-topology-gpu-rocm.c

diff --git a/Makefile b/Makefile
index e133bdb7e..e7fed144e 100644
--- a/Makefile
+++ b/Makefile
@@ -122,9 +122,15 @@ OBJ := $(filter-out $(BUILD_DIR)/loadDataARM.o,$(OBJ))
 endif
 ifneq ($(NVIDIA_INTERFACE), true)
 OBJ := $(filter-out $(BUILD_DIR)/nvmon.o,$(OBJ))
+OBJ := $(filter-out $(BUILD_DIR)/nvmon_nvml.o,$(OBJ))
 OBJ := $(filter-out $(BUILD_DIR)/topology_gpu.o,$(OBJ))
 OBJ := $(filter-out $(BUILD_DIR)/libnvctr.o,$(OBJ))
 endif
+ifneq ($(ROCM_INTERFACE), true)
+OBJ := $(filter-out $(BUILD_DIR)/rocmon.o,$(OBJ))
+OBJ := $(filter-out $(BUILD_DIR)/rocmon-marker.o,$(OBJ))
+OBJ := $(filter-out $(BUILD_DIR)/topology_gpu_rocm.o,$(OBJ))
+endif
 ifeq ($(COMPILER),GCCPOWER)
 OBJ := $(filter-out $(BUILD_DIR)/topology_cpuid.o,$(OBJ))
 OBJ := $(filter-out $(BUILD_DIR)/access_x86.o,$(OBJ))
@@ -195,6 +201,7 @@ $(L_APPS):  $(addprefix $(SRC_DIR)/applications/,$(addsuffix  .lua,$(L_APPS)))
 	@echo "===>  ADJUSTING  $@"
 	@if [ "$(ACCESSMODE)" = "direct" ]; then sed -i -e s/"access_mode = 1"/"access_mode = 0"/g $(SRC_DIR)/applications/$@.lua;fi
 	@sed -e s/'<INSTALLED_BINPREFIX>'/$(subst /,\\/,$(INSTALLED_BINPREFIX))/g \
+		-e s/'<INSTALLED_LIBPREFIX>'/$(subst /,\\/,$(INSTALLED_LIBPREFIX))/g \
 		-e s/'<INSTALLED_PREFIX>'/$(subst /,\\/,$(INSTALLED_PREFIX))/g \
 		-e s/'<VERSION>'/$(VERSION).$(RELEASE).$(MINOR)/g \
 		-e s/'<DATE>'/$(DATE)/g \
@@ -236,6 +243,7 @@ $(DYNAMIC_TARGET_LIB): $(BUILD_DIR) $(PERFMONHEADERS) $(OBJ) $(TARGET_HWLOC_LIB)
 	@ln -sf $(TARGET_LIB) $(TARGET_LIB).$(VERSION).$(RELEASE)
 	@sed -e s+'@PREFIX@'+$(INSTALLED_PREFIX)+g \
 		-e s+'@NVIDIA_INTERFACE@'+$(NVIDIA_INTERFACE)+g \
+		-e s+'@ROCM_INTERFACE@'+$(ROCM_INTERFACE)+g \
 		-e s+'@FORTRAN_INTERFACE@'+$(FORTRAN_INTERFACE)+g \
 		-e s+'@LIBPREFIX@'+$(INSTALLED_LIBPREFIX)+g \
 		-e s+'@BINPREFIX@'+$(INSTALLED_BINPREFIX)+g \
@@ -303,6 +311,11 @@ $(BUILD_DIR)/%.o:  %.c
 	$(Q)$(CC) -c $(DEBUG_FLAGS) $(CFLAGS) $(ANSI_CFLAGS) $(CPPFLAGS) $< -o $@
 	$(Q)$(CC) $(DEBUG_FLAGS) $(CPPFLAGS) -MT $(@:.d=.o) -MM  $< > $(BUILD_DIR)/$*.d
 
+$(BUILD_DIR)/rocmon_marker.o:  rocmon_marker.c
+	@echo "===>  COMPILE $@"
+	$(Q)$(CC) -c $(DEBUG_FLAGS) $(CFLAGS) $(ANSI_CFLAGS) $(CPPFLAGS) $< -o $@
+	$(Q)objcopy --redefine-sym HSA_VEN_AMD_AQLPROFILE_LEGACY_PM4_PACKET_SIZE=HSA_VEN_AMD_AQLPROFILE_LEGACY_PM4_PACKET_SIZE2 $@
+
 $(BUILD_DIR)/%.o:  %.cc
 	@echo "===>  COMPILE  $@"
 	$(Q)$(CXX) -c $(DEBUG_FLAGS) $(CXXFLAGS) $(CPPFLAGS) $< -o $@
diff --git a/README_ROCM.md b/README_ROCM.md
new file mode 100644
index 000000000..3553f841e
--- /dev/null
+++ b/README_ROCM.md
@@ -0,0 +1,28 @@
+## Build & Install
+
+```bash
+export ROCM_HOME=/opt/rocm
+make
+make install
+```
+
+## Test
+
+Build
+
+```bash
+cd test
+# make clean
+make test-topology-gpu-rocm
+make test-rocmon-triad
+make test-rocmon-triad-marker
+```
+
+Run
+
+```bash
+export LD_LIBRARY_PATH=/home/users/kraljic/likwid-rocmon/install/lib:/opt/rocm/hip/lib:/opt/rocm/hsa/lib:/opt/rocm/rocprofiler/lib:$LD_LIBRARY_PATH
+export ROCP_METRICS=/opt/rocm/rocprofiler/lib/metrics.xml # for rocmon test
+export HSA_TOOLS_LIB=librocprofiler64.so.1 # allows rocmon to intercept hsa commands
+./gpu-test-topology-gpu-rocm
+```
diff --git a/config.mk b/config.mk
index 32633a9e1..dd3be85a9 100644
--- a/config.mk
+++ b/config.mk
@@ -30,6 +30,10 @@ INSTRUMENT_BENCH = true#NO SPACE
 # For configuring include paths, go to CUDA section
 NVIDIA_INTERFACE = false#NO SPACE
 
+# Build LIKWID with AMD GPU interface (ROCm)
+# For configuring include paths, go to ROCm section
+ROCM_INTERFACE = false#NO SPACE
+
 #################################################################
 #################################################################
 # Advanced configuration options                                #
@@ -172,3 +176,13 @@ CUPTIINCLUDE = $(CUDA_HOME)/extras/CUPTI/include
 # In order to hook into the CUDA application, the appDaemon is required
 # If you just want the NvMarkerAPI, you can keep it false
 BUILDAPPDAEMON=false
+
+# ROCm build data
+# LIKWID requires ROCm to be present only for compilation with
+# ROCM_INTERFACE=true. At runtime, the ROCm library have
+# to be in the LD_LIBRARY_PATH to dynamically load the libraries.
+# Include directory for ROCm headers
+HSAINCLUDE 			= $(ROCM_HOME)/include
+ROCPROFILERINCLUDE	        = $(ROCM_HOME)/include/rocprofiler
+HIPINCLUDE 			= $(ROCM_HOME)/include
+RSMIINCLUDE			= $(ROCM_HOME)/include
diff --git a/doc/applications/likwid-perfctr.md b/doc/applications/likwid-perfctr.md
index 25077c39a..3f3cdd244 100644
--- a/doc/applications/likwid-perfctr.md
+++ b/doc/applications/likwid-perfctr.md
@@ -56,7 +56,11 @@ custom event sets. The \ref Marker_API can measure mulitple named regions and th
 </TR>
 <TR>
   <TD>-W, --gpugroup &lt;arg&gt;</TD>
-  <TD>Specify which event string or performance group should be measured on the GPUs. Only if built with NVIDIA_INTERFACE=true.</TD>
+  <TD>Specify which event string or performance group should be measured on the Nvidia GPUs. Only if built with NVIDIA_INTERFACE=true.</TD>
+</TR>
+<TR>
+  <TD>-R &lt;arg&gt;</TD>
+  <TD>Specify which event string or performance group should be measured on the AMD GPUs. Only if built with ROCM_INTERFACE=true.</TD>
 </TR>
 <TR>
   <TD>-c &lt;arg&gt;</TD>
@@ -68,7 +72,11 @@ custom event sets. The \ref Marker_API can measure mulitple named regions and th
 </TR>
 <TR>
   <TD>-G &lt;arg&gt;</TD>
-  <TD>Defines the GPUs that should be measured<BR>You can use simple lists like 0,1,3 or ranges like 0-2. Only if built with NVIDIA_INTERFACE=true.</TD>
+  <TD>Defines the Nvidia GPUs that should be measured<BR>You can use simple lists like 0,1,3 or ranges like 0-2. Only if built with NVIDIA_INTERFACE=true.</TD>
+</TR>
+<TR>
+  <TD>-I &lt;arg&gt;</TD>
+  <TD>Defines the AMD GPUs that should be measured<BR>You can use simple lists like 0,1,3 or ranges like 0-2. Only if built with ROCM_INTERFACE=true.</TD>
 </TR>
 <TR>
   <TD>-H</TD>
@@ -274,6 +282,8 @@ The LIKWID package contains an example code: see \ref F-markerAPI-code.
 Since the calls to the LIKWID library are executed by your application, the runtime will raise and in specific circumstances, there are some other problems like the time measurement. You can execute <CODE>LIKWID_MARKER_THREADINIT</CODE> and <CODE>LIKWID_MARKER_START</CODE> inside the same parallel region but put a barrier between the calls to ensure that there is no big timing difference between the threads. The common way is to init LIKWID and the participating threads inside of an initialization routine, use only START and STOP in your code and close the Marker API in a finalization routine. Be aware that at the first start of a region, the thread-local hash table gets a new entry to store the measured values. If your code inside the region is short or you are executing the region only once, the overhead of creating the hash table entry can be significant compared to the execution of the region code. The overhead of creating the hash tables can be done in prior by using the <CODE>LIKWID_MARKER_REGISTER</CODE> function. It must be called by each thread and one time for each compute region. It is completely <I>optional</I>, <CODE>LIKWID_MARKER_START</CODE> performs the same operations.
 
 <H2>CUDA code</H2>
-With LIKWID 5.0 CUDA kernels can be measured. There is a special NvMarkerAPI for Nvidia GPUs. The usage is similar to the CPU MarkerAPI, just replace <CODE>LIKWID_MARKER_</CODE> with <CODE>LIKWID_NVMARKER_</CODE>. The two MarkerAPIs can be mixed.
+With LIKWID 5.0 CUDA kernels can be measured. There is a special NvMarkerAPI for Nvidia GPUs. The usage is similar to the CPU MarkerAPI, just replace <CODE>LIKWID_MARKER_</CODE> with <CODE>LIKWID_NVMARKER_</CODE>. All MarkerAPIs can be mixed.
 
+<H2>ROCm code</H2>
+ROCm kernels can be measured. There is a special RocmonMarkerAPI for AMD GPUs. The usage is similar to the CPU or Nvidia MarkerAPI, just replace <CODE>LIKWID_MARKER_</CODE> with <CODE>ROCMON_MARKER_</CODE>. All MarkerAPIs can be mixed.
 */
diff --git a/doc/likwid-doxygen.md b/doc/likwid-doxygen.md
index b7788df66..2a4f96305 100644
--- a/doc/likwid-doxygen.md
+++ b/doc/likwid-doxygen.md
@@ -1,7 +1,7 @@
 /*! \mainpage LIKWID - Like I Knew What I Am Doing
 
 \section Introduction
-This is an effort to develop easy to use but yet powerful performance tools for the GNU Linux operating system. While the focus of LIKWID was on x86 processors, it is now ported to ARM and POWER processors. A backend for Nvidia GPUs is part of LIKWID with version 5.0.<BR>
+This is an effort to develop easy to use but yet powerful performance tools for the GNU Linux operating system. While the focus of LIKWID was on x86 processors, it is now ported to ARM and POWER processors. A backend for Nvidia GPUs is part of LIKWID with version 5.0. With the Rocmon backend, AMD GPUs can be monitored.<BR>
 
 LIKWID follows the philosophy:
 - Simple
@@ -16,7 +16,7 @@ LIKWID follows the philosophy:
 \section Tools LIKWID Tools
 - \ref likwid-topology : A tool to display the thread and cache topology on multicore/multisocket computers.
 - \ref likwid-pin : A tool to pin your threaded application without changing your code. Works for pthreads and OpenMP.
-- \ref likwid-perfctr : A tool to measure hardware performance counters on x86, ARM and POWER processors as well as Nvidia GPUs. It can be used as wrapper application without modifying the profiled code or with a marker API to measure only parts of the code.
+- \ref likwid-perfctr : A tool to measure hardware performance counters on x86, ARM and POWER processors as well as Nvidia/AMD GPUs. It can be used as wrapper application without modifying the profiled code or with a marker API to measure only parts of the code.
 - \ref likwid-powermeter : A tool for accessing RAPL counters and query Turbo mode steps on Intel processor. RAPL counters are also available in \ref likwid-perfctr.
 - \ref likwid-setFrequencies : A tool to print and manage the clock frequency of CPU hardware threads and the Uncore (Intel only).
 - \ref likwid-memsweeper : A tool to cleanup ccNUMA domains and LLC caches to get a clean environment for benchmarks.
@@ -133,6 +133,9 @@ Optionally, a global configuration file \ref likwid.cfg can be given to modify s
 - For compute capability < 7.0: support based on CUPTI Events API
 - For compute capability >= 7.0: support based on CUpti Profiling API
 
+\subsection Architectures_AMD AMD GPU architectures
+- ROCm 5.0 and higher capable GPUs
+
 \section Examples Example Codes
 Using the Likwid API:
 - \ref C-likwidAPI-code
diff --git a/doc/likwid-perfctr.1 b/doc/likwid-perfctr.1
index c5343322f..3312d6d74 100644
--- a/doc/likwid-perfctr.1
+++ b/doc/likwid-perfctr.1
@@ -1,6 +1,6 @@
 .TH LIKWID-PERFCTR 1 <DATE> likwid\-<VERSION>
 .SH NAME
-likwid-perfctr \- configure and read out hardware performance counters on x86, ARM and POWER CPUs and Nvidia GPUs
+likwid-perfctr \- configure and read out hardware performance counters on x86, ARM and POWER CPUs and Nvidia/AMD GPUs
 .SH SYNOPSIS
 .B likwid-perfctr
 .RB [\-vhHmaiefO]
@@ -34,6 +34,12 @@ or
 .IR gpu_performance_group
 or
 .IR gpu_performance_event_string (*) ]
+.RB [ \-I
+.IR gpu_list (**) ]
+.RB [ \-R
+.IR gpu_performance_group
+or
+.IR gpu_performance_event_string (**) ]
 .RB [ \-\-stats ]
 .SH DESCRIPTION
 .B likwid-perfctr
@@ -44,6 +50,7 @@ There are preconfigured performance groups with useful event sets and derived me
 events can be measured with custom event sets. The marker API can measure mulitple named regions and the
 results are accumulated over multiple region calls.
 .IR (*) Option only available if built with Nvidia GPU support
+.IR (**) Option only available if built with AMD GPU support
 
 .SH OPTIONS
 .TP
@@ -66,7 +73,7 @@ run in marker API mode
 print available performance groups for current processor, then exit.
 .TP
 .B \-\^e
-print available counters and performance events of current processor and (if available) Nvidia GPUs.
+print available counters and performance events of current processor and (if available) Nvidia or AMD GPUs.
 .TP
 .B \-\^o, \-\-\^output <filename>
 store all ouput to a file instead of stdout. For the filename the following placeholders are supported:
@@ -116,7 +123,7 @@ Force writing of registers even if they are in use.
 Print only events and corresponding counters matching <search_str>
 .TP
 .B \-\^G, \-\-\^gpus <gpu_list>
-specify a numerical list of GPU IDs. The list may contain multiple
+specify a numerical list of Nvidia GPU IDs. The list may contain multiple
 items, separated by comma, and ranges. For example 0,3,9-11.
 .TP
 .B \-\^W, \-\-\^gpugroup <gpu performance group> or <gpu performance event set string>
@@ -125,6 +132,16 @@ This can be one of the tags output with the -a flag in the GPU section.
 Also a custom event set can be specified by a comma separated list of events. Each event has the format
 eventId:GPUx (x=0,1,2,...). You can add as many events to the string until you hit an error.
 .TP
+.B \-\^I, \-\-\^gpus <gpu_list>
+specify a numerical list of AMD GPU IDs. The list may contain multiple
+items, separated by comma, and ranges. For example 0,3,9-11.
+.TP
+.B \-\^R, \-\-\^gpugroup <gpu performance group> or <gpu performance event set string>
+specify which performance group to measure on the specified AMD GPUs.
+This can be one of the tags output with the -a flag in the GPU section.
+Also a custom event set can be specified by a comma separated list of events. Each event has the format
+eventId:GPUx (x=0,1,2,...). You can add as many events to the string until you hit an error.
+.TP
 .B \-\-\^stats
 Always print statistics table
 
diff --git a/doc/likwid-topology.1 b/doc/likwid-topology.1
index 8ae22b6e2..b804a9bb2 100644
--- a/doc/likwid-topology.1
+++ b/doc/likwid-topology.1
@@ -1,6 +1,6 @@
 .TH LIKWID-TOPOLOGY 1 <DATE> likwid\-<VERSION>
 .SH NAME
-likwid-topology \- print thread, cache, NUMA and Nvidia GPU topology
+likwid-topology \- print thread, cache, NUMA and Nvidia/AMD GPU topology
 .SH SYNOPSIS
 .B likwid-topology
 .RB [\-hvgcCG]
@@ -11,12 +11,12 @@ likwid-topology \- print thread, cache, NUMA and Nvidia GPU topology
 .SH DESCRIPTION
 .B likwid-topology
 is a command line application to print the thread and cache
-topology on multicore x86, ARM and POWER processors and Nvidia GPUs.
+topology on multicore x86, ARM and POWER processors and Nvidia/AMD GPUs.
 Used with mono spaced fonts it can draw the processor topology of a
 machine in ASCII art. Beyond topology likwid-topology determines the
 clock of a processor and prints detailed informations about the caches hierarchy.
 When compiled with NVIDIA_INTERFACE=true in config.mk and the CUDA/CUPTI library reachable
-at runtime, likwid-topology prints information about the Nvidia GPUs in the system.
+at runtime, likwid-topology prints information about the Nvidia GPUs in the system. The same is possible for AMD GPUs with ROCM_INTERFACE=TRUE and the required ROCm libraries.
 .SH OPTIONS
 .TP
 .B \-h, \-\-\^help
@@ -38,7 +38,7 @@ prints detailed information about cache hierarchy
 measures and output the processor clock. This involves a longer run time of likwid-topology.
 .TP
 .B \-G, \-\-\^gpus
-prints detailed information about the Nvidia GPUs in the system (if compiled with Nvidia support)
+prints detailed information about the Nvidia/AMD GPUs in the system (if compiled with Nvidia or AMD support)
 .TP
 .B \-o, \-\-\^output <file>
 write the output to file instead of stdout.
diff --git a/groups/amd_gpu/GDS.txt b/groups/amd_gpu/GDS.txt
new file mode 100644
index 000000000..f29639357
--- /dev/null
+++ b/groups/amd_gpu/GDS.txt
@@ -0,0 +1,13 @@
+SHORT GDS Instructions
+
+EVENTSET
+ROCM0 ROCP_SQ_INSTS_GDS
+ROCM1 ROCP_SQ_WAVES
+
+METRICS
+GPU GDS rw insts per work-item ROCM0/ROCM1
+
+LONG
+--
+The average number of GDS read or GDS write instructions executed 
+per work item (affected by flow control).
diff --git a/groups/amd_gpu/MEM.txt b/groups/amd_gpu/MEM.txt
new file mode 100644
index 000000000..d5e6c5350
--- /dev/null
+++ b/groups/amd_gpu/MEM.txt
@@ -0,0 +1,16 @@
+SHORT Memory utilization
+
+EVENTSET
+ROCM0 ROCP_TA_TA_BUSY
+ROCM1 ROCP_GRBM_GUI_ACTIVE
+ROCM2 ROCP_SE_NUM
+
+METRICS
+GPU memory utilization 100*max(ROCM0,16)/ROCM1/ROCM2
+
+LONG
+--
+The percentage of GPUTime the memory unit is active. The result includes 
+the stall time (MemUnitStalled). This is measured with all extra fetches 
+and writes and any cache or memory effects taken into account. 
+Value range: 0% to 100% (fetch-bound).
diff --git a/groups/amd_gpu/PCI.txt b/groups/amd_gpu/PCI.txt
new file mode 100644
index 000000000..201f4ff89
--- /dev/null
+++ b/groups/amd_gpu/PCI.txt
@@ -0,0 +1,18 @@
+SHORT PCI Transfers
+
+EVENTSET
+ROCM0 RSMI_PCI_THROUGHPUT_SENT
+ROCM1 RSMI_PCI_THROUGHPUT_RECEIVED
+
+
+METRICS
+Runtime time
+PCI sent ROCM0
+PCI received ROCM1
+PCI send bandwidth 1E-6*ROCM0/time
+PCI recv bandwidth 1E-6*ROCM1/time
+
+LONG
+--
+Currently not usable since the RSMI_PCI_THROUGHPUT_* events require
+one second per call, so 2 seconds for both of them.
diff --git a/groups/amd_gpu/POWER.txt b/groups/amd_gpu/POWER.txt
new file mode 100644
index 000000000..e4ee0a7bb
--- /dev/null
+++ b/groups/amd_gpu/POWER.txt
@@ -0,0 +1,17 @@
+SHORT Power, temperature and voltage
+
+EVENTSET
+ROCM0 RSMI_POWER_AVE[0]
+ROCM1 RSMI_TEMP_EDGE
+ROCM2 RSMI_VOLT_VDDGFX
+
+
+METRICS
+Power average 1E-6*ROCM0
+Edge temperature 1E-3*ROCM1
+Voltage 1E-3*ROCM2
+
+LONG
+--
+Gets the current average power consumption in watts, the
+temperature in celsius and the voltage in volts.
diff --git a/groups/amd_gpu/SALU.txt b/groups/amd_gpu/SALU.txt
new file mode 100644
index 000000000..b5259d793
--- /dev/null
+++ b/groups/amd_gpu/SALU.txt
@@ -0,0 +1,13 @@
+SHORT SALU Instructions
+
+EVENTSET
+ROCM0 ROCP_SQ_INSTS_SALU
+ROCM1 ROCP_SQ_WAVES
+
+METRICS
+GPU SALU insts per work-item ROCM0/ROCM1
+
+LONG
+--
+The average number of scalar ALU instructions executed per work-item 
+(affected by flow control).
diff --git a/groups/amd_gpu/SFETCH.txt b/groups/amd_gpu/SFETCH.txt
new file mode 100644
index 000000000..e33930eba
--- /dev/null
+++ b/groups/amd_gpu/SFETCH.txt
@@ -0,0 +1,13 @@
+SHORT SFetch Instructions
+
+EVENTSET
+ROCM0 ROCP_SQ_INSTS_SMEM
+ROCM1 ROCP_SQ_WAVES
+
+METRICS
+GPU SFETCH insts per work-item ROCM0/ROCM1
+
+LONG
+--
+The average number of scalar fetch instructions from the video memory 
+executed per work-item (affected by flow control).
diff --git a/groups/amd_gpu/STALLED.txt b/groups/amd_gpu/STALLED.txt
new file mode 100644
index 000000000..bc6086022
--- /dev/null
+++ b/groups/amd_gpu/STALLED.txt
@@ -0,0 +1,17 @@
+SHORT ALU stalled by LDS
+
+EVENTSET
+ROCM0 ROCP_SQ_WAIT_INST_LDS
+ROCM1 ROCP_SQ_WAVES
+ROCM2 ROCP_GRBM_GUI_ACTIVE
+
+METRICS
+GPU ALD stalled 100*ROCM0*4/ROCM1/ROCM2
+
+LONG
+--
+The percentage of GPUTime ALU units are stalled by the LDS input queue 
+being full or the output queue being not ready. If there are LDS bank 
+conflicts, reduce them. Otherwise, try reducing the number of LDS 
+accesses if possible. 
+Value range: 0% (optimal) to 100% (bad).
diff --git a/groups/amd_gpu/UTIL.txt b/groups/amd_gpu/UTIL.txt
new file mode 100644
index 000000000..e831e3c16
--- /dev/null
+++ b/groups/amd_gpu/UTIL.txt
@@ -0,0 +1,16 @@
+SHORT GPU utilization
+
+EVENTSET
+ROCM0 ROCP_GRBM_COUNT
+ROCM1 ROCP_GRBM_GUI_ACTIVE
+
+
+METRICS
+GPU utilization 100*ROCM1/ROCM0
+
+
+LONG
+--
+This group reassembles the 'GPUBusy' metric provided by RocProfiler.
+We should add, that we can select the GPUBusy metric directly and the
+calculations are done internally in case the metric formula changes.
diff --git a/groups/amd_gpu/VALU.txt b/groups/amd_gpu/VALU.txt
new file mode 100644
index 000000000..e26a3b690
--- /dev/null
+++ b/groups/amd_gpu/VALU.txt
@@ -0,0 +1,13 @@
+SHORT VALU Instructions
+
+EVENTSET
+ROCM0 ROCP_SQ_INSTS_VALU
+ROCM1 ROCP_SQ_WAVES
+
+METRICS
+GPU VALU insts per work-item ROCM0/ROCM1
+
+LONG
+--
+The average number of vector ALU instructions executed per work-item 
+(affected by flow control).
diff --git a/groups/amd_gpu/WAVE.txt b/groups/amd_gpu/WAVE.txt
new file mode 100644
index 000000000..eb9aec9fe
--- /dev/null
+++ b/groups/amd_gpu/WAVE.txt
@@ -0,0 +1,13 @@
+SHORT Wavefronts
+
+EVENTSET
+ROCM0 ROCP_SQ_WAVES
+
+
+METRICS
+GPU wavefronts ROCM0
+
+
+LONG
+--
+Total Wavefronts
diff --git a/make/config_checks.mk b/make/config_checks.mk
index 4d23b3607..214a83e5c 100644
--- a/make/config_checks.mk
+++ b/make/config_checks.mk
@@ -82,3 +82,8 @@ ifeq ($(strip $(NVIDIA_INTERFACE)), true)
 INCLUDES += -I$(CUDAINCLUDE) -I$(CUPTIINCLUDE)
 #CPPFLAGS += -L$(CUDALIBDIR) -L$(CUPTILIBDIR)
 endif
+
+ifeq ($(strip $(ROCM_INTERFACE)), true)
+# HSA includes 'hsa/xxx.h' and rocprofiler 'xxx.h'
+INCLUDES += -I$(HIPINCLUDE) -I$(HSAINCLUDE) -I$(HSAINCLUDE)/hsa -I$(ROCPROFILERINCLUDE) -I$(RSMIINCLUDE)
+endif
diff --git a/make/config_defines.mk b/make/config_defines.mk
index 990185e1f..92c4b9e3b 100644
--- a/make/config_defines.mk
+++ b/make/config_defines.mk
@@ -294,8 +294,10 @@ endif
 
 ifeq ($(strip $(NVIDIA_INTERFACE)),true)
 DEFINES += -DLIKWID_WITH_NVMON
-else
-BUILDAPPDAEMON := false
+endif
+
+ifeq ($(strip $(ROCM_INTERFACE)),true)
+DEFINES += -DLIKWID_WITH_ROCMON -D__HIP_PLATFORM_HCC__
 endif
 
 ifeq ($(strip $(BUILDDAEMON)),true)
diff --git a/src/access-daemon/Makefile b/src/access-daemon/Makefile
index 8e272d09f..ecd500c1a 100644
--- a/src/access-daemon/Makefile
+++ b/src/access-daemon/Makefile
@@ -39,12 +39,18 @@ DEFINES   += -D_GNU_SOURCE -DMAX_NUM_THREADS=$(MAX_NUM_THREADS) -DMAX_NUM_NODES=
 ifeq ($(DEBUG),true)
 DEFINES += -DDEBUG_LIKWID
 endif
+ifeq ($(NVIDIA_INTERFACE), true)
+DEFINES += -DLIKWID_NVMON
+endif
+ifeq ($(ROCM_INTERFACE), true)
+DEFINES += -DLIKWID_ROCMON
+endif
 INCLUDES  = -I../includes
 CFLAGS    += -std=c99 -fPIC -pie -fPIE -fstack-protector
 ifeq ($(COMPILER),GCCX86)
 CFLAGS    +=  -m32
 endif
-CPPFLAGS :=  $(DEFINES) $(INCLUDES)
+CPPFLAGS :=  $(DEFINES) $(INCLUDES) -L$(PREFIX)/lib
 
 ifeq ($(COMPILER),GCCARMv8)
 all:
@@ -59,4 +65,4 @@ $(SETFREQ_TARGET): setFreqDaemon.c
 	$(Q)$(CC) $(CFLAGS) $(CPPFLAGS) -o ../../$(SETFREQ_TARGET) setFreqDaemon.c
 
 $(APPDAEMON_TARGET): $(GOTCHA_TARGET) appDaemon.c
-	$(Q)$(CC) -shared -fPIC $(CPPFLAGS) -Wl,-soname,$(APPDAEMON_TARGET).$(VERSION).$(RELEASE) -fstack-protector -I. -I$(GOTCHA_FOLDER)/include  -L$(GOTCHA_FOLDER) appDaemon.c -o ../../$(APPDAEMON_TARGET)  -llikwid-gotcha
+	$(Q)$(CC) -shared -fPIC $(CPPFLAGS) -Wl,-soname,$(APPDAEMON_TARGET).$(VERSION).$(RELEASE) -fstack-protector -I. ../bstrlib.c appDaemon.c -o ../../$(APPDAEMON_TARGET) -llikwid -L../../
diff --git a/src/access-daemon/appDaemon.c b/src/access-daemon/appDaemon.c
index cbf33cc43..1f82f6757 100644
--- a/src/access-daemon/appDaemon.c
+++ b/src/access-daemon/appDaemon.c
@@ -32,48 +32,581 @@
 
 #include <stdio.h>
 #include <stdlib.h>
-#include <gotcha/gotcha.h>
+#include <sys/mman.h>
+#include <dlfcn.h>
+#include <sched.h>
+#include <pthread.h>
+#include <unistd.h>
+#include <time.h>
 
-gotcha_wrappee_handle_t orig_main_handle;
+#include <likwid.h>
+#include <error.h>
 
-static int appDaemon_initialized = 0;
+typedef void(*appdaemon_exit_func)(void);
+#define APPDAEMON_MAX_EXIT_FUNCS 2
+static appdaemon_exit_func appdaemon_exit_funcs[APPDAEMON_MAX_EXIT_FUNCS];
+static int appdaemon_num_exit_funcs = 0;
 
-int likwid_appDaemon_main(int argc, char** argv)
+static struct tagbstring daemon_name = bsStatic("likwid-appDaemon.so");
+static FILE* output_file = NULL;
+
+// Timeline mode
+static int stopIssued = 0;
+static pthread_mutex_t stopMutex;
+
+int appdaemon_register_exit(appdaemon_exit_func f)
 {
-    int return_code = 0;
-    typeof(&likwid_appDaemon_main) orig_main = (int (*)(int, char**))gotcha_get_wrappee(orig_main_handle);
-    char* nvEventStr = getenv("NVMON_EVENTS");
-    char* nvGpuStr = getenv("NVMON_GPUS");
+    if (appdaemon_num_exit_funcs < APPDAEMON_MAX_EXIT_FUNCS)
+    {
+        appdaemon_exit_funcs[appdaemon_num_exit_funcs] = f;
+        appdaemon_num_exit_funcs++;
+    }
+}
 
-    if (appDaemon_initialized)
+static void after_main()
+{
+    // Stop timeline thread (if running)
+    pthread_mutex_lock(&stopMutex);
+    stopIssued = 1;
+    pthread_mutex_unlock(&stopMutex);
+
+    for (int i = 0; i < appdaemon_num_exit_funcs; i++)
     {
-        return_code = orig_main(argc, argv);
+        appdaemon_exit_funcs[i]();
     }
-    else
+
+    if (output_file)
     {
+        fclose(output_file);
+    }
+}
 
-        appDaemon_initialized = 1;
+static void prepare_ldpreload()
+{
+    int (*mysetenv)(const char *name, const char *value, int overwrite) = setenv;
+    char* ldpreload = getenv("LD_PRELOAD");
+    if (ldpreload)
+    {
+        printf("Old LD_PRELOAD=%s\n", ldpreload);
+        bstring bldpre = bfromcstr(ldpreload);
+        bstring new_bldpre = bfromcstr("");
+        struct bstrList *liblist = bsplit(bldpre, ':');
+        for (int i = 0; i < liblist->qty; i++)
+        {
+            if (binstr(liblist->entry[i], 0, &daemon_name) == BSTR_ERR)
+            {
+                bconcat(new_bldpre, liblist->entry[i]);
+                bconchar(new_bldpre, ':');
+            }
+        }
+        printf("New LD_PRELOAD=%s\n", bdata(new_bldpre));
+        mysetenv("LD_PRELOAD", bdata(new_bldpre), 1);
+        bstrListDestroy(liblist);
+        bdestroy(new_bldpre);
+        bdestroy(bldpre);
+    }
+}
 
+static int parse_gpustr(char* gpuStr, int* numGpus, int** gpuIds)
+{
+    // Create bstring
+    bstring bGpuStr = bfromcstr(gpuStr);
+    
+    // Parse list
+    struct bstrList* gpuTokens = bsplit(bGpuStr,',');
+    int tmpNumGpus = gpuTokens->qty;
 
-        return_code = orig_main(argc, argv);
+    // Allocate gpuId list
+    int* tmpGpuIds = malloc(tmpNumGpus * sizeof(int));
+    if (!tmpGpuIds)
+    {
+        fprintf(stderr,"Cannot allocate space for GPU list.\n");
+        bdestroy(bGpuStr);
+        bstrListDestroy(gpuTokens);
+        return -EXIT_FAILURE;
     }
 
+    // Parse ids to int
+    for (int i = 0; i < tmpNumGpus; i++)
+    {
+        tmpGpuIds[i] = atoi(bdata(gpuTokens->entry[i]));
+    }
 
+    // Copy data
+    *numGpus = tmpNumGpus;
+    *gpuIds = tmpGpuIds;
 
+    // Destroy bstring
+    bdestroy(bGpuStr);
+    bstrListDestroy(gpuTokens);
 
+    return 0;
+}
 
+/*
+Nvmon
+*/
+#ifdef LIKWID_NVMON
+static int  nvmon_initialized = 0;
+static int* nvmon_gpulist = NULL;
+static int  nvmon_numgpus = 0;
+static int* nvmon_gids = NULL;
+static int  nvmon_numgids = 0;
 
-    appDaemon_initialized = 0;
-    return return_code;
+static int appdaemon_setup_nvmon(char* gpuStr, char* eventStr)
+{
+    int ret = 0;
+    printf("Nvmon GPU string: %s\n", gpuStr);
+    printf("Nvmon Event string: %s\n", eventStr);
+
+    // Parse gpu string
+    ret = parse_gpustr(gpuStr, &nvmon_numgpus, &nvmon_gpulist);
+    if (ret < 0)
+    {
+        ERROR_PRINT(Failed to get nvmon gpulist from '%s', gpuStr);
+        goto appdaemon_setup_nvmon_cleanup;
+    }
+
+    // Parse event string
+    bstring bev = bfromcstr(eventStr);
+    struct bstrList* nvmon_eventlist = bsplit(bev, '|');
+    bdestroy(bev);
+    nvmon_gids = malloc(nvmon_eventlist->qty * sizeof(int));
+    if (!nvmon_gids)
+    {
+        ERROR_PRINT(Failed to allocate space for nvmon group IDs);
+        goto appdaemon_setup_nvmon_cleanup;
+    }
+
+    // Init nvmon
+    ret = nvmon_init(nvmon_numgpus, nvmon_gpulist);
+    if (ret < 0)
+    {
+        ERROR_PRINT(Failed to initialize nvmon);
+        goto appdaemon_setup_nvmon_cleanup;
+    }
+    nvmon_initialized = 1;
+
+    // Add event sets
+    for (int i = 0; i < nvmon_eventlist->qty; i++)
+    {
+        ret = nvmon_addEventSet(bdata(nvmon_eventlist->entry[i]));
+        if (ret < 0)
+        {
+            ERROR_PRINT(Failed to add nvmon group: %s, bdata(nvmon_eventlist->entry[i]));
+            continue;
+        }
+        nvmon_gids[nvmon_numgids++] = ret;
+    }
+    if (nvmon_numgids == 0)
+    {
+        ERROR_PRINT(Failed to add any events to nvmon);
+        goto appdaemon_setup_nvmon_cleanup;
+    }
+
+    // Setup counters
+    ret = nvmon_setupCounters(nvmon_gids[0]);
+    if (ret < 0)
+    {
+        ERROR_PRINT(Failed to setup nvmon);
+        goto appdaemon_setup_nvmon_cleanup;
+    }
+
+    // Start counters
+    ret = nvmon_startCounters();
+    if (ret < 0)
+    {
+        ERROR_PRINT(Failed to start nvmon);
+        goto appdaemon_setup_nvmon_cleanup;
+    }
+    return 0;
+appdaemon_setup_nvmon_cleanup:
+    if (nvmon_initialized)
+    {
+        nvmon_finalize();
+        nvmon_initialized = 0;
+    }
+    if (nvmon_gids)
+    {
+        free(nvmon_gids);
+        nvmon_gids = NULL;
+        nvmon_numgids = 0;
+    }
+    if (nvmon_eventlist)
+    {
+        bstrListDestroy(nvmon_eventlist);
+        nvmon_eventlist = NULL;
+    }
+    if (nvmon_gpulist)
+    {
+        free(nvmon_gpulist);
+        nvmon_gpulist = NULL;
+        nvmon_numgpus = 0;
+    }
+    return ret;
 }
 
+static void appdaemon_close_nvmon(void)
+{
+    // Stop counters
+    int ret = nvmon_stopCounters();
+    if (ret < 0)
+    {
+        ERROR_PRINT(Failed to stop nvmon);
+    }
 
-struct gotcha_binding_t likwid_appDaemon_overwrites[] = {
-  {"main", likwid_appDaemon_main, (void*)&orig_main_handle},
-};
+    // Print results
+    for (int g = 0; g < nvmon_numgids; g++)
+    {
+        int gid = nvmon_gids[g];
+        for (int i = 0; i < nvmon_getNumberOfEvents(gid); i++)
+        {
+            for (int j = 0; j < nvmon_numgpus; j++)
+            {
+                fprintf(output_file, "Nvmon, %d, %f, %s, %f, %f\n", nvmon_gpulist[j], nvmon_getTimeOfGroup(nvmon_gpulist[j]), nvmon_getEventName(gid, i), nvmon_getResult(gid, i, j), nvmon_getLastResult(gid, i, j));
+            }
+        }
+    }
+    fflush(output_file);
 
+    // Cleanup
+    if (nvmon_initialized)
+    {
+        nvmon_finalize();
+        nvmon_initialized = 0;
+    }
+    if (nvmon_gids)
+    {
+        free(nvmon_gids);
+        nvmon_gids = NULL;
+        nvmon_numgids = 0;
+    }
+    if (nvmon_gpulist)
+    {
+        free(nvmon_gpulist);
+        nvmon_gpulist = NULL;
+        nvmon_numgpus = 0;
+    }
+}
 
-void __attribute__((constructor)) likwid_appDaemon_constructor()
+static void appdaemon_read_nvmon(void)
 {
-    gotcha_wrap(likwid_appDaemon_overwrites, 1 ,"likwid_appDaemon");
+    // Read counters
+    int ret = nvmon_readCounters();
+    if (ret < 0)
+    {
+        fprintf(stderr, "Failed to read Nvmon counters\n");
+        return;
+    }
+
+    // Print results
+    for (int g = 0; g < nvmon_numgids; g++)
+    {
+        int gid = nvmon_gids[g];
+        for (int i = 0; i < nvmon_getNumberOfEvents(gid); i++)
+        {
+            for (int j = 0; j < nvmon_numgpus; j++)
+            {
+                fprintf(output_file, "Nvmon, %d, %f, %s, %f, %f\n", nvmon_gpulist[j], nvmon_getTimeToLastReadOfGroup(nvmon_gpulist[j]), nvmon_getEventName(gid, i), nvmon_getResult(gid, i, j), nvmon_getLastResult(gid, i, j));
+            }
+        }
+    }
 }
+#endif
+
+/*
+Rocmon
+*/
+#ifdef LIKWID_ROCMON
+static int  rocmon_initialized = 0;
+static int* rocmon_gpulist = NULL;
+static int  rocmon_numgpus = 0;
+static int* rocmon_gids = NULL;
+static int  rocmon_numgids = 0;
+
+static int appdaemon_setup_rocmon(char* gpuStr, char* eventStr)
+{
+    int ret = 0;
+    printf("Rocmon GPU string: %s\n", gpuStr);
+    printf("Rocmon Event string: %s\n", eventStr);
+
+    // Parse gpu string
+    ret = parse_gpustr(gpuStr, &rocmon_numgpus, &rocmon_gpulist);
+    if (ret < 0)
+    {
+        ERROR_PRINT(Failed to get rocmon gpulist from '%s', gpuStr);
+        goto appdaemon_setup_rocmon_cleanup;
+    }
+
+    // Parse event string
+    bstring bev = bfromcstr(eventStr);
+    struct bstrList* rocmon_eventlist = bsplit(bev, '|'); // TODO: multiple event sets not supported
+    bdestroy(bev);
+    rocmon_gids = malloc(rocmon_eventlist->qty * sizeof(int));
+    if (!rocmon_gids)
+    {
+        ERROR_PRINT(Failed to allocate space for rocmon group IDs);
+        goto appdaemon_setup_rocmon_cleanup;
+    }
+
+    // Init rocmon
+    ret = rocmon_init(rocmon_numgpus, rocmon_gpulist);
+    if (ret < 0)
+    {
+        ERROR_PRINT(Failed to initialize rocmon);
+        goto appdaemon_setup_rocmon_cleanup;
+    }
+    rocmon_initialized = 1;
+
+    // Add event sets
+    for (int i = 0; i < rocmon_eventlist->qty; i++)
+    {
+        ret = rocmon_addEventSet(bdata(rocmon_eventlist->entry[i]), &rocmon_gids[rocmon_numgids++]);
+        if (ret < 0)
+        {
+            ERROR_PRINT(Failed to add rocmon group: %s, bdata(rocmon_eventlist->entry[i]));
+        }
+    }
+    if (rocmon_numgids == 0)
+    {
+        ERROR_PRINT(Failed to add any events to rocmon);
+        goto appdaemon_setup_rocmon_cleanup;
+    }
+
+    // Setup counters
+    ret = rocmon_setupCounters(rocmon_gids[0]);
+    if (ret < 0)
+    {
+        ERROR_PRINT(Failed to setup rocmon);
+        goto appdaemon_setup_rocmon_cleanup;
+    }
+
+    // Start counters
+    ret = rocmon_startCounters();
+    if (ret < 0)
+    {
+        ERROR_PRINT(Failed to start rocmon);
+        goto appdaemon_setup_rocmon_cleanup;
+    }
+    return 0;
+appdaemon_setup_rocmon_cleanup:
+    if (rocmon_initialized)
+    {
+        rocmon_finalize();
+        rocmon_initialized = 0;
+    }
+    if (rocmon_gids)
+    {
+        free(rocmon_gids);
+        rocmon_gids = NULL;
+        rocmon_numgids = 0;
+    }
+    if (rocmon_eventlist)
+    {
+        bstrListDestroy(rocmon_eventlist);
+        rocmon_eventlist = NULL;
+    }
+    if (rocmon_gpulist)
+    {
+        free(rocmon_gpulist);
+        rocmon_gpulist = NULL;
+        rocmon_numgpus = 0;
+    }
+    return ret;
+}
+
+static void appdaemon_close_rocmon(void)
+{
+    // Stop counters
+    int ret = rocmon_stopCounters();
+    if (ret < 0)
+    {
+        ERROR_PRINT(Failed to stop rocmon);
+    }
+
+    // Print results
+    for (int g = 0; g < rocmon_numgids; g++)
+    {
+        int gid = rocmon_gids[g];
+        for (int i = 0; i < rocmon_getNumberOfEvents(gid); i++)
+        {
+            for (int j = 0; j < rocmon_numgpus; j++)
+            {
+                fprintf(output_file, "Rocmon, %d, %f, %s, %f, %f\n", rocmon_gpulist[j], rocmon_getTimeOfGroup(rocmon_gpulist[j]), rocmon_getEventName(gid, i), rocmon_getResult(j, gid, i), rocmon_getLastResult(j, gid, i));
+            }
+        }
+    }
+
+    // Cleanup
+    if (rocmon_initialized)
+    {
+        rocmon_finalize();
+        rocmon_initialized = 0;
+    }
+    if (rocmon_gids)
+    {
+        free(rocmon_gids);
+        rocmon_gids = NULL;
+        rocmon_numgids = 0;
+    }
+    if (rocmon_gpulist)
+    {
+        free(rocmon_gpulist);
+        rocmon_gpulist = NULL;
+        rocmon_numgpus = 0;
+    }
+}
+
+static void appdaemon_read_rocmon(void)
+{
+    // Read counters
+    int ret = rocmon_readCounters();
+    if (ret < 0)
+    {
+        fprintf(stderr, "Failed to read Rocmon counters\n");
+        return;
+    }
+
+    // Print results
+    for (int g = 0; g < rocmon_numgids; g++)
+    {
+        int gid = rocmon_gids[g];
+        for (int i = 0; i < rocmon_getNumberOfEvents(gid); i++)
+        {
+            for (int j = 0; j < rocmon_numgpus; j++)
+            {
+                fprintf(output_file, "Rocmon, %d, %f, %s, %f, %f\n", rocmon_gpulist[j], rocmon_getTimeToLastReadOfGroup(rocmon_gpulist[j]), rocmon_getEventName(gid, i), rocmon_getResult(j, gid, i), rocmon_getLastResult(j, gid, i));
+            }
+        }
+    }
+}
+#endif
+
+
+/*
+Timeline mode
+*/
+static void* appdaemon_timeline_main(void* arg)
+{
+    int stop = 0;
+    int target_delay_ms = *((int*)arg);
+    ;
+
+    while (1)
+    {
+        usleep(target_delay_ms * 1E3);
+
+        // Check stop status
+        pthread_mutex_lock(&stopMutex);
+        stop = stopIssued;
+        pthread_mutex_unlock(&stopMutex);
+        if (stop > 0) break;
+        
+#ifdef LIKWID_NVMON
+        appdaemon_read_nvmon();
+#endif
+#ifdef LIKWID_ROCMON
+        appdaemon_read_rocmon();
+#endif
+    }
+}
+
+
+/*
+Main
+*/
+int __libc_start_main(int (*main) (int,char **,char **),
+              int argc,char **ubp_av,
+              void (*init) (void),
+              void (*fini)(void),
+              void (*rtld_fini)(void),
+              void (*stack_end)) {
+    int ret = 0;
+    int (*original__libc_start_main)(int (*main) (int,char **,char **),
+                    int argc,char **ubp_av,
+                    void (*init) (void),
+                    void (*fini)(void),
+                    void (*rtld_fini)(void),
+                    void (*stack_end));
+
+    mlockall(MCL_CURRENT);
+    munlockall();
+    atexit(after_main);
+
+
+    original__libc_start_main = dlsym(RTLD_NEXT, "__libc_start_main");
+
+    prepare_ldpreload();
+
+    // Get timeline mode info
+    char* timelineStr = getenv("LIKWID_INTERVAL");
+    int timelineInterval = -1; // in ms
+    if (timelineStr != NULL)
+    {
+        timelineInterval = atoi(timelineStr);
+    }
+    if (timelineInterval == 0)
+    {
+        fprintf(stderr, "Invalid timeline interval\n");
+        return -1;
+    }
+
+    // Open output file
+    char* outputFilename = getenv("LIKWID_OUTPUTFILE");
+    if (outputFilename == NULL)
+    {
+        output_file = stderr;
+    } else {
+        output_file = fopen(outputFilename,"w");
+    }
+
+    if (output_file == NULL)
+    {
+        fprintf(stderr, "Cannot open file %s\n", outputFilename);
+        fprintf(stderr, "%s", strerror(errno));
+        return -1;
+    }
+    fprintf(output_file, "Backend, GPU, Time, Event, Full Value, Last Value\n");
+
+#ifdef LIKWID_NVMON
+    char* nvEventStr = getenv("LIKWID_NVMON_EVENTS");
+    char* nvGpuStr = getenv("LIKWID_NVMON_GPUS");
+    if (nvEventStr && nvGpuStr)
+    {
+        ret = appdaemon_setup_nvmon(nvGpuStr, nvEventStr);
+        if (!ret)
+        {
+            appdaemon_register_exit(appdaemon_close_nvmon);
+        }
+    }
+#endif
+
+#ifdef LIKWID_ROCMON
+    char* rocmonEventStr = getenv("LIKWID_ROCMON_EVENTS");
+    char* rocmonGpuStr = getenv("LIKWID_ROCMON_GPUS");
+    if (rocmonEventStr && rocmonGpuStr)
+    {
+        ret = appdaemon_setup_rocmon(rocmonGpuStr, rocmonEventStr);
+        if (!ret)
+        {
+            appdaemon_register_exit(appdaemon_close_rocmon);
+        }
+    }
+#endif
+
+    // Start timeline thread
+    if (timelineInterval >= 0)
+    {
+        pthread_t tid;
+        ret = pthread_create(&tid, NULL, &appdaemon_timeline_main, &timelineInterval);
+        if (ret < 0)
+        {
+            fprintf(stderr, "Failed to create timeline thread\n");
+            return -1;
+        }
+    }
+
+    return original__libc_start_main(main,argc,ubp_av,
+                     init,fini,rtld_fini,stack_end);
+}
+
diff --git a/src/applications/likwid-perfctr.lua b/src/applications/likwid-perfctr.lua
index 7759f0f45..b58536143 100644
--- a/src/applications/likwid-perfctr.lua
+++ b/src/applications/likwid-perfctr.lua
@@ -35,10 +35,14 @@ package.path = '<INSTALLED_PREFIX>/share/lua/?.lua;' .. package.path
 local likwid = require("likwid")
 
 print_stdout = print
-print_stderr = function(...) for k,v in pairs({...}) do io.stderr:write(v .. "\n") end io.stderr:flush() end
+print_stderr = function(...)
+    for k, v in pairs({ ... }) do io.stderr:write(v .. "\n") end
+    io.stderr:flush()
+end
 
 local function version()
-    print_stdout(string.format("likwid-perfctr -- Version %d.%d.%d (commit: %s)",likwid.version,likwid.release,likwid.minor,likwid.commit))
+    print_stdout(string.format("likwid-perfctr -- Version %d.%d.%d (commit: %s)", likwid.version, likwid.release,
+        likwid.minor, likwid.commit))
 end
 
 local function examples()
@@ -57,6 +61,12 @@ local function examples()
         io.stdout:write("It is possible to combine CPU and GPU measurements (with MarkerAPI and NVMarkerAPI):\n")
         io.stdout:write("likwid-perfctr -C 2 -g CLOCK -G 1 -W FLOPS_DP -m ./a.out\n")
     end
+    if likwid.rocmSupported() then
+        io.stdout:write("Run command and measure on GPU 1 the performance group PCI (Only with ROCmMarkerAPI):\n")
+        io.stdout:write("likwid-perfctr -I 1 -R PCI -m ./a.out\n")
+        io.stdout:write("It is possible to combine CPU and GPU measurements (with MarkerAPI and ROCmMarkerAPI):\n")
+        io.stdout:write("likwid-perfctr -C 2 -g CLOCK -I 1 -R PCI -m ./a.out\n")
+    end
 end
 
 local function usage(config)
@@ -70,12 +80,18 @@ local function usage(config)
     io.stdout:write("-C <list>\t\t Processor ids to pin threads and measure, e.g. 1,2-4,8\n")
     io.stdout:write("\t\t\t For information about the <list> syntax, see likwid-pin\n")
     if likwid.nvSupported() then
-        io.stdout:write("-G, --gpus <list>\t List of GPUs to monitor\n")
+        io.stdout:write("-G, --gpus <list>\t List of CUDA GPUs to monitor\n")
+    end
+    if likwid.rocmSupported() then
+        io.stdout:write("-I <list>\t\t List of ROCm GPUs to monitor\n")
     end
     io.stdout:write("-g, --group <string>\t Performance group or custom event set string for CPU monitoring\n")
     if likwid.nvSupported() then
         io.stdout:write("-W, --gpugroup <string>\t Performance group or custom event set string for GPU monitoring\n")
     end
+    if likwid.rocmSupported() then
+        io.stdout:write("-R <string>\t\t Performance group or custom event set string for ROCm GPU monitoring\n")
+    end
     io.stdout:write("-H\t\t\t Get group help (together with -g switch)\n")
     io.stdout:write("-s, --skip <hex>\t Bitmask with threads to skip\n")
     io.stdout:write("-M <0|1>\t\t Set how MSR registers are accessed, 0=direct, 1=accessDaemon\n")
@@ -89,12 +105,15 @@ local function usage(config)
     io.stdout:write("-S <time>\t\t Stethoscope mode with duration in s, ms or us, e.g 20ms\n")
     io.stdout:write("-t <time>\t\t Timeline mode with frequency in s, ms or us, e.g. 300ms\n")
     io.stdout:write("\t\t\t The output format (to stderr) is:\n")
-    io.stdout:write("\t\t\t <groupID> <nrEvents> <nrThreads> <Timestamp> <Event1_Thread1> <Event1_Thread2> ... <EventN_ThreadN>\n")
+    io.stdout:write(
+    "\t\t\t <groupID> <nrEvents> <nrThreads> <Timestamp> <Event1_Thread1> <Event1_Thread2> ... <EventN_ThreadN>\n")
     io.stdout:write("\t\t\t or\n")
-    io.stdout:write("\t\t\t <groupID> <nrEvents> <nrThreads> <Timestamp> <Metric1_Thread1> <Metric1_Thread2> ... <MetricN_ThreadN>\n")
+    io.stdout:write(
+    "\t\t\t <groupID> <nrEvents> <nrThreads> <Timestamp> <Metric1_Thread1> <Metric1_Thread2> ... <MetricN_ThreadN>\n")
     io.stdout:write("-m, --marker\t\t Use Marker API inside code\n")
     io.stdout:write("Output options:\n")
-    io.stdout:write("-o, --output <file>\t Store output to file. (Optional: Apply text filter according to filename suffix)\n")
+    io.stdout:write(
+    "-o, --output <file>\t Store output to file. (Optional: Apply text filter according to filename suffix)\n")
     io.stdout:write("-O\t\t\t Output easily parseable CSV instead of fancy tables\n")
     io.stdout:write("--stats\t\t\t Always print statistics table\n")
     if config and config["daemonMode"] == -1 then
@@ -106,6 +125,15 @@ local function usage(config)
     examples()
 end
 
+local function file_exists(path)
+    local file = io.open(path, "r")
+    if file ~= nil then
+        io.close(file)
+        return true
+    else
+        return false
+    end
+end
 
 local config = likwid.getConfiguration()
 verbose = 0
@@ -171,10 +199,13 @@ perfpid = nil
 nan2value = '-'
 cputopo = nil
 cpuinfo = nil
+cliopts = { "a", "c:", "C:", "e", "E:", "g:", "h", "H", "i", "m", "M:", "o:", "O", "P", "s:", "S:", "t:", "v", "V:",
+    "T:", "f", "group:", "help", "info", "version", "verbose:", "output:", "skip:", "marker", "force", "stats",
+    "execpid", "perfflags:", "perfpid:", "Z", "outprefix:" }
 
 
 ---------------------------
-gpusSupported = likwid.nvSupported()
+nvSupported = likwid.nvSupported()
 num_gpus = 0
 gpulist = {}
 gpu_event_string_list = {}
@@ -182,6 +213,26 @@ nvMarkerFile = string.format("%s/likwid_gpu_%d.txt", markerFolder, likwid.getpid
 gotG = false
 gpugroups = {}
 gputopo = nil
+if nvSupported then
+    table.insert(cliopts, "W:")
+    table.insert(cliopts, "G:")
+    table.insert(cliopts, "gpugroup:")
+end
+---------------------------
+rocmSupported = likwid.rocmSupported()
+num_rocm_gpus = 0
+gpulist_rocm = {}
+rocm_event_string_list = {}
+rocmMarkerFile = string.format("%s/likwid_rocm_%d.txt", markerFolder, likwid.getpid())
+gotRocmG = false
+rocmgroups = {}
+rocmInitialized = false
+rocmtopo = nil
+if rocmSupported then
+    table.insert(cliopts, "I:")
+    table.insert(cliopts, "R:")
+    table.insert(cliopts, "rocmgroup:")
+end
 ---------------------------
 
 likwid.catchSignal()
@@ -193,6 +244,19 @@ local function perfctr_exit(exitcode)
     if likwid.access(markerFile, "e") == 0 then
         os.remove(markerFile)
     end
+    if rocmSupported then
+        if rocmInitialized then
+            likwid.finalize_rocm()
+            rocmInitialized = false
+            rocmgroups = {}
+            rocm_event_string_list = {}
+            num_rocm_gpus = 0
+            gpulist_rocm = {}
+        end
+        if likwid.access(rocmMarkerFile, "e") == 0 then
+            os.remove(rocmMarkerFile)
+        end
+    end
     if cputopo then
         likwid.putTopology()
         cputopo = nil
@@ -214,9 +278,9 @@ if #arg == 0 then
     perfctr_exit(0)
 end
 
-for opt,arg in likwid.getopt(arg, {"a", "c:", "C:", "e", "E:", "g:", "h", "H", "i", "m", "M:", "o:", "O", "P", "s:", "S:", "t:", "v", "V:", "T:", "G:", "W:", "f", "group:", "help", "info", "version", "verbose:", "output:", "skip:", "marker", "force", "stats", "execpid", "perfflags:", "perfpid:", "Z", "gpugroup:", "outprefix:"}) do
+for opt, arg in likwid.getopt(arg, cliopts) do
     if (type(arg) == "string") then
-        local s,e = arg:find("-");
+        local s, e = arg:find("-");
         if s == 1 then
             print_stderr(string.format("Argument %s to option -%s starts with invalid character -.", arg, opt))
             print_stderr("Did you forget an argument to an option?")
@@ -281,7 +345,9 @@ for opt,arg in likwid.getopt(arg, {"a", "c:", "C:", "e", "E:", "g:", "h", "H", "
     elseif opt == "f" or opt == "force" then
         forceOverwrite = 1
     elseif opt == "g" or opt == "group" then
-        table.insert(event_string_list, arg)
+        if arg ~= nil then
+            table.insert(event_string_list, arg)
+        end
     elseif (opt == "H") then
         print_group_help = true
     elseif opt == "s" or opt == "skip" then
@@ -350,13 +416,13 @@ for opt,arg in likwid.getopt(arg, {"a", "c:", "C:", "e", "E:", "g:", "h", "H", "
         outfile = outfile:gsub("%%j", likwid.getjid())
         outfile = outfile:gsub("%%r", likwid.getMPIrank())
         io.output(outfile)
-        print = function(...) for k,v in pairs({...}) do io.write(v .. "\n") end end
+        print = function(...) for k, v in pairs({ ... }) do io.write(v .. "\n") end end
     elseif (opt == "O") then
         use_csv = true
     elseif (opt == "stats") then
         print_stats = true
----------------------------
-    elseif gpusSupported and (opt == "G") then
+        ---------------------------
+    elseif nvSupported and (opt == "G") then
         if arg ~= nil then
             num_gpus, gpulist = likwid.gpustr_to_gpulist(arg)
         else
@@ -364,16 +430,32 @@ for opt,arg in likwid.getopt(arg, {"a", "c:", "C:", "e", "E:", "g:", "h", "H", "
             perfctr_exit(1)
         end
         gotG = true
-    elseif gpusSupported and (opt == "W" or opt == "gpugroup") then
+    elseif nvSupported and (opt == "W" or opt == "gpugroup") then
         if arg ~= nil then
             table.insert(gpu_event_string_list, arg)
         else
             print_stderr("Option requires an argument")
             perfctr_exit(1)
         end
----------------------------
+        ---------------------------
+    elseif rocmSupported and (opt == "I") then
+        if arg ~= nil then
+            num_rocm_gpus, gpulist_rocm = likwid.gpustr_to_gpulist_rocm(arg)
+        else
+            print_stderr("Option requires an argument")
+            perfctr_exit(1)
+        end
+        gotG = true
+    elseif rocmSupported and (opt == "R" or opt == "rocmgroup") then
+        if arg ~= nil then
+            table.insert(rocm_event_string_list, arg)
+        else
+            print_stderr("Option requires an argument")
+            perfctr_exit(1)
+        end
+        ---------------------------
     elseif opt == "?" then
-        print_stderr("Invalid commandline option -"..arg)
+        print_stderr("Invalid commandline option -" .. arg)
         perfctr_exit(1)
     elseif opt == "!" then
         print_stderr("Option requires an argument")
@@ -381,9 +463,9 @@ for opt,arg in likwid.getopt(arg, {"a", "c:", "C:", "e", "E:", "g:", "h", "H", "
     end
 end
 local execList = {}
-for i=1, likwid.tablelength(arg)-2 do
+for i = 1, likwid.tablelength(arg) - 2 do
     if string.find(arg[i], " ") then
-        table.insert(execList, "\""..arg[i].."\"")
+        table.insert(execList, "\"" .. arg[i] .. "\"")
     else
         table.insert(execList, arg[i])
     end
@@ -400,33 +482,36 @@ io.stdout:setvbuf("no")
 cpuinfo = likwid.getCpuInfo()
 cputopo = likwid.getCpuTopology()
 ---------------------------
-if gpusSupported and (#gpu_event_string_list > 0 or print_events or print_event ~= nil) then
+if nvSupported then
     gputopo = likwid.getGpuTopology()
 end
 ---------------------------
+if rocmSupported then
+    rocmtopo = likwid.getGpuTopology_rocm()
+end
 
 if num_cpus == 0 and
-   not gotC and
-   not print_events and
-   print_event == nil and
-   not print_groups and
-   not print_group_help and
-   not print_info then
+    not gotC and
+    not print_events and
+    print_event == nil and
+    not print_groups and
+    not print_group_help and
+    not print_info then
     cpulist = {}
     pin_cpus = false
-    for cntr=0,cputopo["numHWThreads"]-1 do
+    for cntr = 0, cputopo["numHWThreads"] - 1 do
         if cputopo["threadPool"][cntr]["inCpuSet"] == 1 then
             num_cpus = num_cpus + 1
             table.insert(cpulist, cputopo["threadPool"][cntr]["apicId"])
         end
     end
 elseif num_cpus == 0 and
-       gotC and
-       not print_events and
-       print_event == nil and
-       not print_groups and
-       not print_group_help and
-       not print_info then
+    gotC and
+    not print_events and
+    print_event == nil and
+    not print_groups and
+    not print_group_help and
+    not print_info then
     print_stderr("CPUs given on commandline are not valid in current environment, maybe it's limited by a cpuset.")
     perfctr_exit(1)
 end
@@ -437,26 +522,43 @@ if use_timeline and outfile then
 end
 
 ---------------------------
-if gpusSupported and
-   num_gpus == 0 and
-   not gotG and
-   gputopo and
-   not print_events and
-   print_event == nil and
-   not print_groups and
-   not print_group_help and
-   not print_info then
+if nvSupported and
+    num_gpus == 0 and
+    not gotG and
+    gputopo and
+    not print_events and
+    print_event == nil and
+    not print_groups and
+    not print_group_help and
+    not print_info then
     newgpulist = {}
-    for g=1,gputopo["numDevices"] do
+    for g = 1, gputopo["numDevices"] do
         num_gpus = num_gpus + 1
         table.insert(newgpulist, gputopo["devices"][g]["id"])
     end
     gpulist = newgpulist
 end
 ---------------------------
+if rocmSupported and
+    num_rocm_gpus == 0 and
+    not gotRocmG and
+    rocmtopo and
+    not print_events and
+    print_event == nil and
+    not print_groups and
+    not print_group_help and
+    not print_info then
+    newrocmlist = {}
+    for g = 1, rocmtopo["numDevices"] do
+        num_rocm_gpus = num_rocm_gpus + 1
+        table.insert(newrocmlist, rocmtopo["devices"][g]["devid"])
+    end
+    gpulist_rocm = newrocmlist
+end
+---------------------------
 
 if num_cpus > 0 then
-    for i,cpu1 in pairs(cpulist) do
+    for i, cpu1 in pairs(cpulist) do
         for j, cpu2 in pairs(cpulist) do
             if i ~= j and cpu1 == cpu2 then
                 print_stderr("List of CPUs is not unique, got two times CPU " .. tostring(cpu1))
@@ -467,8 +569,8 @@ if num_cpus > 0 then
 end
 
 ---------------------------
-if gpusSupported and gputopo and num_gpus > 0 then
-    for i,gpu1 in pairs(gpulist) do
+if nvSupported and gputopo and num_gpus > 0 then
+    for i, gpu1 in pairs(gpulist) do
         for j, gpu2 in pairs(gpulist) do
             if i ~= j and gpu1 == gpu2 then
                 print_stderr("List of GPUs is not unique, got two times GPU " .. tostring(gpu1))
@@ -478,6 +580,17 @@ if gpusSupported and gputopo and num_gpus > 0 then
     end
 end
 ---------------------------
+if rocmSupported and rocmtopo and num_rocm_gpus > 0 then
+    for i, gpu1 in pairs(gpulist_rocm) do
+        for j, gpu2 in pairs(gpulist_rocm) do
+            if i ~= j and gpu1 == gpu2 then
+                print_stderr("List of GPUs is not unique, got two times GPU " .. tostring(gpu1))
+                perfctr_exit(1)
+            end
+        end
+    end
+end
+---------------------------
 
 if print_events == true then
     local tab = likwid.getEventsAndCounters()
@@ -487,45 +600,89 @@ if print_events == true then
     for _, counter in pairs(tab["Counters"]) do
         outstr = string.format("%s, %s", counter["Name"], counter["TypeName"]);
         if counter["Options"]:len() > 0 then
-            outstr = outstr .. string.format(", %s",counter["Options"])
+            outstr = outstr .. string.format(", %s", counter["Options"])
         end
         print_stdout(outstr)
     end
     print_stdout("\n\n")
-    print_stdout(string.format("This architecture has %d events.",#tab["Events"]))
+    print_stdout(string.format("This architecture has %d events.", #tab["Events"]))
     print_stdout("Event tags (tag, id, umask, counters<, options>):")
     for _, eventTab in pairs(tab["Events"]) do
         outstr = eventTab["Name"] .. ", "
-        outstr = outstr .. string.format("0x%X, 0x%X, ",eventTab["ID"],eventTab["UMask"])
+        outstr = outstr .. string.format("0x%X, 0x%X, ", eventTab["ID"], eventTab["UMask"])
         outstr = outstr .. eventTab["Limit"]
         if #eventTab["Options"] > 0 then
-            outstr = outstr .. string.format(", %s",eventTab["Options"])
+            outstr = outstr .. string.format(", %s", eventTab["Options"])
         end
         print_stdout(outstr)
     end
----------------------------
-    if gpusSupported and gputopo then
+    ---------------------------
+    if nvSupported and gputopo then
         local cudahome = os.getenv("CUDA_HOME")
         if cudahome and cudahome:len() > 0 then
             ldpath = os.getenv("LD_LIBRARY_PATH")
             local cuptilib = string.format("%s/extras/CUPTI/lib64", cudahome)
-            likwid.setenv("LD_LIBRARY_PATH", cuptilib..":"..ldpath)
+            likwid.setenv("LD_LIBRARY_PATH", cuptilib .. ":" .. ldpath)
+        end
+        -- we need the gpulist to initialize nvmon
+        newgpulist = {}
+        for g = 1, gputopo["numDevices"] do
+            num_gpus = num_gpus + 1
+            table.insert(newgpulist, gputopo["devices"][g]["id"])
+        end
+        gpulist = newgpulist
+        -- nvmon has be initialized to initialize nvml which provides the smi events
+        if likwid.nvInit(num_gpus, gpulist) < 0 then
+            perfctr_exit(1)
         end
         tab = likwid.getGpuEventsAndCounters()
-        for d=0,tab["numDevices"],1 do
+        for d = 0, tab["numDevices"], 1 do
             if tab["devices"][d] then
                 print_stdout("\n\n")
                 print_stdout(string.format("The GPUs %d provides %d events.", d, #tab["devices"][d]))
                 print_stdout("You can use as many GPUx counters until you get an error.")
                 print_stdout("Event tags (tag, counters)")
-                for _,e in pairs(tab["devices"][d]) do
+                for _, e in pairs(tab["devices"][d]) do
                     outstr = string.format("%s, %s", e["Name"], e["Limit"])
                     print_stdout(outstr)
                 end
             end
         end
     end
----------------------------
+    ---------------------------
+    if rocmSupported and rocmtopo then
+        local rocmhome = os.getenv("ROCM_HOME")
+        if rocmhome and rocmhome:len() > 0 then
+            ldpath = os.getenv("LD_LIBRARY_PATH")
+            local hiplib = string.format("%s/hip/lib", rocmhome)
+            local hsalib = string.format("%s/hsa/lib", rocmhome)
+            local rocproflib = string.format("%s/lib/rocprofiler", rocmhome)
+            local metrics_xml = ""
+            if file_exists(string.format("%s/lib/rocprofiler/metrics.xml", rocmhome)) then
+                metrics_xml = string.format("%s/lib/rocprofiler/metrics.xml", rocmhome)
+            else
+                -- fall back to old location for backwards compatibility
+                metrics_xml = string.format("%s/rocprofiler/lib/metrics.xml", rocmhome)
+            end
+            likwid.setenv("LD_LIBRARY_PATH", hiplib .. ":" .. hsalib .. ":" .. rocproflib .. ":" .. ldpath)
+            likwid.setenv("HSA_TOOLS_LIB", "librocprofiler64.so")
+            likwid.setenv("ROCP_METRICS", metrics_xml)
+        end
+        tab = likwid.getGpuEventsAndCounters_rocm()
+        for d = 0, tab["numDevices"], 1 do
+            if tab["devices"][d] then
+                print_stdout("\n\n")
+                print_stdout(string.format("The ROCM GPU %d provides %d events.", d, #tab["devices"][d]))
+                print_stdout("You can use as many ROCMx counters until you get an error.")
+                print_stdout("Event tags (tag, counters)")
+                for _, e in pairs(tab["devices"][d]) do
+                    outstr = string.format("%s, %s", e["Name"], e["Limit"])
+                    print_stdout(outstr)
+                end
+            end
+        end
+    end
+    ---------------------------
     perfctr_exit(0)
 end
 
@@ -533,13 +690,14 @@ if print_event ~= nil then
     function case_insensitive_pattern(pattern)
         local p = pattern:gsub("(%%?)(.)", function(percent, letter)
             if percent ~= "" or not letter:match("%a") then
-              return percent .. letter
+                return percent .. letter
             else
                 return string.format("[%s%s]", letter:lower(), letter:upper())
             end
         end)
         return p
     end
+
     local tab = likwid.getEventsAndCounters()
     local events = {}
     local counters = {}
@@ -556,21 +714,21 @@ if print_event ~= nil then
             end
         end
     end
----------------------------
-    if gpusSupported and gputopo then
+    ---------------------------
+    if nvSupported and gputopo then
         local cudahome = os.getenv("CUDA_HOME")
         if cudahome and cudahome:len() > 0 then
             ldpath = os.getenv("LD_LIBRARY_PATH")
             local cuptilib = string.format("%s/extras/CUPTI/lib64", cudahome)
-            likwid.setenv("LD_LIBRARY_PATH", cuptilib..":"..ldpath)
+            likwid.setenv("LD_LIBRARY_PATH", cuptilib .. ":" .. ldpath)
         end
         if cudahome then
             tab = likwid.getGpuEventsAndCounters()
-            for d=0,tab["numDevices"]-1,1 do
-                for _,e in pairs(tab["devices"][d]) do
+            for d = 0, tab["numDevices"] - 1, 1 do
+                for _, e in pairs(tab["devices"][d]) do
                     if e["Name"]:match(case_insensitive_pattern(print_event)) then
                         local f = false
-                        for _,x in pairs(events) do
+                        for _, x in pairs(events) do
                             if e["Name"] == x["Name"] then
                                 f = true
                                 break
@@ -578,23 +736,63 @@ if print_event ~= nil then
                         end
                         if not f then
                             table.insert(events, e)
-                            counters["GPU"] = {["Name"] = "GPU", ["TypeName"] = "Nvidia GPU counters"}
+                            counters["GPU"] = { ["Name"] = "GPU", ["TypeName"] = "Nvidia GPU counters" }
                         end
                     end
                 end
             end
         end
     end
----------------------------
+    ---------------------------
+    if rocmSupported and rocmtopo then
+        local rocmhome = os.getenv("ROCM_HOME")
+        if rocmhome and rocmhome:len() > 0 then
+            ldpath = os.getenv("LD_LIBRARY_PATH")
+            local hiplib = string.format("%s/hip/lib", rocmhome)
+            local hsalib = string.format("%s/hsa/lib", rocmhome)
+            local rocproflib = string.format("%s/lib/rocprofiler", rocmhome)
+            local metrics_xml = ""
+            if file_exists(string.format("%s/lib/rocprofiler/metrics.xml", rocmhome)) then
+                metrics_xml = string.format("%s/lib/rocprofiler/metrics.xml", rocmhome)
+            else
+                -- fall back to old location for backwards compatibility
+                metrics_xml = string.format("%s/rocprofiler/lib/metrics.xml", rocmhome)
+            end
+            likwid.setenv("LD_LIBRARY_PATH", hiplib .. ":" .. hsalib .. ":" .. rocproflib .. ":" .. ldpath)
+            likwid.setenv("HSA_TOOLS_LIB", "librocprofiler64.so")
+            likwid.setenv("ROCP_METRICS", metrics_xml)
+        end
+        if rocmhome then
+            tab = likwid.getGpuEventsAndCounters_rocm()
+            for d = 0, tab["numDevices"] - 1, 1 do
+                for _, e in pairs(tab["devices"][d]) do
+                    if e["Name"]:match(case_insensitive_pattern(print_event)) then
+                        local f = false
+                        for _, x in pairs(events) do
+                            if e["Name"] == x["Name"] then
+                                f = true
+                                break
+                            end
+                        end
+                        if not f then
+                            table.insert(events, e)
+                            counters["ROCM"] = { ["Name"] = "ROCM", ["TypeName"] = "ROCM GPU counters" }
+                        end
+                    end
+                end
+            end
+        end
+    end
+    ---------------------------
     print_stdout(string.format("Found %d event(s) with search key %s:", #events, print_event))
     for _, eventTab in pairs(events) do
         outstr = eventTab["Name"] .. ", "
         if (eventTab["ID"] and eventTab["UMask"]) then
-            outstr = outstr .. string.format("0x%X, 0x%X, ",eventTab["ID"],eventTab["UMask"])
+            outstr = outstr .. string.format("0x%X, 0x%X, ", eventTab["ID"], eventTab["UMask"])
         end
         outstr = outstr .. eventTab["Limit"]
         if eventTab["Options"] and #eventTab["Options"] > 0 then
-            outstr = outstr .. string.format(", %s",eventTab["Options"])
+            outstr = outstr .. string.format(", %s", eventTab["Options"])
         end
         print_stdout(outstr)
     end
@@ -602,7 +800,7 @@ if print_event ~= nil then
     for i, counter in pairs(counters) do
         outstr = string.format("%s, %s", counter["Name"], counter["TypeName"]);
         if counter["Options"] and counter["Options"]:len() > 0 then
-            outstr = outstr .. string.format(", %s",counter["Options"])
+            outstr = outstr .. string.format(", %s", counter["Options"])
         end
         print_stdout(outstr)
     end
@@ -615,39 +813,59 @@ avail_groups = likwid.getGroups()
 if print_groups == true then
     if avail_groups then
         local max_len = 0
-        for i,g in pairs(avail_groups) do
+        for i, g in pairs(avail_groups) do
             if g["Name"]:len() > max_len then max_len = g["Name"]:len() end
         end
         local s = string.format("%%%ds\t%%s", max_len)
-        if gpusSupported and gputopo then
-            print_stdout(string.format(s,"PerfMon group name", "Description"))
+        if nvSupported and gputopo then
+            print_stdout(string.format(s, "PerfMon group name", "Description"))
         else
-            print_stdout(string.format(s,"Group name", "Description"))
+            print_stdout(string.format(s, "Group name", "Description"))
         end
         print_stdout(likwid.hline)
-        for i,g in pairs(avail_groups) do
+        for i, g in pairs(avail_groups) do
             print_stdout(string.format(s, g["Name"], g["Info"]))
         end
     else
-        print_stdout(string.format("No groups defined for %s",cpuinfo["name"]))
+        print_stdout(string.format("No groups defined for %s", cpuinfo["name"]))
     end
-    if gpusSupported and gputopo then
+    ---------------------------
+    if nvSupported and gputopo then
         avail_groups = likwid.getGpuGroups(0)
         if avail_groups then
             local max_len = 0
-            for i,g in pairs(avail_groups) do
+            for i, g in pairs(avail_groups) do
+                if g["Name"]:len() > max_len then max_len = g["Name"]:len() end
+            end
+            local s = string.format("%%%ds\t%%s", max_len)
+            print_stdout(string.format(s, "\nNvMon group name", "Description"))
+            print_stdout(likwid.hline)
+            for i, g in pairs(avail_groups) do
+                print_stdout(string.format(s, g["Name"], g["Info"]))
+            end
+        else
+            print_stdout(string.format("No groups defined for %s", gputopo["devices"][1]["name"]))
+        end
+    end
+    ---------------------------
+    if rocmSupported and rocmtopo then
+        avail_groups = likwid.getGpuGroups_rocm(0)
+        if avail_groups then
+            local max_len = 0
+            for i, g in pairs(avail_groups) do
                 if g["Name"]:len() > max_len then max_len = g["Name"]:len() end
             end
             local s = string.format("%%%ds\t%%s", max_len)
-            print_stdout(string.format(s,"\nNvMon group name", "Description"))
+            print_stdout(string.format(s, "\nRocMon group name", "Description"))
             print_stdout(likwid.hline)
-            for i,g in pairs(avail_groups) do
+            for i, g in pairs(avail_groups) do
                 print_stdout(string.format(s, g["Name"], g["Info"]))
             end
         else
-            print_stdout(string.format("No groups defined for %s",gputopo["devices"][1]["name"]))
+            print_stdout(string.format("No groups defined for %s", rocmtopo["devices"][1]["name"]))
         end
     end
+    ---------------------------
     perfctr_exit(0)
 end
 
@@ -656,15 +874,15 @@ if print_group_help == true then
         print_stdout("Group(s) must be given on commandline to get group help")
         perfctr_exit(1)
     end
-    for i,event_string in pairs(event_string_list) do
-        local s,e = event_string:find(":")
+    for i, event_string in pairs(event_string_list) do
+        local s, e = event_string:find(":")
         if s ~= nil then
             print_stdout("Given string is no group")
             perfctr_exit(1)
         end
-        for i,g in pairs(avail_groups) do
+        for i, g in pairs(avail_groups) do
             if event_string == g["Name"] then
-                print_stdout(string.format("Group %s:",g["Name"]))
+                print_stdout(string.format("Group %s:", g["Name"]))
                 print_stdout(g["Long"])
             end
         end
@@ -672,9 +890,9 @@ if print_group_help == true then
     perfctr_exit(0)
 end
 
-if #event_string_list == 0 and #gpu_event_string_list == 0 and not print_info then
-    print_stderr("Option(s) -g <string> or -W <string> must be given on commandline")
-    usage(config)
+if #event_string_list == 0 and #gpu_event_string_list == 0 and #rocm_event_string_list == 0 and not print_info then
+    print_stderr("Option(s) -g <string>, -W <string> (Nvidia) or -R <string> (AMD) must be given on commandline")
+    usage()
     perfctr_exit(1)
 end
 
@@ -686,9 +904,9 @@ end
 
 if outfile == nil then
     print_stdout(likwid.hline)
-    print_stdout(string.format("CPU name:\t%s",cpuinfo["osname"]))
-    print_stdout(string.format("CPU type:\t%s",cpuinfo["name"]))
-    print_stdout(string.format("CPU clock:\t%3.2f GHz",cpuClock * 1.E-09))
+    print_stdout(string.format("CPU name:\t%s", cpuinfo["osname"]))
+    print_stdout(string.format("CPU type:\t%s", cpuinfo["name"]))
+    print_stdout(string.format("CPU clock:\t%3.2f GHz", cpuClock * 1.E-09))
 end
 
 if print_info or verbose > 0 then
@@ -701,24 +919,46 @@ if print_info or verbose > 0 then
     P6_FAMILY = 6
     if cpuinfo["family"] == P6_FAMILY and cpuinfo["perf_version"] > 0 then
         print_stdout(likwid.hline)
-        print_stdout(string.format("PERFMON version:\t\t\t%u",cpuinfo["perf_version"]))
-        print_stdout(string.format("PERFMON number of counters:\t\t%u",cpuinfo["perf_num_ctr"]))
-        print_stdout(string.format("PERFMON width of counters:\t\t%u",cpuinfo["perf_width_ctr"]))
-        print_stdout(string.format("PERFMON number of fixed counters:\t%u",cpuinfo["perf_num_fixed_ctr"]))
+        print_stdout(string.format("PERFMON version:\t\t\t%u", cpuinfo["perf_version"]))
+        print_stdout(string.format("PERFMON number of counters:\t\t%u", cpuinfo["perf_num_ctr"]))
+        print_stdout(string.format("PERFMON width of counters:\t\t%u", cpuinfo["perf_width_ctr"]))
+        print_stdout(string.format("PERFMON number of fixed counters:\t%u", cpuinfo["perf_num_fixed_ctr"]))
     end
-    if gpusSupported and gputopo then
+    ---------------------------
+    if nvSupported and gputopo then
         print_stdout(likwid.hline)
-        for i=1, gputopo["numDevices"] do
+        for i = 1, gputopo["numDevices"] do
             gpu = gputopo["devices"][i]
-            print_stdout(string.format("NVMON GPU %d compute capability:\t%d.%d", gpu["id"], gpu["ccapMajor"], gpu["ccapMinor"]))
+            print_stdout(string.format("NVMON GPU %d compute capability:\t%d.%d", gpu["id"], gpu["ccapMajor"],
+                gpu["ccapMinor"]))
             print_stdout(string.format("NVMON GPU %d short:\t\t%s", gpu["id"], gpu["short"]))
         end
     end
+    ---------------------------
+    if rocmSupported and rocmtopo then
+        print_stdout(likwid.hline)
+        for i = 1, rocmtopo["numDevices"] do
+            gpu = rocmtopo["devices"][i]
+            print_stdout(string.format("ROCMON GPU %d compute capability:\t%d.%d", gpu["id"], gpu["ccapMajor"],
+                gpu["ccapMinor"]))
+            print_stdout(string.format("ROCMON GPU %d short:\t\t\t%s", gpu["id"], gpu["short"]))
+        end
+    end
+    ---------------------------
     print_stdout(likwid.hline)
     if print_info then
         likwid.printSupportedCPUs()
         perfctr_exit(0)
     end
+    --[[if nvSupported and gputopo then
+        print("Supported NVIDIA GPUs processors:")
+        print("\tCompute capability < 7.0")
+        print("\tCompute capability >= 7.0")
+    end
+    if rocmSupported and rocmtopo then
+        print("Supported AMD ROCM GPUs processors:")
+        print("\tall variants")
+    end]]
 end
 
 if use_marker == true and use_timeline == true then
@@ -736,7 +976,7 @@ if use_stethoscope == false and use_timeline == false and use_marker == false th
     use_wrapper = true
 end
 
-if use_wrapper and likwid.tablelength(arg)-2 == 0 and print_info == false then
+if use_wrapper and likwid.tablelength(arg) - 2 == 0 and print_info == false then
     print_stderr("No Executable can be found on commandline")
     usage(config)
     perfctr_exit(0)
@@ -744,38 +984,49 @@ end
 
 if use_marker then
     if likwid.access(markerFile, "rw") ~= -1 then
-        print_stderr(string.format("ERROR: MarkerAPI file %s not accessible. Maybe a remaining file of another user.", markerFile))
+        print_stderr(string.format("ERROR: MarkerAPI file %s not accessible. Maybe a remaining file of another user.",
+            markerFile))
         print_stderr(string.format("Please purge all MarkerAPI files from %s.", markerFolder))
         perfctr_exit(1)
     end
-    if gpusSupported and #gpulist and likwid.access(nvMarkerFile, "rw") ~= -1 then
-        print_stderr(string.format("ERROR: GPUMarkerAPI file %s not accessible. Maybe a remaining file of another user.", nvMarkerFile))
+    if nvSupported and #gpulist and likwid.access(nvMarkerFile, "rw") ~= -1 then
+        print_stderr(string.format("ERROR: GPUMarkerAPI file %s not accessible. Maybe a remaining file of another user.",
+            nvMarkerFile))
+        print_stderr(string.format("Please purge all GPUMarkerAPI files from %s.", markerFolder))
+        perfctr_exit(1)
+    end
+    if rocmSupported and #gpulist_rocm and likwid.access(rocmMarkerFile, "rw") ~= -1 then
+        print_stderr(string.format("ERROR: GPUMarkerAPI file %s not accessible. Maybe a remaining file of another user.",
+            rocmMarkerFile))
         print_stderr(string.format("Please purge all GPUMarkerAPI files from %s.", markerFolder))
         perfctr_exit(1)
     end
     if not pin_cpus and #cpulist > 0 and #event_string_list > 0 then
         print_stderr("Warning: The Marker API requires the application to run on the selected CPUs.")
         print_stderr("Warning: likwid-perfctr pins the application only when using the -C command line option.")
-        print_stderr("Warning: LIKWID assumes that the application does it before the first instrumented code region is started.")
-        print_stderr("Warning: You can use the string in the environment variable LIKWID_THREADS to pin you application to")
+        print_stderr(
+        "Warning: LIKWID assumes that the application does it before the first instrumented code region is started.")
+        print_stderr(
+        "Warning: You can use the string in the environment variable LIKWID_THREADS to pin you application to")
         print_stderr("Warning: to the CPUs specified after the -c command line option.")
     end
 end
 
 if verbose == 0 then
-    likwid.setenv("LIKWID_SILENT","true")
+    likwid.setenv("LIKWID_SILENT", "true")
 end
 
 if pin_cpus then
     local omp_threads = os.getenv("OMP_NUM_THREADS")
     if omp_threads == nil then
-        likwid.setenv("OMP_NUM_THREADS",tostring(math.tointeger(num_cpus)))
+        likwid.setenv("OMP_NUM_THREADS", tostring(math.tointeger(num_cpus)))
     elseif num_cpus > tonumber(omp_threads) then
-        print_stderr(string.format("Environment variable OMP_NUM_THREADS already set to %s but %d cpus required", omp_threads,num_cpus))
+        print_stderr(string.format("Environment variable OMP_NUM_THREADS already set to %s but %d cpus required",
+            omp_threads, num_cpus))
     end
     if omp_threads and tonumber(omp_threads) < num_cpus then
         num_cpus = tonumber(omp_threads)
-        for i=#cpulist,num_cpus+1,-1 do
+        for i = #cpulist, num_cpus + 1, -1 do
             cpulist[i] = nil
         end
     end
@@ -786,13 +1037,13 @@ if pin_cpus then
         likwid.setenv("TBB_MAX_NUM_THREADS", tostring(math.tointeger(num_cpus)))
     end
     if skip_mask then
-        likwid.setenv("LIKWID_SKIP",skip_mask)
+        likwid.setenv("LIKWID_SKIP", skip_mask)
     end
-    likwid.setenv("KMP_AFFINITY","disabled")
+    likwid.setenv("KMP_AFFINITY", "disabled")
 
     if num_cpus > 1 then
         local pinString = tostring(math.tointeger(cpulist[2]))
-        for i=3,likwid.tablelength(cpulist) do
+        for i = 3, likwid.tablelength(cpulist) do
             pinString = pinString .. "," .. tostring(math.tointeger(cpulist[i]))
         end
         pinString = pinString .. "," .. tostring(math.tointeger(cpulist[1]))
@@ -800,9 +1051,9 @@ if pin_cpus then
 
         local preload = os.getenv("LD_PRELOAD")
         if preload == nil then
-            likwid.setenv("LD_PRELOAD",likwid.pinlibpath)
+            likwid.setenv("LD_PRELOAD", likwid.pinlibpath)
         else
-            likwid.setenv("LD_PRELOAD",likwid.pinlibpath .. ":" .. preload)
+            likwid.setenv("LD_PRELOAD", likwid.pinlibpath .. ":" .. preload)
         end
     elseif num_cpus == 1 then
         likwid.setenv("LIKWID_PIN", tostring(math.tointeger(cpulist[1])))
@@ -820,18 +1071,60 @@ if use_marker == true then
     likwid.setenv("LIKWID_DEBUG", tostring(verbose))
     local str = table.concat(event_string_list, "|")
     likwid.setenv("LIKWID_EVENTS", str)
-    likwid.setenv("LIKWID_THREADS", table.concat(cpulist,","))
+    likwid.setenv("LIKWID_THREADS", table.concat(cpulist, ","))
     likwid.setenv("LIKWID_FORCE", "-1")
     likwid.setenv("KMP_INIT_AT_FORK", "FALSE")
-    if gpusSupported and #gpulist > 0 and #gpu_event_string_list > 0 then
-        likwid.setenv("LIKWID_GPUS", table.concat(gpulist,","))
+    if nvSupported and #gpulist > 0 and #gpu_event_string_list > 0 then
+        likwid.setenv("LIKWID_GPUS", table.concat(gpulist, ","))
         str = table.concat(gpu_event_string_list, "|")
         likwid.setenv("LIKWID_GEVENTS", str)
         likwid.setenv("LIKWID_GPUFILEPATH", nvMarkerFile)
     end
+    if rocmSupported and #gpulist_rocm > 0 and #rocm_event_string_list > 0 then
+        likwid.setenv("LIKWID_ROCMON_GPUS", table.concat(gpulist_rocm, ","))
+        str = table.concat(rocm_event_string_list, "|")
+        likwid.setenv("LIKWID_ROCMON_EVENTS", str)
+        likwid.setenv("LIKWID_ROCMON_FILEPATH", rocmMarkerFile)
+        local rocmhome = os.getenv("ROCM_HOME")
+        if rocmhome then
+            local metrics_xml = ""
+            if file_exists(string.format("%s/lib/rocprofiler/metrics.xml", rocmhome)) then
+                metrics_xml = string.format("%s/lib/rocprofiler/metrics.xml", rocmhome)
+            else
+                -- fall back to old location for backwards compatibility
+                metrics_xml = string.format("%s/rocprofiler/lib/metrics.xml", rocmhome)
+            end
+            likwid.setenv("ROCP_METRICS", metrics_xml)
+        end
+        if verbose > 0 then
+            likwid.setenv("LIKWID_ROCMON_VERBOSITY", tostring(verbose))
+        end
+    end
 end
+
+if use_timeline == true then
+    if nvSupported and #gpulist > 0 and #gpu_event_string_list > 0 then
+        if outfile then
+            likwid.setenv("LIKWID_OUTPUTFILE", outfile)
+        end
+        likwid.setenv("LIKWID_INTERVAL", duration / 1000)
+        likwid.setenv("LIKWID_NVMON_GPUS", table.concat(gpulist, ","))
+        str = table.concat(gpu_event_string_list, "|")
+        likwid.setenv("LIKWID_NVMON_EVENTS", str)
+    end
+    if rocmSupported and #gpulist_rocm > 0 and #rocm_event_string_list > 0 then
+        if outfile then
+            likwid.setenv("LIKWID_OUTPUTFILE", outfile)
+        end
+        likwid.setenv("LIKWID_INTERVAL", duration / 1000)
+        likwid.setenv("LIKWID_ROCMON_GPUS", table.concat(gpulist_rocm, ","))
+        str = table.concat(rocm_event_string_list, "|")
+        likwid.setenv("LIKWID_ROCMON_EVENTS", str)
+    end
+end
+
 local likwid_hwthreads = {}
-for i=1,#cpulist do
+for i = 1, #cpulist do
     table.insert(likwid_hwthreads, tostring(math.tointeger(cpulist[i])))
 end
 likwid.setenv("LIKWID_HWTHREADS", table.concat(likwid_hwthreads, ","))
@@ -861,32 +1154,82 @@ if #event_string_list > 0 then
     end
 end
 ---------------------------
-if gpusSupported and #gpu_event_string_list > 0 then
-    if likwid.nvInit(num_gpus, gpulist) < 0 then
-        perfctr_exit(1)
+if nvSupported and #gpu_event_string_list > 0 then
+    if not use_timeline then
+        if likwid.nvInit(num_gpus, gpulist) < 0 then
+            perfctr_exit(1)
+        end
+    else
+        local preload = os.getenv("LD_PRELOAD")
+        if preload == nil then
+            likwid.setenv("LD_PRELOAD", "likwid-appDaemon.so")
+        else
+            likwid.setenv("LD_PRELOAD", "likwid-appDaemon.so" .. ":" .. preload)
+        end
+    end
+end
+---------------------------
+if rocmSupported and #rocm_event_string_list > 0 then
+    if not use_timeline then
+        if likwid.init_rocm(num_rocm_gpus, gpulist_rocm) < 0 then
+            rocmInitialized = true
+            perfctr_exit(1)
+        end
+    else
+        local preload = os.getenv("LD_PRELOAD")
+        if preload == nil then
+            likwid.setenv("LD_PRELOAD", "likwid-appDaemon.so")
+        else
+            likwid.setenv("LD_PRELOAD", "likwid-appDaemon.so" .. ":" .. preload)
+        end
     end
 end
 ---------------------------
 
 if verbose > 0 then
-    print_stdout(string.format("Executing: %s",table.concat(execList," ")))
+    print_stdout(string.format("Executing: %s", table.concat(execList, " ")))
 end
 local ldpath = os.getenv("LD_LIBRARY_PATH")
 local libpath = string.match(likwid.pinlibpath, "([/%a%d]+)/[%a%s%d]*")
 if ldpath == nil then
     likwid.setenv("LD_LIBRARY_PATH", libpath)
 elseif not ldpath:match(libpath) then
-    likwid.setenv("LD_LIBRARY_PATH", libpath..":"..ldpath)
+    likwid.setenv("LD_LIBRARY_PATH", libpath .. ":" .. ldpath)
 end
 ---------------------------
-if gpusSupported  and #gpu_event_string_list > 0 then
+if nvSupported then
     local cudahome = os.getenv("CUDA_HOME")
     if cudahome then
         ldpath = os.getenv("LD_LIBRARY_PATH")
         local cuptilib = string.format("%s/extras/CUPTI/lib64", cudahome)
+        local likwidlib = "<INSTALLED_LIBPREFIX>"
         if not ldpath:match(cuptilib) then
-            likwid.setenv("LD_LIBRARY_PATH", cuptilib..":"..ldpath)
+            likwid.setenv("LD_LIBRARY_PATH", cuptilib .. ":" .. likwidlib .. ":" .. ldpath)
+        else
+            likwid.setenv("LD_LIBRARY_PATH", likwidlib .. ":" .. ldpath)
+        end
+    end
+end
+---------------------------
+if rocmSupported then
+    local rocmhome = os.getenv("ROCM_HOME")
+    if rocmhome and rocmhome:len() > 0 then
+        ldpath = os.getenv("LD_LIBRARY_PATH")
+        local hiplib = string.format("%s/hip/lib", rocmhome)
+        local hsalib = string.format("%s/hsa/lib", rocmhome)
+        local rocproflib = string.format("%s/lib/rocprofiler", rocmhome)
+        local metrics_xml = ""
+        if file_exists(string.format("%s/lib/rocprofiler/metrics.xml", rocmhome)) then
+            metrics_xml = string.format("%s/lib/rocprofiler/metrics.xml", rocmhome)
+        else
+            -- fall back to old location for backwards compatibility
+            metrics_xml = string.format("%s/rocprofiler/lib/metrics.xml", rocmhome)
         end
+        local likwidlib = "<INSTALLED_LIBPREFIX>"
+        likwid.setenv("LD_LIBRARY_PATH", hiplib .. ":" .. hsalib .. ":" .. rocproflib .. ":" .. likwidlib .. ":" ..
+        ldpath)
+        likwid.setenv("HSA_TOOLS_LIB", "librocprofiler64.so")
+        likwid.setenv("ROCP_METRICS", metrics_xml)
     end
 end
 ---------------------------
@@ -894,7 +1237,7 @@ end
 
 local pid = nil
 if #execList > 0 then
-    local execString = table.concat(execList," ")
+    local execString = table.concat(execList, " ")
     if execpid then
         likwid.setenv("LIKWID_PERF_EXECPID", "1")
     end
@@ -908,7 +1251,7 @@ if #execList > 0 then
     end
 end
 if not pid and #execList > 0 then
-    print_stderr(string.format("Failed to execute command: %s", table.concat(execList," ")))
+    print_stderr(string.format("Failed to execute command: %s", table.concat(execList, " ")))
     perfctr_exit(1)
 elseif #execList > 0 then
     likwid.sendSignal(pid, 19)
@@ -934,7 +1277,8 @@ for i, event_string in pairs(event_string_list) do
         table.insert(group_ids, gid)
     end
 end
-if gpusSupported  and #gpu_event_string_list > 0 then
+---------------------------
+if nvSupported and use_timeline == false then
     for i, event_string in pairs(gpu_event_string_list) do
         if event_string:len() > 0 then
             local gid = likwid.nvAddEventSet(event_string)
@@ -946,7 +1290,21 @@ if gpusSupported  and #gpu_event_string_list > 0 then
         end
     end
 end
-if #group_ids == 0 and not (#gpu_event_string_list > 0 and use_marker) then
+---------------------------
+if rocmSupported and use_timeline == false then
+    for i, event_string in pairs(rocm_event_string_list) do
+        if event_string:len() > 0 then
+            local gid = likwid.addEventSet_rocm(event_string)
+            if gid < 0 then
+                likwid.finalize_rocm()
+                perfctr_exit(1)
+            end
+            table.insert(rocmgroups, gid)
+        end
+    end
+end
+---------------------------
+if #group_ids == 0 and #gpugroups == 0 and #rocmgroups == 0 and use_timeline == false then
     print_stderr("ERROR: No valid eventset given on commandline. Exiting...")
     likwid.finalize()
     perfctr_exit(1)
@@ -963,15 +1321,21 @@ if #event_string_list > 0 then
         print_stdout(likwid.hline)
     end
 end
-
-if gpusSupported and #gpu_event_string_list > 0 then
+---------------------------
+if nvSupported and #gpu_event_string_list > 0 then
     activeNvGroup = gpugroups[1]
     if outfile == nil then
         print_stdout(likwid.hline)
     end
 end
-
-
+---------------------------
+if rocmSupported and #rocm_event_string_list > 0 then
+    activeRocmGroup = rocmgroups[1]
+    if outfile == nil then
+        print_stdout(likwid.hline)
+    end
+end
+---------------------------
 
 
 if #event_string_list > 0 then
@@ -991,25 +1355,25 @@ if #event_string_list > 0 then
         for i, cpu in pairs(cpulist) do
             table.insert(clist, tostring(cpu))
         end
-        print(outprefix.."# HWThreads"..word_delim..table.concat(clist, delim))
+        print(outprefix .. "# HWThreads" .. word_delim .. table.concat(clist, delim))
         for i, gid in pairs(group_ids) do
-            local strlist = {"GID"}
+            local strlist = { "GID" }
             if likwid.getNumberOfMetrics(gid) == 0 then
                 table.insert(strlist, "EventCount")
                 table.insert(strlist, "CpuCount")
                 table.insert(strlist, "Total runtime [s]")
-                for e=1,likwid.getNumberOfEvents(gid) do
+                for e = 1, likwid.getNumberOfEvents(gid) do
                     table.insert(strlist, likwid.getNameOfEvent(gid, e))
                 end
             else
                 table.insert(strlist, "MetricsCount")
                 table.insert(strlist, "CpuCount")
                 table.insert(strlist, "Total runtime [s]")
-                for m=1,likwid.getNumberOfMetrics(gid) do
+                for m = 1, likwid.getNumberOfMetrics(gid) do
                     table.insert(strlist, likwid.getNameOfMetric(gid, m))
                 end
             end
-            print(outprefix.."# "..table.concat(strlist, delim))
+            print(outprefix .. "# " .. table.concat(strlist, delim))
         end
     end
 end
@@ -1032,10 +1396,12 @@ if use_wrapper or use_timeline then
         duration = 30.E06
     end
 
-    local ret = likwid.startCounters()
-    if ret < 0 then
-        print_stderr(string.format("Error starting counters for cpu %d.",cpulist[ret * (-1)]))
-        perfctr_exit(1)
+    if #event_string_list > 0 then
+        local ret = likwid.startCounters()
+        if ret < 0 then
+            print_stderr(string.format("Error starting counters for cpu %d.", cpulist[ret * (-1)]))
+            perfctr_exit(1)
+        end
     end
 
     likwid.sendSignal(pid, 18)
@@ -1051,8 +1417,8 @@ if use_wrapper or use_timeline then
             end
             break
         end
-        local remain = likwid.sleep(math.floor(duration-(twork*1E6)))
-        
+        local remain = likwid.sleep(math.floor(duration - (twork * 1E6)))
+
         exitvalue, exited = likwid.checkProgram(pid)
         if exited then
             io.stdout:flush()
@@ -1061,11 +1427,12 @@ if use_wrapper or use_timeline then
             end
         end
 
-        if use_timeline == true then
-
+        if use_timeline == true and #event_string_list > 0 then
             stop = likwid.stopClock()
             xstart = likwid.startClock()
-            likwid.readCounters()
+            if #event_string_list > 0 then
+                likwid.readCounters()
+            end
 
             local time = likwid.getClock(start, stop)
             if likwid.getNumberOfMetrics(activeGroup) == 0 then
@@ -1078,15 +1445,15 @@ if use_wrapper or use_timeline then
             table.insert(outList, tostring(#results[activeGroup]))
             table.insert(outList, tostring(#cpulist))
             table.insert(outList, tostring(time))
-            for i,l1 in pairs(results[activeGroup]) do
+            for i, l1 in pairs(results[activeGroup]) do
                 for j, value in pairs(l1) do
                     table.insert(outList, tostring(value))
                 end
             end
             if not outfile then
-                print_stderr(outprefix..table.concat(outList, timeline_delim))
+                print_stderr(outprefix .. table.concat(outList, timeline_delim))
             else
-                print(outprefix..table.concat(outList, timeline_delim))
+                print(outprefix .. table.concat(outList, timeline_delim))
                 io.flush()
             end
             groupTime[activeGroup] = time
@@ -1094,7 +1461,9 @@ if use_wrapper or use_timeline then
             twork = likwid.getClock(xstart, xstop)
         else
             xstart = likwid.startClock()
-            likwid.readCounters()
+            if #event_string_list > 0 then
+                likwid.readCounters()
+            end
             xstop = likwid.stopClock()
             twork = likwid.getClock(xstart, xstop)
         end
@@ -1112,7 +1481,7 @@ if use_wrapper or use_timeline then
 elseif use_stethoscope then
     local ret = likwid.startCounters()
     if ret < 0 then
-        print_stderr(string.format("Error starting counters for cpu %d.",cpulist[ret * (-1)]))
+        print_stderr(string.format("Error starting counters for cpu %d.", cpulist[ret * (-1)]))
         perfctr_exit(1)
     end
     likwid.sleep(duration)
@@ -1122,11 +1491,13 @@ elseif use_marker then
 end
 
 if not use_marker then
-    local ret = likwid.stopCounters()
-    if ret < 0 then
-        print_stderr(string.format("Error stopping counters for thread %d.",ret * (-1)))
-        likwid.finalize()
-        perfctr_exit(exitvalue)
+    if #event_string_list > 0 then
+        local ret = likwid.stopCounters()
+        if ret < 0 then
+            print_stderr(string.format("Error stopping counters for thread %d.", ret * (-1)))
+            likwid.finalize()
+            perfctr_exit(exitvalue)
+        end
     end
 end
 io.stdout:flush()
@@ -1144,16 +1515,18 @@ if use_marker == true then
             elseif #results == 0 then
                 print_stderr("No regions could be found in Marker API result file.")
             else
-                for r=1, #results do
+                for r = 1, #results do
                     likwid.printOutput(results[r], metrics[r], cpulist, r, print_stats)
                 end
             end
             os.remove(markerFile)
         else
-            print_stderr("MMarker API result file does not exist. This may happen if the application was not compiled with LIKWID_PERFMON macro or the application has not called LIKWID_MARKER_CLOSE.")
+            print_stderr(
+            "MMarker API result file does not exist. This may happen if the application was not compiled with LIKWID_PERFMON macro or the application has not called LIKWID_MARKER_CLOSE.")
         end
     end
-    if gpusSupported and #gpu_event_string_list > 0 then
+    ---------------------------
+    if nvSupported and #gpu_event_string_list > 0 then
         if likwid.access(nvMarkerFile, "e") >= 0 then
             results, metrics = likwid.getNvMarkerResults(nvMarkerFile, markergpulist, nan2value)
             if not results then
@@ -1161,20 +1534,44 @@ if use_marker == true then
             elseif #results == 0 then
                 print_stderr("No regions could be found in GPU Marker API result file.")
             else
-                for r=1, #results do
+                for r = 1, #results do
                     likwid.printGpuOutput(results[r], metrics[r], gpulist, r, print_stats)
                 end
             end
             likwid.destroyNvMarkerFile()
             os.remove(nvMarkerFile)
         else
-            print_stderr("GPU Marker API result file does not exist. This may happen if the application was not compiled with LIKWID_NVMON macro or the application has not called LIKWID_GPUMARKER_CLOSE.")
+            print_stderr(
+            "GPU Marker API result file does not exist. This may happen if the application was not compiled with LIKWID_NVMON macro or the application has not called LIKWID_GPUMARKER_CLOSE.")
         end
     end
+    ---------------------------
+    if rocmSupported and #rocm_event_string_list > 0 then
+        if likwid.access(rocmMarkerFile, "e") >= 0 then
+            results, metrics = likwid.getMarkerResultsRocm(rocmMarkerFile, markerrocmgpulist, nan2value)
+            if not results then
+                print_stderr("Failure reading ROCM Marker API result file.")
+            elseif #results == 0 then
+                print_stderr("No regions could be found in ROCM Marker API result file.")
+            else
+                for r = 1, #results do
+                    likwid.printOutputRocm(results[r], metrics[r], gpulist, r, print_stats)
+                end
+            end
+            likwid.destroyMarkerFileRocm()
+            os.remove(rocmMarkerFile)
+        else
+            print_stderr(
+            "ROCM Marker API result file does not exist. This may happen if the application has not called LIKWID_ROCMMARKER_CLOSE.")
+        end
+    end
+    ---------------------------
 elseif use_timeline == false then
-    results = likwid.getResults(nan2value)
-    metrics = likwid.getMetrics(nan2value)
-    likwid.printOutput(results, metrics, cpulist, nil, print_stats)
+    if #event_string_list > 0 then
+        results = likwid.getResults(nan2value)
+        metrics = likwid.getMetrics(nan2value)
+        likwid.printOutput(results, metrics, cpulist, nil, print_stats)
+    end
 end
 
 if outfile and not use_timeline then
@@ -1185,14 +1582,15 @@ if outfile and not use_timeline then
     local command = "<INSTALLED_PREFIX>/share/likwid/filter/" .. suffix
     if suffix:len() > 0 then
         if likwid.access(command, "x") == 0 then
-            local tmpfile = outfile..".tmp"
+            local tmpfile = outfile .. ".tmp"
             os.rename(outfile, tmpfile)
-            local cmd = command .." ".. tmpfile .. " perfctr"
+            local cmd = command .. " " .. tmpfile .. " perfctr"
             local f = assert(io.popen(cmd), "r")
             if f ~= nil then
                 local o = f:read("*a")
                 if o:len() > 0 then
-                    print_stderr(string.format("Failed to executed filter script %s. Output file %s in CSV format.", command, outfile))
+                    print_stderr(string.format("Failed to executed filter script %s. Output file %s in CSV format.",
+                        command, outfile))
                     if likwid.access(tmpfile, "e") == 0 then
                         os.rename(tmpfile, outfile)
                     end
diff --git a/src/applications/likwid-topology.lua b/src/applications/likwid-topology.lua
index 67d27a43b..af470e8be 100644
--- a/src/applications/likwid-topology.lua
+++ b/src/applications/likwid-topology.lua
@@ -329,6 +329,65 @@ if likwid.nvSupported() then
     likwid.putGpuTopology()
 end
 
+if likwid.rocmSupported() then
+    gputopo_rocm = likwid.getGpuTopology_rocm()
+    if gputopo_rocm then
+        table.insert(output_csv, likwid.sline)
+        table.insert(output_csv, "ROCm GPU Topology")
+        table.insert(output_csv, likwid.sline)
+        table.insert(output_csv, string.format("GPU count:\t\t%d", gputopo_rocm["numDevices"]))
+        table.insert(output_csv, likwid.hline)
+
+        for i=1, gputopo_rocm["numDevices"] do
+            gpu = gputopo_rocm["devices"][i]
+            table.insert(output_csv, string.format("STRUCT,GPU Topology %d,9", gpu["id"]))
+            table.insert(output_csv, string.format("ID:\t\t\t%d", gpu["id"]))
+            table.insert(output_csv, string.format("Name:\t\t\t%s", gpu["name"]))
+            table.insert(output_csv, string.format("Compute capability:\t%d.%d", gpu["ccapMajor"], gpu["ccapMinor"]))
+            table.insert(output_csv, string.format("L2 size:\t\t%.2f MB", gpu["l2Size"]/(1024*1024)))
+            table.insert(output_csv, string.format("Memory:\t\t\t%.2f GB", gpu["memory"]/(1024*1024*1024)))
+            table.insert(output_csv, string.format("Clock rate:\t\t%d kHz", gpu["clockRatekHz"]))
+            table.insert(output_csv, string.format("Memory clock rate:\t%d kHz", gpu["memClockRatekHz"]))
+            table.insert(output_csv, string.format("Attached to NUMA node:\t%d", gpu["numaNode"]))
+            if print_gpus then
+                table.insert(output_csv, string.format("Number of SPs:\t\t%d", gpu["numMultiProcs"]))
+                table.insert(output_csv, string.format("Max. threads per SP:\t%d", gpu["maxThreadPerMultiProc"]))
+                table.insert(output_csv, string.format("Max. threads per block:\t%d", gpu["maxThreadsPerBlock"]))
+                local s = {}
+                for i, data in pairs(gpu["maxThreadsDim"]) do
+                    table.insert(s, string.format("%d", data))
+                end
+                table.insert(output_csv, string.format("Max. thread dimensions:\t%s", table.concat(s, "/")))
+                table.insert(output_csv, string.format("Max. regs per block:\t%d", gpu["regsPerBlock"]))
+                table.insert(output_csv, string.format("Shared mem per block:\t%d", gpu["sharedMemPerBlock"]))
+
+                table.insert(output_csv, string.format("Memory bus width:\t%d", gpu["memBusWidth"]))
+                table.insert(output_csv, string.format("Texture alignment:\t%d", gpu["textureAlign"]))
+                if gpu["ecc"] == 0 then
+                    table.insert(output_csv, "ECC:\t\t\toff")
+                else
+                    table.insert(output_csv, "ECC:\t\t\ton")
+                end
+                if gpu["integrated"] == 0 then
+                    table.insert(output_csv, "GPU integrated:\t\tno")
+                else
+                    table.insert(output_csv, "GPU integrated:\t\tyes")
+                end
+                s = {}
+                for i, data in pairs(gpu["maxGridSize"]) do
+                    table.insert(s, string.format("%d", data))
+                end
+                table.insert(output_csv, string.format("Max. grid sizes:\t%s", table.concat(s, "/")))
+                table.insert(output_csv, string.format("PCI bus:\t\t0x%x", gpu["pciBus"]))
+                table.insert(output_csv, string.format("PCI domain:\t\t0x%x", gpu["pciDom"]))
+                table.insert(output_csv, string.format("PCI device:\t\t0x%x", gpu["pciDev"]))
+            end
+            table.insert(output_csv, likwid.hline)
+        end
+    end
+    likwid.putGpuTopology_rocm()
+end
+
 if print_csv then
     longest_line = 0
     local tmpList = {}
diff --git a/src/applications/likwid.lua b/src/applications/likwid.lua
index 46e9b722e..c41c530c3 100644
--- a/src/applications/likwid.lua
+++ b/src/applications/likwid.lua
@@ -213,6 +213,33 @@ likwid.nvInit = likwid_nvInit
 likwid.nvAddEventSet = likwid_nvAddEventSet
 likwid.nvFinalize = likwid_nvFinalize
 
+likwid.rocmSupported = likwid_rocmSupported
+likwid.getGpuTopology_rocm = likwid_getGpuTopology_rocm
+likwid.putGpuTopology_rocm = likwid_putGpuTopology_rocm
+likwid.getGpuEventsAndCounters_rocm = likwid_getGpuEventsAndCounters_rocm
+likwid.getGpuGroups_rocm = likwid_getGpuGroups_rocm
+likwid.gpustr_to_gpulist_rocm = likwid_gpustr_to_gpulist_rocm
+likwid.init_rocm = likwid_init_rocm
+likwid.addEventSet_rocm = likwid_addEventSet_rocm
+likwid.finalize_rocm = likwid_finalize_rocm
+likwid.getNameOfEventRocm = likwid_getNameOfEvent_rocm
+likwid.getNameOfCounterRocm = likwid_getNameOfCounter_rocm
+likwid.getNameOfGroupRocm = likwid_getNameOfGroup_rocm
+likwid.getNameOfMetricRocm = likwid_getNameOfMetric_rocm
+likwid.readMarkerFileRocm = likwid_readMarkerFile_rocm
+likwid.destroyMarkerFileRocm = likwid_markerFile_destroy_rocm
+likwid.markerNumRegionsRocm = likwid_markerNumRegions_rocm
+likwid.markerRegionGroupRocm = likwid_markerRegionGroup_rocm
+likwid.markerRegionTagRocm = likwid_markerRegionTag_rocm
+likwid.markerRegionEventsRocm = likwid_markerRegionEvents_rocm
+likwid.markerRegionMetricsRocm = likwid_markerRegionMetrics_rocm
+likwid.markerRegionGpulistRocm = likwid_markerRegionGpulist_rocm
+likwid.markerRegionGpusRocm = likwid_markerRegionGpus_rocm
+likwid.markerRegionTimeRocm = likwid_markerRegionTime_rocm
+likwid.markerRegionCountRocm = likwid_markerRegionCount_rocm
+likwid.markerRegionResultRocm = likwid_markerRegionResult_rocm
+likwid.markerRegionMetricRocm = likwid_markerRegionMetric_rocm
+
 
 likwid.cpuFeatures = { [0]="HW_PREFETCHER", [1]="CL_PREFETCHER", [2]="DCU_PREFETCHER", [3]="IP_PREFETCHER",
                         [4]="FAST_STRINGS", [5]="THERMAL_CONTROL", [6]="PERF_MON", [7]="FERR_MULTIPLEX",
@@ -1582,6 +1609,213 @@ end
 
 likwid.getNvMarkerResults = getNvMarkerResults
 
+local function getMarkerResultsRocm(filename, gpulist, nan2value)
+    local gputopo = likwid.getGpuTopology_rocm()
+    local ret = likwid.readMarkerFileRocm(filename)
+    if ret < 0 then
+        return nil, nil
+    elseif ret == 0 then
+        return {}, {}
+    end
+    if not nan2value then
+        nan2value = '-'
+    end
+    results = {}
+    metrics = {}
+    for i=1, likwid.markerNumRegionsRocm() do
+        local regionName = likwid.markerRegionTagRocm(i)
+        local groupID = likwid.markerRegionGroupRocm(i)
+        local regionThreads = likwid.markerRegionGpusRocm(i)
+        results[i] = {}
+        metrics[i] = {}
+        results[i][groupID] = {}
+        metrics[i][groupID] = {}
+        for k=1, likwid.markerRegionEventsRocm(i) do
+            local eventName = likwid.getNameOfEventRocm(groupID, k)
+            local counterName = likwid.getNameOfCounterRocm(groupID, k)
+            results[i][groupID][k] = {}
+            for j=1, regionThreads do
+                results[i][groupID][k][j] = likwid.markerRegionResultRocm(i,k,j)
+                if results[i][groupID][k][j] ~= results[i][groupID][k][j] then
+                    results[i][groupID][k][j] = nan2value
+                end
+            end
+        end
+        if likwid.markerRegionMetricsRocm(groupID) > 0 then
+            for k=1, likwid.markerRegionMetricsRocm(groupID) do
+                local metricName = likwid.getNameOfMetricRocm(groupID, k)
+                metrics[i][likwid.markerRegionGroupRocm(i)][k] = {}
+                for j=1, regionThreads do
+                    metrics[i][groupID][k][j] = likwid.markerRegionMetricRocm(i,k,j)
+                    if metrics[i][groupID][k][j] ~= metrics[i][groupID][k][j] then
+                        metrics[i][groupID][k][j] = nan2value
+                    end
+                end
+            end
+        end
+    end
+    return results, metrics
+end
+
+likwid.getMarkerResultsRocm = getMarkerResultsRocm
+
+local function printOutputRocm(results, metrics, gpulist, region, stats)
+    local maxLineFields = 0
+    local gputopo = likwid.getGpuTopology_rocm()
+    local regionName = likwid.markerRegionTagRocm(region)
+    local regionGPUs = likwid.markerRegionGpusRocm(region)
+    local cur_gpulist = gpulist
+    if region ~= nil then
+        cur_gpulist = likwid.markerRegionGpulistRocm(region)
+    end
+
+    for g, group in pairs(results) do
+        local infotab = {}
+        local firsttab = {}
+        local firsttab_combined = {}
+        local secondtab = {}
+        local secondtab_combined = {}
+        local runtime = likwid.markerRegionTimeRocm(g, 0)
+        local groupName = likwid.getNameOfGroupRocm(g)
+        if region ~= nil then
+            infotab[1] = {"Region Info","RDTSC Runtime [s]","call count"}
+            for c, gpu in pairs(cur_gpulist) do
+                local tmpList = {}
+                table.insert(tmpList, "GPU "..tostring(gpu))
+                table.insert(tmpList, string.format("%.6f", likwid.markerRegionTimeRocm(region, c)))
+                table.insert(tmpList, tostring(likwid.markerRegionCountRocm(region, c)))
+                table.insert(infotab, tmpList)
+            end
+        end
+        firsttab[1] = {"Event"}
+        firsttab_combined[1] = {"Event"}
+        firsttab[2] = {"Counter"}
+        firsttab_combined[2] = {"Counter"}
+        if likwid.markerRegionMetricsRocm(g) == 0 then
+            table.insert(firsttab[1],"Runtime (RDTSC) [s]")
+            table.insert(firsttab[2],"TSC")
+        end
+        for e, event in pairs(group) do
+            eventname = likwid.getNameOfEventRocm(g, e)
+            countername = likwid.getNameOfCounterRocm(g, e)
+            table.insert(firsttab[1], eventname)
+            table.insert(firsttab[2], countername)
+            table.insert(firsttab_combined[1], eventname .. " STAT")
+            table.insert(firsttab_combined[2], countername)
+        end
+        for c, gpu in pairs(cur_gpulist) do
+            local tmpList = {"GPU "..tostring(gpu)}
+            if likwid.markerRegionMetricsRocm(g) == 0 then
+                if region == nil then
+                    table.insert(tmpList, string.format("%e", runtime))
+                else
+                    table.insert(tmpList, string.format("%e", likwid.markerRegionTimeRocm(region, c)))
+                end
+            end
+            for e, event in pairs(group) do
+                local tmp = tostring(likwid.num2str(event[c]))
+                table.insert(tmpList, tmp)
+            end
+            table.insert(firsttab, tmpList)
+        end
+        if #gpulist > 1 or stats == true then
+            firsttab_combined = tableMinMaxAvgSum(firsttab, 2, 1)
+        end
+        if likwid.markerRegionMetricsRocm(g) > 0 then
+            secondtab[1] = {"Metric"}
+            secondtab_combined[1] = {"Metric"}
+            for m=1, likwid.markerRegionMetricsRocm(g) do
+                local mname = likwid.getNameOfMetricRocm(g, m)
+                table.insert(secondtab[1], mname)
+                table.insert(secondtab_combined[1], mname .." STAT" )
+            end
+            for c, gpu in pairs(cur_gpulist) do
+                local tmpList = {"GPU "..tostring(gpu)}
+                for m=1, likwid.markerRegionMetricsRocm(g) do
+                    local tmp = tostring(likwid.num2str(metrics[g][m][c]))
+                    table.insert(tmpList, tmp)
+                end
+                table.insert(secondtab, tmpList)
+            end
+            if #gpulist > 1 or stats == true  then
+                secondtab_combined = tableMinMaxAvgSum(secondtab, 1, 1)
+            end
+        end
+        maxLineFields = math.max(#firsttab, #firsttab_combined,
+                                 #secondtab, #secondtab_combined)
+        if use_csv then
+--            print(string.format("STRUCT,Info,3%s",string.rep(",",maxLineFields-3)))
+--            print(string.format("GPU name:,%s%s", cpuinfo["osname"],string.rep(",",maxLineFields-2)))
+--            print(string.format("CPU type:,%s%s", cpuinfo["name"],string.rep(",",maxLineFields-2)))
+--            print(string.format("CPU clock:,%s GHz%s", clock*1.E-09,string.rep(",",maxLineFields-2)))
+            if region == nil then
+                print(string.format("TABLE,Group %d Raw,%s,%d%s",g,groupName,#firsttab[1]-1,string.rep(",",maxLineFields-4)))
+            else
+                print(string.format("TABLE,Region %s,Group %d Raw,%s,%d%s",regionName,g,groupName,#firsttab[1]-1,string.rep(",",maxLineFields-5)))
+            end
+            if #infotab > 0 then
+                likwid.printcsv(infotab, maxLineFields)
+            end
+            likwid.printcsv(firsttab, maxLineFields)
+        else
+            if outfile ~= nil then
+                print(likwid.hline)
+--                print(string.format("CPU name:\t%s",cpuinfo["osname"]))
+--                print(string.format("CPU type:\t%s",cpuinfo["name"]))
+--                print(string.format("CPU clock:\t%3.2f GHz",clock * 1.E-09))
+                print(likwid.hline)
+            end
+            if region == nil then
+                print("Group "..tostring(g)..": "..groupName)
+            else
+                print("Region "..regionName..", Group "..tostring(g)..": "..groupName)
+            end
+            if #infotab > 0 then
+                likwid.printtable(infotab)
+            end
+            likwid.printtable(firsttab)
+        end
+        if #cur_gpulist > 1 or stats == true then
+            if use_csv then
+                if region == nil then
+                    print(string.format("TABLE,Group %d Raw STAT,%s,%d%s",g,groupName,#firsttab_combined[1]-1,string.rep(",",maxLineFields-4)))
+                else
+                    print(string.format("TABLE,Region %s,Group %d Raw STAT,%s,%d%s",regionName, g,groupName,#firsttab_combined[1]-1,string.rep(",",maxLineFields-5)))
+                end
+                likwid.printcsv(firsttab_combined, maxLineFields)
+            else
+                likwid.printtable(firsttab_combined)
+            end
+        end
+        if likwid.markerRegionMetricsRocm(g) > 0 then
+            if use_csv then
+                if region == nil then
+                    print(string.format("TABLE,Group %d Metric,%s,%d%s",g,groupName,#secondtab[1]-1,string.rep(",",maxLineFields-4)))
+                else
+                    print(string.format("TABLE,Region %s,Group %d Metric,%s,%d%s",regionName,g,groupName,#secondtab[1]-1,string.rep(",",maxLineFields-5)))
+                end
+                likwid.printcsv(secondtab, maxLineFields)
+            else
+                likwid.printtable(secondtab)
+            end
+            if #cur_gpulist > 1 or stats == true then
+                if use_csv then
+                    if region == nil then
+                        print(string.format("TABLE,Group %d Metric STAT,%s,%d%s",g,groupName,#secondtab_combined[1]-1,string.rep(",",maxLineFields-4)))
+                    else
+                        print(string.format("TABLE,Region %s,Group %d Metric STAT,%s,%d%s",regionName,g,groupName,#secondtab_combined[1]-1,string.rep(",",maxLineFields-5)))
+                    end
+                    likwid.printcsv(secondtab_combined, maxLineFields)
+                else
+                    likwid.printtable(secondtab_combined)
+                end
+            end
+        end
+    end
+end
+
+likwid.printOutputRocm = printOutputRocm
+
 local function printTextTable(header, line, print_header)
     local linelength = 80
     local headerlength = 0
diff --git a/src/cpustring.c b/src/cpustring.c
index 17a3cd2ef..131ce79d7 100644
--- a/src/cpustring.c
+++ b/src/cpustring.c
@@ -963,7 +963,7 @@ sockstr_to_socklist(const char* sockstr, int* sockets, int length)
 
 #ifdef LIKWID_WITH_NVMON
 
-static int valid_gpu(GpuTopology_t topo, int id)
+static int valid_gpu_nvmon(GpuTopology_t topo, int id)
 {
     for (int i = 0; i < topo->numDevices; i++)
     {
@@ -994,7 +994,7 @@ gpustr_to_gpulist(const char* gpustr, int* gpulist, int length)
             {
                 for (int k = start; k <= end; k++)
                 {
-                    if (valid_gpu(gpu_topology, k) && insert < length)
+                    if (valid_gpu_nvmon(gpu_topology, k) && insert < length)
                     {
                         gpulist[insert] = k;
                         insert++;
@@ -1005,7 +1005,7 @@ gpustr_to_gpulist(const char* gpustr, int* gpulist, int length)
         else
         {
             int id = check_and_atoi(bdata(commalist->entry[i]));
-            if (valid_gpu(gpu_topology, id) && insert < length)
+            if (valid_gpu_nvmon(gpu_topology, id) && insert < length)
             {
                 gpulist[insert] = id;
                 insert++;
@@ -1016,3 +1016,59 @@ gpustr_to_gpulist(const char* gpustr, int* gpulist, int length)
 }
 
 #endif /* LIKWID_WITH_NVMON */
+
+#ifdef LIKWID_WITH_ROCMON
+
+static int valid_gpu_rocmon(GpuTopology_rocm_t topo, int id)
+{
+    for (int i = 0; i < topo->numDevices; i++)
+    {
+        if (topo->devices[i].devid == id)
+        {
+            return 1;
+        }
+    }
+    return 0;
+}
+
+int
+gpustr_to_gpulist_rocm(const char* gpustr, int* gpulist, int length)
+{
+    int insert = 0;
+    topology_gpu_init_rocm();
+    GpuTopology_rocm_t gpu_topology = get_gpuTopology_rocm();
+    bstring bgpustr = bfromcstr(gpustr);
+    struct bstrList* commalist = bsplit(bgpustr, ',');
+    for (int i = 0; i < commalist->qty; i++)
+    {
+        if (bstrchrp(commalist->entry[i], '-', 0) != BSTR_ERR)
+        {
+            struct bstrList* indexlist = bsplit(commalist->entry[i], '-');
+            int start = check_and_atoi(bdata(indexlist->entry[0]));
+            int end = check_and_atoi(bdata(indexlist->entry[1]));
+            if (start <= end)
+            {
+                for (int k = start; k <= end; k++)
+                {
+                    if (valid_gpu_rocmon(gpu_topology, k) && insert < length)
+                    {
+                        gpulist[insert] = k;
+                        insert++;
+                    }
+                }
+            }
+        }
+        else
+        {
+            int id = check_and_atoi(bdata(commalist->entry[i]));
+            if (valid_gpu_rocmon(gpu_topology, id) && insert < length)
+            {
+                gpulist[insert] = id;
+                insert++;
+            }
+        }
+    }
+    return insert;
+}
+
+#endif /* LIKWID_WITH_ROCMON */
diff --git a/src/includes/error.h b/src/includes/error.h
index 81a4f42f3..04a8874d8 100644
--- a/src/includes/error.h
+++ b/src/includes/error.h
@@ -93,6 +93,12 @@
         fflush(stdout); \
     }
 
+#define ROCMON_DEBUG_PRINT(lev, fmt, ...) \
+    if ((lev >= 0) && (lev <= likwid_rocmon_verbosity)) { \
+        fprintf(stdout, "ROCMON DEBUG - [%s:%d] " str(fmt) "\n", __func__, __LINE__,##__VA_ARGS__); \
+        fflush(stdout); \
+    }
+
 #define DEBUG_PLAIN_PRINT(lev, msg) \
     if ((lev >= 0) && (lev <= perfmon_verbosity)) { \
         fprintf(stdout, "DEBUG - [%s:%d] " str(msg) "\n",__func__, __LINE__);  \
@@ -109,6 +115,11 @@
     { \
         fprintf(stdout, "INFO - " STRINGIFY(fmt) "\n", ##__VA_ARGS__); \
     }
+#define ROCMON_INFO_PRINT(fmt, ...) \
+    if (likwid_rocmon_verbosity >= DEBUGLEV_INFO) \
+    { \
+        fprintf(stdout, "ROCMON INFO - " STRINGIFY(fmt) "\n", ##__VA_ARGS__); \
+    }
 
 #define TODO_PRINT(fmt, ...)  \
     fprintf(stdout, "TODO - " STRINGIFY(fmt) "\n", ##__VA_ARGS__);
diff --git a/src/includes/likwid-marker.h b/src/includes/likwid-marker.h
index ebf8b89b0..6e3fda6f6 100644
--- a/src/includes/likwid-marker.h
+++ b/src/includes/likwid-marker.h
@@ -166,5 +166,72 @@ Shortcut for likwid_gpuMarkerClose() if compiled with -DLIKWID_NVMON. Otherwise
 #endif /* LIKWID_NVMON */
 
 
+/** \addtogroup RocMarkerAPI RocMarker API module (MarkerAPI for AMD GPUs)
+*  @{
+*/
+/*!
+\def ROCMON_MARKER_INIT
+Shortcut for rocmon_markerInit() if compiled with -DLIKWID_ROCMON. Otherwise no operation is performed
+*/
+/*!
+\def ROCMON_MARKER_THREADINIT
+Shortcut for rocmon_markerThreadInit() if compiled with -DLIKWID_ROCMON. Otherwise no operation is performed
+*/
+/*!
+\def ROCMON_MARKER_REGISTER(regionTag)
+Shortcut for rocmon_markerRegisterRegion() with \a regionTag if compiled with -DLIKWID_ROCMON. Otherwise no operation is performed
+*/
+/*!
+\def ROCMON_MARKER_START(regionTag)
+Shortcut for rocmon_markerStartRegion() with \a regionTag if compiled with -DLIKWID_ROCMON. Otherwise no operation is performed
+*/
+/*!
+\def ROCMON_MARKER_STOP(regionTag)
+Shortcut for rocmon_markerStopRegion() with \a regionTag if compiled with -DLIKWID_ROCMON. Otherwise no operation is performed
+*/
+/*!
+\def ROCMON_MARKER_GET(regionTag, ngpus, nevents, events, time, count)
+Shortcut for rocmon_markerGetRegion() for \a regionTag if compiled with -DLIKWID_ROCMON. Otherwise no operation is performed
+*/
+/*!
+\def ROCMON_MARKER_SWITCH
+Shortcut for rocmon_markerNextGroup() if compiled with -DLIKWID_ROCMON. Otherwise no operation is performed
+*/
+/*!
+\def ROCMON_MARKER_RESET(regionTag)
+Shortcut for rocmon_markerResetRegion() if compiled with -DLIKWID_ROCMON. Otherwise no operation is performed
+*/
+/*!
+\def ROCMON_MARKER_CLOSE
+Shortcut for rocmon_markerClose() if compiled with -DLIKWID_ROCMON. Otherwise no operation is performed
+*/
+/** @}*/
+
+#ifdef LIKWID_ROCMON
+#ifndef LIKWID_WITH_ROCMON
+#define LIKWID_WITH_ROCMON
+#endif
+#include <likwid.h>
+#define ROCMON_MARKER_INIT rocmon_markerInit()
+#define ROCMON_MARKER_THREADINIT rocmon_markerThreadInit()
+#define ROCMON_MARKER_SWITCH rocmon_markerNextGroup()
+#define ROCMON_MARKER_REGISTER(regionTag) rocmon_markerRegisterRegion(regionTag)
+#define ROCMON_MARKER_START(regionTag) rocmon_markerStartRegion(regionTag)
+#define ROCMON_MARKER_STOP(regionTag) rocmon_markerStopRegion(regionTag)
+#define ROCMON_MARKER_CLOSE rocmon_markerClose()
+#define ROCMON_MARKER_RESET(regionTag) rocmon_markerResetRegion(regionTag)
+#define ROCMON_MARKER_GET(regionTag, ngpus, nevents, events, time, count) rocmon_markerGetRegion(regionTag, ngpus, nevents, events, time, count)
+#else /* LIKWID_ROCMON */
+#define ROCMON_MARKER_INIT
+#define ROCMON_MARKER_THREADINIT
+#define ROCMON_MARKER_SWITCH
+#define ROCMON_MARKER_REGISTER(regionTag)
+#define ROCMON_MARKER_START(regionTag)
+#define ROCMON_MARKER_STOP(regionTag)
+#define ROCMON_MARKER_CLOSE
+#define ROCMON_MARKER_GET(regionTag, nevents, events, time, count)
+#define ROCMON_MARKER_RESET(regionTag)
+#endif /* LIKWID_ROCMON */
+
 
 #endif /* LIKWID_MARKER_H */
diff --git a/src/includes/likwid.h b/src/includes/likwid.h
index 83f3ba1ab..13fa3f949 100644
--- a/src/includes/likwid.h
+++ b/src/includes/likwid.h
@@ -14,25 +14,26 @@
  *
  *      Copyright (C) 2016 RRZE, University Erlangen-Nuremberg
  *
- *      This program is free software: you can redistribute it and/or modify it under
- *      the terms of the GNU General Public License as published by the Free Software
- *      Foundation, either version 3 of the License, or (at your option) any later
- *      version.
+ *      This program is free software: you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation, either version 3 of the License, or (at your option) any
+ * later version.
  *
- *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
- *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
- *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *      This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
  *
- *      You should have received a copy of the GNU General Public License along with
- *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *      You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
  *
  * =======================================================================================
  */
 #ifndef LIKWID_H
 #define LIKWID_H
 
-#include <stdint.h>
 #include <errno.h>
+#include <stdint.h>
 #include <string.h>
 
 #include <bstrlib.h>
@@ -47,6 +48,7 @@
 
 extern int perfmon_verbosity;
 extern int likwid_nvmon_verbosity;
+extern int likwid_rocmon_verbosity;
 
 #ifdef __cplusplus
 extern "C" {
@@ -97,49 +99,53 @@ extern int likwid_getMaxSupportedSockets(void) __attribute__ ((visibility ("defa
 ################################################################################
 */
 /** \addtogroup MarkerAPI Marker API module
-*  @{
-*/
+ *  @{
+ */
 /*! \brief Initialize LIKWID's marker API
 
-Must be called in serial region of the application to set up basic data structures
-of LIKWID.
-Reads environment variables:
+Must be called in serial region of the application to set up basic data
+structures of LIKWID. Reads environment variables:
 - LIKWID_MODE (access mode)
 - LIKWID_MASK (event bitmask)
 - LIKWID_EVENTS (event string)
 - LIKWID_THREADS (cpu list separated by ,)
 - LIKWID_GROUPS (amount of groups)
 */
-extern void likwid_markerInit(void) __attribute__ ((visibility ("default") ));
+extern void likwid_markerInit(void) __attribute__((visibility("default")));
 /*! \brief Initialize LIKWID's marker API for the current thread
 
-Must be called in parallel region of the application to set up basic data structures
-of LIKWID. Before you can call likwid_markerThreadInit() you have to call likwid_markerInit().
+Must be called in parallel region of the application to set up basic data
+structures of LIKWID. Before you can call likwid_markerThreadInit() you have to
+call likwid_markerInit().
 
 */
-extern void likwid_markerThreadInit(void) __attribute__ ((visibility ("default") ));
+extern void likwid_markerThreadInit(void)
+    __attribute__((visibility("default")));
 /*! \brief Switch to next group to measure
 
 Should be called in a serial region of code. If it is to be called from inside
 a parallel region, ensure only one thread runs it by using "#pragma omp single"
 or similar. Additionally, if this function is called in a parallel region,
 ensure that the serial regions is preceeded by a barrier ("#pragma omp barrier"
-or similar) to prevent race conditions. 
+or similar) to prevent race conditions.
 */
-extern void likwid_markerNextGroup(void) __attribute__ ((visibility ("default") ));
+extern void likwid_markerNextGroup(void) __attribute__((visibility("default")));
 /*! \brief Close LIKWID's marker API
 
-Must be called in serial region of the application. It gathers all data of regions and
-writes them out to a file (filepath in env variable LIKWID_FILEPATH).
+Must be called in serial region of the application. It gathers all data of
+regions and writes them out to a file (filepath in env variable
+LIKWID_FILEPATH).
 */
-extern void likwid_markerClose(void) __attribute__ ((visibility ("default") ));
+extern void likwid_markerClose(void) __attribute__((visibility("default")));
 /*! \brief Register a measurement region
 
-Initializes the hashTable entry in order to reduce execution time of likwid_markerStartRegion()
+Initializes the hashTable entry in order to reduce execution time of
+likwid_markerStartRegion()
 @param regionTag [in] Initialize data using this string
 @return Error code
 */
-extern int likwid_markerRegisterRegion(const char* regionTag) __attribute__ ((visibility ("default") ));
+extern int likwid_markerRegisterRegion(const char *regionTag)
+    __attribute__((visibility("default")));
 /*! \brief Start a measurement region
 
 Reads the values of all configured counters and saves the results under the
@@ -152,7 +158,8 @@ likwid_markerStartRegion
 @param regionTag [in] Store data using this string
 @return Error code of start operation
 */
-extern int likwid_markerStartRegion(const char* regionTag) __attribute__ ((visibility ("default") ));
+extern int likwid_markerStartRegion(const char *regionTag)
+    __attribute__((visibility("default")));
 /*! \brief Stop a measurement region
 
 Reads the values of all configured counters and saves the results under the
@@ -166,14 +173,16 @@ call to likwid_markerStopRegion
 @param regionTag [in] Store data using this string
 @return Error code of stop operation
 */
-extern int likwid_markerStopRegion(const char* regionTag) __attribute__ ((visibility ("default") ));
+extern int likwid_markerStopRegion(const char *regionTag)
+    __attribute__((visibility("default")));
 /*! \brief Reset a measurement region
 
 Reset the values of all configured counters and timers.
 @param regionTag [in] Reset data using this string
 @return Error code of reset operation
 */
-extern int likwid_markerResetRegion(const char* regionTag) __attribute__ ((visibility ("default") ));
+extern int likwid_markerResetRegion(const char *regionTag)
+    __attribute__((visibility("default")));
 /*! \brief Get accumulated data of a code region
 
 Get the accumulated data of the current thread for the given regionTag.
@@ -183,30 +192,36 @@ Get the accumulated data of the current thread for the given regionTag.
 @param time [out] Accumulated measurement time
 @param count [out] Call count of the code region
 */
-extern void likwid_markerGetRegion(const char* regionTag, int* nr_events, double* events, double *time, int *count) __attribute__ ((visibility ("default") ));
+extern void likwid_markerGetRegion(const char *regionTag, int *nr_events,
+                                   double *events, double *time, int *count)
+    __attribute__((visibility("default")));
 /* utility routines */
 /*! \brief Get CPU ID of the current process/thread
 
 Returns the ID of the CPU the current process or thread is running on.
 @return current CPU ID
 */
-extern int  likwid_getProcessorId() __attribute__ ((visibility ("default") ));
+extern int likwid_getProcessorId() __attribute__((visibility("default")));
 /*! \brief Pin the current process to given CPU
 
 Pin the current process to the given CPU ID. The process cannot be scheduled to
-another CPU after pinning but the pinning can be changed anytime with this function.
+another CPU after pinning but the pinning can be changed anytime with this
+function.
 @param [in] processorId CPU ID to pin the current process to
 @return error code (1 for success, 0 for error)
 */
-extern int  likwid_pinProcess(int processorId) __attribute__ ((visibility ("default") ));
+extern int likwid_pinProcess(int processorId)
+    __attribute__((visibility("default")));
 /*! \brief Pin the current thread to given CPU
 
 Pin the current thread to the given CPU ID. The thread cannot be scheduled to
-another CPU after pinning but the pinning can be changed anytime with this function
+another CPU after pinning but the pinning can be changed anytime with this
+function
 @param [in] processorId CPU ID to pin the current thread to
 @return error code (1 for success, 0 for error)
 */
-extern int  likwid_pinThread(int processorId) __attribute__ ((visibility ("default") ));
+extern int likwid_pinThread(int processorId)
+    __attribute__((visibility("default")));
 /** @}*/
 
 /*
@@ -226,35 +241,41 @@ MSR and PCI devices. The daemon mode forwards the operations to a daemon with
 higher priviledges.
 */
 typedef enum {
-    ACCESSMODE_PERF = -1, /*!< \brief Access performance monitoring through perf_event kernel interface */
-    ACCESSMODE_DIRECT = 0, /*!< \brief Access performance monitoring registers directly */
-    ACCESSMODE_DAEMON = 1 /*!< \brief Use the access daemon to access the registers */
+  ACCESSMODE_PERF = -1, /*!< \brief Access performance monitoring through
+                           perf_event kernel interface */
+  ACCESSMODE_DIRECT =
+      0, /*!< \brief Access performance monitoring registers directly */
+  ACCESSMODE_DAEMON =
+      1 /*!< \brief Use the access daemon to access the registers */
 } AccessMode;
 
 /*! \brief Set access mode
 
-Sets the mode how the MSR and PCI registers should be accessed. 0 for direct access (propably root priviledges required) and 1 for accesses through the access daemon. It must be called before HPMinit()
+Sets the mode how the MSR and PCI registers should be accessed. 0 for direct
+access (propably root priviledges required) and 1 for accesses through the
+access daemon. It must be called before HPMinit()
 @param [in] mode (0=direct, 1=daemon)
 */
-extern void HPMmode(int mode) __attribute__ ((visibility ("default") ));
+extern void HPMmode(int mode) __attribute__((visibility("default")));
 /*! \brief Initialize access module
 
 Initialize the module internals to either the MSR/PCI files or the access daemon
 @return error code (0 for access)
 */
-extern int HPMinit() __attribute__ ((visibility ("default") ));
+extern int HPMinit() __attribute__((visibility("default")));
 /*! \brief Add CPU to access module
 
-Add the given CPU to the access module. This opens the commnunication to either the MSR/PCI files or the access daemon.
+Add the given CPU to the access module. This opens the commnunication to either
+the MSR/PCI files or the access daemon.
 @param [in] cpu_id CPU that should be enabled for measurements
 @return error code (0 for success, -ENODEV if access cannot be initialized
 */
-extern int HPMaddThread(int cpu_id) __attribute__ ((visibility ("default") ));
+extern int HPMaddThread(int cpu_id) __attribute__((visibility("default")));
 /*! \brief Close connections
 
 Close the connections to the MSR/PCI files or the access daemon
 */
-extern void HPMfinalize() __attribute__ ((visibility ("default") ));
+extern void HPMfinalize() __attribute__((visibility("default")));
 /** @}*/
 
 /*
@@ -263,8 +284,8 @@ extern void HPMfinalize() __attribute__ ((visibility ("default") ));
 ################################################################################
 */
 /** \addtogroup Config Config file module
-*  @{
-*/
+ *  @{
+ */
 /*! \brief Structure holding values of the configuration file
 
 LIKWID supports the definition of runtime values in a configuration file. The
@@ -274,17 +295,17 @@ at each start, a path to a topology file can be set. The other values are mostly
 used internally.
 */
 typedef struct {
-    char* configFileName; /*!< \brief Path to the configuration file */
-    char* topologyCfgFileName; /*!< \brief Path to the topology file */
-    char* daemonPath; /*!< \brief Path of the access daemon */
-    char* groupPath; /*!< \brief Path of default performance group directory */
-    AccessMode daemonMode; /*!< \brief Access mode to the MSR and PCI registers */
-    int maxNumThreads; /*!< \brief Maximum number of HW threads */
-    int maxNumNodes; /*!< \brief Maximum number of NUMA nodes */
+  char *configFileName;      /*!< \brief Path to the configuration file */
+  char *topologyCfgFileName; /*!< \brief Path to the topology file */
+  char *daemonPath;          /*!< \brief Path of the access daemon */
+  char *groupPath; /*!< \brief Path of default performance group directory */
+  AccessMode daemonMode; /*!< \brief Access mode to the MSR and PCI registers */
+  int maxNumThreads;     /*!< \brief Maximum number of HW threads */
+  int maxNumNodes;       /*!< \brief Maximum number of NUMA nodes */
 } Likwid_Configuration;
 
 /** \brief Pointer for exporting the Configuration data structure */
-typedef Likwid_Configuration* Configuration_t;
+typedef Likwid_Configuration *Configuration_t;
 /*! \brief Read the config file of LIKWID, if it exists
 
 Search for LIKWID config file and read the values in
@@ -292,14 +313,14 @@ Currently the paths /usr/local/etc/likwid.cfg, /etc/likwid.cfg and the path
 defined in config.mk are checked.
 @return error code (0 for success, -EFAULT if no file can be found)
 */
-extern int init_configuration(void) __attribute__ ((visibility ("default") ));
+extern int init_configuration(void) __attribute__((visibility("default")));
 /*! \brief Destroy the config structure
 
-Destroys the current config structure and frees all allocated memory for path names
+Destroys the current config structure and frees all allocated memory for path
+names
 @return error code (0 for success, -EFAULT if config structure not initialized)
 */
-extern int destroy_configuration(void) __attribute__ ((visibility ("default") ));
-
+extern int destroy_configuration(void) __attribute__((visibility("default")));
 
 /*! \brief Retrieve the config structure
 
@@ -307,15 +328,18 @@ Get the initialized configuration
 \sa Configuration_t
 @return Configuration_t (pointer to internal Configuration structure)
 */
-extern Configuration_t get_configuration(void) __attribute__ ((visibility ("default") ));
+extern Configuration_t get_configuration(void)
+    __attribute__((visibility("default")));
 
 /*! \brief Set group path in the config struction
 
 Set group path in the config struction. The path must be a directory.
 @param [in] path
-@return error code (0 for success, -ENOMEM if reallocation failed, -ENOTDIR if no directoy)
+@return error code (0 for success, -ENOMEM if reallocation failed, -ENOTDIR if
+no directoy)
 */
-extern int config_setGroupPath(const char* path) __attribute__ ((visibility ("default") ));
+extern int config_setGroupPath(const char *path)
+    __attribute__((visibility("default")));
 
 /** @}*/
 /*
@@ -324,34 +348,43 @@ extern int config_setGroupPath(const char* path) __attribute__ ((visibility ("de
 ################################################################################
 */
 /** \addtogroup CPUTopology CPU information module
-*  @{
-*/
+ *  @{
+ */
 /*! \brief Structure with general CPU information
 
 General information covers CPU family, model, name and current clock and vendor
-specific information like the version of Intel's performance monitoring facility.
+specific information like the version of Intel's performance monitoring
+facility.
 */
 typedef struct {
-    uint32_t    family; /*!< \brief CPU family ID*/
-    uint32_t    model; /*!< \brief CPU model ID */
-    uint32_t    stepping; /*!< \brief Stepping (version) of the CPU */
-    uint32_t    vendor; /*!< \brief Vendor of the CPU */
-    uint32_t    part; /*!< \brief Part number of the CPU */
-    uint64_t    clock; /*!< \brief Current clock frequency of the executing CPU*/
-    int         turbo; /*!< \brief Flag if CPU has a turbo mode */
-    char*       osname; /*!< \brief Name of the CPU reported by OS */
-    char*       name; /*!< \brief Name of the CPU as identified by LIKWID */
-    char*       short_name; /*!< \brief Short name of the CPU*/
-    char*       features; /*!< \brief String with all features supported by the CPU*/
-    int         isIntel; /*!< \brief Flag if it is an Intel CPU*/
-    char        architecture[20]; /*!< \brief name of the architecture like x86_64 or ppc64 (comparable with uname -m)*/
-    int         supportUncore; /*!< \brief Flag if system has Uncore performance monitors */
-    int         supportClientmem; /*!< \brief Flag if system has mappable memory controllers */
-    uint64_t    featureFlags; /*!< \brief Mask of all features supported by the CPU*/
-    uint32_t    perf_version; /*!< \brief Version of Intel's performance monitoring facility */
-    uint32_t    perf_num_ctr; /*!< \brief Number of general purpose HWthread-local performance monitoring counters */
-    uint32_t    perf_width_ctr; /*!< \brief Bit width of fixed and general purpose counters */
-    uint32_t    perf_num_fixed_ctr; /*!< \brief Number of fixed purpose HWthread-local performance monitoring counters */
+  uint32_t family;   /*!< \brief CPU family ID*/
+  uint32_t model;    /*!< \brief CPU model ID */
+  uint32_t stepping; /*!< \brief Stepping (version) of the CPU */
+  uint32_t vendor;   /*!< \brief Vendor of the CPU */
+  uint32_t part;     /*!< \brief Part number of the CPU */
+  uint64_t clock;    /*!< \brief Current clock frequency of the executing CPU*/
+  int turbo;         /*!< \brief Flag if CPU has a turbo mode */
+  char *osname;      /*!< \brief Name of the CPU reported by OS */
+  char *name;        /*!< \brief Name of the CPU as identified by LIKWID */
+  char *short_name;  /*!< \brief Short name of the CPU*/
+  char *features;    /*!< \brief String with all features supported by the CPU*/
+  int isIntel;       /*!< \brief Flag if it is an Intel CPU*/
+  char architecture[20]; /*!< \brief name of the architecture like x86_64 or
+                            ppc64 (comparable with uname -m)*/
+  int supportUncore; /*!< \brief Flag if system has Uncore performance monitors
+                      */
+  int supportClientmem;  /*!< \brief Flag if system has mappable memory
+                            controllers */
+  uint64_t featureFlags; /*!< \brief Mask of all features supported by the CPU*/
+  uint32_t perf_version; /*!< \brief Version of Intel's performance monitoring
+                            facility */
+  uint32_t perf_num_ctr; /*!< \brief Number of general purpose HWthread-local
+                            performance monitoring counters */
+  uint32_t perf_width_ctr; /*!< \brief Bit width of fixed and general purpose
+                              counters */
+  uint32_t
+      perf_num_fixed_ctr; /*!< \brief Number of fixed purpose HWthread-local
+                             performance monitoring counters */
 } CpuInfo;
 
 /*! \brief Structure with IDs of a HW thread
@@ -361,59 +394,74 @@ CPU core ID of the HW thread and the CPU socket ID.
 \extends CpuTopology
 */
 typedef struct {
-    uint32_t threadId; /*!< \brief ID of HW thread inside the CPU core */
-    uint32_t coreId; /*!< \brief ID of CPU core that executes the HW thread */
-    uint32_t packageId; /*!< \brief ID of CPU socket containing the HW thread */
-    uint32_t apicId; /*!< \brief ID of HW thread retrieved through the Advanced Programmable Interrupt Controller */
-    uint32_t dieId; /*!< \brief ID of die. A package might contain multiple dies */
-    uint32_t inCpuSet; /*!< \brief Flag if HW thread is inside the CPUset */
+  uint32_t threadId;  /*!< \brief ID of HW thread inside the CPU core */
+  uint32_t coreId;    /*!< \brief ID of CPU core that executes the HW thread */
+  uint32_t packageId; /*!< \brief ID of CPU socket containing the HW thread */
+  uint32_t apicId;    /*!< \brief ID of HW thread retrieved through the Advanced
+                         Programmable Interrupt Controller */
+  uint32_t
+      dieId; /*!< \brief ID of die. A package might contain multiple dies */
+  uint32_t inCpuSet; /*!< \brief Flag if HW thread is inside the CPUset */
 } HWThread;
 
 /*! \brief Enum of possible caches
 
-CPU caches can have different tasks and hold different kind of data. This enum lists all shapes used in all supported CPUs
-\extends CacheLevel
+CPU caches can have different tasks and hold different kind of data. This enum
+lists all shapes used in all supported CPUs \extends CacheLevel
 */
 typedef enum {
-    NOCACHE=0, /*!< \brief No cache used as undef value */
-    DATACACHE, /*!< \brief Cache holding data cache lines */
-    INSTRUCTIONCACHE, /*!< \brief Cache holding instruction cache lines */
-    UNIFIEDCACHE, /*!< \brief Cache holding both instruction and data cache lines */
-    ITLB, /*!< \brief Translation Lookaside Buffer cache for instruction pages */
-    DTLB /*!< \brief Translation Lookaside Buffer cache for data pages */
+  NOCACHE = 0,      /*!< \brief No cache used as undef value */
+  DATACACHE,        /*!< \brief Cache holding data cache lines */
+  INSTRUCTIONCACHE, /*!< \brief Cache holding instruction cache lines */
+  UNIFIEDCACHE, /*!< \brief Cache holding both instruction and data cache lines
+                 */
+  ITLB, /*!< \brief Translation Lookaside Buffer cache for instruction pages */
+  DTLB  /*!< \brief Translation Lookaside Buffer cache for data pages */
 } CacheType;
 
 /*! \brief Structure describing a cache level
 
-CPUs are connected to a cache hierarchy with different amount of caches at each level. The CacheLevel structure holds general information about the cache.
+CPUs are connected to a cache hierarchy with different amount of caches at each
+level. The CacheLevel structure holds general information about the cache.
 \extends CpuTopology
 */
 typedef struct {
-    uint32_t level; /*!< \brief Level of the cache in the hierarchy */
-    CacheType type; /*!< \brief Type of the cache */
-    uint32_t associativity; /*!< \brief Amount of cache lines hold by each set */
-    uint32_t sets; /*!< \brief Amount of sets */
-    uint32_t lineSize; /*!< \brief Size in bytes of one cache line */
-    uint32_t size; /*!< \brief Size in bytes of the cache */
-    uint32_t threads; /*!< \brief Number of HW thread connected to the cache */
-    uint32_t inclusive; /*!< \brief Flag if cache is inclusive (holds also cache lines available in caches nearer to the CPU) or exclusive */
+  uint32_t level;         /*!< \brief Level of the cache in the hierarchy */
+  CacheType type;         /*!< \brief Type of the cache */
+  uint32_t associativity; /*!< \brief Amount of cache lines hold by each set */
+  uint32_t sets;          /*!< \brief Amount of sets */
+  uint32_t lineSize;      /*!< \brief Size in bytes of one cache line */
+  uint32_t size;          /*!< \brief Size in bytes of the cache */
+  uint32_t threads; /*!< \brief Number of HW thread connected to the cache */
+  uint32_t
+      inclusive; /*!< \brief Flag if cache is inclusive (holds also cache lines
+                    available in caches nearer to the CPU) or exclusive */
 } CacheLevel;
 
 /*! \brief Structure describing the topology of the HW threads in the system
 
-This structure describes the topology at HW thread level like the amount of HW threads, how they are distributed over the CPU sockets/packages and how the caching hierarchy is assembled.
+This structure describes the topology at HW thread level like the amount of HW
+threads, how they are distributed over the CPU sockets/packages and how the
+caching hierarchy is assembled.
 */
 typedef struct {
-    uint32_t numHWThreads; /*!< \brief Amount of active HW threads in the system (e.g. in cpuset) */
-    uint32_t activeHWThreads; /*!< \brief Amount of HW threads in the system and length of \a threadPool */
-    uint32_t numSockets; /*!< \brief Amount of CPU sockets/packages in the system */
-    uint32_t numDies; /*!< \brief Amount of CPU dies in the system */
-    uint32_t numCoresPerSocket; /*!< \brief Amount of physical cores in one CPU socket/package */
-    uint32_t numThreadsPerCore; /*!< \brief Amount of HW threads in one physical CPU core */
-    uint32_t numCacheLevels; /*!< \brief Amount of caches for each HW thread and length of \a cacheLevels */
-    HWThread* threadPool; /*!< \brief List of all HW thread descriptions */
-    CacheLevel*  cacheLevels; /*!< \brief List of all caches in the hierarchy */
-    struct treeNode* topologyTree; /*!< \brief Anchor for a tree structure describing the system topology */
+  uint32_t numHWThreads; /*!< \brief Amount of active HW threads in the system
+                            (e.g. in cpuset) */
+  uint32_t activeHWThreads; /*!< \brief Amount of HW threads in the system and
+                               length of \a threadPool */
+  uint32_t
+      numSockets;   /*!< \brief Amount of CPU sockets/packages in the system */
+  uint32_t numDies; /*!< \brief Amount of CPU dies in the system */
+  uint32_t numCoresPerSocket; /*!< \brief Amount of physical cores in one CPU
+                                 socket/package */
+  uint32_t numThreadsPerCore; /*!< \brief Amount of HW threads in one physical
+                                 CPU core */
+  uint32_t numCacheLevels; /*!< \brief Amount of caches for each HW thread and
+                              length of \a cacheLevels */
+  HWThread *threadPool;    /*!< \brief List of all HW thread descriptions */
+  CacheLevel *cacheLevels; /*!< \brief List of all caches in the hierarchy */
+  struct treeNode *topologyTree; /*!< \brief Anchor for a tree structure
+                                    describing the system topology */
 } CpuTopology;
 
 /*! \brief Variable holding the global cpu information structure */
@@ -422,38 +470,41 @@ extern CpuInfo cpuid_info;
 extern CpuTopology cpuid_topology;
 
 /** \brief Pointer for exporting the CpuInfo data structure */
-typedef CpuInfo* CpuInfo_t;
+typedef CpuInfo *CpuInfo_t;
 /** \brief Pointer for exporting the CpuTopology data structure */
-typedef CpuTopology* CpuTopology_t;
+typedef CpuTopology *CpuTopology_t;
 /*! \brief Initialize topology information
 
-CpuInfo_t and CpuTopology_t are initialized by either HWLOC, CPUID/ProcFS or topology file if present. The topology file name can be configured in the configuration file. Furthermore, the paths /etc/likwid_topo.cfg and &lt;PREFIX&gt;/etc/likwid_topo.cfg are checked.
-\sa CpuInfo_t and CpuTopology_t
+CpuInfo_t and CpuTopology_t are initialized by either HWLOC, CPUID/ProcFS or
+topology file if present. The topology file name can be configured in the
+configuration file. Furthermore, the paths /etc/likwid_topo.cfg and
+&lt;PREFIX&gt;/etc/likwid_topo.cfg are checked. \sa CpuInfo_t and CpuTopology_t
 @return always 0
 */
-extern int topology_init(void) __attribute__ ((visibility ("default") ));
+extern int topology_init(void) __attribute__((visibility("default")));
 /*! \brief Retrieve CPU topology of the current machine
 
 \sa CpuTopology_t
 @return CpuTopology_t (pointer to internal cpuid_topology structure)
 */
-extern CpuTopology_t get_cpuTopology(void) __attribute__ ((visibility ("default") ));
+extern CpuTopology_t get_cpuTopology(void)
+    __attribute__((visibility("default")));
 /*! \brief Retrieve CPU information of the current machine
 
-Get the previously initialized CPU info structure containing number of CPUs/Threads
-\sa CpuInfo_t
+Get the previously initialized CPU info structure containing number of
+CPUs/Threads \sa CpuInfo_t
 @return CpuInfo_t (pointer to internal cpuid_info structure)
 */
-extern CpuInfo_t get_cpuInfo(void) __attribute__ ((visibility ("default") ));
+extern CpuInfo_t get_cpuInfo(void) __attribute__((visibility("default")));
 /*! \brief Destroy topology structures CpuInfo_t and CpuTopology_t.
 
-Retrieved pointers to the structures are not valid anymore after this function call
-\sa CpuInfo_t and CpuTopology_t
+Retrieved pointers to the structures are not valid anymore after this function
+call \sa CpuInfo_t and CpuTopology_t
 */
-extern void topology_finalize(void) __attribute__ ((visibility ("default") ));
+extern void topology_finalize(void) __attribute__((visibility("default")));
 /*! \brief Print all supported architectures
-*/
-extern void print_supportedCPUs(void) __attribute__ ((visibility ("default") ));
+ */
+extern void print_supportedCPUs(void) __attribute__((visibility("default")));
 /** @}*/
 /*
 ################################################################################
@@ -471,57 +522,65 @@ library or by evaluating the /proc filesystem.
 \extends NumaTopology
 */
 typedef struct {
-    uint32_t id; /*!< \brief ID of the NUMA node */
-    uint64_t totalMemory; /*!< \brief Amount of memory in the NUMA node */
-    uint64_t freeMemory; /*!< \brief Amount of free memory in the NUMA node */
-    uint32_t numberOfProcessors; /*!< \brief umber of processors covered by the NUMA node and length of \a processors */
-    uint32_t*  processors; /*!< \brief List of HW threads in the NUMA node */
-    uint32_t numberOfDistances; /*!< \brief Amount of distances to the other NUMA nodes in the system and self  */
-    uint32_t*  distances; /*!< \brief List of distances to the other NUMA nodes and self */
+  uint32_t id;          /*!< \brief ID of the NUMA node */
+  uint64_t totalMemory; /*!< \brief Amount of memory in the NUMA node */
+  uint64_t freeMemory;  /*!< \brief Amount of free memory in the NUMA node */
+  uint32_t numberOfProcessors; /*!< \brief umber of processors covered by the
+                                  NUMA node and length of \a processors */
+  uint32_t *processors;       /*!< \brief List of HW threads in the NUMA node */
+  uint32_t numberOfDistances; /*!< \brief Amount of distances to the other NUMA
+                                 nodes in the system and self  */
+  uint32_t *distances; /*!< \brief List of distances to the other NUMA nodes and
+                          self */
 } NumaNode;
 
-
-/*! \brief  The NumaTopology structure describes all NUMA nodes in the current system.
-*/
+/*! \brief  The NumaTopology structure describes all NUMA nodes in the current
+ * system.
+ */
 typedef struct {
-    uint32_t numberOfNodes; /*!< \brief Number of NUMA nodes in the system and length of \a nodes  */
-    NumaNode* nodes; /*!< \brief List of NUMA nodes */
+  uint32_t numberOfNodes; /*!< \brief Number of NUMA nodes in the system and
+                             length of \a nodes  */
+  NumaNode *nodes;        /*!< \brief List of NUMA nodes */
 } NumaTopology;
 
 /*! \brief Variable holding the global NUMA information structure */
 extern NumaTopology numa_info;
 
 /** \brief Pointer for exporting the NumaTopology data structure */
-typedef NumaTopology* NumaTopology_t;
+typedef NumaTopology *NumaTopology_t;
 
 /*! \brief Initialize NUMA information
 
-Initialize NUMA information NumaTopology_t using either HWLOC or CPUID/ProcFS. If
-a topology config file is present it is read at topology_init() and fills \a NumaTopology_t
-\sa NumaTopology_t
+Initialize NUMA information NumaTopology_t using either HWLOC or CPUID/ProcFS.
+If a topology config file is present it is read at topology_init() and fills \a
+NumaTopology_t \sa NumaTopology_t
 @return error code (0 for success, -1 if initialization failed)
 */
-extern int numa_init(void) __attribute__ ((visibility ("default") ));
+extern int numa_init(void) __attribute__((visibility("default")));
 /*! \brief Retrieve NUMA information of the current machine
 
 Get the previously initialized NUMA info structure
 \sa NumaTopology_t
 @return NumaTopology_t (pointer to internal numa_info structure)
 */
-extern NumaTopology_t get_numaTopology(void) __attribute__ ((visibility ("default") ));
+extern NumaTopology_t get_numaTopology(void)
+    __attribute__((visibility("default")));
 /*! \brief Set memory allocation policy to interleaved
 
 Set the memory allocation policy to interleaved for given list of CPUs
 @param [in] processorList List of processors
 @param [in] numberOfProcessors Length of processor list
 */
-extern void numa_setInterleaved(const int* processorList, int numberOfProcessors) __attribute__ ((visibility ("default") ));
+extern void numa_setInterleaved(const int *processorList,
+                                int numberOfProcessors)
+    __attribute__((visibility("default")));
 /*! \brief Allocate memory from a specific specific NUMA node
 @param [in,out] ptr Start pointer of memory
 @param [in] size Size for the allocation
 @param [in] domainId ID of NUMA node for the allocation
 */
-extern void numa_membind(void* ptr, size_t size, int domainId) __attribute__ ((visibility ("default") ));
+extern void numa_membind(void *ptr, size_t size, int domainId)
+    __attribute__((visibility("default")));
 /*! \brief Set memory allocation policy to membind
 
 Set the memory allocation policy to membind for given list of CPUs. This forces
@@ -529,14 +588,15 @@ allocation to be placed in NUMA domains spanning the given processor list.
 @param [in] processorList List of processors
 @param [in] numberOfProcessors Length of processor list
 */
-extern void numa_setMembind(const int* processorList, int numberOfProcessors) __attribute__ ((visibility ("default") ));
+extern void numa_setMembind(const int *processorList, int numberOfProcessors)
+    __attribute__((visibility("default")));
 /*! \brief Destroy NUMA information structure
 
 Destroys the NUMA information structure NumaTopology_t. Retrieved pointers
 to the structures are not valid anymore after this function call
 \sa NumaTopology_t
 */
-extern void numa_finalize(void) __attribute__ ((visibility ("default") ));
+extern void numa_finalize(void) __attribute__((visibility("default")));
 /*! \brief Retrieve the number of NUMA nodes
 
 Returns the number of NUMA nodes of the current machine. Can also be read out of
@@ -544,7 +604,7 @@ NumaTopology_t
 \sa NumaTopology_t
 @return Number of NUMA nodes
 */
-extern int likwid_getNumberOfNodes(void) __attribute__ ((visibility ("default") ));
+extern int likwid_getNumberOfNodes(void) __attribute__((visibility("default")));
 /** @}*/
 /*
 ################################################################################
@@ -555,91 +615,111 @@ extern int likwid_getNumberOfNodes(void) __attribute__ ((visibility ("default")
  *  @{
  */
 
-/*! \brief The AffinityDomain data structure describes a single domain in the current system
+/*! \brief The AffinityDomain data structure describes a single domain in the
+current system
 
-The AffinityDomain data structure describes a single domain in the current system. Example domains are NUMA nodes, CPU sockets/packages or LLC (Last Level Cache) cache domains.
-\extends AffinityDomains
+The AffinityDomain data structure describes a single domain in the current
+system. Example domains are NUMA nodes, CPU sockets/packages or LLC (Last Level
+Cache) cache domains. \extends AffinityDomains
 */
 typedef struct {
-    bstring tag; /*!< \brief Bstring with the ID for the affinity domain. Currently possible values: N (node), SX (socket/package X), CX (LLC cache domain X) and MX (memory domain X) */
-    uint32_t numberOfProcessors; /*!< \brief Number of HW threads in the domain and length of \a processorList */
-    uint32_t numberOfCores; /*!< \brief Number of hardware threads in the domain */
-    int*  processorList; /*!< \brief List of HW thread IDs in the domain */
+  bstring tag; /*!< \brief Bstring with the ID for the affinity domain.
+                  Currently possible values: N (node), SX (socket/package X), CX
+                  (LLC cache domain X) and MX (memory domain X) */
+  uint32_t numberOfProcessors; /*!< \brief Number of HW threads in the domain
+                                  and length of \a processorList */
+  uint32_t
+      numberOfCores;  /*!< \brief Number of hardware threads in the domain */
+  int *processorList; /*!< \brief List of HW thread IDs in the domain */
 } AffinityDomain;
 
-/*! \brief The AffinityDomains data structure holds different count variables describing the
-various system layers
+/*! \brief The AffinityDomains data structure holds different count variables
+describing the various system layers
 
-Affinity domains are for example the amount of NUMA domains, CPU sockets/packages or LLC
-(Last Level Cache) cache domains of the current machine. Moreover a list of
-\a domains holds the processor lists for each domain that are used for
-scheduling processes to domain specific HW threads. Some amounts are duplicates
-or derivation of values in \a CpuInfo, \a CpuTopology and \a NumaTopology.
+Affinity domains are for example the amount of NUMA domains, CPU
+sockets/packages or LLC (Last Level Cache) cache domains of the current machine.
+Moreover a list of \a domains holds the processor lists for each domain that are
+used for scheduling processes to domain specific HW threads. Some amounts are
+duplicates or derivation of values in \a CpuInfo, \a CpuTopology and \a
+NumaTopology.
 */
 typedef struct {
-    uint32_t numberOfSocketDomains; /*!< \brief Number of CPU sockets/packages in the system */
-    uint32_t numberOfNumaDomains; /*!< \brief Number of NUMA nodes in the system */
-    uint32_t numberOfProcessorsPerSocket; /*!< \brief Number of HW threads per socket/package in the system */
-    uint32_t numberOfCacheDomains; /*!< \brief Number of LLC caches in the system */
-    uint32_t numberOfCoresPerCache; /*!< \brief Number of CPU cores per LLC cache in the system */
-    uint32_t numberOfProcessorsPerCache; /*!< \brief Number of hardware threads per LLC cache in the system */
-    uint32_t numberOfAffinityDomains; /*!< \brief Number of affinity domains in the current system  and length of \a domains array */
-    AffinityDomain* domains; /*!< \brief List of all domains in the system */
+  uint32_t numberOfSocketDomains; /*!< \brief Number of CPU sockets/packages in
+                                     the system */
+  uint32_t
+      numberOfNumaDomains; /*!< \brief Number of NUMA nodes in the system */
+  uint32_t numberOfProcessorsPerSocket; /*!< \brief Number of HW threads per
+                                           socket/package in the system */
+  uint32_t
+      numberOfCacheDomains; /*!< \brief Number of LLC caches in the system */
+  uint32_t numberOfCoresPerCache; /*!< \brief Number of CPU cores per LLC cache
+                                     in the system */
+  uint32_t numberOfProcessorsPerCache; /*!< \brief Number of hardware threads
+                                          per LLC cache in the system */
+  uint32_t numberOfAffinityDomains;    /*!< \brief Number of affinity domains in
+                                          the current system  and length of \a
+                                          domains array */
+  AffinityDomain *domains; /*!< \brief List of all domains in the system */
 } AffinityDomains;
 
 /** \brief Pointer for exporting the AffinityDomains data structure */
-typedef AffinityDomains* AffinityDomains_t;
+typedef AffinityDomains *AffinityDomains_t;
 
 /*! \brief Initialize affinity information
 
-Initialize affinity information AffinityDomains_t using the data of the structures
-\a CpuInfo_t, CpuTopology_t and NumaTopology_t
-\sa AffinityDomains_t
+Initialize affinity information AffinityDomains_t using the data of the
+structures \a CpuInfo_t, CpuTopology_t and NumaTopology_t \sa AffinityDomains_t
 */
-extern void affinity_init() __attribute__ ((visibility ("default") ));
+extern void affinity_init() __attribute__((visibility("default")));
 /*! \brief Retrieve affinity structure
 
 Get the previously initialized affinity info structure
 \sa AffinityDomains_t
 @return AffinityDomains_t (pointer to internal affinityDomains structure)
 */
-extern AffinityDomains_t get_affinityDomains(void) __attribute__ ((visibility ("default") ));
+extern AffinityDomains_t get_affinityDomains(void)
+    __attribute__((visibility("default")));
 /*! \brief Pin process to a CPU
 
 Pin process to a CPU. Duplicate of likwid_pinProcess()
 @param [in] processorId CPU ID for pinning
 */
-extern void affinity_pinProcess(int processorId) __attribute__ ((visibility ("default") ));
+extern void affinity_pinProcess(int processorId)
+    __attribute__((visibility("default")));
 /*! \brief Pin processes to a CPU
 
 Pin processes to a CPU. Creates a cpuset with the given processor IDs
 @param [in] cpu_count Number of processors in processorIds
 @param [in] processorIds Array of processor IDs
 */
-extern void affinity_pinProcesses(int cpu_count, const int* processorIds) __attribute__ ((visibility ("default") ));
+extern void affinity_pinProcesses(int cpu_count, const int *processorIds)
+    __attribute__((visibility("default")));
 /*! \brief Pin thread to a CPU
 
 Pin thread to a CPU. Duplicate of likwid_pinThread()
 @param [in] processorId CPU ID for pinning
 */
-extern void affinity_pinThread(int processorId) __attribute__ ((visibility ("default") ));
+extern void affinity_pinThread(int processorId)
+    __attribute__((visibility("default")));
 /*! \brief Return the CPU ID where the current process runs.
 
 @return CPU ID
 */
-extern int affinity_processGetProcessorId() __attribute__ ((visibility ("default") ));
+extern int affinity_processGetProcessorId()
+    __attribute__((visibility("default")));
 /*! \brief Return the CPU ID where the current thread runs.
 
 @return CPU ID
 */
-extern int affinity_threadGetProcessorId() __attribute__ ((visibility ("default") ));
+extern int affinity_threadGetProcessorId()
+    __attribute__((visibility("default")));
 /*! \brief Destroy affinity information structure
 
-Destroys the affinity information structure AffinityDomains_t. Retrieved pointers
-to the structures are not valid anymore after this function call
-\sa AffinityDomains_t
+Destroys the affinity information structure AffinityDomains_t. Retrieved
+pointers to the structures are not valid anymore after this function call \sa
+AffinityDomains_t
 */
-extern void affinity_finalize() __attribute__ ((visibility ("default") ));
+extern void affinity_finalize() __attribute__((visibility("default")));
 /** @}*/
 
 /*
@@ -659,43 +739,70 @@ different selection modes: scatter, expression, logical and physical.
 @param [in] cpustring Selection string
 @param [in,out] cpulist List of CPUs
 @param [in] length Length of cpulist
-@return error code (>0 on success for the returned list length, -ERRORCODE on failure)
+@return error code (>0 on success for the returned list length, -ERRORCODE on
+failure)
 */
-extern int cpustr_to_cpulist(const char* cpustring, int* cpulist, int length)  __attribute__ ((visibility ("default") ));
-/*! \brief Read NUMA node selection string and resolve to available NUMA node numbers
+extern int cpustr_to_cpulist(const char *cpustring, int *cpulist, int length)
+    __attribute__((visibility("default")));
+/*! \brief Read NUMA node selection string and resolve to available NUMA node
+numbers
 
-Reads the NUMA node selection string and fills the given list with the NUMA node numbers
-defined in the selection string.
+Reads the NUMA node selection string and fills the given list with the NUMA node
+numbers defined in the selection string.
 @param [in] nodestr Selection string
 @param [out] nodes List of available NUMA nodes
 @param [in] length Length of NUMA node list
-@return error code (>0 on success for the returned list length, -ERRORCODE on failure)
+@return error code (>0 on success for the returned list length, -ERRORCODE on
+failure)
 */
-extern int nodestr_to_nodelist(const char* nodestr, int* nodes, int length)  __attribute__ ((visibility ("default") ));
-/*! \brief Read CPU socket selection string and resolve to available CPU socket numbers
+extern int nodestr_to_nodelist(const char *nodestr, int *nodes, int length)
+    __attribute__((visibility("default")));
+/*! \brief Read CPU socket selection string and resolve to available CPU socket
+numbers
 
-Reads the CPU socket selection string and fills the given list with the CPU socket numbers
-defined in the selection string.
+Reads the CPU socket selection string and fills the given list with the CPU
+socket numbers defined in the selection string.
 @param [in] sockstr Selection string
 @param [out] sockets List of available CPU sockets
 @param [in] length Length of CPU socket list
-@return error code (>0 on success for the returned list length, -ERRORCODE on failure)
+@return error code (>0 on success for the returned list length, -ERRORCODE on
+failure)
 */
-extern int sockstr_to_socklist(const char* sockstr, int* sockets, int length)  __attribute__ ((visibility ("default") ));
+extern int sockstr_to_socklist(const char *sockstr, int *sockets, int length)
+    __attribute__((visibility("default")));
 
 #ifdef LIKWID_WITH_NVMON
 /*! \brief Read GPU selection string and resolve to available GPUs numbers
 
-Reads the GPU selection string and fills the given list with the GPU numbers defined in the selection string.
+Reads the GPU selection string and fills the given list with the GPU numbers
+defined in the selection string.
 @param [in] gpustr Selection string
 @param [out] gpulist List of available GPU
 @param [in] length Length of GPU list
-@return error code (>0 on success for the returned list length, -ERRORCODE on failure)
+@return error code (>0 on success for the returned list length, -ERRORCODE on
+failure)
 */
-extern int gpustr_to_gpulist(const char* gpustr, int* gpulist, int length)  __attribute__ ((visibility ("default") ));
+extern int gpustr_to_gpulist(const char *gpustr, int *gpulist, int length)
+    __attribute__((visibility("default")));
 
 #endif /* LIKWID_WITH_NVMON */
 
+#ifdef LIKWID_WITH_ROCMON
+/*! \brief Read GPU selection string and resolve to available ROCM GPUs numbers
+
+Reads the GPU selection string and fills the given list with the GPU numbers
+defined in the selection string.
+@param [in] gpustr Selection string
+@param [out] gpulist List of available ROCM GPU
+@param [in] length Length of GPU list
+@return error code (>0 on success for the returned list length, -ERRORCODE on
+failure)
+*/
+extern int gpustr_to_gpulist_rocm(const char *gpustr, int *gpulist, int length)
+    __attribute__((visibility("default")));
+
+#endif /* LIKWID_WITH_ROCMON */
+
 /** @}*/
 
 /*
@@ -713,7 +820,9 @@ Checks the configured performance group path for the current architecture and
 returns all found group names
 @return Amount of found performance groups
 */
-extern int perfmon_getGroups(char*** groups, char*** shortinfos, char*** longinfos) __attribute__ ((visibility ("default") ));
+extern int perfmon_getGroups(char ***groups, char ***shortinfos,
+                             char ***longinfos)
+    __attribute__((visibility("default")));
 
 /*! \brief Free all group information
 
@@ -722,37 +831,43 @@ extern int perfmon_getGroups(char*** groups, char*** shortinfos, char*** longinf
 @param [in] shortinfos List of short information string about group
 @param [in] longinfos List of long information string about group
 */
-extern void perfmon_returnGroups(int nrgroups, char** groups, char** shortinfos, char** longinfos) __attribute__ ((visibility ("default") ));
+extern void perfmon_returnGroups(int nrgroups, char **groups, char **shortinfos,
+                                 char **longinfos)
+    __attribute__((visibility("default")));
 
 /*! \brief Initialize performance monitoring facility
 
 Initialize the performance monitoring feature by creating basic data structures.
-The CPU ids for the threadsToCpu list can be found in cpuTopology->threadPool[thread_id]->apicId.
-The access mode must already be set when calling perfmon_init().
-\sa HPMmode() function and CpuTopology structure with HWThread list
+The CPU ids for the threadsToCpu list can be found in
+cpuTopology->threadPool[thread_id]->apicId. The access mode must already be set
+when calling perfmon_init(). \sa HPMmode() function and CpuTopology structure
+with HWThread list
 
 @param [in] nrThreads Amount of threads
 @param [in] threadsToCpu List of CPUs
 @return error code (0 on success, -ERRORCODE on failure)
 */
-extern int perfmon_init(int nrThreads, const int* threadsToCpu) __attribute__ ((visibility ("default") ));
+extern int perfmon_init(int nrThreads, const int *threadsToCpu)
+    __attribute__((visibility("default")));
 
 /*! \brief Initialize performance monitoring maps
 
 Initialize the performance monitoring maps for counters, events and Uncore boxes
-for the current architecture. topology_init() and numa_init() must be called before calling
-perfmon_init_maps()
-\sa RegisterMap list, PerfmonEvent list and BoxMap list
+for the current architecture. topology_init() and numa_init() must be called
+before calling perfmon_init_maps() \sa RegisterMap list, PerfmonEvent list and
+BoxMap list
 */
-extern int perfmon_init_maps(void) __attribute__ ((visibility ("default") ));
-/*! \brief Check the performance monitoring maps whether counters and events are available
+extern int perfmon_init_maps(void) __attribute__((visibility("default")));
+/*! \brief Check the performance monitoring maps whether counters and events are
+available
 
-Checks each counter and event in the performance monitoring maps for their availibility on
-the current system. topology_init(), numa_init() and perfmon_init_maps() must be called before calling
-perfmon_check_counter_map().
+Checks each counter and event in the performance monitoring maps for their
+availibility on the current system. topology_init(), numa_init() and
+perfmon_init_maps() must be called before calling perfmon_check_counter_map().
 \sa RegisterMap list, PerfmonEvent list and BoxMap list
 */
-extern void perfmon_check_counter_map(int cpu_id) __attribute__ ((visibility ("default") ));
+extern void perfmon_check_counter_map(int cpu_id)
+    __attribute__((visibility("default")));
 /*! \brief Add an event string to LIKWID
 
 A event string looks like Eventname:Countername(:Option1:Option2:...),...
@@ -760,64 +875,73 @@ The eventname, countername and options are checked if they are available.
 @param [in] eventCString Event string
 @return Returns the ID of the new eventSet
 */
-extern int perfmon_addEventSet(const char* eventCString) __attribute__ ((visibility ("default") ));
+extern int perfmon_addEventSet(const char *eventCString)
+    __attribute__((visibility("default")));
 /*! \brief Setup all performance monitoring counters of an eventSet
 
 @param [in] groupId (returned from perfmon_addEventSet()
-@return error code (-ENOENT if groupId is invalid and -1 if the counters of one CPU cannot be set up)
+@return error code (-ENOENT if groupId is invalid and -1 if the counters of one
+CPU cannot be set up)
 */
-extern int perfmon_setupCounters(int groupId) __attribute__ ((visibility ("default") ));
+extern int perfmon_setupCounters(int groupId)
+    __attribute__((visibility("default")));
 /*! \brief Start performance monitoring counters
 
 Start the counters that have been previously set up by perfmon_setupCounters().
 The counter registered are zeroed before enabling the counters
 @return 0 on success and -(thread_id+1) for error
 */
-extern int perfmon_startCounters(void) __attribute__ ((visibility ("default") ));
+extern int perfmon_startCounters(void) __attribute__((visibility("default")));
 /*! \brief Stop performance monitoring counters
 
 Stop the counters that have been previously started by perfmon_startCounters().
 This function reads the counters, so afterwards the results are availble through
-perfmon_getResult, perfmon_getLastResult, perfmon_getMetric and perfmon_getLastMetric.
+perfmon_getResult, perfmon_getLastResult, perfmon_getMetric and
+perfmon_getLastMetric.
 @return 0 on success and -(thread_id+1) for error
 */
-extern int perfmon_stopCounters(void) __attribute__ ((visibility ("default") ));
+extern int perfmon_stopCounters(void) __attribute__((visibility("default")));
 /*! \brief Read the performance monitoring counters on all CPUs
 
 Read the counters that have been previously started by perfmon_startCounters().
-The counters are stopped directly to avoid interference of LIKWID with the measured
-code. Before returning, the counters are started again.
+The counters are stopped directly to avoid interference of LIKWID with the
+measured code. Before returning, the counters are started again.
 @return 0 on success and -(thread_id+1) for error
 */
-extern int perfmon_readCounters(void) __attribute__ ((visibility ("default") ));
+extern int perfmon_readCounters(void) __attribute__((visibility("default")));
 /*! \brief Read the performance monitoring counters on one CPU
 
 Read the counters that have been previously started by perfmon_startCounters().
-The counters are stopped directly to avoid interference of LIKWID with the measured
-code. Before returning, the counters are started again. Only one CPU is read.
+The counters are stopped directly to avoid interference of LIKWID with the
+measured code. Before returning, the counters are started again. Only one CPU is
+read.
 @param [in] cpu_id CPU ID of the CPU that should be read
 @return 0 on success and -(thread_id+1) for error
 */
-extern int perfmon_readCountersCpu(int cpu_id) __attribute__ ((visibility ("default") ));
+extern int perfmon_readCountersCpu(int cpu_id)
+    __attribute__((visibility("default")));
 /*! \brief Read the performance monitoring counters of all threads in a group
 
 Read the counters that have been previously started by perfmon_startCounters().
-The counters are stopped directly to avoid interference of LIKWID with the measured
-code. Before returning, the counters are started again.
+The counters are stopped directly to avoid interference of LIKWID with the
+measured code. Before returning, the counters are started again.
 @param [in] groupId Read the counters for all threads taking part in group
 @return 0 on success and -(thread_id+1) for error
 */
-extern int perfmon_readGroupCounters(int groupId) __attribute__ ((visibility ("default") ));
+extern int perfmon_readGroupCounters(int groupId)
+    __attribute__((visibility("default")));
 /*! \brief Read the performance monitoring counters of on thread in a group
 
 Read the counters that have been previously started by perfmon_startCounters().
-The counters are stopped directly to avoid interference of LIKWID with the measured
-code. Before returning, the counters are started again. Only one thread's CPU is read.
+The counters are stopped directly to avoid interference of LIKWID with the
+measured code. Before returning, the counters are started again. Only one
+thread's CPU is read.
 @param [in] groupId Read the counters defined in group identified with groupId
 @param [in] threadId Read the counters for the thread
 @return 0 on success and -(thread_id+1) for error
 */
-extern int perfmon_readGroupThreadCounters(int groupId, int threadId) __attribute__ ((visibility ("default") ));
+extern int perfmon_readGroupThreadCounters(int groupId, int threadId)
+    __attribute__((visibility("default")));
 /*! \brief Switch the active eventSet to a new one
 
 Stops the currently running counters, switches the eventSet by setting up the
@@ -825,13 +949,14 @@ counters and start the counters.
 @param [in] new_group ID of group that should be switched to.
 @return 0 on success and -(thread_id+1) for error
 */
-extern int perfmon_switchActiveGroup(int new_group) __attribute__ ((visibility ("default") ));
+extern int perfmon_switchActiveGroup(int new_group)
+    __attribute__((visibility("default")));
 /*! \brief Close the perfomance monitoring facility of LIKWID
 
 Deallocates all internal data that is used during performance monitoring. Also
 the counter values are not accessible after this function.
 */
-extern void perfmon_finalize(void) __attribute__ ((visibility ("default") ));
+extern void perfmon_finalize(void) __attribute__((visibility("default")));
 /*! \brief Get the results of the specified group, counter and thread
 
 Get the result of all measurement cycles. The function takes care of happened
@@ -841,68 +966,81 @@ overflows and if the counter values need to be calculated with multipliers.
 @param [in] threadId ID of the thread/cpu that should be read
 @return The counter result
 */
-extern double perfmon_getResult(int groupId, int eventId, int threadId) __attribute__ ((visibility ("default") ));
+extern double perfmon_getResult(int groupId, int eventId, int threadId)
+    __attribute__((visibility("default")));
 /*! \brief Get the last results of the specified group, counter and thread
 
-Get the result of the last measurement cycle. The function takes care of happened
-overflows and if the counter values need to be calculated with multipliers.
+Get the result of the last measurement cycle. The function takes care of
+happened overflows and if the counter values need to be calculated with
+multipliers.
 @param [in] groupId ID of the group that should be read
 @param [in] eventId ID of the event that should be read
 @param [in] threadId ID of the thread/cpu that should be read
 @return The counter result
 */
-extern double perfmon_getLastResult(int groupId, int eventId, int threadId) __attribute__ ((visibility ("default") ));
+extern double perfmon_getLastResult(int groupId, int eventId, int threadId)
+    __attribute__((visibility("default")));
 /*! \brief Get the metric result of the specified group, counter and thread
 
-Get the metric result of all measurement cycles. It reads all raw results for the given groupId and threadId.
+Get the metric result of all measurement cycles. It reads all raw results for
+the given groupId and threadId.
 @param [in] groupId ID of the group that should be read
 @param [in] metricId ID of the metric that should be calculated
 @param [in] threadId ID of the thread/cpu that should be read
 @return The metric result
 */
-extern double perfmon_getMetric(int groupId, int metricId, int threadId) __attribute__ ((visibility ("default") ));
+extern double perfmon_getMetric(int groupId, int metricId, int threadId)
+    __attribute__((visibility("default")));
 /*! \brief Get the last metric result of the specified group, counter and thread
 
-Get the metric result of the last measurement cycle. It reads all raw results for the given groupId and threadId.
+Get the metric result of the last measurement cycle. It reads all raw results
+for the given groupId and threadId.
 @param [in] groupId ID of the group that should be read
 @param [in] metricId ID of the metric that should be calculated
 @param [in] threadId ID of the thread/cpu that should be read
 @return The metric result
 */
-extern double perfmon_getLastMetric(int groupId, int metricId, int threadId) __attribute__ ((visibility ("default") ));
+extern double perfmon_getLastMetric(int groupId, int metricId, int threadId)
+    __attribute__((visibility("default")));
 
 /*! \brief Get the number of configured event groups
 
 @return Number of groups
 */
-extern int perfmon_getNumberOfGroups(void) __attribute__ ((visibility ("default") ));
+extern int perfmon_getNumberOfGroups(void)
+    __attribute__((visibility("default")));
 /*! \brief Get the number of configured eventSets in group
 
 @param [in] groupId ID of group
 @return Number of eventSets
 */
-extern int perfmon_getNumberOfEvents(int groupId) __attribute__ ((visibility ("default") ));
+extern int perfmon_getNumberOfEvents(int groupId)
+    __attribute__((visibility("default")));
 /*! \brief Get the accumulated measurement time a group
 
 @param [in] groupId ID of group
 @return Time in seconds the event group was measured
 */
-extern double perfmon_getTimeOfGroup(int groupId) __attribute__ ((visibility ("default") ));
+extern double perfmon_getTimeOfGroup(int groupId)
+    __attribute__((visibility("default")));
 /*! \brief Get the ID of the currently set up event group
 
 @return Number of active group
 */
-extern int perfmon_getIdOfActiveGroup(void) __attribute__ ((visibility ("default") ));
+extern int perfmon_getIdOfActiveGroup(void)
+    __attribute__((visibility("default")));
 /*! \brief Get the number of threads specified at perfmon_init()
 
 @return Number of threads
 */
-extern int perfmon_getNumberOfThreads(void) __attribute__ ((visibility ("default") ));
+extern int perfmon_getNumberOfThreads(void)
+    __attribute__((visibility("default")));
 
 /*! \brief Set verbosity of LIKWID library
 
 */
-extern void perfmon_setVerbosity(int verbose) __attribute__ ((visibility ("default") ));
+extern void perfmon_setVerbosity(int verbose)
+    __attribute__((visibility("default")));
 
 /*! \brief Get the event name of the specified group and event
 
@@ -911,7 +1049,8 @@ Get the metric name as defined in the performance group file
 @param [in] eventId ID of the event that should be returned
 @return The event name or NULL in case of failure
 */
-extern char* perfmon_getEventName(int groupId, int eventId) __attribute__ ((visibility ("default") ));
+extern char *perfmon_getEventName(int groupId, int eventId)
+    __attribute__((visibility("default")));
 /*! \brief Get the counter name of the specified group and event
 
 Get the counter name as defined in the performance group file
@@ -919,15 +1058,18 @@ Get the counter name as defined in the performance group file
 @param [in] eventId ID of the event of which the counter should be returned
 @return The counter name or NULL in case of failure
 */
-extern char* perfmon_getCounterName(int groupId, int eventId) __attribute__ ((visibility ("default") ));
+extern char *perfmon_getCounterName(int groupId, int eventId)
+    __attribute__((visibility("default")));
 
 /*! \brief Get the name group
 
-Get the name of group. Either it is the name of the performance group or "Custom"
+Get the name of group. Either it is the name of the performance group or
+"Custom"
 @param [in] groupId ID of the group that should be read
 @return The group name or NULL in case of failure
 */
-extern char* perfmon_getGroupName(int groupId) __attribute__ ((visibility ("default") ));
+extern char *perfmon_getGroupName(int groupId)
+    __attribute__((visibility("default")));
 /*! \brief Get the metric name of the specified group and metric
 
 Get the metric name as defined in the performance group file
@@ -935,15 +1077,17 @@ Get the metric name as defined in the performance group file
 @param [in] metricId ID of the metric that should be calculated
 @return The metric name or NULL in case of failure
 */
-extern char* perfmon_getMetricName(int groupId, int metricId) __attribute__ ((visibility ("default") ));
+extern char *perfmon_getMetricName(int groupId, int metricId)
+    __attribute__((visibility("default")));
 /*! \brief Get the short informational string of the specified group
 
-Returns the short information string as defined by performance groups or "Custom"
-in case of custom event sets
+Returns the short information string as defined by performance groups or
+"Custom" in case of custom event sets
 @param [in] groupId ID of the group that should be read
 @return The short information or NULL in case of failure
 */
-extern char* perfmon_getGroupInfoShort(int groupId) __attribute__ ((visibility ("default") ));
+extern char *perfmon_getGroupInfoShort(int groupId)
+    __attribute__((visibility("default")));
 /*! \brief Get the long descriptive string of the specified group
 
 Returns the long descriptive string as defined by performance groups or NULL
@@ -951,94 +1095,110 @@ in case of custom event sets
 @param [in] groupId ID of the group that should be read
 @return The long description or NULL in case of failure
 */
-extern char* perfmon_getGroupInfoLong(int groupId) __attribute__ ((visibility ("default") ));
+extern char *perfmon_getGroupInfoLong(int groupId)
+    __attribute__((visibility("default")));
 
 /*! \brief Get the number of configured metrics for group
 
 @param [in] groupId ID of group
 @return Number of metrics
 */
-extern int perfmon_getNumberOfMetrics(int groupId) __attribute__ ((visibility ("default") ));
+extern int perfmon_getNumberOfMetrics(int groupId)
+    __attribute__((visibility("default")));
 
 /*! \brief Get the last measurement time a group
 
 @param [in] groupId ID of group
 @return Time in seconds the event group was measured the last time
 */
-extern double perfmon_getLastTimeOfGroup(int groupId) __attribute__ ((visibility ("default") ));
+extern double perfmon_getLastTimeOfGroup(int groupId)
+    __attribute__((visibility("default")));
 
 /*! \brief Read the output file of the Marker API
 @param [in] filename Filename with Marker API results
 @return 0 or negative error number
 */
-extern int perfmon_readMarkerFile(const char* filename) __attribute__ ((visibility ("default") ));
+extern int perfmon_readMarkerFile(const char *filename)
+    __attribute__((visibility("default")));
 /*! \brief Free space for read in Marker API file
-*/
-extern void perfmon_destroyMarkerResults() __attribute__ ((visibility ("default") ));
+ */
+extern void perfmon_destroyMarkerResults()
+    __attribute__((visibility("default")));
 /*! \brief Get the number of regions listed in Marker API result file
 
 @return Number of regions
 */
-extern int perfmon_getNumberOfRegions() __attribute__ ((visibility ("default") ));
+extern int perfmon_getNumberOfRegions() __attribute__((visibility("default")));
 /*! \brief Get the groupID of a region
 
 @param [in] region ID of region
 @return Group ID of region
 */
-extern int perfmon_getGroupOfRegion(int region) __attribute__ ((visibility ("default") ));
+extern int perfmon_getGroupOfRegion(int region)
+    __attribute__((visibility("default")));
 /*! \brief Get the tag of a region
 @param [in] region ID of region
 @return tag of region
 */
-extern char* perfmon_getTagOfRegion(int region) __attribute__ ((visibility ("default") ));
+extern char *perfmon_getTagOfRegion(int region)
+    __attribute__((visibility("default")));
 /*! \brief Get the number of events of a region
 @param [in] region ID of region
 @return Number of events of region
 */
-extern int perfmon_getEventsOfRegion(int region) __attribute__ ((visibility ("default") ));
+extern int perfmon_getEventsOfRegion(int region)
+    __attribute__((visibility("default")));
 /*! \brief Get the number of metrics of a region
 @param [in] region ID of region
 @return Number of metrics of region
 */
-extern int perfmon_getMetricsOfRegion(int region) __attribute__ ((visibility ("default") ));
+extern int perfmon_getMetricsOfRegion(int region)
+    __attribute__((visibility("default")));
 /*! \brief Get the number of threads of a region
 @param [in] region ID of region
 @return Number of threads of region
 */
-extern int perfmon_getThreadsOfRegion(int region) __attribute__ ((visibility ("default") ));
+extern int perfmon_getThreadsOfRegion(int region)
+    __attribute__((visibility("default")));
 /*! \brief Get the cpulist of a region
 @param [in] region ID of region
 @param [in] count Length of cpulist array
 @param [in,out] cpulist cpulist array
 @return Number of threads of region or count, whatever is lower
 */
-extern int perfmon_getCpulistOfRegion(int region, int count, int* cpulist)  __attribute__ ((visibility ("default") ));
+extern int perfmon_getCpulistOfRegion(int region, int count, int *cpulist)
+    __attribute__((visibility("default")));
 /*! \brief Get the accumulated measurement time of a region for a thread
 @param [in] region ID of region
 @param [in] thread ID of thread
 @return Measurement time of a region for a thread
 */
-extern double perfmon_getTimeOfRegion(int region, int thread) __attribute__ ((visibility ("default") ));
+extern double perfmon_getTimeOfRegion(int region, int thread)
+    __attribute__((visibility("default")));
 /*! \brief Get the call count of a region for a thread
 @param [in] region ID of region
 @param [in] thread ID of thread
 @return Call count of a region for a thread
 */
-extern int perfmon_getCountOfRegion(int region, int thread) __attribute__ ((visibility ("default") ));
+extern int perfmon_getCountOfRegion(int region, int thread)
+    __attribute__((visibility("default")));
 /*! \brief Get the event result of a region for an event and thread
 @param [in] region ID of region
 @param [in] event ID of event
 @param [in] thread ID of thread
 @return Result of a region for an event and thread
 */
-extern double perfmon_getResultOfRegionThread(int region, int event, int thread) __attribute__ ((visibility ("default") ));
+extern double perfmon_getResultOfRegionThread(int region, int event, int thread)
+    __attribute__((visibility("default")));
 /*! \brief Get the metric result of a region for a metric and thread
 @param [in] region ID of region
 @param [in] metricId ID of metric
 @param [in] threadId ID of thread
 @return Metric result of a region for a thread
 */
-extern double perfmon_getMetricOfRegionThread(int region, int metricId, int threadId) __attribute__ ((visibility ("default") ));
+extern double perfmon_getMetricOfRegionThread(int region, int metricId,
+                                              int threadId)
+    __attribute__((visibility("default")));
 
 /** @}*/
 
@@ -1054,28 +1214,30 @@ extern double perfmon_getMetricOfRegionThread(int region, int metricId, int thre
 
 /*! \brief The groupInfo data structure describes a performance group
 
-Groups can be either be read in from file or be a group with custom event set. For
-performance groups commonly all values are set. For groups with custom event set,
-the fields groupname and shortinfo are set to 'Custom', longinfo is NULL and in
-general the nmetrics value is 0.
+Groups can be either be read in from file or be a group with custom event set.
+For performance groups commonly all values are set. For groups with custom event
+set, the fields groupname and shortinfo are set to 'Custom', longinfo is NULL
+and in general the nmetrics value is 0.
 */
 typedef struct {
-    char* groupname; /*!< \brief Name of the group: performance group name or 'Custom' */
-    char* shortinfo; /*!< \brief Short info string for the group or 'Custom' */
-    int nevents; /*!< \brief Number of event/counter combinations */
-    char** events; /*!< \brief List of events */
-    char** counters; /*!< \brief List of counter registers */
-    int nmetrics; /*!< \brief Number of metrics */
-    char** metricnames; /*!< \brief Metric names */
-    char** metricformulas; /*!< \brief Metric formulas */
-    char* longinfo; /*!< \brief Descriptive text about the group or empty */
+  char *groupname;    /*!< \brief Name of the group: performance group name or
+                         'Custom' */
+  char *shortinfo;    /*!< \brief Short info string for the group or 'Custom' */
+  int nevents;        /*!< \brief Number of event/counter combinations */
+  char **events;      /*!< \brief List of events */
+  char **counters;    /*!< \brief List of counter registers */
+  int nmetrics;       /*!< \brief Number of metrics */
+  char **metricnames; /*!< \brief Metric names */
+  char **metricformulas; /*!< \brief Metric formulas */
+  char *longinfo; /*!< \brief Descriptive text about the group or empty */
 } GroupInfo;
 
 /*! \brief Initialize values in GroupInfo struct
 
-Initialize values in GroupInfo struct. The function does NOT allocate the GroupInfo struct
+Initialize values in GroupInfo struct. The function does NOT allocate the
+GroupInfo struct
 */
-int perfgroup_new(GroupInfo* ginfo) __attribute__ ((visibility ("default") ));
+int perfgroup_new(GroupInfo *ginfo) __attribute__((visibility("default")));
 
 /*! \brief Add a counter and event combination to the group
 
@@ -1085,7 +1247,8 @@ Add a counter and event combination to the group.
 @param [in] event String with event name
 @return 0 for success, -EINVAL or -ENOMEM in case of error.
 */
-int perfgroup_addEvent(GroupInfo* ginfo, char* counter, char* event) __attribute__ ((visibility ("default") ));
+int perfgroup_addEvent(GroupInfo *ginfo, char *counter, char *event)
+    __attribute__((visibility("default")));
 
 /*! \brief Remove a counter and event combination from a group
 
@@ -1093,7 +1256,8 @@ Remove a counter and event combination from a group
 @param [in] ginfo GroupInfo struct
 @param [in] counter String with counter name
 */
-void perfgroup_removeEvent(GroupInfo* ginfo, char* counter) __attribute__ ((visibility ("default") ));
+void perfgroup_removeEvent(GroupInfo *ginfo, char *counter)
+    __attribute__((visibility("default")));
 
 /*! \brief Add a metric to the group
 
@@ -1103,14 +1267,16 @@ Add a metric to the group
 @param [in] mcalc String with metric formula. No spaces in string.
 @return 0 for success, -EINVAL or -ENOMEM in case of error.
 */
-int perfgroup_addMetric(GroupInfo* ginfo, char* mname, char* mcalc) __attribute__ ((visibility ("default") ));
+int perfgroup_addMetric(GroupInfo *ginfo, char *mname, char *mcalc)
+    __attribute__((visibility("default")));
 /*! \brief Remove a metric from a group
 
 Remove a metric from a group
 @param [in] ginfo GroupInfo struct
 @param [in] mname String with metric name/description
 */
-void perfgroup_removeMetric(GroupInfo* ginfo, char* mname) __attribute__ ((visibility ("default") ));
+void perfgroup_removeMetric(GroupInfo *ginfo, char *mname)
+    __attribute__((visibility("default")));
 
 /*! \brief Get the event string of a group needed for perfmon_addEventSet
 
@@ -1118,13 +1284,15 @@ Get the event string of a group needed for perfmon_addEventSet
 @param [in] ginfo GroupInfo struct
 @return String with eventset or NULL
 */
-char* perfgroup_getEventStr(GroupInfo* ginfo) __attribute__ ((visibility ("default") ));
+char *perfgroup_getEventStr(GroupInfo *ginfo)
+    __attribute__((visibility("default")));
 /*! \brief Return the eventset string of a group
 
 Return the event string of a group
 @param [in] eventStr Eventset string
 */
-void perfgroup_returnEventStr(char* eventStr) __attribute__ ((visibility ("default") ));
+void perfgroup_returnEventStr(char *eventStr)
+    __attribute__((visibility("default")));
 
 /*! \brief Get the group name of a group
 
@@ -1132,7 +1300,8 @@ Get the group name of a group
 @param [in] ginfo GroupInfo struct
 @return String with group name or NULL
 */
-char* perfgroup_getGroupName(GroupInfo* ginfo) __attribute__ ((visibility ("default") ));
+char *perfgroup_getGroupName(GroupInfo *ginfo)
+    __attribute__((visibility("default")));
 /*! \brief Set the group name of a group
 
 Set the group name of a group. String must be zero-terminated
@@ -1140,14 +1309,15 @@ Set the group name of a group. String must be zero-terminated
 @param [in] groupName String with group name
 @return 0 for success, -EINVAL or -ENOMEM in case of error.
 */
-int perfgroup_setGroupName(GroupInfo* ginfo, char* groupName) __attribute__ ((visibility ("default") ));
+int perfgroup_setGroupName(GroupInfo *ginfo, char *groupName)
+    __attribute__((visibility("default")));
 /*! \brief Return the group name string of a group
 
 Return the group name string of a group
 @param [in] gname Group name string
 */
-void perfgroup_returnGroupName(char* gname) __attribute__ ((visibility ("default") ));
-
+void perfgroup_returnGroupName(char *gname)
+    __attribute__((visibility("default")));
 
 /*! \brief Set the short information string of a group
 
@@ -1156,20 +1326,23 @@ Set the short information string of a group. String must be zero-terminated
 @param [in] shortInfo String with short information
 @return 0 for success, -EINVAL or -ENOMEM in case of error.
 */
-int perfgroup_setShortInfo(GroupInfo* ginfo, char* shortInfo) __attribute__ ((visibility ("default") ));
+int perfgroup_setShortInfo(GroupInfo *ginfo, char *shortInfo)
+    __attribute__((visibility("default")));
 /*! \brief Get the short information string of a group
 
 Get the short information string of a group
 @param [in] ginfo GroupInfo struct
 @return String with short information or NULL
 */
-char* perfgroup_getShortInfo(GroupInfo* ginfo) __attribute__ ((visibility ("default") ));
+char *perfgroup_getShortInfo(GroupInfo *ginfo)
+    __attribute__((visibility("default")));
 /*! \brief Return the short information string of a group
 
 Return the short information string of a group
 @param [in] sinfo Short information string
 */
-void perfgroup_returnShortInfo(char* sinfo) __attribute__ ((visibility ("default") ));
+void perfgroup_returnShortInfo(char *sinfo)
+    __attribute__((visibility("default")));
 
 /*! \brief Set the long information string of a group
 
@@ -1178,20 +1351,23 @@ Set the long information string of a group. String must be zero-terminated
 @param [in] longInfo String with long information
 @return 0 for success, -EINVAL or -ENOMEM in case of error.
 */
-int perfgroup_setLongInfo(GroupInfo* ginfo, char* longInfo) __attribute__ ((visibility ("default") ));
+int perfgroup_setLongInfo(GroupInfo *ginfo, char *longInfo)
+    __attribute__((visibility("default")));
 /*! \brief Get the long information string of a group
 
 Get the long information string of a group
 @param [in] ginfo GroupInfo struct
 @return String with long information or NULL
 */
-char* perfgroup_getLongInfo(GroupInfo* ginfo) __attribute__ ((visibility ("default") ));
+char *perfgroup_getLongInfo(GroupInfo *ginfo)
+    __attribute__((visibility("default")));
 /*! \brief Return the long information string of a group
 
 Return the long information string of a group
 @param [in] linfo Long information string
 */
-void perfgroup_returnLongInfo(char* linfo) __attribute__ ((visibility ("default") ));
+void perfgroup_returnLongInfo(char *linfo)
+    __attribute__((visibility("default")));
 
 /*! \brief Merge two groups
 
@@ -1200,7 +1376,8 @@ Merge two groups (group2 into group1).
 @param [in] grp2 Group2
 @return 0 for success, -EINVAL or -ENOMEM in case of error.
 */
-int perfgroup_mergeGroups(GroupInfo* grp1, GroupInfo* grp2) __attribute__ ((visibility ("default") ));
+int perfgroup_mergeGroups(GroupInfo *grp1, GroupInfo *grp2)
+    __attribute__((visibility("default")));
 
 /*! \brief Read group from file
 
@@ -1211,7 +1388,9 @@ Read group from file
 @param [in,out] ginfo Group filled with data from file
 @return 0 for success, -EINVAL or -ENOMEM in case of error.
 */
-int perfgroup_readGroup(const char* grouppath, const char* architecture, const char* groupname, GroupInfo* ginfo) __attribute__ ((visibility ("default") ));
+int perfgroup_readGroup(const char *grouppath, const char *architecture,
+                        const char *groupname, GroupInfo *ginfo)
+    __attribute__((visibility("default")));
 /*! \brief Create group from event string
 
 Create group from event string (list of event:counter(:opts)).
@@ -1219,14 +1398,16 @@ Create group from event string (list of event:counter(:opts)).
 @param [in,out] ginfo Group filled with data from event string
 @return 0 for success, -EINVAL or -ENOMEM in case of error.
 */
-int perfgroup_customGroup(const char* eventStr, GroupInfo* ginfo) __attribute__ ((visibility ("default") ));
+int perfgroup_customGroup(const char *eventStr, GroupInfo *ginfo)
+    __attribute__((visibility("default")));
 
 /*! \brief Return group
 
 Return group (frees internal lists)
 @param [in] ginfo Performance group info
 */
-void perfgroup_returnGroup(GroupInfo* ginfo) __attribute__ ((visibility ("default") ));
+void perfgroup_returnGroup(GroupInfo *ginfo)
+    __attribute__((visibility("default")));
 /*! \brief Get all groups available in the system (base + user home)
 
 Get all groups available in the system (base + user home)
@@ -1237,7 +1418,10 @@ Get all groups available in the system (base + user home)
 @param [out] grouplong List of groups' long information string
 @return number of groups, -EINVAL or -ENOMEM in case of error.
 */
-int perfgroup_getGroups( const char* grouppath, const char* architecture, char*** groupnames, char*** groupshort, char*** grouplong) __attribute__ ((visibility ("default") ));
+int perfgroup_getGroups(const char *grouppath, const char *architecture,
+                        char ***groupnames, char ***groupshort,
+                        char ***grouplong)
+    __attribute__((visibility("default")));
 /*! \brief Return list of all groups
 
 Return list of all groups
@@ -1246,10 +1430,9 @@ Return list of all groups
 @param [in] groupshort List of groups' short information string
 @param [in] grouplong List of groups' long information string
 */
-void perfgroup_returnGroups(int groups, char** groupnames, char** groupshort, char** grouplong) __attribute__ ((visibility ("default") ));
-
-
-
+void perfgroup_returnGroups(int groups, char **groupnames, char **groupshort,
+                            char **grouplong)
+    __attribute__((visibility("default")));
 
 /** @}*/
 
@@ -1266,79 +1449,85 @@ void perfgroup_returnGroups(int groups, char** groupnames, char** groupshort, ch
 /*! \brief Struct defining the start and stop time of a time interval
 \extends TimerData
 */
-typedef union
-{
-    uint64_t int64; /*!< \brief Cycle count in 64 bit */
-    struct {uint32_t lo, hi;} int32; /*!< \brief Cycle count stored in two 32 bit fields */
+typedef union {
+  uint64_t int64; /*!< \brief Cycle count in 64 bit */
+  struct {
+    uint32_t lo, hi;
+  } int32; /*!< \brief Cycle count stored in two 32 bit fields */
 } TscCounter;
 
 /*! \brief Struct defining the start and stop time of a time interval
-*/
+ */
 typedef struct {
-    TscCounter start; /*!< \brief Cycles at start */
-    TscCounter stop; /*!< \brief Cycles at stop */
+  TscCounter start; /*!< \brief Cycles at start */
+  TscCounter stop;  /*!< \brief Cycles at stop */
 } TimerData;
 
 /*! \brief Initialize timer by retrieving baseline frequency and cpu clock
-*/
-extern void timer_init( void ) __attribute__ ((visibility ("default") ));
+ */
+extern void timer_init(void) __attribute__((visibility("default")));
 /*! \brief Return the measured interval in seconds
 
 @param [in] time Structure holding the cycle count at start and stop
 @return Time in seconds
 */
-extern double timer_print( const TimerData* time) __attribute__ ((visibility ("default") ));
+extern double timer_print(const TimerData *time)
+    __attribute__((visibility("default")));
 /*! \brief Return the measured interval in cycles
 
 @param [in] time Structure holding the cycle count at start and stop
 @return Time in cycles
 */
-extern uint64_t timer_printCycles( const TimerData* time) __attribute__ ((visibility ("default") ));
+extern uint64_t timer_printCycles(const TimerData *time)
+    __attribute__((visibility("default")));
 /*! \brief Reset values in TimerData
 
 @param [in] time Structure holding the cycle count at start and stop
 */
-extern void timer_reset( TimerData* time ) __attribute__ ((visibility ("default") ));
+extern void timer_reset(TimerData *time) __attribute__((visibility("default")));
 /*! \brief Return the CPU clock determined at timer_init
 
 @return CPU clock
 */
-extern uint64_t timer_getCpuClock( void ) __attribute__ ((visibility ("default") ));
+extern uint64_t timer_getCpuClock(void) __attribute__((visibility("default")));
 /*! \brief Return the current CPU clock read from sysfs
 
 @return CPU clock
 */
-extern uint64_t timer_getCpuClockCurrent( int cpu_id ) __attribute__ ((visibility ("default") ));
+extern uint64_t timer_getCpuClockCurrent(int cpu_id)
+    __attribute__((visibility("default")));
 /*! \brief Return the cycles clock determined at timer_init
 
 @return cycle clock
 */
-extern uint64_t timer_getCycleClock( void ) __attribute__ ((visibility ("default") ));
+extern uint64_t timer_getCycleClock(void)
+    __attribute__((visibility("default")));
 /*! \brief Return the baseline CPU clock determined at timer_init
 
 @return Baseline CPU clock
 */
-extern uint64_t timer_getBaseline( void ) __attribute__ ((visibility ("default") ));
+extern uint64_t timer_getBaseline(void) __attribute__((visibility("default")));
 /*! \brief Start time measurement
 
 @param [in,out] time Structure holding the cycle count at start
 */
-extern void timer_start( TimerData* time ) __attribute__ ((visibility ("default") ));
+extern void timer_start(TimerData *time) __attribute__((visibility("default")));
 /*! \brief Stop time measurement
 
 @param [in,out] time Structure holding the cycle count at stop
 */
-extern void timer_stop ( TimerData* time) __attribute__ ((visibility ("default") ));
+extern void timer_stop(TimerData *time) __attribute__((visibility("default")));
 /*! \brief Sleep for specified usecs
 
 @param [in] usec Amount of usecs to sleep
 */
-extern int timer_sleep(unsigned long usec) __attribute__ ((visibility ("default") ));
+extern int timer_sleep(unsigned long usec)
+    __attribute__((visibility("default")));
 
 /*! \brief Finalize timer module
 
 */
-extern void timer_finalize(void) __attribute__ ((visibility ("default") ));
+extern void timer_finalize(void) __attribute__((visibility("default")));
 
 /** @}*/
 
@@ -1357,70 +1546,79 @@ Amount of currently supported RAPL domains
 */
 #define NUM_POWER_DOMAINS 5
 /*! \brief List of all RAPL domain names
-*/
-extern const char* power_names[NUM_POWER_DOMAINS] __attribute__ ((visibility ("default") ));
+ */
+extern const char *power_names[NUM_POWER_DOMAINS]
+    __attribute__((visibility("default")));
 
 /*!
 \def POWER_DOMAIN_SUPPORT_STATUS
-Flag to check in PowerDomain's supportFlag if the status msr registers are available
+Flag to check in PowerDomain's supportFlag if the status msr registers are
+available
 */
-#define POWER_DOMAIN_SUPPORT_STATUS (1ULL<<0)
+#define POWER_DOMAIN_SUPPORT_STATUS (1ULL << 0)
 /*!
 \def POWER_DOMAIN_SUPPORT_LIMIT
-Flag to check in PowerDomain's supportFlag if the limit msr registers are available
+Flag to check in PowerDomain's supportFlag if the limit msr registers are
+available
 */
-#define POWER_DOMAIN_SUPPORT_LIMIT (1ULL<<1)
+#define POWER_DOMAIN_SUPPORT_LIMIT (1ULL << 1)
 /*!
 \def POWER_DOMAIN_SUPPORT_POLICY
-Flag to check in PowerDomain's supportFlag if the policy msr registers are available
+Flag to check in PowerDomain's supportFlag if the policy msr registers are
+available
 */
-#define POWER_DOMAIN_SUPPORT_POLICY (1ULL<<2)
+#define POWER_DOMAIN_SUPPORT_POLICY (1ULL << 2)
 /*!
 \def POWER_DOMAIN_SUPPORT_PERF
-Flag to check in PowerDomain's supportFlag if the perf msr registers are available
+Flag to check in PowerDomain's supportFlag if the perf msr registers are
+available
 */
-#define POWER_DOMAIN_SUPPORT_PERF (1ULL<<3)
+#define POWER_DOMAIN_SUPPORT_PERF (1ULL << 3)
 /*!
 \def POWER_DOMAIN_SUPPORT_INFO
-Flag to check in PowerDomain's supportFlag if the info msr registers are available
+Flag to check in PowerDomain's supportFlag if the info msr registers are
+available
 */
-#define POWER_DOMAIN_SUPPORT_INFO (1ULL<<4)
-
+#define POWER_DOMAIN_SUPPORT_INFO (1ULL << 4)
 
 /*! \brief Information structure of CPU's turbo mode
 \extends PowerInfo
 */
 typedef struct {
-    int numSteps; /*!< \brief Amount of turbo mode steps/frequencies */
-    double* steps; /*!< \brief List of turbo mode steps */
+  int numSteps;  /*!< \brief Amount of turbo mode steps/frequencies */
+  double *steps; /*!< \brief List of turbo mode steps */
 } TurboBoost;
 
 /*! \brief Enum for all supported RAPL domains
 \extends PowerDomain
 */
 typedef enum {
-    PKG = 0, /*!< \brief PKG domain, mostly one CPU socket/package */
-    PP0 = 1, /*!< \brief PP0 domain, not clearly defined by Intel */
-    PP1 = 2, /*!< \brief PP1 domain, not clearly defined by Intel */
-    DRAM = 3, /*!< \brief DRAM domain, the memory modules */
-    PLATFORM = 4 /*!< \brief PLATFORM domain, the whole system (if powered through the main board) */
+  PKG = 0,     /*!< \brief PKG domain, mostly one CPU socket/package */
+  PP0 = 1,     /*!< \brief PP0 domain, not clearly defined by Intel */
+  PP1 = 2,     /*!< \brief PP1 domain, not clearly defined by Intel */
+  DRAM = 3,    /*!< \brief DRAM domain, the memory modules */
+  PLATFORM = 4 /*!< \brief PLATFORM domain, the whole system (if powered through
+                  the main board) */
 } PowerType;
 
 /*! \brief Structure describing an RAPL power domain
 \extends PowerInfo
 */
 typedef struct {
-    PowerType type; /*!< \brief Identifier which RAPL domain is managed by this struct */
-    uint32_t supportFlags; /*!< \brief Bitmask which features are supported by the power domain */
-    double energyUnit; /*!< \brief Multiplier for energy measurements */
-    double tdp; /*!< \brief Thermal Design Power (maximum amount of heat generated by the CPU) */
-    double minPower; /*!< \brief Minimal power consumption of the CPU */
-    double maxPower; /*!< \brief Maximal power consumption of the CPU */
-    double maxTimeWindow; /*!< \brief Minimal power measurement interval */
+  PowerType type; /*!< \brief Identifier which RAPL domain is managed by this
+                     struct */
+  uint32_t supportFlags; /*!< \brief Bitmask which features are supported by the
+                            power domain */
+  double energyUnit;     /*!< \brief Multiplier for energy measurements */
+  double tdp; /*!< \brief Thermal Design Power (maximum amount of heat generated
+                 by the CPU) */
+  double minPower;      /*!< \brief Minimal power consumption of the CPU */
+  double maxPower;      /*!< \brief Maximal power consumption of the CPU */
+  double maxTimeWindow; /*!< \brief Minimal power measurement interval */
 } PowerDomain;
 
 /*! \brief Information structure of CPU's power measurement facility
-*/
+ */
 typedef struct {
     double baseFrequency; /*!< \brief Base frequency of the CPU */
     double minFrequency; /*!< \brief Minimal frequency of the CPU */
@@ -1437,20 +1635,20 @@ typedef struct {
 } PowerInfo;
 
 /*! \brief Power measurement data for start/stop measurements
-*/
+ */
 typedef struct {
-    int domain; /*!< \brief RAPL domain identifier */
-    uint32_t before; /*!< \brief Counter state at start */
-    uint32_t after; /*!< \brief Counter state at stop */
+  int domain;      /*!< \brief RAPL domain identifier */
+  uint32_t before; /*!< \brief Counter state at start */
+  uint32_t after;  /*!< \brief Counter state at stop */
 } PowerData;
 
 /*! \brief Variable holding the global power information structure */
 extern PowerInfo power_info;
 
 /** \brief Pointer for exporting the PowerInfo data structure */
-typedef PowerInfo* PowerInfo_t;
+typedef PowerInfo *PowerInfo_t;
 /** \brief Pointer for exporting the PowerData data structure */
-typedef PowerData* PowerData_t;
+typedef PowerData *PowerData_t;
 
 /*! \brief Initialize energy measurements on specific CPU
 
@@ -1459,20 +1657,21 @@ minimal measurement time.
 @param [in] cpuId Initialize energy facility for this CPU
 @return RAPL status (0=No RAPL, 1=RAPL working)
 */
-extern int power_init(int cpuId) __attribute__ ((visibility ("default") ));
+extern int power_init(int cpuId) __attribute__((visibility("default")));
 /*! \brief Get a pointer to the energy facility information
 
 @return PowerInfo_t pointer
 \sa PowerInfo_t
 */
-extern PowerInfo_t get_powerInfo(void) __attribute__ ((visibility ("default") ));
+extern PowerInfo_t get_powerInfo(void) __attribute__((visibility("default")));
 /*! \brief Read the current power value
 
 @param [in] cpuId Read energy facility for this CPU
 @param [in] reg Energy register
 @param [out] data Energy data
 */
-extern int power_read(int cpuId, uint64_t reg, uint32_t *data) __attribute__ ((visibility ("default") ));
+extern int power_read(int cpuId, uint64_t reg, uint32_t *data)
+    __attribute__((visibility("default")));
 /*! \brief Read the current energy value using a specific communication socket
 
 @param [in] socket_fd Communication socket for the read operation
@@ -1480,35 +1679,43 @@ extern int power_read(int cpuId, uint64_t reg, uint32_t *data) __attribute__ ((v
 @param [in] reg Energy register
 @param [out] data Energy data
 */
-extern int power_tread(int socket_fd, int cpuId, uint64_t reg, uint32_t *data) __attribute__ ((visibility ("default") ));
+extern int power_tread(int socket_fd, int cpuId, uint64_t reg, uint32_t *data)
+    __attribute__((visibility("default")));
 /*! \brief Start energy measurements
 
-@param [in,out] data Data structure holding start and stop values for energy measurements
+@param [in,out] data Data structure holding start and stop values for energy
+measurements
 @param [in] cpuId Start energy facility for this CPU
 @param [in] type Which type should be measured
 @return error code
 */
-extern int power_start(PowerData_t data, int cpuId, PowerType type) __attribute__ ((visibility ("default") ));
+extern int power_start(PowerData_t data, int cpuId, PowerType type)
+    __attribute__((visibility("default")));
 /*! \brief Stop energy measurements
 
-@param [in,out] data Data structure holding start and stop values for energy measurements
+@param [in,out] data Data structure holding start and stop values for energy
+measurements
 @param [in] cpuId Start energy facility for this CPU
 @param [in] type Which type should be measured
 @return error code
 */
-extern int power_stop(PowerData_t data, int cpuId, PowerType type) __attribute__ ((visibility ("default") ));
+extern int power_stop(PowerData_t data, int cpuId, PowerType type)
+    __attribute__((visibility("default")));
 /*! \brief Print energy measurements gathered by power_start() and power_stop()
 
-@param [in] data Data structure holding start and stop values for energy measurements
+@param [in] data Data structure holding start and stop values for energy
+measurements
 @return Consumed energy in Joules
 */
-extern double power_printEnergy(const PowerData* data) __attribute__ ((visibility ("default") ));
+extern double power_printEnergy(const PowerData *data)
+    __attribute__((visibility("default")));
 /*! \brief Get energy Unit
 
 @param [in] domain RAPL domain ID
 @return Energy unit of the given RAPL domain
 */
-extern double power_getEnergyUnit(int domain) __attribute__ ((visibility ("default") ));
+extern double power_getEnergyUnit(int domain)
+    __attribute__((visibility("default")));
 
 /*! \brief Get the values of the limit register of a domain
 NOT IMPLEMENTED
@@ -1519,7 +1726,8 @@ NOT IMPLEMENTED
 @param [out] time Time limit
 @return error code
 */
-int power_limitGet(int cpuId, PowerType domain, double* power, double* time) __attribute__ ((visibility ("default") ));
+int power_limitGet(int cpuId, PowerType domain, double *power, double *time)
+    __attribute__((visibility("default")));
 
 /*! \brief Set the values of the limit register of a domain
 NOT IMPLEMENTED
@@ -1531,7 +1739,8 @@ NOT IMPLEMENTED
 @param [in] doClamping Activate clamping (going below OS-requested power level)
 @return error code
 */
-int power_limitSet(int cpuId, PowerType domain, double power, double time, int doClamping) __attribute__ ((visibility ("default") ));
+int power_limitSet(int cpuId, PowerType domain, double power, double time,
+                   int doClamping) __attribute__((visibility("default")));
 
 /*! \brief Get the state of a energy limit, activated or deactivated
 NOT IMPLEMENTED
@@ -1540,11 +1749,12 @@ NOT IMPLEMENTED
 @param [in] domain RAPL domain ID
 @return state, 1 for active, 0 for inactive
 */
-int power_limitState(int cpuId, PowerType domain) __attribute__ ((visibility ("default") ));
+int power_limitState(int cpuId, PowerType domain)
+    __attribute__((visibility("default")));
 
 /*! \brief Free space of power_unit
-*/
-extern void power_finalize(void) __attribute__ ((visibility ("default") ));
+ */
+extern void power_finalize(void) __attribute__((visibility("default")));
 /** @}*/
 
 /*
@@ -1559,23 +1769,24 @@ extern void power_finalize(void) __attribute__ ((visibility ("default") ));
 
 @param [in] cpuId Initialize thermal facility for this CPU
 */
-extern void thermal_init(int cpuId) __attribute__ ((visibility ("default") ));
+extern void thermal_init(int cpuId) __attribute__((visibility("default")));
 /*! \brief Read the current thermal value
 
 @param [in] cpuId Read thermal facility for this CPU
 @param [out] data Thermal data
 */
-extern int thermal_read(int cpuId, uint32_t *data) __attribute__ ((visibility ("default") ));
+extern int thermal_read(int cpuId, uint32_t *data)
+    __attribute__((visibility("default")));
 /*! \brief Read the current thermal value using a specific communication socket
 
 @param [in] socket_fd Communication socket for the read operation
 @param [in] cpuId Read thermal facility for this CPU
 @param [out] data Thermal data
 */
-extern int thermal_tread(int socket_fd, int cpuId, uint32_t *data) __attribute__ ((visibility ("default") ));
+extern int thermal_tread(int socket_fd, int cpuId, uint32_t *data)
+    __attribute__((visibility("default")));
 /** @}*/
 
-
 /*
 ################################################################################
 # Memory sweeping related functions
@@ -1589,14 +1800,18 @@ extern int thermal_tread(int socket_fd, int cpuId, uint32_t *data) __attribute__
 Sweeps (zeros) the memory of NUMA node with ID \a domainId
 @param [in] domainId NUMA node ID
 */
-extern void memsweep_domain(int domainId) __attribute__ ((visibility ("default") ));
+extern void memsweep_domain(int domainId)
+    __attribute__((visibility("default")));
 /*! \brief Sweeping the memory of all NUMA nodes covered by CPU list
 
-Sweeps (zeros) the memory of all NUMA nodes containing the CPUs in \a processorList
+Sweeps (zeros) the memory of all NUMA nodes containing the CPUs in \a
+processorList
 @param [in] processorList List of CPU IDs
 @param [in] numberOfProcessors Number of CPUs in list
 */
-extern void memsweep_threadGroup(const int* processorList, int numberOfProcessors) __attribute__ ((visibility ("default") ));
+extern void memsweep_threadGroup(const int *processorList,
+                                 int numberOfProcessors)
+    __attribute__((visibility("default")));
 /** @}*/
 
 /*
@@ -1608,41 +1823,44 @@ extern void memsweep_threadGroup(const int* processorList, int numberOfProcessor
  *  @{
  */
 /*! \brief Enumeration of all CPU related features.
-*/
+ */
 typedef enum {
-    FEAT_HW_PREFETCHER=0, /*!< \brief Hardware prefetcher */
-    FEAT_CL_PREFETCHER, /*!< \brief Adjacent cache line prefetcher */
-    FEAT_DCU_PREFETCHER, /*!< \brief DCU L1 data cache prefetcher */
-    FEAT_IP_PREFETCHER, /*!< \brief IP L1 data cache prefetcher */
-    FEAT_FAST_STRINGS, /*!< \brief Fast-strings feature */
-    FEAT_THERMAL_CONTROL, /*!< \brief Automatic Thermal Control Circuit */
-    FEAT_PERF_MON, /*!< \brief Hardware performance monitoring */
-    FEAT_FERR_MULTIPLEX, /*!< \brief FERR# Multiplexing, must be 1 for XAPIC interrupt model */
-    FEAT_BRANCH_TRACE_STORAGE, /*!< \brief Branch Trace Storage */
-    FEAT_XTPR_MESSAGE, /*!< \brief xTPR Message to set processor priority */
-    FEAT_PEBS, /*!< \brief Precise Event Based Sampling (PEBS) */
-    FEAT_SPEEDSTEP, /*!< \brief Enhanced Intel SpeedStep Technology to reduce energy consumption*/
-    FEAT_MONITOR, /*!< \brief MONITOR/MWAIT feature to monitor write-back stores*/
-    FEAT_SPEEDSTEP_LOCK, /*!< \brief Enhanced Intel SpeedStep Technology Select Lock */
-    FEAT_CPUID_MAX_VAL, /*!< \brief Limit CPUID Maxval */
-    FEAT_XD_BIT, /*!< \brief Execute Disable Bit */
-    FEAT_DYN_ACCEL, /*!< \brief Intel Dynamic Acceleration */
-    FEAT_TURBO_MODE, /*!< \brief Intel Turbo Mode */
-    FEAT_TM2, /*!< \brief Thermal Monitoring 2 */
-    CPUFEATURES_MAX
+  FEAT_HW_PREFETCHER = 0,    /*!< \brief Hardware prefetcher */
+  FEAT_CL_PREFETCHER,        /*!< \brief Adjacent cache line prefetcher */
+  FEAT_DCU_PREFETCHER,       /*!< \brief DCU L1 data cache prefetcher */
+  FEAT_IP_PREFETCHER,        /*!< \brief IP L1 data cache prefetcher */
+  FEAT_FAST_STRINGS,         /*!< \brief Fast-strings feature */
+  FEAT_THERMAL_CONTROL,      /*!< \brief Automatic Thermal Control Circuit */
+  FEAT_PERF_MON,             /*!< \brief Hardware performance monitoring */
+  FEAT_FERR_MULTIPLEX,       /*!< \brief FERR# Multiplexing, must be 1 for XAPIC
+                                interrupt model */
+  FEAT_BRANCH_TRACE_STORAGE, /*!< \brief Branch Trace Storage */
+  FEAT_XTPR_MESSAGE, /*!< \brief xTPR Message to set processor priority */
+  FEAT_PEBS,         /*!< \brief Precise Event Based Sampling (PEBS) */
+  FEAT_SPEEDSTEP,    /*!< \brief Enhanced Intel SpeedStep Technology to reduce
+                        energy consumption*/
+  FEAT_MONITOR, /*!< \brief MONITOR/MWAIT feature to monitor write-back stores*/
+  FEAT_SPEEDSTEP_LOCK, /*!< \brief Enhanced Intel SpeedStep Technology Select
+                          Lock */
+  FEAT_CPUID_MAX_VAL,  /*!< \brief Limit CPUID Maxval */
+  FEAT_XD_BIT,         /*!< \brief Execute Disable Bit */
+  FEAT_DYN_ACCEL,      /*!< \brief Intel Dynamic Acceleration */
+  FEAT_TURBO_MODE,     /*!< \brief Intel Turbo Mode */
+  FEAT_TM2,            /*!< \brief Thermal Monitoring 2 */
+  CPUFEATURES_MAX
 } CpuFeature;
 
 /*! \brief Initialize the internal feature variables for all CPUs
 
 Initialize the internal feature variables for all CPUs
 */
-extern void cpuFeatures_init() __attribute__ ((visibility ("default") ));
+extern void cpuFeatures_init() __attribute__((visibility("default")));
 /*! \brief Print state of all CPU features for a given CPU
 
 Print state of all CPU features for a given CPU
 @param [in] cpu CPU ID
 */
-extern void cpuFeatures_print(int cpu) __attribute__ ((visibility ("default") ));
+extern void cpuFeatures_print(int cpu) __attribute__((visibility("default")));
 /*! \brief Get state of a CPU feature for a given CPU
 
 Get state of a CPU feature for a given CPU
@@ -1650,41 +1868,49 @@ Get state of a CPU feature for a given CPU
 @param [in] type CPU feature
 @return State of CPU feature (1=enabled, 0=disabled)
 */
-extern int cpuFeatures_get(int cpu, CpuFeature type)  __attribute__ ((visibility ("default") ));
+extern int cpuFeatures_get(int cpu, CpuFeature type)
+    __attribute__((visibility("default")));
 /*! \brief Get the name of a CPU feature
 
 Get the name of a CPU feature
 @param [in] type CPU feature
 @return Name of the CPU feature or NULL if feature is not available
 */
-extern char* cpuFeatures_name(CpuFeature type)  __attribute__ ((visibility ("default") ));
+extern char *cpuFeatures_name(CpuFeature type)
+    __attribute__((visibility("default")));
 /*! \brief Enable a CPU feature for a specific CPU
 
-Enable a CPU feature for a specific CPU. Only the state of the prefetchers can be changed, all other features return -EINVAL
+Enable a CPU feature for a specific CPU. Only the state of the prefetchers can
+be changed, all other features return -EINVAL
 @param [in] cpu CPU ID
 @param [in] type CPU feature
 @param [in] print Print outcome of operation
-@return Status of operation (0=success, all others are erros, either by MSR access or invalid feature)
+@return Status of operation (0=success, all others are erros, either by MSR
+access or invalid feature)
 */
-extern int cpuFeatures_enable(int cpu, CpuFeature type, int print) __attribute__ ((visibility ("default") ));
+extern int cpuFeatures_enable(int cpu, CpuFeature type, int print)
+    __attribute__((visibility("default")));
 /*! \brief Disable a CPU feature for a specific CPU
 
-Disable a CPU feature for a specific CPU. Only the state of the prefetchers can be changed, all other features return -EINVAL
+Disable a CPU feature for a specific CPU. Only the state of the prefetchers can
+be changed, all other features return -EINVAL
 @param [in] cpu CPU ID
 @param [in] type CPU feature
 @param [in] print Print outcome of operation
-@return Status of operation (0=success, all others are erros, either by MSR access or invalid feature)
+@return Status of operation (0=success, all others are erros, either by MSR
+access or invalid feature)
 */
-extern int cpuFeatures_disable(int cpu, CpuFeature type, int print) __attribute__ ((visibility ("default") ));
+extern int cpuFeatures_disable(int cpu, CpuFeature type, int print)
+    __attribute__((visibility("default")));
 /** @}*/
 
-
 /*
 ################################################################################
 # CPU frequency related functions
 ################################################################################
 */
-/** \addtogroup CpuFreq Retrieval and manipulation of processor clock frequencies
+/** \addtogroup CpuFreq Retrieval and manipulation of processor clock
+ * frequencies
  *  @{
  */
 /*! \brief Initialize cpu frequency module
@@ -1692,21 +1918,23 @@ extern int cpuFeatures_disable(int cpu, CpuFeature type, int print) __attribute_
 Initialize cpu frequency module
 @return returns 0 if successfull and 1 if invalid accessmode
 */
-extern int freq_init(void) __attribute__ ((visibility ("default") ));
+extern int freq_init(void) __attribute__((visibility("default")));
 /*! \brief Get the base clock frequency of a hardware thread
 
 Get the base clock frequency of a hardware thread
 @param [in] cpu_id CPU ID
 @return Frequency or 0 in case of errors
 */
-uint64_t freq_getCpuClockBase(const int cpu_id) __attribute__ ((visibility ("default") ));
+uint64_t freq_getCpuClockBase(const int cpu_id)
+    __attribute__((visibility("default")));
 /*! \brief Get the current clock frequency of a hardware thread
 
 Get the current clock frequency of a hardware thread
 @param [in] cpu_id CPU ID
 @return Frequency or 0 in case of errors
 */
-extern uint64_t freq_getCpuClockCurrent(const int cpu_id ) __attribute__ ((visibility ("default") ));
+extern uint64_t freq_getCpuClockCurrent(const int cpu_id)
+    __attribute__((visibility("default")));
 
 /*! \brief Get the maximal clock frequency of a hardware thread
 
@@ -1714,14 +1942,16 @@ Get the maximal clock frequency of a hardware thread
 @param [in] cpu_id CPU ID
 @return Frequency or 0 in case of errors
 */
-extern uint64_t freq_getCpuClockMax(const int cpu_id ) __attribute__ ((visibility ("default") ));
+extern uint64_t freq_getCpuClockMax(const int cpu_id)
+    __attribute__((visibility("default")));
 /*! \brief Get the maximal available clock frequency of a hardware thread
 
 Get the maximal clock frequency of a hardware thread
 @param [in] cpu_id CPU ID
 @return Frequency or 0 in case of errors
 */
-extern uint64_t freq_getConfCpuClockMax(const int cpu_id) __attribute__ ((visibility ("default") ));
+extern uint64_t freq_getConfCpuClockMax(const int cpu_id)
+    __attribute__((visibility("default")));
 /*! \brief Set the maximal clock frequency of a hardware thread
 
 Set the maximal clock frequency of a hardware thread
@@ -1729,21 +1959,24 @@ Set the maximal clock frequency of a hardware thread
 @param [in] freq Frequency in kHz
 @return Frequency or 0 in case of errors
 */
-extern uint64_t freq_setCpuClockMax(const int cpu_id, const uint64_t freq) __attribute__ ((visibility ("default") ));
+extern uint64_t freq_setCpuClockMax(const int cpu_id, const uint64_t freq)
+    __attribute__((visibility("default")));
 /*! \brief Get the minimal clock frequency of a hardware thread
 
 Get the minimal clock frequency of a hardware thread
 @param [in] cpu_id CPU ID
 @return Frequency or 0 in case of errors
 */
-extern uint64_t freq_getCpuClockMin(const int cpu_id ) __attribute__ ((visibility ("default") ));
+extern uint64_t freq_getCpuClockMin(const int cpu_id)
+    __attribute__((visibility("default")));
 /*! \brief Get the minimal available clock frequency of a hardware thread
 
 Get the minimal clock frequency of a hardware thread
 @param [in] cpu_id CPU ID
 @return Frequency or 0 in case of errors
 */
-extern uint64_t freq_getConfCpuClockMin(const int cpu_id) __attribute__ ((visibility ("default") ));
+extern uint64_t freq_getConfCpuClockMin(const int cpu_id)
+    __attribute__((visibility("default")));
 /*! \brief Set the minimal clock frequency of a hardware thread
 
 Set the minimal clock frequency of a hardware thread
@@ -1751,7 +1984,8 @@ Set the minimal clock frequency of a hardware thread
 @param [in] freq Frequency in kHz
 @return Frequency or 0 in case of errors
 */
-extern uint64_t freq_setCpuClockMin(const int cpu_id, const uint64_t freq) __attribute__ ((visibility ("default") ));
+extern uint64_t freq_setCpuClockMin(const int cpu_id, const uint64_t freq)
+    __attribute__((visibility("default")));
 /*! \brief De/Activate turbo mode for a hardware thread
 
 De/Activate turbo mode for a hardware thread
@@ -1759,21 +1993,25 @@ De/Activate turbo mode for a hardware thread
 @param [in] turbo (0=off, 1=on)
 @return 1 or 0 in case of errors
 */
-extern int freq_setTurbo(const int cpu_id, int turbo) __attribute__ ((visibility ("default") ));
+extern int freq_setTurbo(const int cpu_id, int turbo)
+    __attribute__((visibility("default")));
 /*! \brief Get state of turbo mode for a hardware thread
 
 Get state of turbo mode for a hardware thread
 @param [in] cpu_id CPU ID
 @return 1=Turbo active or 0=Turbo inactive
 */
-extern int freq_getTurbo(const int cpu_id) __attribute__ ((visibility ("default") ));
+extern int freq_getTurbo(const int cpu_id)
+    __attribute__((visibility("default")));
 /*! \brief Get the frequency governor of a hardware thread
 
-Get the frequency governor of a hardware thread. The returned string must be freed by the caller.
+Get the frequency governor of a hardware thread. The returned string must be
+freed by the caller.
 @param [in] cpu_id CPU ID
 @return Governor or NULL in case of errors
 */
-extern char * freq_getGovernor(const int cpu_id ) __attribute__ ((visibility ("default") ));
+extern char *freq_getGovernor(const int cpu_id)
+    __attribute__((visibility("default")));
 /*! \brief Set the frequency governor of a hardware thread
 
 Set the frequency governor of a hardware thread.
@@ -1781,30 +2019,39 @@ Set the frequency governor of a hardware thread.
 @param [in] gov Governor
 @return 1 or 0 in case of errors
 */
-extern int freq_setGovernor(const int cpu_id, const char* gov) __attribute__ ((visibility ("default") ));
+extern int freq_setGovernor(const int cpu_id, const char *gov)
+    __attribute__((visibility("default")));
 /*! \brief Get the available frequencies of a hardware thread
 
-Get the available frequencies of a hardware thread. The returned string must be freed by the caller.
+Get the available frequencies of a hardware thread. The returned string must be
+freed by the caller.
 @param [in] cpu_id CPU ID
 @return String with available frequencies or NULL in case of errors
 */
-extern char * freq_getAvailFreq(const int cpu_id ) __attribute__ ((visibility ("default") ));
+extern char *freq_getAvailFreq(const int cpu_id)
+    __attribute__((visibility("default")));
 /*! \brief Get the available frequency governors of a hardware thread
 
-Get the available frequency governors of a hardware thread. The returned string must be freed by the caller.
+Get the available frequency governors of a hardware thread. The returned string
+must be freed by the caller.
 @param [in] cpu_id CPU ID
 @return String with available frequency governors or NULL in case of errors
 */
-extern char * freq_getAvailGovs(const int cpu_id ) __attribute__ ((visibility ("default") ));
+extern char *freq_getAvailGovs(const int cpu_id)
+    __attribute__((visibility("default")));
 
 /*! \brief Set the minimal Uncore frequency
 
-Set the minimal Uncore frequency. Since the ranges are not documented, valid frequencies are from minimal CPU clock to maximal Turbo clock. If selecting a frequency at the borders, please check the result with the UNCORE_CLOCK event to be effective.
+Set the minimal Uncore frequency. Since the ranges are not documented, valid
+frequencies are from minimal CPU clock to maximal Turbo clock. If selecting a
+frequency at the borders, please check the result with the UNCORE_CLOCK event to
+be effective.
 @param [in] socket_id ID of socket
 @param [in] freq Frequency in MHz
 @return 0 for success, -ERROR at failure
 */
-extern int freq_setUncoreFreqMin(const int socket_id, const uint64_t freq) __attribute__ ((visibility ("default") ));
+extern int freq_setUncoreFreqMin(const int socket_id, const uint64_t freq)
+    __attribute__((visibility("default")));
 
 /*! \brief Get the minimal Uncore frequency
 
@@ -1812,16 +2059,21 @@ Get the minimal Uncore frequency.
 @param [in] socket_id ID of socket
 @return frequency in MHz or 0 at failure
 */
-extern uint64_t freq_getUncoreFreqMin(const int socket_id) __attribute__ ((visibility ("default") ));
+extern uint64_t freq_getUncoreFreqMin(const int socket_id)
+    __attribute__((visibility("default")));
 
 /*! \brief Set the maximal Uncore frequency
 
-Set the maximal Uncore frequency. Since the ranges are not documented, valid frequencies are from minimal CPU clock to maximal Turbo clock. If selecting a frequency at the borders, please check the result with the UNCORE_CLOCK event to be effective.
+Set the maximal Uncore frequency. Since the ranges are not documented, valid
+frequencies are from minimal CPU clock to maximal Turbo clock. If selecting a
+frequency at the borders, please check the result with the UNCORE_CLOCK event to
+be effective.
 @param [in] socket_id ID of socket
 @param [in] freq Frequency in MHz
 @return 0 for success, -ERROR at failure
 */
-extern int freq_setUncoreFreqMax(const int socket_id, const uint64_t freq) __attribute__ ((visibility ("default") ));
+extern int freq_setUncoreFreqMax(const int socket_id, const uint64_t freq)
+    __attribute__((visibility("default")));
 
 /*! \brief Get the maximal Uncore frequency
 
@@ -1829,22 +2081,23 @@ Get the maximal Uncore frequency.
 @param [in] socket_id ID of socket
 @return frequency in MHz or 0 at failure
 */
-extern uint64_t freq_getUncoreFreqMax(const int socket_id) __attribute__ ((visibility ("default") ));
+extern uint64_t freq_getUncoreFreqMax(const int socket_id)
+    __attribute__((visibility("default")));
 /*! \brief Get the current Uncore frequency
 
 Get the current Uncore frequency.
 @param [in] socket_id ID of socket
 @return frequency in MHz or 0 at failure
 */
-extern uint64_t freq_getUncoreFreqCur(const int socket_id) __attribute__ ((visibility ("default") ));
+extern uint64_t freq_getUncoreFreqCur(const int socket_id)
+    __attribute__((visibility("default")));
 /*! \brief Finalize cpu frequency module
 
 Finalize cpu frequency module
 */
-extern void freq_finalize(void) __attribute__ ((visibility ("default") ));
+extern void freq_finalize(void) __attribute__((visibility("default")));
 /** @}*/
 
-
 /*
 ################################################################################
 # Performance monitoring for NVIDIA GPUs related functions
@@ -1857,58 +2110,70 @@ extern void freq_finalize(void) __attribute__ ((visibility ("default") ));
 #if defined(LIKWID_WITH_NVMON) || defined(LIKWID_NVMON)
 /*! \brief Structure with general GPU information for each device
 
-General information covers GPU devid, name and clock and memory specific information.
-Most information comes from cuDeviceGetProperties() and cuDeviceGetAttribute().
+General information covers GPU devid, name and clock and memory specific
+information. Most information comes from cuDeviceGetProperties() and
+cuDeviceGetAttribute().
 */
 typedef struct {
-    int devid; /*!< \brief Device ID  */
-    int numaNode; /*!< \brief Closest NUMA domain to the device */
-    char* name; /*!< \brief Name of the device */
-    char* short_name; /*!< \brief Short name of the device */
-    uint64_t mem; /*!< \brief Total memory of device */
-    int ccapMajor; /*!< \brief Major number of device's compute capability */
-    int ccapMinor; /*!< \brief Minor number of device's compute capability */
-    int maxThreadsPerBlock; /*!< \brief Maximam number of thread per block */
-    int maxThreadsDim[3]; /*!< \brief Maximum sizes of each dimension of a block */
-    int maxGridSize[3]; /*!< \brief Maximum sizes of each dimension of a grid */
-    int sharedMemPerBlock; /*!< \brief Total amount of shared memory available per block */
-    int totalConstantMemory; /*!< \brief Total amount of constant memory available on the device */
-    int simdWidth; /*!< \brief SIMD width of arithmetic units = warp size */
-    int memPitch; /*!< \brief Maximum pitch allowed by the memory copy functions that involve memory regions allocated through cuMemAllocPitch() */
-    int regsPerBlock; /*!< \brief Total number of registers available per block */
-    int clockRatekHz; /*!< \brief Clock frequency in kilohertz */
-    int textureAlign; /*!< \brief Alignment requirement */
-    int surfaceAlign; /*!< \brief Alignment requirement for surfaces */
-    int l2Size; /*!< \brief L2 cache in bytes. 0 if the device doesn't have L2 cache */
-    int memClockRatekHz; /*!< \brief Peak memory clock frequency in kilohertz */
-    int pciBus; /*!< \brief PCI bus identifier of the device */
-    int pciDev; /*!< \brief PCI device (also known as slot) identifier of the device */
-    int pciDom; /*!< \brief PCI domain identifier of the device */
-    int maxBlockRegs; /*!< \brief Maximum number of 32-bit registers available to a thread block */
-    int numMultiProcs; /*!< \brief Number of multiprocessors on the device */
-    int maxThreadPerMultiProc; /*!< \brief Maximum resident threads per multiprocessor */
-    int memBusWidth; /*!< \brief Global memory bus width in bits */
-    int unifiedAddrSpace; /*!< \brief 1 if the device shares a unified address space with the host, or 0 if not */
-    int ecc; /*!< \brief 1 if error correction is enabled on the device, 0 if error correction is disabled or not supported by the device */
-    int asyncEngines; /*!< \brief Number of asynchronous engines */
-    int mapHostMem; /*!< \brief 1 if the device can map host memory into the CUDA address space */
-    int integrated; /*!< \brief 1 if the device is an integrated (motherboard) GPU and 0 if it is a discrete (card) component */
+  int devid;        /*!< \brief Device ID  */
+  int numaNode;     /*!< \brief Closest NUMA domain to the device */
+  char *name;       /*!< \brief Name of the device */
+  char *short_name; /*!< \brief Short name of the device */
+  uint64_t mem;     /*!< \brief Total memory of device */
+  int ccapMajor;    /*!< \brief Major number of device's compute capability */
+  int ccapMinor;    /*!< \brief Minor number of device's compute capability */
+  int maxThreadsPerBlock; /*!< \brief Maximam number of thread per block */
+  int maxThreadsDim[3];   /*!< \brief Maximum sizes of each dimension of a block
+                           */
+  int maxGridSize[3]; /*!< \brief Maximum sizes of each dimension of a grid */
+  int sharedMemPerBlock; /*!< \brief Total amount of shared memory available per
+                            block */
+  int totalConstantMemory; /*!< \brief Total amount of constant memory available
+                              on the device */
+  int simdWidth; /*!< \brief SIMD width of arithmetic units = warp size */
+  int memPitch;  /*!< \brief Maximum pitch allowed by the memory copy functions
+                    that involve memory regions allocated through
+                    cuMemAllocPitch() */
+  int regsPerBlock; /*!< \brief Total number of registers available per block */
+  int clockRatekHz; /*!< \brief Clock frequency in kilohertz */
+  int textureAlign; /*!< \brief Alignment requirement */
+  int surfaceAlign; /*!< \brief Alignment requirement for surfaces */
+  int l2Size; /*!< \brief L2 cache in bytes. 0 if the device doesn't have L2
+                 cache */
+  int memClockRatekHz; /*!< \brief Peak memory clock frequency in kilohertz */
+  int pciBus;          /*!< \brief PCI bus identifier of the device */
+  int pciDev; /*!< \brief PCI device (also known as slot) identifier of the
+                 device */
+  int pciDom; /*!< \brief PCI domain identifier of the device */
+  int maxBlockRegs;  /*!< \brief Maximum number of 32-bit registers available to
+                        a thread block */
+  int numMultiProcs; /*!< \brief Number of multiprocessors on the device */
+  int maxThreadPerMultiProc; /*!< \brief Maximum resident threads per
+                                multiprocessor */
+  int memBusWidth;           /*!< \brief Global memory bus width in bits */
+  int unifiedAddrSpace; /*!< \brief 1 if the device shares a unified address
+                           space with the host, or 0 if not */
+  int ecc; /*!< \brief 1 if error correction is enabled on the device, 0 if
+              error correction is disabled or not supported by the device */
+  int asyncEngines; /*!< \brief Number of asynchronous engines */
+  int mapHostMem; /*!< \brief 1 if the device can map host memory into the CUDA
+                     address space */
+  int integrated; /*!< \brief 1 if the device is an integrated (motherboard) GPU
+                     and 0 if it is a discrete (card) component */
 } GpuDevice;
 
-
 /*! \brief Structure holding information of all GPUs
 
 */
 typedef struct {
-    int numDevices; /*!< \brief Number of detected devices */
-    GpuDevice* devices; /*!< \brief List with GPU-specific topology information */
+  int numDevices;     /*!< \brief Number of detected devices */
+  GpuDevice *devices; /*!< \brief List with GPU-specific topology information */
 } GpuTopology;
 
 /*! \brief Variable holding the global gpu information structure */
 extern GpuTopology gpuTopology;
 /** \brief Pointer for exporting the GpuTopology data structure */
-typedef GpuTopology* GpuTopology_t;
-
+typedef GpuTopology *GpuTopology_t;
 
 /*! \brief Initialize GPU topology information
 
@@ -1916,20 +2181,20 @@ Reads in the topology information from the CUDA library (if found).
 \sa GpuTopology_t
 @return 0 or -errno in case of error
 */
-extern int topology_gpu_init(void) __attribute__ ((visibility ("default") ));
+extern int topology_gpu_init(void) __attribute__((visibility("default")));
 /*! \brief Destroy GPU topology structure GpuTopology_t
 
-Retrieved pointers to the structures are not valid anymore after this function call
-\sa GpuTopology_t
+Retrieved pointers to the structures are not valid anymore after this function
+call \sa GpuTopology_t
 */
-extern void topology_gpu_finalize(void) __attribute__ ((visibility ("default") ));
+extern void topology_gpu_finalize(void) __attribute__((visibility("default")));
 /*! \brief Retrieve GPU topology of the current machine
 
 \sa GpuTopology_t
 @return GpuTopology_t (pointer to internal gpuTopology structure)
 */
-extern GpuTopology_t get_gpuTopology(void) __attribute__ ((visibility ("default") ));
-
+extern GpuTopology_t get_gpuTopology(void)
+    __attribute__((visibility("default")));
 
 /*
 ################################################################################
@@ -1937,144 +2202,164 @@ extern GpuTopology_t get_gpuTopology(void) __attribute__ ((visibility ("default"
 ################################################################################
 */
 /** \addtogroup NvMarkerAPI Marker API module for GPUs
-*  @{
-*/
+ *  @{
+ */
 /*! \brief Initialize NvLIKWID's marker API
 
-Must be called in serial region of the application to set up basic data structures
-of LIKWID.
-Reads environment variables:
+Must be called in serial region of the application to set up basic data
+structures of LIKWID. Reads environment variables:
 - LIKWID_GEVENTS (GPU event string)
 - LIKWID_GPUS (GPU list separated by ,)
 - LIKWID_GPUFILEPATH (Outputpath for NvMarkerAPI file)
 */
-extern void likwid_gpuMarkerInit(void) __attribute__ ((visibility ("default") ));
+extern void likwid_gpuMarkerInit(void) __attribute__((visibility("default")));
 /*! \brief Select next group to measure
 
-Must be called in parallel region of the application to switch group on every CPU.
+Must be called in parallel region of the application to switch group on every
+CPU.
 */
-extern void likwid_gpuMarkerNextGroup(void) __attribute__ ((visibility ("default") ));
+extern void likwid_gpuMarkerNextGroup(void)
+    __attribute__((visibility("default")));
 /*! \brief Close LIKWID's NvMarker API
 
-Must be called in serial region of the application. It gathers all data of regions and
-writes them out to a file (filepath in env variable LIKWID_FILEPATH).
+Must be called in serial region of the application. It gathers all data of
+regions and writes them out to a file (filepath in env variable
+LIKWID_FILEPATH).
 */
-extern void likwid_gpuMarkerClose(void) __attribute__ ((visibility ("default") ));
+extern void likwid_gpuMarkerClose(void) __attribute__((visibility("default")));
 /*! \brief Register a measurement region
 
-Initializes the hashTable entry in order to reduce execution time of likwid_gpuMarkerStartRegion()
+Initializes the hashTable entry in order to reduce execution time of
+likwid_gpuMarkerStartRegion()
 @param regionTag [in] Initialize data using this string
 @return Error code
 */
-extern int likwid_gpuMarkerRegisterRegion(const char* regionTag) __attribute__ ((visibility ("default") ));
+extern int likwid_gpuMarkerRegisterRegion(const char *regionTag)
+    __attribute__((visibility("default")));
 /*! \brief Start a measurement region
 
-Reads the values of all configured counters and saves the results under the name given
-in regionTag.
+Reads the values of all configured counters and saves the results under the name
+given in regionTag.
 @param regionTag [in] Store data using this string
 @return Error code of start operation
 */
-extern int likwid_gpuMarkerStartRegion(const char* regionTag) __attribute__ ((visibility ("default") ));
+extern int likwid_gpuMarkerStartRegion(const char *regionTag)
+    __attribute__((visibility("default")));
 /*! \brief Stop a measurement region
 
-Reads the values of all configured counters and saves the results under the name given
-in regionTag. The measurement data of the stopped region gets summed up in global region counters.
+Reads the values of all configured counters and saves the results under the name
+given in regionTag. The measurement data of the stopped region gets summed up in
+global region counters.
 @param regionTag [in] Store data using this string
 @return Error code of stop operation
 */
-extern int likwid_gpuMarkerStopRegion(const char* regionTag) __attribute__ ((visibility ("default") ));
+extern int likwid_gpuMarkerStopRegion(const char *regionTag)
+    __attribute__((visibility("default")));
 /*! \brief Reset a measurement region
 
 Reset the values of all configured counters and timers.
 @param regionTag [in] Reset data using this string
 @return Error code of reset operation
 */
-extern int likwid_gpuMarkerResetRegion(const char* regionTag) __attribute__ ((visibility ("default") ));
+extern int likwid_gpuMarkerResetRegion(const char *regionTag)
+    __attribute__((visibility("default")));
 /*! \brief Get accumulated data of a code region
 
-Get the accumulated data of the GPUs for the given regionTag. If the operation fails, nr_gpus and nr_events are set to zero.
+Get the accumulated data of the GPUs for the given regionTag. If the operation
+fails, nr_gpus and nr_events are set to zero.
 
 @param regionTag [in] Print data using this string
-@param nr_gpus [in,out] Length of first dimension of the arrys. Afterwards the actual count of GPUs and consequently the length of events, time and count.
-@param nr_events [in,out] Length of events array. Afterwards the actual count of events in the second dimension of events.
+@param nr_gpus [in,out] Length of first dimension of the arrys. Afterwards the
+actual count of GPUs and consequently the length of events, time and count.
+@param nr_events [in,out] Length of events array. Afterwards the actual count of
+events in the second dimension of events.
 @param events [out] Events array for the intermediate results
 @param time [out] Accumulated measurement times per GPU
 @param count [out] Call counts of the code region per GPU
 */
-extern void likwid_gpuMarkerGetRegion(const char* regionTag, int* nr_gpus, int* nr_events, double** events, double *time, int *count) __attribute__ ((visibility ("default") ));
+extern void likwid_gpuMarkerGetRegion(const char *regionTag, int *nr_gpus,
+                                      int *nr_events, double **events,
+                                      double *time, int *count)
+    __attribute__((visibility("default")));
 
 /*! \brief Read the output file of the NvMarker API
 @param [in] filename Filename with NvMarker API results
 @return 0 or negative error number
 */
-int nvmon_readMarkerFile(const char* filename) __attribute__ ((visibility ("default") ));
+int nvmon_readMarkerFile(const char *filename)
+    __attribute__((visibility("default")));
 /*! \brief Free space for read in NvMarker API file
-*/
-void nvmon_destroyMarkerResults() __attribute__ ((visibility ("default") ));
+ */
+void nvmon_destroyMarkerResults() __attribute__((visibility("default")));
 /*! \brief Get the number of regions listed in NvMarker API result file
 
 @return Number of regions
 */
-int nvmon_getNumberOfRegions() __attribute__ ((visibility ("default") ));
+int nvmon_getNumberOfRegions() __attribute__((visibility("default")));
 /*! \brief Get the number of metrics of a region
 @param [in] region ID of region
 @return Number of metrics of region
 */
-int nvmon_getMetricsOfRegion(int region) __attribute__ ((visibility ("default") ));
+int nvmon_getMetricsOfRegion(int region) __attribute__((visibility("default")));
 /*! \brief Get the number of GPUs of a region
 @param [in] region ID of region
 @return Number of GPUs of region
 */
-int nvmon_getGpusOfRegion(int region) __attribute__ ((visibility ("default") ));
+int nvmon_getGpusOfRegion(int region) __attribute__((visibility("default")));
 /*! \brief Get the GPU list of a region
 @param [in] region ID of region
 @param [in] count Length of gpulist array
 @param [in,out] gpulist gpulist array
 @return Number of GPUs of region or count, whatever is lower
 */
-int nvmon_getGpulistOfRegion(int region, int count, int* gpulist) __attribute__ ((visibility ("default") ));
+int nvmon_getGpulistOfRegion(int region, int count, int *gpulist)
+    __attribute__((visibility("default")));
 /*! \brief Get the accumulated measurement time of a region for a GPU
 @param [in] region ID of region
 @param [in] gpu ID of GPU
 @return Measurement time of a region for a GPU
 */
-double nvmon_getTimeOfRegion(int region, int gpu) __attribute__ ((visibility ("default") ));
+double nvmon_getTimeOfRegion(int region, int gpu)
+    __attribute__((visibility("default")));
 /*! \brief Get the call count of a region for a GPU
 @param [in] region ID of region
 @param [in] gpu ID of GPU
 @return Call count of a region for a GPU
 */
-int nvmon_getCountOfRegion(int region, int gpu) __attribute__ ((visibility ("default") ));
+int nvmon_getCountOfRegion(int region, int gpu)
+    __attribute__((visibility("default")));
 /*! \brief Get the groupID of a region
 
 @param [in] region ID of region
 @return Group ID of region
 */
-int nvmon_getGroupOfRegion(int region) __attribute__ ((visibility ("default") ));
+int nvmon_getGroupOfRegion(int region) __attribute__((visibility("default")));
 /*! \brief Get the tag of a region
 @param [in] region ID of region
 @return tag of region
 */
-char* nvmon_getTagOfRegion(int region) __attribute__ ((visibility ("default") ));
+char *nvmon_getTagOfRegion(int region) __attribute__((visibility("default")));
 /*! \brief Get the number of events of a region
 @param [in] region ID of region
 @return Number of events of region
 */
-int nvmon_getEventsOfRegion(int region) __attribute__ ((visibility ("default") ));
+int nvmon_getEventsOfRegion(int region) __attribute__((visibility("default")));
 /*! \brief Get the event result of a region for an event and GPU
 @param [in] region ID of region
 @param [in] eventId ID of event
 @param [in] gpuId ID of GPU
 @return Result of a region for an event and GPU
 */
-double nvmon_getResultOfRegionGpu(int region, int eventId, int gpuId) __attribute__ ((visibility ("default") ));
+double nvmon_getResultOfRegionGpu(int region, int eventId, int gpuId)
+    __attribute__((visibility("default")));
 /*! \brief Get the metric result of a region for a metric and GPU
 @param [in] region ID of region
 @param [in] metricId ID of metric
 @param [in] gpuId ID of GPU
 @return Metric result of a region for a GPU
 */
-double nvmon_getMetricOfRegionGpu(int region, int metricId, int gpuId) __attribute__ ((visibility ("default") ));
+double nvmon_getMetricOfRegionGpu(int region, int metricId, int gpuId)
+    __attribute__((visibility("default")));
 
 /** @}*/
 
@@ -2085,17 +2370,17 @@ double nvmon_getMetricOfRegionGpu(int region, int metricId, int gpuId) __attribu
 */
 
 /** \addtogroup Nvmon Nvidia GPU monitoring API module for GPUs
-*  @{
-*/
+ *  @{
+ */
 
 /*! \brief Element in the output list from nvmon_getEventsOfGpu
 
 It holds the name, the description and the limitation string for one event.
 */
 typedef struct {
-    char* name; /*!< \brief Name of the event */
-    char* desc; /*!< \brief Description of the event */
-    char* limit; /*!< \brief Limitation string of the event, commonly 'GPU' */
+  char *name;  /*!< \brief Name of the event */
+  char *desc;  /*!< \brief Description of the event */
+  char *limit; /*!< \brief Limitation string of the event, commonly 'GPU' */
 } NvmonEventListEntry;
 
 /*! \brief Output list from nvmon_getEventsOfGpu with all supported events
@@ -2103,12 +2388,11 @@ typedef struct {
 Output list from nvmon_getEventsOfGpu with all supported events
 */
 typedef struct {
-    int numEvents; /*!< \brief Number of events */
-    NvmonEventListEntry *events; /*!< \brief List of events */
+  int numEvents;               /*!< \brief Number of events */
+  NvmonEventListEntry *events; /*!< \brief List of events */
 } NvmonEventList;
 /** \brief Pointer for exporting the NvmonEventList data structure */
-typedef NvmonEventList* NvmonEventList_t;
-
+typedef NvmonEventList *NvmonEventList_t;
 
 /*! \brief Get the list of supported event of a GPU
 
@@ -2116,7 +2400,7 @@ typedef NvmonEventList* NvmonEventList_t;
 @param [out] list List of events
 @return Number of supported events or -errno
 */
-int nvmon_getEventsOfGpu(int gpuId, NvmonEventList_t* list);
+int nvmon_getEventsOfGpu(int gpuId, NvmonEventList_t *list);
 /*! \brief Return the list of supported event of a GPU
 
 Return the list of supported event of a GPU from nvmon_getEventsOfGpu()
@@ -2124,24 +2408,25 @@ Return the list of supported event of a GPU from nvmon_getEventsOfGpu()
 */
 void nvmon_returnEventsOfGpu(NvmonEventList_t list);
 
-
 /*! \brief Initialize the Nvidia GPU performance monitoring facility (Nvmon)
 
-Initialize the Nvidia GPU performance monitoring feature by creating basic data structures.
-The CUDA and CUPTI library paths need to be in LD_LIBRARY_PATH to be found by dlopen.
+Initialize the Nvidia GPU performance monitoring feature by creating basic data
+structures. The CUDA and CUPTI library paths need to be in LD_LIBRARY_PATH to be
+found by dlopen.
 
 @param [in] nrGpus Amount of GPUs
 @param [in] gpuIds List of GPUs
 @return error code (0 on success, -ERRORCODE on failure)
 */
-int nvmon_init(int nrGpus, const int* gpuIds) __attribute__ ((visibility ("default") ));
+int nvmon_init(int nrGpus, const int *gpuIds)
+    __attribute__((visibility("default")));
 
 /*! \brief Close the Nvidia GPU perfomance monitoring facility of LIKWID (Nvmon)
 
-Deallocates all internal data that is used during Nvmon performance monitoring. Also
-the counter values are not accessible anymore after calling this function.
+Deallocates all internal data that is used during Nvmon performance monitoring.
+Also the counter values are not accessible anymore after calling this function.
 */
-void nvmon_finalize(void) __attribute__ ((visibility ("default") ));
+void nvmon_finalize(void) __attribute__((visibility("default")));
 /*! \brief Add an event string to LIKWID Nvmon
 
 A event string looks like Eventname:Countername,...
@@ -2150,32 +2435,34 @@ The eventname and countername are checked if they are available.
 @param [in] eventCString Event string
 @return Returns the ID of the new eventSet
 */
-int nvmon_addEventSet(const char* eventCString) __attribute__ ((visibility ("default") ));
+int nvmon_addEventSet(const char *eventCString)
+    __attribute__((visibility("default")));
 /*! \brief Setup all Nvmon performance monitoring counters of an eventSet
 
 @param [in] gid (returned from perfmon_addEventSet()
-@return error code (-ENOENT if groupId is invalid and -1 if the counters of one CPU cannot be set up)
+@return error code (-ENOENT if groupId is invalid and -1 if the counters of one
+CPU cannot be set up)
 */
-int nvmon_setupCounters(int gid) __attribute__ ((visibility ("default") ));
+int nvmon_setupCounters(int gid) __attribute__((visibility("default")));
 /*! \brief Start Nvmon performance monitoring counters
 
 Start the counters that have been previously set up by nvmon_setupCounters().
 The counter registered are zeroed before enabling the counters
 @return 0 on success and -(gpuid+1) for error
 */
-int nvmon_startCounters(void) __attribute__ ((visibility ("default") ));
+int nvmon_startCounters(void) __attribute__((visibility("default")));
 /*! \brief Stop Nvmon performance monitoring counters
 
 Stop the counters that have been previously started by nvmon_startCounters().
 @return 0 on success and -(gpuid+1) for error
 */
-int nvmon_stopCounters(void) __attribute__ ((visibility ("default") ));
+int nvmon_stopCounters(void) __attribute__((visibility("default")));
 /*! \brief Read the Nvmon performance monitoring counters on all GPUs
 
 Read the counters that have been previously started by nvmon_startCounters().
 @return 0 on success and -(gpuid+1) for error
 */
-int nvmon_readCounters(void) __attribute__ ((visibility ("default") ));
+int nvmon_readCounters(void) __attribute__((visibility("default")));
 /*! \brief Switch the active eventSet to a new one (Nvmon)
 
 Stops the currently running counters, switches the eventSet by setting up the
@@ -2183,11 +2470,12 @@ counters and start the counters.
 @param [in] new_group ID of group that should be switched to.
 @return 0 on success and -(thread_id+1) for error
 */
-int nvmon_switchActiveGroup(int new_group) __attribute__ ((visibility ("default") ));
+int nvmon_switchActiveGroup(int new_group)
+    __attribute__((visibility("default")));
 /*! \brief Set verbosity of LIKWID Nvmon library
 
 */
-void nvmon_setVerbosity(int level) __attribute__ ((visibility ("default") ));
+void nvmon_setVerbosity(int level) __attribute__((visibility("default")));
 
 /*! \brief Get the results of the specified group, counter and GPU (Nvmon)
 
@@ -2197,73 +2485,90 @@ Get the result of all measurement cycles.
 @param [in] gpuId ID of the GPU that should be read
 @return The counter result
 */
-double nvmon_getResult(int groupId, int eventId, int gpuId) __attribute__ ((visibility ("default") ));
+double nvmon_getResult(int groupId, int eventId, int gpuId)
+    __attribute__((visibility("default")));
 /*! \brief Get the last results of the specified group, counter and GPU (Nvmon)
 
-Get the result of the last measurement cycle (between start/stop, start/read, read/read or read/top).
+Get the result of the last measurement cycle (between start/stop, start/read,
+read/read or read/top).
 @param [in] groupId ID of the group that should be read
 @param [in] eventId ID of the event that should be read
 @param [in] gpuId ID of the GPU that should be read
 @return The counter result
 */
-double nvmon_getLastResult(int groupId, int eventId, int gpuId) __attribute__ ((visibility ("default") ));
+double nvmon_getLastResult(int groupId, int eventId, int gpuId)
+    __attribute__((visibility("default")));
 /*! \brief Get the metric result of the specified group, counter and GPU (Nvmon)
 
-Get the metric result of all measurement cycles. It reads all raw results for the given groupId and gpuId.
+Get the metric result of all measurement cycles. It reads all raw results for
+the given groupId and gpuId.
 @param [in] groupId ID of the group that should be read
 @param [in] metricId ID of the metric that should be calculated
 @param [in] gpuId ID of the GPU that should be read
 @return The metric result
 */
-double nvmon_getMetric(int groupId, int metricId, int gpuId) __attribute__ ((visibility ("default") ));
-/*! \brief Get the last metric result of the specified group, counter and GPU (Nvmon)
+double nvmon_getMetric(int groupId, int metricId, int gpuId)
+    __attribute__((visibility("default")));
+/*! \brief Get the last metric result of the specified group, counter and GPU
+(Nvmon)
 
-Get the metric result of the last measurement cycle. It reads all raw results for the given groupId and gpuId.
+Get the metric result of the last measurement cycle. It reads all raw results
+for the given groupId and gpuId.
 @param [in] groupId ID of the group that should be read
 @param [in] metricId ID of the metric that should be calculated
 @param [in] gpuId ID of the GPU that should be read
 @return The metric result
 */
-double nvmon_getLastMetric(int groupId, int metricId, int gpuId) __attribute__ ((visibility ("default") ));
+double nvmon_getLastMetric(int groupId, int metricId, int gpuId)
+    __attribute__((visibility("default")));
 /*! \brief Get the number of configured event groups (Nvmon)
 
 @return Number of groups
 */
-int nvmon_getNumberOfGroups(void) __attribute__ ((visibility ("default") ));
+int nvmon_getNumberOfGroups(void) __attribute__((visibility("default")));
 /*! \brief Get the ID of the currently set up event group (Nvmon)
 
 @return Number of active group
 */
-int nvmon_getIdOfActiveGroup(void) __attribute__ ((visibility ("default") ));
+int nvmon_getIdOfActiveGroup(void) __attribute__((visibility("default")));
 /*! \brief Get the number of GPUs specified at nvmon_init() (Nvmon)
 
 @return Number of GPUs
 */
-int nvmon_getNumberOfGPUs(void) __attribute__ ((visibility ("default") ));
+int nvmon_getNumberOfGPUs(void) __attribute__((visibility("default")));
 /*! \brief Get the number of configured eventSets in group (Nvmon)
 
 @param [in] groupId ID of group
 @return Number of eventSets
 */
-int nvmon_getNumberOfEvents(int groupId) __attribute__ ((visibility ("default") ));
+int nvmon_getNumberOfEvents(int groupId) __attribute__((visibility("default")));
 /*! \brief Get the number of configured metrics for group (Nvmon)
 
 @param [in] groupId ID of group
 @return Number of metrics
 */
-int nvmon_getNumberOfMetrics(int groupId) __attribute__ ((visibility ("default") ));
+int nvmon_getNumberOfMetrics(int groupId)
+    __attribute__((visibility("default")));
 /*! \brief Get the accumulated measurement time a group (Nvmon)
 
 @param [in] groupId ID of group
 @return Time in seconds the event group was measured
 */
-double nvmon_getTimeOfGroup(int groupId) __attribute__ ((visibility ("default") ));
+double nvmon_getTimeOfGroup(int groupId) __attribute__((visibility("default")));
 /*! \brief Get the last measurement time a group (Nvmon)
 
 @param [in] groupId ID of group
 @return Time in seconds the event group was measured the last time
 */
-double nvmon_getLastTimeOfGroup(int groupId) __attribute__ ((visibility ("default") ));
+double nvmon_getLastTimeOfGroup(int groupId)
+    __attribute__((visibility("default")));
+/*! \brief Get the measurement time from start to last read of a group (Nvmon)
+
+@param [in] groupId ID of group
+@return Time in seconds the event group was measured the last time
+*/
+double nvmon_getTimeToLastReadOfGroup(int groupId)
+    __attribute__((visibility("default")));
 /*! \brief Get the event name of the specified group and event (Nvmon)
 
 Get the metric name as defined in the performance group file
@@ -2271,7 +2576,8 @@ Get the metric name as defined in the performance group file
 @param [in] eventId ID of the event that should be returned
 @return The event name or NULL in case of failure
 */
-char* nvmon_getEventName(int groupId, int eventId) __attribute__ ((visibility ("default") ));
+char *nvmon_getEventName(int groupId, int eventId)
+    __attribute__((visibility("default")));
 /*! \brief Get the counter name of the specified group and event (Nvmon)
 
 Get the counter name as defined in the performance group file
@@ -2279,7 +2585,8 @@ Get the counter name as defined in the performance group file
 @param [in] eventId ID of the event of which the counter should be returned
 @return The counter name or NULL in case of failure
 */
-char* nvmon_getCounterName(int groupId, int eventId) __attribute__ ((visibility ("default") ));
+char *nvmon_getCounterName(int groupId, int eventId)
+    __attribute__((visibility("default")));
 /*! \brief Get the metric name of the specified group and metric (Nvmon)
 
 Get the metric name as defined in the performance group file
@@ -2287,22 +2594,25 @@ Get the metric name as defined in the performance group file
 @param [in] metricId ID of the metric that should be calculated
 @return The metric name or NULL in case of failure
 */
-char* nvmon_getMetricName(int groupId, int metricId) __attribute__ ((visibility ("default") ));
+char *nvmon_getMetricName(int groupId, int metricId)
+    __attribute__((visibility("default")));
 /*! \brief Get the name group (Nvmon)
 
-Get the name of group. Either it is the name of the performance group or "Custom"
+Get the name of group. Either it is the name of the performance group or
+"Custom"
 @param [in] groupId ID of the group that should be read
 @return The group name or NULL in case of failure
 */
-char* nvmon_getGroupName(int groupId) __attribute__ ((visibility ("default") ));
+char *nvmon_getGroupName(int groupId) __attribute__((visibility("default")));
 /*! \brief Get the short informational string of the specified group (Nvmon)
 
-Returns the short information string as defined by performance groups or "Custom"
-in case of custom event sets
+Returns the short information string as defined by performance groups or
+"Custom" in case of custom event sets
 @param [in] groupId ID of the group that should be read
 @return The short information or NULL in case of failure
 */
-char* nvmon_getGroupInfoShort(int groupId) __attribute__ ((visibility ("default") ));
+char *nvmon_getGroupInfoShort(int groupId)
+    __attribute__((visibility("default")));
 /*! \brief Get the long descriptive string of the specified group (Nvmon)
 
 Returns the long descriptive string as defined by performance groups or NULL
@@ -2310,7 +2620,8 @@ in case of custom event sets
 @param [in] groupId ID of the group that should be read
 @return The long description or NULL in case of failure
 */
-char* nvmon_getGroupInfoLong(int groupId) __attribute__ ((visibility ("default") ));
+char *nvmon_getGroupInfoLong(int groupId)
+    __attribute__((visibility("default")));
 
 /*! \brief Get all groups (Nvmon)
 
@@ -2322,7 +2633,8 @@ returns all found group names
 @param [out] longinfos List of long information string about group
 @return Amount of found performance groups
 */
-int nvmon_getGroups(int gpuId, char*** groups, char*** shortinfos, char*** longinfos) __attribute__ ((visibility ("default") ));
+int nvmon_getGroups(int gpuId, char ***groups, char ***shortinfos,
+                    char ***longinfos) __attribute__((visibility("default")));
 /*! \brief Free all group information (Nvmon)
 
 @param [in] nrgroups Number of groups
@@ -2330,50 +2642,243 @@ int nvmon_getGroups(int gpuId, char*** groups, char*** shortinfos, char*** longi
 @param [in] shortinfos List of short information string about group
 @param [in] longinfos List of long information string about group
 */
-int nvmon_returnGroups(int nrgroups, char** groups, char** shortinfos, char** longinfos) __attribute__ ((visibility ("default") ));
-
-
+int nvmon_returnGroups(int nrgroups, char **groups, char **shortinfos,
+                       char **longinfos) __attribute__((visibility("default")));
 
 /** @}*/
 
 #endif /* LIKWID_WITH_NVMON */
 
+/*
+################################################################################
+# Performance monitoring for AMD GPUs related functions
+################################################################################
+*/
+/** \addtogroup Performance monitoring for AMD GPUs
+ *  @{
+ */
+#ifdef LIKWID_WITH_ROCMON
+
+/*! \brief Structure with general GPU information for each device
+
+General information covers GPU devid, name and clock and memory specific
+information. Most information comes from hipGetDeviceProperties() and
+hipDeviceGetAttribute().
+*/
+typedef struct {
+  int devid;        /*!< \brief Device ID  */
+  int numaNode;     /*!< \brief Closest NUMA domain to the device */
+  char *name;       /*!< \brief Name of the device */
+  char *short_name; /*!< \brief Short name of the device */
+  size_t mem;       /*!< \brief Size of global memory region (in bytes) */
+  int ccapMajor;    /*!< \brief Major number of device's compute capability */
+  int ccapMinor;    /*!< \brief Minor number of device's compute capability */
+  int maxThreadsPerBlock; /*!< \brief Maximam number of thread per block */
+  int maxThreadsDim[3];   /*!< \brief Maximum sizes of each dimension of a block
+                           */
+  int maxGridSize[3]; /*!< \brief Maximum sizes of each dimension of a grid */
+  int sharedMemPerBlock; /*!< \brief Total amount of shared memory available per
+                            block */
+  size_t totalConstantMemory; /*!< \brief Total amount of constant memory
+                                 available on the device */
+  int simdWidth;    /*!< \brief SIMD width of arithmetic units = warp size */
+  size_t memPitch;  /*!< \brief Maximum pitch allowed by the memory copy
+                       functions that involve memory regions allocated through
+                       cuMemAllocPitch() */
+  int regsPerBlock; /*!< \brief Total number of registers available per block */
+  int clockRatekHz; /*!< \brief Clock frequency in kilohertz */
+  size_t textureAlign; /*!< \brief Alignment requirement */
+  // int surfaceAlign; /*!< \brief Alignment requirement for surfaces */
+  int l2Size; /*!< \brief L2 cache in bytes. 0 if the device doesn't have L2
+                 cache */
+  int memClockRatekHz; /*!< \brief Peak memory clock frequency in kilohertz */
+  int pciBus;          /*!< \brief PCI bus identifier of the device */
+  int pciDev; /*!< \brief PCI device (also known as slot) identifier of the
+                 device */
+  int pciDom; /*!< \brief PCI domain identifier of the device */
+  // int maxBlockRegs; /*!< \brief Maximum number of 32-bit registers available
+  // to a thread block */
+  int numMultiProcs; /*!< \brief Number of multiprocessors on the device */
+  int maxThreadPerMultiProc; /*!< \brief Maximum resident threads per
+                                multiprocessor */
+  int memBusWidth;           /*!< \brief Global memory bus width in bits */
+  // int unifiedAddrSpace; /*!< \brief 1 if the device shares a unified address
+  // space with the host, or 0 if not */
+  int ecc; /*!< \brief 1 if error correction is enabled on the device, 0 if
+              error correction is disabled or not supported by the device */
+  // int asyncEngines; /*!< \brief Number of asynchronous engines */
+  int mapHostMem; /*!< \brief 1 if the device can map host memory */
+  int integrated; /*!< \brief 1 if the device is an integrated (motherboard) GPU
+                     and 0 if it is a discrete (card) component */
+} GpuDevice_rocm;
+
+/*! \brief Structure holding information of all GPUs
+
+*/
+typedef struct {
+  int numDevices; /*!< \brief Number of detected devices */
+  GpuDevice_rocm
+      *devices; /*!< \brief List with GPU-specific topology information */
+} GpuTopology_rocm;
+
+/** \brief Pointer for exporting the GpuTopology data structure */
+typedef GpuTopology_rocm *GpuTopology_rocm_t;
+
+int topology_gpu_init_rocm() __attribute__((visibility("default")));
+void topology_gpu_finalize_rocm(void) __attribute__((visibility("default")));
+GpuTopology_rocm_t get_gpuTopology_rocm(void)
+    __attribute__((visibility("default")));
+
+/*
+################################################################################
+# Rocmon related functions (AMD GPU monitoring)
+################################################################################
+*/
+
+typedef struct {
+  char *name;
+  int instances;
+  char *description;
+} Event_rocm_t;
+
+typedef struct {
+  Event_rocm_t *events;
+  int numEvents;
+} EventList_rocm;
+typedef EventList_rocm *EventList_rocm_t;
+
+void rocmon_setVerbosity(int level) __attribute__((visibility("default")));
+int rocmon_init(int numGpus, const int *gpuIds)
+    __attribute__((visibility("default")));
+void rocmon_finalize(void) __attribute__((visibility("default")));
+int rocmon_addEventSet(const char *eventString, int *gid)
+    __attribute__((visibility("default")));
+int rocmon_switchActiveGroup(int newGroupId)
+    __attribute__((visibility("default")));
+int rocmon_setupCounters(int gid) __attribute__((visibility("default")));
+int rocmon_startCounters(void) __attribute__((visibility("default")));
+int rocmon_stopCounters(void) __attribute__((visibility("default")));
+int rocmon_readCounters(void) __attribute__((visibility("default")));
+
+double rocmon_getResult(int gpuIdx, int groupId, int eventId)
+    __attribute__((visibility("default")));
+double rocmon_getLastResult(int gpuIdx, int groupId, int eventId)
+    __attribute__((visibility("default")));
+
+int rocmon_getEventsOfGpu(int gpuIdx, EventList_rocm_t *list)
+    __attribute__((visibility("default")));
+void rocmon_freeEventsOfGpu(EventList_rocm_t list)
+    __attribute__((visibility("default")));
+
+int rocmon_getNumberOfGroups(void) __attribute__((visibility("default")));
+int rocmon_getIdOfActiveGroup(void) __attribute__((visibility("default")));
+int rocmon_getNumberOfGPUs(void) __attribute__((visibility("default")));
+int rocmon_getNumberOfEvents(int groupId)
+    __attribute__((visibility("default")));
+int rocmon_getNumberOfMetrics(int groupId)
+    __attribute__((visibility("default")));
+
+char *rocmon_getEventName(int groupId, int eventId)
+    __attribute__((visibility("default")));
+char *rocmon_getCounterName(int groupId, int eventId)
+    __attribute__((visibility("default")));
+char *rocmon_getMetricName(int groupId, int metricId)
+    __attribute__((visibility("default")));
+
+double rocmon_getTimeOfGroup(int groupId)
+    __attribute__((visibility("default")));
+double rocmon_getLastTimeOfGroup(int groupId)
+    __attribute__((visibility("default")));
+double rocmon_getTimeToLastReadOfGroup(int groupId)
+    __attribute__((visibility("default")));
+
+char *rocmon_getGroupName(int groupId) __attribute__((visibility("default")));
+char *rocmon_getGroupInfoShort(int groupId)
+    __attribute__((visibility("default")));
+char *rocmon_getGroupInfoLong(int groupId)
+    __attribute__((visibility("default")));
+
+int rocmon_getGroups(char ***groups, char ***shortinfos, char ***longinfos)
+    __attribute__((visibility("default")));
+int rocmon_returnGroups(int nrgroups, char **groups, char **shortinfos,
+                        char **longinfos)
+    __attribute__((visibility("default")));
+
+// Marker API
+void rocmon_markerInit(void) __attribute__((visibility("default")));
+void rocmon_markerClose(void) __attribute__((visibility("default")));
+int rocmon_markerRegisterRegion(const char *regionTag)
+    __attribute__((visibility("default")));
+int rocmon_markerStartRegion(const char *regionTag)
+    __attribute__((visibility("default")));
+int rocmon_markerStopRegion(const char *regionTag)
+    __attribute__((visibility("default")));
+int rocmon_markerResetRegion(const char *regionTag)
+    __attribute__((visibility("default")));
+
+int rocmon_readMarkerFile(const char *filename)
+    __attribute__((visibility("default")));
+void rocmon_destroyMarkerResults() __attribute__((visibility("default")));
+int rocmon_getCountOfRegion(int region, int gpu)
+    __attribute__((visibility("default")));
+double rocmon_getTimeOfRegion(int region, int gpu)
+    __attribute__((visibility("default")));
+int rocmon_getGpulistOfRegion(int region, int count, int *gpulist)
+    __attribute__((visibility("default")));
+int rocmon_getGpusOfRegion(int region) __attribute__((visibility("default")));
+int rocmon_getMetricsOfRegion(int region)
+    __attribute__((visibility("default")));
+int rocmon_getNumberOfRegions() __attribute__((visibility("default")));
+int rocmon_getGroupOfRegion(int region) __attribute__((visibility("default")));
+char *rocmon_getTagOfRegion(int region) __attribute__((visibility("default")));
+int rocmon_getEventsOfRegion(int region) __attribute__((visibility("default")));
+double rocmon_getResultOfRegionGpu(int region, int eventId, int gpuId)
+    __attribute__((visibility("default")));
+double rocmon_getMetricOfRegionGpu(int region, int metricId, int gpuId)
+    __attribute__((visibility("default")));
+
+#endif /* LIKWID_WITH_ROCMON */
+
 typedef enum {
-    HWFEATURE_SCOPE_INVALID = 0,
-    HWFEATURE_SCOPE_HWTHREAD,
-    MAX_HWFEATURE_SCOPE,
+  HWFEATURE_SCOPE_INVALID = 0,
+  HWFEATURE_SCOPE_HWTHREAD,
+  MAX_HWFEATURE_SCOPE,
 } HWFeatureScope;
 
-static char* HWFeatureScopeNames[MAX_HWFEATURE_SCOPE] = {
+static char *HWFeatureScopeNames[MAX_HWFEATURE_SCOPE] = {
     [HWFEATURE_SCOPE_INVALID] = "invalid",
     [HWFEATURE_SCOPE_HWTHREAD] = "hwthread",
 };
 
 typedef struct {
-    char* name;
-    char* description;
-    HWFeatureScope scope;
-    unsigned int readonly:1;
-    unsigned int writeonly:1;
+  char *name;
+  char *description;
+  HWFeatureScope scope;
+  unsigned int readonly : 1;
+  unsigned int writeonly : 1;
 } HWFeature;
 
 typedef struct {
-    int num_features;
-    HWFeature* features;
+  int num_features;
+  HWFeature *features;
 } HWFeatureList;
 
-int hwFeatures_init() __attribute__ ((visibility ("default") ));
-
-int hwFeatures_list(HWFeatureList* list) __attribute__ ((visibility ("default") ));
-void hwFeatures_list_return(HWFeatureList* list) __attribute__ ((visibility ("default") ));
+int hwFeatures_init() __attribute__((visibility("default")));
 
-int hwFeatures_get(HWFeature* feature, int hwthread, uint64_t* value) __attribute__ ((visibility ("default") ));
-int hwFeatures_getByName(char* name, int hwthread, uint64_t* value) __attribute__ ((visibility ("default") ));
-int hwFeatures_modify(HWFeature* feature, int hwthread, uint64_t value) __attribute__ ((visibility ("default") ));
-int hwFeatures_modifyByName(char* name, int hwthread, uint64_t value) __attribute__ ((visibility ("default") ));
+int hwFeatures_list(HWFeatureList *list) __attribute__((visibility("default")));
+void hwFeatures_list_return(HWFeatureList *list)
+    __attribute__((visibility("default")));
 
-void hwFeatures_finalize() __attribute__ ((visibility ("default") ));
+int hwFeatures_get(HWFeature *feature, int hwthread, uint64_t *value)
+    __attribute__((visibility("default")));
+int hwFeatures_getByName(char *name, int hwthread, uint64_t *value)
+    __attribute__((visibility("default")));
+int hwFeatures_modify(HWFeature *feature, int hwthread, uint64_t value)
+    __attribute__((visibility("default")));
+int hwFeatures_modifyByName(char *name, int hwthread, uint64_t value)
+    __attribute__((visibility("default")));
 
+void hwFeatures_finalize() __attribute__((visibility("default")));
 
 #ifdef __cplusplus
 }
diff --git a/src/includes/nvmon_nvml.h b/src/includes/nvmon_nvml.h
index 4ba63ece9..0fb7fcbba 100644
--- a/src/includes/nvmon_nvml.h
+++ b/src/includes/nvmon_nvml.h
@@ -30,21 +30,22 @@
 #ifndef LIKWID_NVMON_NVML_H
 #define LIKWID_NVMON_NVML_H
 
-
-
-
-
-
-NvmonFunctions nvmon_nvml_functions = {
-    .freeDevice = NULL,
-    .createDevice = NULL,
-    .getEventList = NULL,
-    .addEvents = NULL,
-    .setupCounters = NULL,
-    .startCounters = NULL,
-    .readCounters = NULL,
-};
-
-
+#include <nvmon_types.h>
+
+int nvml_init();
+void nvml_finalize();
+int nvml_getEventsOfGpu(int gpuId, NvmonEventList_t* output);
+void nvml_returnEventsOfGpu(NvmonEventList_t list);
+int nvml_addEventSet(char** events, int numEvents);
+int nvml_setupCounters(int gid);
+int nvml_startCounters();
+int nvml_stopCounters();
+int nvml_readCounters();
+int nvml_getNumberOfEvents(int groupId);
+double nvml_getResult(int gpuIdx, int groupId, int eventId);
+double nvml_getLastResult(int gpuIdx, int groupId, int eventId);
+double nvml_getTimeOfGroup(int groupId);
+double nvml_getLastTimeOfGroup(int groupId);
+double nvml_getTimeToLastReadOfGroup(int groupId);
 
 #endif /* LIKWID_NVMON_NVML_H */
diff --git a/src/includes/nvmon_perfworks.h b/src/includes/nvmon_perfworks.h
index d09205395..4131a85c7 100644
--- a/src/includes/nvmon_perfworks.h
+++ b/src/includes/nvmon_perfworks.h
@@ -13,111 +13,111 @@
  *
  *      Copyright (C) 2019 RRZE, University Erlangen-Nuremberg
  *
- *      This program is free software: you can redistribute it and/or modify it under
- *      the terms of the GNU General Public License as published by the Free Software
- *      Foundation, either version 3 of the License, or (at your option) any later
- *      version.
+ *      This program is free software: you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation, either version 3 of the License, or (at your option) any
+ * later version.
  *
- *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
- *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
- *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *      This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
  *
- *      You should have received a copy of the GNU General Public License along with
- *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *      You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
  *
  * =======================================================================================
  */
 #ifndef LIKWID_NVMON_PERFWORKS_H
 #define LIKWID_NVMON_PERFWORKS_H
 
-
 #if defined(CUDART_VERSION) && CUDART_VERSION > 10000
 
 #include <cuda.h>
-#include <cupti_target.h>
-#include <cupti_profiler_target.h>
 #include <cuda_runtime_api.h>
+#include <cupti_profiler_target.h>
+#include <cupti_target.h>
 
-#include <nvperf_host.h>
 #include <nvperf_cuda_host.h>
+#include <nvperf_host.h>
 #include <nvperf_target.h>
 
-
-
-
 static void *dl_perfworks_libcuda = NULL;
 static void *dl_libhost = NULL;
 static void *dl_cupti = NULL;
 static void *dl_perfworks_libcudart = NULL;
 
-#define LIKWID_CU_CALL( call, handleerror )                                    \
-    do {                                                                \
-        CUresult _status = (call);                                      \
-        if (_status != CUDA_SUCCESS) {                                  \
-            ERROR_PRINT(Function %s failed with error %d, #call, _status); \
-            handleerror;                                                \
-        }                                                               \
-    } while (0)
-
-#define LIKWID_NVPW_API_CALL(call, handleerror) \
-    do                                                                                   \
-    {                          \
-        NVPA_Status _status = (call);                                                          \
-        if(_status != NVPA_STATUS_SUCCESS)                                                \
-        {                                                                                \
-            ERROR_PRINT(Function %s failed with error %d, #call, _status); \
-            handleerror;                                                               \
-        }                                                                                \
-    } while(0)
-
-#define LIKWID_CUPTI_API_CALL(call, handleerror)                                            \
-    do {                                                                           \
-        CUptiResult _status = (call);                                         \
-        if (_status != CUPTI_SUCCESS) {                                            \
-            ERROR_PRINT(Function %s failed with error %d, #call, _status); \
-            if (cuptiGetResultStringPtr) { const char* es = NULL; (*cuptiGetResultStringPtr)(_status, &es); if (es) ERROR_PRINT(%s, es); } \
-            handleerror;                                                             \
-        }                                                                          \
-    } while (0)
-
-#define LIKWID_CUDA_API_CALL( call, handleerror )                                \
-    do {                                                                \
-        cudaError_t _status = (call);                                   \
-        if (_status != cudaSuccess) {                                   \
-            ERROR_PRINT(Function %s failed with error %d, #call, _status); \
-            handleerror;                                                \
-        }                                                               \
-    } while (0)
-
+#define LIKWID_CU_CALL(call, handleerror)                                      \
+  do {                                                                         \
+    CUresult _status = (call);                                                 \
+    if (_status != CUDA_SUCCESS) {                                             \
+      ERROR_PRINT(Function % s failed with error % d, #call, _status);         \
+      handleerror;                                                             \
+    }                                                                          \
+  } while (0)
+
+#define LIKWID_NVPW_API_CALL(call, handleerror)                                \
+  do {                                                                         \
+    NVPA_Status _status = (call);                                              \
+    if (_status != NVPA_STATUS_SUCCESS) {                                      \
+      ERROR_PRINT(Function % s failed with error % d, #call, _status);         \
+      handleerror;                                                             \
+    }                                                                          \
+  } while (0)
+
+#define LIKWID_CUPTI_API_CALL(call, handleerror)                               \
+  do {                                                                         \
+    CUptiResult _status = (call);                                              \
+    if (_status != CUPTI_SUCCESS) {                                            \
+      ERROR_PRINT(Function % s failed with error % d, #call, _status);         \
+      if (cuptiGetResultStringPtr) {                                           \
+        const char *es = NULL;                                                 \
+        (*cuptiGetResultStringPtr)(_status, &es);                              \
+        if (es)                                                                \
+          ERROR_PRINT(% s, es);                                                \
+      }                                                                        \
+      handleerror;                                                             \
+    }                                                                          \
+  } while (0)
+
+#define LIKWID_CUDA_API_CALL(call, handleerror)                                \
+  do {                                                                         \
+    cudaError_t _status = (call);                                              \
+    if (_status != cudaSuccess) {                                              \
+      ERROR_PRINT(Function % s failed with error % d, #call, _status);         \
+      handleerror;                                                             \
+    }                                                                          \
+  } while (0)
 
 /* This definitions are used for CUDA 10.1 */
 #if defined(CUDART_VERSION) && CUDART_VERSION < 11000
-typedef struct CUpti_Profiler_GetCounterAvailability_Params
-{
-    size_t structSize;
-    void* pPriv;
-    CUcontext ctx;
-    size_t counterAvailabilityImageSize;
-    uint8_t* pCounterAvailabilityImage;
+typedef struct CUpti_Profiler_GetCounterAvailability_Params {
+  size_t structSize;
+  void *pPriv;
+  CUcontext ctx;
+  size_t counterAvailabilityImageSize;
+  uint8_t *pCounterAvailabilityImage;
 } CUpti_Profiler_GetCounterAvailability_Params;
-#define CUpti_Profiler_GetCounterAvailability_Params_STRUCT_SIZE sizeof(CUpti_Profiler_GetCounterAvailability_Params)
+#define CUpti_Profiler_GetCounterAvailability_Params_STRUCT_SIZE               \
+  sizeof(CUpti_Profiler_GetCounterAvailability_Params)
 
-CUptiResult cuptiProfilerGetCounterAvailability(CUpti_Profiler_GetCounterAvailability_Params* params)
-{
-    return CUPTI_SUCCESS;
+CUptiResult cuptiProfilerGetCounterAvailability(
+    CUpti_Profiler_GetCounterAvailability_Params *params) {
+  return CUPTI_SUCCESS;
 }
 
 typedef struct {
-    size_t structSize;
-    void* pPriv;
-    NVPA_RawMetricsConfig* pRawMetricsConfig;
-    uint8_t* pCounterAvailabilityImage;
+  size_t structSize;
+  void *pPriv;
+  NVPA_RawMetricsConfig *pRawMetricsConfig;
+  uint8_t *pCounterAvailabilityImage;
 } NVPW_RawMetricsConfig_SetCounterAvailability_Params;
-#define NVPW_RawMetricsConfig_SetCounterAvailability_Params_STRUCT_SIZE sizeof(NVPW_RawMetricsConfig_SetCounterAvailability_Params)
+#define NVPW_RawMetricsConfig_SetCounterAvailability_Params_STRUCT_SIZE        \
+  sizeof(NVPW_RawMetricsConfig_SetCounterAvailability_Params)
 
-NVPA_Status NVPW_RawMetricsConfig_SetCounterAvailability(NVPW_RawMetricsConfig_SetCounterAvailability_Params* params)
-{
-    return NVPA_STATUS_SUCCESS;
+NVPA_Status NVPW_RawMetricsConfig_SetCounterAvailability(
+    NVPW_RawMetricsConfig_SetCounterAvailability_Params *params) {
+  return NVPA_STATUS_SUCCESS;
 }
 #endif /* End of definitions for CUDA 10.1 */
 
@@ -135,10 +135,10 @@ NVPA_Status NVPW_RawMetricsConfig_SetCounterAvailability(NVPW_RawMetricsConfig_S
 
 #if CUDART_VERSION >= 11040
 typedef struct {
-    size_t structSize;
-    NVPA_ActivityKind activityKind;
-    const char* pChipName;
-    const uint8_t* pCounterAvailabilityImage;
+  size_t structSize;
+  NVPA_ActivityKind activityKind;
+  const char *pChipName;
+  const uint8_t *pCounterAvailabilityImage;
 } NVPA_RawMetricsConfigOptions;
 #define NVPA_RAW_METRICS_CONFIG_OPTIONS_STRUCT_SIZE 1
 #else
@@ -146,80 +146,83 @@ typedef struct {
  * Copies from CUDA 11.4
  */
 
-typedef struct NVPW_MetricsEvaluator {} NVPW_MetricsEvaluator;
+typedef struct NVPW_MetricsEvaluator {
+} NVPW_MetricsEvaluator;
 
 typedef struct NVPW_MetricEvalRequest {
-    size_t metricIndex;
-    uint8_t metricType;
-    uint8_t rollupOp;
-    uint16_t submetric;
+  size_t metricIndex;
+  uint8_t metricType;
+  uint8_t rollupOp;
+  uint16_t submetric;
 } NVPW_MetricEvalRequest;
 #define NVPW_MetricEvalRequest_STRUCT_SIZE 1
 
 typedef struct {
-    size_t structSize;
-    const char* pChipName;
-    const uint8_t* pCounterAvailabilityImage;
-    size_t scratchBufferSize;
+  size_t structSize;
+  const char *pChipName;
+  const uint8_t *pCounterAvailabilityImage;
+  size_t scratchBufferSize;
 } NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params;
-#define NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params_STRUCT_SIZE 1
+#define NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params_STRUCT_SIZE \
+  1
 
 typedef struct {
-    size_t structSize;
-    uint8_t* pScratchBuffer;
-    size_t scratchBufferSize;
-    const char* pChipName;
-    const uint8_t* pCounterAvailabilityImage;
-    const uint8_t* pCounterDataImage;
-    size_t counterDataImageSize;
-    struct NVPW_MetricsEvaluator* pMetricsEvaluator;
+  size_t structSize;
+  uint8_t *pScratchBuffer;
+  size_t scratchBufferSize;
+  const char *pChipName;
+  const uint8_t *pCounterAvailabilityImage;
+  const uint8_t *pCounterDataImage;
+  size_t counterDataImageSize;
+  struct NVPW_MetricsEvaluator *pMetricsEvaluator;
 } NVPW_CUDA_MetricsEvaluator_Initialize_Params;
 #define NVPW_CUDA_MetricsEvaluator_Initialize_Params_STRUCT_SIZE 1
 
 typedef struct {
-    size_t structSize;
-    struct NVPW_MetricsEvaluator* pMetricsEvaluator;
-    const char* pMetricName;
-    struct NVPW_MetricEvalRequest* pMetricEvalRequest;
-    size_t metricEvalRequestStructSize;
+  size_t structSize;
+  struct NVPW_MetricsEvaluator *pMetricsEvaluator;
+  const char *pMetricName;
+  struct NVPW_MetricEvalRequest *pMetricEvalRequest;
+  size_t metricEvalRequestStructSize;
 } NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params;
-#define NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params_STRUCT_SIZE 1
+#define NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params_STRUCT_SIZE \
+  1
 
 typedef struct {
-    size_t structSize;
-    struct NVPW_MetricsEvaluator* pMetricsEvaluator;
-    const struct NVPW_MetricEvalRequest* pMetricEvalRequests;
-    size_t numMetricEvalRequests;
-    size_t metricEvalRequestStructSize;
-    size_t metricEvalRequestStrideSize;
-    const char** ppRawDependencies;
-    size_t numRawDependencies;
+  size_t structSize;
+  struct NVPW_MetricsEvaluator *pMetricsEvaluator;
+  const struct NVPW_MetricEvalRequest *pMetricEvalRequests;
+  size_t numMetricEvalRequests;
+  size_t metricEvalRequestStructSize;
+  size_t metricEvalRequestStrideSize;
+  const char **ppRawDependencies;
+  size_t numRawDependencies;
 } NVPW_MetricsEvaluator_GetMetricRawDependencies_Params;
 #define NVPW_MetricsEvaluator_GetMetricRawDependencies_Params_STRUCT_SIZE 1
 
 typedef struct {
-    size_t structSize;
-    struct NVPW_MetricsEvaluator* pMetricsEvaluator;
+  size_t structSize;
+  struct NVPW_MetricsEvaluator *pMetricsEvaluator;
 } NVPW_MetricsEvaluator_Destroy_Params;
 #define NVPW_MetricsEvaluator_Destroy_Params_STRUCT_SIZE 1
 
 typedef struct {
-    size_t structSize;
-    NVPA_ActivityKind activityKind;
-    const char* pChipName;
-    const uint8_t* pCounterAvailabilityImage;
-    struct NVPA_RawMetricsConfig* pRawMetricsConfig;
+  size_t structSize;
+  NVPA_ActivityKind activityKind;
+  const char *pChipName;
+  const uint8_t *pCounterAvailabilityImage;
+  struct NVPA_RawMetricsConfig *pRawMetricsConfig;
 } NVPW_CUDA_RawMetricsConfig_Create_V2_Params;
 
 #define NVPW_CUDA_RawMetricsConfig_Create_V2_Params_STRUCT_SIZE 1
 
-
 #endif
 
-
 #ifndef DECLARE_CUFUNC
-#define CUAPIWEAK __attribute__( ( weak ) )
-#define DECLARE_CUFUNC(funcname, funcsig) CUresult CUAPIWEAK funcname funcsig;  CUresult( *funcname##Ptr ) funcsig;
+#define CUAPIWEAK __attribute__((weak))
+#define DECLARE_CUFUNC(funcname, funcsig)                                      \
+  CUresult CUAPIWEAK funcname funcsig;                                         \
+  CUresult(*funcname##Ptr) funcsig;
 #endif
 
 DECLARE_CUFUNC(cuCtxGetCurrent, (CUcontext *));
@@ -237,8 +240,10 @@ DECLARE_CUFUNC(cuCtxCreate, (CUcontext *, unsigned int, CUdevice));
 DECLARE_CUFUNC(cuDevicePrimaryCtxRetain, (CUcontext *, CUdevice));
 
 #ifndef DECLARE_CUDAFUNC
-#define CUDAAPIWEAK __attribute__( ( weak ) )
-#define DECLARE_CUDAFUNC(funcname, funcsig) cudaError_t CUDAAPIWEAK funcname funcsig;  cudaError_t( *funcname##Ptr ) funcsig;
+#define CUDAAPIWEAK __attribute__((weak))
+#define DECLARE_CUDAFUNC(funcname, funcsig)                                    \
+  cudaError_t CUDAAPIWEAK funcname funcsig;                                    \
+  cudaError_t(*funcname##Ptr) funcsig;
 #endif
 DECLARE_CUDAFUNC(cudaGetDevice, (int *));
 DECLARE_CUDAFUNC(cudaSetDevice, (int));
@@ -246,929 +251,1022 @@ DECLARE_CUDAFUNC(cudaFree, (void *));
 DECLARE_CUDAFUNC(cudaDriverGetVersion, (int *));
 DECLARE_CUDAFUNC(cudaRuntimeGetVersion, (int *));
 
-
 #ifndef DECLARE_NVPWFUNC
-#define NVPWAPIWEAK __attribute__( ( weak ) )
-#define DECLARE_NVPWFUNC(fname, fsig) NVPA_Status NVPWAPIWEAK fname fsig; NVPA_Status( *fname##Ptr ) fsig;
+#define NVPWAPIWEAK __attribute__((weak))
+#define DECLARE_NVPWFUNC(fname, fsig)                                          \
+  NVPA_Status NVPWAPIWEAK fname fsig;                                          \
+  NVPA_Status(*fname##Ptr) fsig;
 #endif
 
-DECLARE_NVPWFUNC(NVPW_GetSupportedChipNames, (NVPW_GetSupportedChipNames_Params* params));
-DECLARE_NVPWFUNC(NVPW_CUDA_MetricsContext_Create, (NVPW_CUDA_MetricsContext_Create_Params* params));
-DECLARE_NVPWFUNC(NVPW_MetricsContext_Destroy, (NVPW_MetricsContext_Destroy_Params * params));
-DECLARE_NVPWFUNC(NVPW_MetricsContext_GetMetricNames_Begin, (NVPW_MetricsContext_GetMetricNames_Begin_Params* params));
-DECLARE_NVPWFUNC(NVPW_MetricsContext_GetMetricNames_End, (NVPW_MetricsContext_GetMetricNames_End_Params* params));
-DECLARE_NVPWFUNC(NVPW_InitializeHost, (NVPW_InitializeHost_Params* params));
-DECLARE_NVPWFUNC(NVPW_MetricsContext_GetMetricProperties_Begin, (NVPW_MetricsContext_GetMetricProperties_Begin_Params* p));
-DECLARE_NVPWFUNC(NVPW_MetricsContext_GetMetricProperties_End, (NVPW_MetricsContext_GetMetricProperties_End_Params* p));
-DECLARE_NVPWFUNC(NVPA_RawMetricsConfig_Create, (const NVPA_RawMetricsConfigOptions*, NVPA_RawMetricsConfig**));
-DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_Destroy, (NVPW_RawMetricsConfig_Destroy_Params* params));
-DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_BeginPassGroup, (NVPW_RawMetricsConfig_BeginPassGroup_Params* params));
-DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_EndPassGroup, (NVPW_RawMetricsConfig_EndPassGroup_Params* params));
-DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_AddMetrics, (NVPW_RawMetricsConfig_AddMetrics_Params* params));
-DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_GenerateConfigImage, (NVPW_RawMetricsConfig_GenerateConfigImage_Params* params));
-DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_GetConfigImage, (NVPW_RawMetricsConfig_GetConfigImage_Params* params));
-DECLARE_NVPWFUNC(NVPW_CounterDataBuilder_Create, (NVPW_CounterDataBuilder_Create_Params* params));
-DECLARE_NVPWFUNC(NVPW_CounterDataBuilder_Destroy, (NVPW_CounterDataBuilder_Destroy_Params* params));
-DECLARE_NVPWFUNC(NVPW_CounterDataBuilder_AddMetrics, (NVPW_CounterDataBuilder_AddMetrics_Params* params));
-DECLARE_NVPWFUNC(NVPW_CounterDataBuilder_GetCounterDataPrefix, (NVPW_CounterDataBuilder_GetCounterDataPrefix_Params* params));
-DECLARE_NVPWFUNC(NVPW_CounterData_GetNumRanges, (NVPW_CounterData_GetNumRanges_Params* params));
-DECLARE_NVPWFUNC(NVPW_Profiler_CounterData_GetRangeDescriptions, (NVPW_Profiler_CounterData_GetRangeDescriptions_Params* params));
-DECLARE_NVPWFUNC(NVPW_MetricsContext_SetCounterData, (NVPW_MetricsContext_SetCounterData_Params* params));
-DECLARE_NVPWFUNC(NVPW_MetricsContext_EvaluateToGpuValues, (NVPW_MetricsContext_EvaluateToGpuValues_Params* params));
-DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_GetNumPasses, (NVPW_RawMetricsConfig_GetNumPasses_Params* params));
-DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_SetCounterAvailability, (NVPW_RawMetricsConfig_SetCounterAvailability_Params* params));
-
-DECLARE_NVPWFUNC(NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize, (NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params* params));
-DECLARE_NVPWFUNC(NVPW_CUDA_MetricsEvaluator_Initialize, (NVPW_CUDA_MetricsEvaluator_Initialize_Params* params));
-DECLARE_NVPWFUNC(NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest, (NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params* params));
-DECLARE_NVPWFUNC(NVPW_MetricsEvaluator_GetMetricRawDependencies, (NVPW_MetricsEvaluator_GetMetricRawDependencies_Params* params));
-DECLARE_NVPWFUNC(NVPW_MetricsEvaluator_Destroy, (NVPW_MetricsEvaluator_Destroy_Params* params));
-DECLARE_NVPWFUNC(NVPW_CUDA_RawMetricsConfig_Create_V2, (NVPW_CUDA_RawMetricsConfig_Create_V2_Params* params));
+DECLARE_NVPWFUNC(NVPW_GetSupportedChipNames,
+                 (NVPW_GetSupportedChipNames_Params * params));
+DECLARE_NVPWFUNC(NVPW_CUDA_MetricsContext_Create,
+                 (NVPW_CUDA_MetricsContext_Create_Params * params));
+DECLARE_NVPWFUNC(NVPW_MetricsContext_Destroy,
+                 (NVPW_MetricsContext_Destroy_Params * params));
+DECLARE_NVPWFUNC(NVPW_MetricsContext_GetMetricNames_Begin,
+                 (NVPW_MetricsContext_GetMetricNames_Begin_Params * params));
+DECLARE_NVPWFUNC(NVPW_MetricsContext_GetMetricNames_End,
+                 (NVPW_MetricsContext_GetMetricNames_End_Params * params));
+DECLARE_NVPWFUNC(NVPW_InitializeHost, (NVPW_InitializeHost_Params * params));
+DECLARE_NVPWFUNC(NVPW_MetricsContext_GetMetricProperties_Begin,
+                 (NVPW_MetricsContext_GetMetricProperties_Begin_Params * p));
+DECLARE_NVPWFUNC(NVPW_MetricsContext_GetMetricProperties_End,
+                 (NVPW_MetricsContext_GetMetricProperties_End_Params * p));
+DECLARE_NVPWFUNC(NVPW_CUDA_RawMetricsConfig_Create,
+                 (NVPW_CUDA_RawMetricsConfig_Create_Params *));
+DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_Destroy,
+                 (NVPW_RawMetricsConfig_Destroy_Params * params));
+DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_BeginPassGroup,
+                 (NVPW_RawMetricsConfig_BeginPassGroup_Params * params));
+DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_EndPassGroup,
+                 (NVPW_RawMetricsConfig_EndPassGroup_Params * params));
+DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_AddMetrics,
+                 (NVPW_RawMetricsConfig_AddMetrics_Params * params));
+DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_GenerateConfigImage,
+                 (NVPW_RawMetricsConfig_GenerateConfigImage_Params * params));
+DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_GetConfigImage,
+                 (NVPW_RawMetricsConfig_GetConfigImage_Params * params));
+DECLARE_NVPWFUNC(NVPW_CounterDataBuilder_Create,
+                 (NVPW_CounterDataBuilder_Create_Params * params));
+DECLARE_NVPWFUNC(NVPW_CounterDataBuilder_Destroy,
+                 (NVPW_CounterDataBuilder_Destroy_Params * params));
+DECLARE_NVPWFUNC(NVPW_CounterDataBuilder_AddMetrics,
+                 (NVPW_CounterDataBuilder_AddMetrics_Params * params));
+DECLARE_NVPWFUNC(NVPW_CounterDataBuilder_GetCounterDataPrefix,
+                 (NVPW_CounterDataBuilder_GetCounterDataPrefix_Params *
+                  params));
+DECLARE_NVPWFUNC(NVPW_CounterData_GetNumRanges,
+                 (NVPW_CounterData_GetNumRanges_Params * params));
+DECLARE_NVPWFUNC(NVPW_Profiler_CounterData_GetRangeDescriptions,
+                 (NVPW_Profiler_CounterData_GetRangeDescriptions_Params *
+                  params));
+DECLARE_NVPWFUNC(NVPW_MetricsContext_SetCounterData,
+                 (NVPW_MetricsContext_SetCounterData_Params * params));
+DECLARE_NVPWFUNC(NVPW_MetricsContext_EvaluateToGpuValues,
+                 (NVPW_MetricsContext_EvaluateToGpuValues_Params * params));
+DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_GetNumPasses,
+                 (NVPW_RawMetricsConfig_GetNumPasses_Params * params));
+DECLARE_NVPWFUNC(NVPW_RawMetricsConfig_SetCounterAvailability,
+                 (NVPW_RawMetricsConfig_SetCounterAvailability_Params *
+                  params));
+
+DECLARE_NVPWFUNC(NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize,
+                 (NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params *
+                  params));
+DECLARE_NVPWFUNC(NVPW_CUDA_MetricsEvaluator_Initialize,
+                 (NVPW_CUDA_MetricsEvaluator_Initialize_Params * params));
+DECLARE_NVPWFUNC(
+    NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest,
+    (NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params *
+     params));
+DECLARE_NVPWFUNC(NVPW_MetricsEvaluator_GetMetricRawDependencies,
+                 (NVPW_MetricsEvaluator_GetMetricRawDependencies_Params *
+                  params));
+DECLARE_NVPWFUNC(NVPW_MetricsEvaluator_Destroy,
+                 (NVPW_MetricsEvaluator_Destroy_Params * params));
+DECLARE_NVPWFUNC(NVPW_CUDA_RawMetricsConfig_Create_V2,
+                 (NVPW_CUDA_RawMetricsConfig_Create_V2_Params * params));
 
 #ifndef DECLARE_CUPTIFUNC
-#define CUPTIAPIWEAK __attribute__( ( weak ) )
-#define DECLARE_CUPTIFUNC(funcname, funcsig) CUptiResult CUPTIAPIWEAK funcname funcsig;  CUptiResult( *funcname##Ptr ) funcsig;
+#define CUPTIAPIWEAK __attribute__((weak))
+#define DECLARE_CUPTIFUNC(funcname, funcsig)                                   \
+  CUptiResult CUPTIAPIWEAK funcname funcsig;                                   \
+  CUptiResult(*funcname##Ptr) funcsig;
 #endif
-DECLARE_CUPTIFUNC(cuptiDeviceGetChipName, (CUpti_Device_GetChipName_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerInitialize, (CUpti_Profiler_Initialize_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerDeInitialize, (CUpti_Profiler_DeInitialize_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerCounterDataImageCalculateSize, (CUpti_Profiler_CounterDataImage_CalculateSize_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerCounterDataImageInitialize, (CUpti_Profiler_CounterDataImage_Initialize_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerCounterDataImageCalculateScratchBufferSize, (CUpti_Profiler_CounterDataImage_CalculateScratchBufferSize_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerCounterDataImageInitializeScratchBuffer, (CUpti_Profiler_CounterDataImage_InitializeScratchBuffer_Params* params));
-
-DECLARE_CUPTIFUNC(cuptiProfilerBeginSession, (CUpti_Profiler_BeginSession_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerSetConfig, (CUpti_Profiler_SetConfig_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerBeginPass, (CUpti_Profiler_BeginPass_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerEnableProfiling, (CUpti_Profiler_EnableProfiling_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerPushRange, (CUpti_Profiler_PushRange_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerPopRange, (CUpti_Profiler_PopRange_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerDisableProfiling, (CUpti_Profiler_DisableProfiling_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerEndPass, (CUpti_Profiler_EndPass_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerFlushCounterData, (CUpti_Profiler_FlushCounterData_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerUnsetConfig, (CUpti_Profiler_UnsetConfig_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerEndSession, (CUpti_Profiler_EndSession_Params* params));
-DECLARE_CUPTIFUNC(cuptiProfilerGetCounterAvailability, (CUpti_Profiler_GetCounterAvailability_Params* params));
+DECLARE_CUPTIFUNC(cuptiDeviceGetChipName,
+                  (CUpti_Device_GetChipName_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerInitialize,
+                  (CUpti_Profiler_Initialize_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerDeInitialize,
+                  (CUpti_Profiler_DeInitialize_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerCounterDataImageCalculateSize,
+                  (CUpti_Profiler_CounterDataImage_CalculateSize_Params *
+                   params));
+DECLARE_CUPTIFUNC(cuptiProfilerCounterDataImageInitialize,
+                  (CUpti_Profiler_CounterDataImage_Initialize_Params * params));
+DECLARE_CUPTIFUNC(
+    cuptiProfilerCounterDataImageCalculateScratchBufferSize,
+    (CUpti_Profiler_CounterDataImage_CalculateScratchBufferSize_Params *
+     params));
+DECLARE_CUPTIFUNC(
+    cuptiProfilerCounterDataImageInitializeScratchBuffer,
+    (CUpti_Profiler_CounterDataImage_InitializeScratchBuffer_Params * params));
+
+DECLARE_CUPTIFUNC(cuptiProfilerBeginSession,
+                  (CUpti_Profiler_BeginSession_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerSetConfig,
+                  (CUpti_Profiler_SetConfig_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerBeginPass,
+                  (CUpti_Profiler_BeginPass_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerEnableProfiling,
+                  (CUpti_Profiler_EnableProfiling_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerPushRange,
+                  (CUpti_Profiler_PushRange_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerPopRange,
+                  (CUpti_Profiler_PopRange_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerDisableProfiling,
+                  (CUpti_Profiler_DisableProfiling_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerEndPass,
+                  (CUpti_Profiler_EndPass_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerFlushCounterData,
+                  (CUpti_Profiler_FlushCounterData_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerUnsetConfig,
+                  (CUpti_Profiler_UnsetConfig_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerEndSession,
+                  (CUpti_Profiler_EndSession_Params * params));
+DECLARE_CUPTIFUNC(cuptiProfilerGetCounterAvailability,
+                  (CUpti_Profiler_GetCounterAvailability_Params * params));
 DECLARE_CUPTIFUNC(cuptiGetResultString, (CUptiResult result, const char **str));
 
-
 #ifndef DLSYM_AND_CHECK
-#define DLSYM_AND_CHECK( dllib, name ) dlsym( dllib, name ); if ( dlerror() != NULL ) { return -1; }
+#define DLSYM_AND_CHECK(dllib, name)                                           \
+  dlsym(dllib, name);                                                          \
+  if (dlerror() != NULL) {                                                     \
+    return -1;                                                                 \
+  }
 #endif
 
-
 static int cuptiProfiler_initialized = 0;
 static int cuda_runtime_version = 0;
 static int cuda_version = 0;
 
-
-
-
-
-static int
-link_perfworks_libraries(void)
-{
-    int ret = 0;
-    char *err = NULL;
-    /* Attempt to guess if we were statically linked to libc, if so bail */
-    if(_dl_non_dynamic_init != NULL) {
-        return -1;
-    }
-    char libcudartpath[1024];
-    char libnvperfpath[1024];
-    char libcuptipath[1024];
-    char* cudahome = getenv("CUDA_HOME");
-    if (cudahome != NULL)
-    {
-        ret = snprintf(libcudartpath, 1023, "%s/lib64/libcudart.so", cudahome);
-        if (ret >= 0)
-        {
-            libcudartpath[ret] = '\0';
-        }
-        ret = snprintf(libnvperfpath, 1023, "%s/extras/CUPTI/lib64/libnvperf_host.so", cudahome);
-        if (ret >= 0)
-        {
-            libnvperfpath[ret] = '\0';
-        }
-        ret = snprintf(libcuptipath, 1023, "%s/extras/CUPTI/lib64/libcupti.so", cudahome);
-        if (ret >= 0)
-        {
-            libcuptipath[ret] = '\0';
-        }
-    }
-    else
-    {
-        ret = snprintf(libcudartpath, 1023, "libcudart.so");
-        if (ret >= 0)
-        {
-            libcudartpath[ret] = '\0';
-        }
-        ret = snprintf(libnvperfpath, 1023, "libnvperf_host.so");
-        if (ret >= 0)
-        {
-            libnvperfpath[ret] = '\0';
-        }
-        ret = snprintf(libcuptipath, 1023, "libcupti.so");
-        if (ret >= 0)
-        {
-            libcuptipath[ret] = '\0';
-        }
-    }
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, LD_LIBRARY_PATH=%s, getenv("LD_LIBRARY_PATH"))
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, CUDA_HOME=%s, getenv("CUDA_HOME"))
-    dl_perfworks_libcuda = dlopen("libcuda.so", RTLD_NOW | RTLD_GLOBAL);
-    if (!dl_perfworks_libcuda || dlerror() != NULL)
-    {
-        DEBUG_PRINT(DEBUGLEV_INFO, CUDA library libcuda.so not found);
-        return -1;
-    }
-    cuCtxGetCurrentPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxGetCurrent");
-    cuCtxSetCurrentPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxSetCurrent");
-    cuDeviceGetPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuDeviceGet");
-    cuDeviceGetCountPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuDeviceGetCount");
-    cuDeviceGetNamePtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuDeviceGetName");
-    cuInitPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuInit");
-    cuCtxPopCurrentPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxPopCurrent");
-    cuCtxPushCurrentPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxPushCurrent");
-    cuCtxSynchronizePtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxSynchronize");
-    cuCtxDestroyPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxDestroy");
-    cuDeviceGetAttributePtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuDeviceGetAttribute");
-    cuCtxCreatePtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxCreate");
-    cuDevicePrimaryCtxRetainPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuDevicePrimaryCtxRetain");
-
-    dl_perfworks_libcudart = dlopen(libcudartpath, RTLD_NOW | RTLD_GLOBAL | RTLD_NODELETE);
-    if ((!dl_perfworks_libcudart) || (dlerror() != NULL))
-    {
-        DEBUG_PRINT(DEBUGLEV_INFO, CUDA library libcudart.so not found);
-        return -1;
-    }
-    cudaGetDevicePtr = DLSYM_AND_CHECK(dl_perfworks_libcudart, "cudaGetDevice");
-    cudaSetDevicePtr = DLSYM_AND_CHECK(dl_perfworks_libcudart, "cudaSetDevice");
-    cudaFreePtr = DLSYM_AND_CHECK(dl_perfworks_libcudart, "cudaFree");
-    cudaDriverGetVersionPtr = DLSYM_AND_CHECK(dl_perfworks_libcudart, "cudaDriverGetVersion");
-    cudaRuntimeGetVersionPtr = DLSYM_AND_CHECK(dl_perfworks_libcudart, "cudaRuntimeGetVersion");
-
-    LIKWID_CUDA_API_CALL((*cudaRuntimeGetVersionPtr)(&cuda_runtime_version), return -EFAULT);
-
-    dl_libhost = dlopen(libnvperfpath, RTLD_NOW | RTLD_GLOBAL | RTLD_NODELETE);
-    if ((!dl_libhost) || (dlerror() != NULL))
-    {
-        dl_libhost = dlopen("libnvperf_host.so", RTLD_NOW | RTLD_GLOBAL | RTLD_NODELETE);
-        if ((!dl_libhost) || (dlerror() != NULL))
-        {
-            DEBUG_PRINT(DEBUGLEV_INFO, CUDA library libnvperf_host.so not found);
-            return -1;
-        }
-    }
-    NVPW_GetSupportedChipNamesPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_GetSupportedChipNames");
-    NVPW_CUDA_MetricsContext_CreatePtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_CUDA_MetricsContext_Create");
-    NVPW_MetricsContext_DestroyPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_Destroy");
-    NVPW_MetricsContext_GetMetricNames_BeginPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_GetMetricNames_Begin");
-    NVPW_MetricsContext_GetMetricNames_EndPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_GetMetricNames_End");
-    NVPW_InitializeHostPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_InitializeHost");
-    NVPW_MetricsContext_GetMetricProperties_BeginPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_GetMetricProperties_Begin");
-    NVPW_MetricsContext_GetMetricProperties_EndPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_GetMetricProperties_End");
-    if (cuda_runtime_version < 11040)
-    {
-        NVPA_RawMetricsConfig_CreatePtr = DLSYM_AND_CHECK(dl_libhost, "NVPA_RawMetricsConfig_Create");
-    }
-    NVPW_RawMetricsConfig_DestroyPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_Destroy");
-    NVPW_RawMetricsConfig_BeginPassGroupPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_BeginPassGroup");
-    NVPW_RawMetricsConfig_EndPassGroupPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_EndPassGroup");
-    NVPW_RawMetricsConfig_AddMetricsPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_AddMetrics");
-    NVPW_RawMetricsConfig_GenerateConfigImagePtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_GenerateConfigImage");
-    NVPW_RawMetricsConfig_GetConfigImagePtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_GetConfigImage");
-    if (cuda_runtime_version >= 11040)
-    {
-        NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSizePtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize");
-        NVPW_CUDA_MetricsEvaluator_InitializePtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_CUDA_MetricsEvaluator_Initialize");
-        NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequestPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest");
-        NVPW_MetricsEvaluator_GetMetricRawDependenciesPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsEvaluator_GetMetricRawDependencies");
-        NVPW_MetricsEvaluator_DestroyPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsEvaluator_Destroy");
-        NVPW_CUDA_RawMetricsConfig_Create_V2Ptr = DLSYM_AND_CHECK(dl_libhost, "NVPW_CUDA_RawMetricsConfig_Create_V2");
-    }
-
-    NVPW_CounterDataBuilder_CreatePtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_CounterDataBuilder_Create");
-    NVPW_CounterDataBuilder_DestroyPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_CounterDataBuilder_Destroy");
-    NVPW_CounterDataBuilder_AddMetricsPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_CounterDataBuilder_AddMetrics");
-    NVPW_CounterDataBuilder_GetCounterDataPrefixPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_CounterDataBuilder_GetCounterDataPrefix");
-
-    NVPW_CounterData_GetNumRangesPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_CounterData_GetNumRanges");
-    NVPW_Profiler_CounterData_GetRangeDescriptionsPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_Profiler_CounterData_GetRangeDescriptions");
-    NVPW_MetricsContext_SetCounterDataPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_SetCounterData");
-    NVPW_MetricsContext_EvaluateToGpuValuesPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_EvaluateToGpuValues");
-    NVPW_RawMetricsConfig_GetNumPassesPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_GetNumPasses");
-
-    dl_cupti = dlopen(libcuptipath, RTLD_NOW | RTLD_GLOBAL | RTLD_NODELETE);
-    if ((!dl_cupti) || (dlerror() != NULL))
-    {
-        dl_cupti = dlopen("libcupti.so", RTLD_NOW | RTLD_GLOBAL | RTLD_NODELETE);
-        if ((!dl_cupti) || (dlerror() != NULL))
-        {
-            DEBUG_PRINT(DEBUGLEV_INFO, CUpti library libcupti.so not found);
-            return -1;
-        }
-    }
-    cuptiProfilerInitializePtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerInitialize");
-    cuptiProfilerDeInitializePtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerDeInitialize");
-    cuptiDeviceGetChipNamePtr = DLSYM_AND_CHECK(dl_cupti, "cuptiDeviceGetChipName");
-    cuptiProfilerCounterDataImageCalculateSizePtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerCounterDataImageCalculateSize");
-    cuptiProfilerCounterDataImageInitializePtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerCounterDataImageInitialize");
-    cuptiProfilerCounterDataImageCalculateScratchBufferSizePtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerCounterDataImageCalculateScratchBufferSize");
-    cuptiProfilerCounterDataImageInitializeScratchBufferPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerCounterDataImageInitializeScratchBuffer");
-    cuptiProfilerBeginSessionPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerBeginSession");
-    cuptiProfilerSetConfigPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerSetConfig");
-    cuptiProfilerBeginPassPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerBeginPass");
-    cuptiProfilerEnableProfilingPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerEnableProfiling");
-    cuptiProfilerPushRangePtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerPushRange");
-    cuptiProfilerPopRangePtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerPopRange");
-    cuptiProfilerDisableProfilingPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerDisableProfiling");
-    cuptiProfilerEndPassPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerEndPass");
-    cuptiProfilerFlushCounterDataPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerFlushCounterData");
-    cuptiProfilerUnsetConfigPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerUnsetConfig");
-    cuptiProfilerEndSessionPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerEndSession");
-    cuptiGetResultStringPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiGetResultString");
-    
-    dlerror();
-    int curDeviceId = -1;
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Run cuInit);
-    LIKWID_CU_CALL((*cuInitPtr)(0), return -EFAULT);
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Run cuDeviceGetCount);
-    LIKWID_CU_CALL((*cuDeviceGetCountPtr)(&curDeviceId), return -EFAULT);
-    // GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Run cudaGetDevice);
-    // LIKWID_CUDA_API_CALL((*cudaGetDevicePtr)(&curDeviceId), return -EFAULT);
-    CUdevice dev;
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Run cuDeviceGet);
-    LIKWID_CU_CALL((*cuDeviceGetPtr)(&dev, 0), return -EFAULT);
+static int link_perfworks_libraries(void) {
+  int ret = 0;
+  char *err = NULL;
+  /* Attempt to guess if we were statically linked to libc, if so bail */
+  if (_dl_non_dynamic_init != NULL) {
+    return -1;
+  }
+  char libcudartpath[1024];
+  char libnvperfpath[1024];
+  char libcuptipath[1024];
+  char *cudahome = getenv("CUDA_HOME");
+  if (cudahome != NULL) {
+    ret = snprintf(libcudartpath, 1023, "%s/lib64/libcudart.so", cudahome);
+    if (ret >= 0) {
+      libcudartpath[ret] = '\0';
+    }
+    ret = snprintf(libnvperfpath, 1023,
+                   "%s/extras/CUPTI/lib64/libnvperf_host.so", cudahome);
+    if (ret >= 0) {
+      libnvperfpath[ret] = '\0';
+    }
+    ret = snprintf(libcuptipath, 1023, "%s/extras/CUPTI/lib64/libcupti.so",
+                   cudahome);
+    if (ret >= 0) {
+      libcuptipath[ret] = '\0';
+    }
+  } else {
+    ret = snprintf(libcudartpath, 1023, "libcudart.so");
+    if (ret >= 0) {
+      libcudartpath[ret] = '\0';
+    }
+    ret = snprintf(libnvperfpath, 1023, "libnvperf_host.so");
+    if (ret >= 0) {
+      libnvperfpath[ret] = '\0';
+    }
+    ret = snprintf(libcuptipath, 1023, "libcupti.so");
+    if (ret >= 0) {
+      libcuptipath[ret] = '\0';
+    }
+  }
+  GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, LD_LIBRARY_PATH = % s,
+                 getenv("LD_LIBRARY_PATH"))
+  GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, CUDA_HOME = % s, getenv("CUDA_HOME"))
+  dl_perfworks_libcuda = dlopen("libcuda.so", RTLD_NOW | RTLD_GLOBAL);
+  if (!dl_perfworks_libcuda || dlerror() != NULL) {
+    DEBUG_PRINT(DEBUGLEV_INFO, CUDA library libcuda.so not found);
+    return -1;
+  }
+  cuCtxGetCurrentPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxGetCurrent");
+  cuCtxSetCurrentPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxSetCurrent");
+  cuDeviceGetPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuDeviceGet");
+  cuDeviceGetCountPtr =
+      DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuDeviceGetCount");
+  cuDeviceGetNamePtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuDeviceGetName");
+  cuInitPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuInit");
+  cuCtxPopCurrentPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxPopCurrent");
+  cuCtxPushCurrentPtr =
+      DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxPushCurrent");
+  cuCtxSynchronizePtr =
+      DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxSynchronize");
+  cuCtxDestroyPtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxDestroy");
+  cuDeviceGetAttributePtr =
+      DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuDeviceGetAttribute");
+  cuCtxCreatePtr = DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuCtxCreate");
+  cuDevicePrimaryCtxRetainPtr =
+      DLSYM_AND_CHECK(dl_perfworks_libcuda, "cuDevicePrimaryCtxRetain");
+
+  dl_perfworks_libcudart =
+      dlopen(libcudartpath, RTLD_NOW | RTLD_GLOBAL | RTLD_NODELETE);
+  if ((!dl_perfworks_libcudart) || (dlerror() != NULL)) {
+    DEBUG_PRINT(DEBUGLEV_INFO, CUDA library libcudart.so not found);
+    return -1;
+  }
+  cudaGetDevicePtr = DLSYM_AND_CHECK(dl_perfworks_libcudart, "cudaGetDevice");
+  cudaSetDevicePtr = DLSYM_AND_CHECK(dl_perfworks_libcudart, "cudaSetDevice");
+  cudaFreePtr = DLSYM_AND_CHECK(dl_perfworks_libcudart, "cudaFree");
+  cudaDriverGetVersionPtr =
+      DLSYM_AND_CHECK(dl_perfworks_libcudart, "cudaDriverGetVersion");
+  cudaRuntimeGetVersionPtr =
+      DLSYM_AND_CHECK(dl_perfworks_libcudart, "cudaRuntimeGetVersion");
+
+  LIKWID_CUDA_API_CALL((*cudaRuntimeGetVersionPtr)(&cuda_runtime_version),
+                       return -EFAULT);
+
+  dl_libhost = dlopen(libnvperfpath, RTLD_NOW | RTLD_GLOBAL | RTLD_NODELETE);
+  if ((!dl_libhost) || (dlerror() != NULL)) {
+    dl_libhost =
+        dlopen("libnvperf_host.so", RTLD_NOW | RTLD_GLOBAL | RTLD_NODELETE);
+    if ((!dl_libhost) || (dlerror() != NULL)) {
+      DEBUG_PRINT(DEBUGLEV_INFO, CUDA library libnvperf_host.so not found);
+      return -1;
+    }
+  }
+  NVPW_GetSupportedChipNamesPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_GetSupportedChipNames");
+  NVPW_CUDA_MetricsContext_CreatePtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_CUDA_MetricsContext_Create");
+  NVPW_MetricsContext_DestroyPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_Destroy");
+  NVPW_MetricsContext_GetMetricNames_BeginPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_GetMetricNames_Begin");
+  NVPW_MetricsContext_GetMetricNames_EndPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_GetMetricNames_End");
+  NVPW_InitializeHostPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_InitializeHost");
+  NVPW_MetricsContext_GetMetricProperties_BeginPtr = DLSYM_AND_CHECK(
+      dl_libhost, "NVPW_MetricsContext_GetMetricProperties_Begin");
+  NVPW_MetricsContext_GetMetricProperties_EndPtr = DLSYM_AND_CHECK(
+      dl_libhost, "NVPW_MetricsContext_GetMetricProperties_End");
+
+  NVPW_CUDA_RawMetricsConfig_CreatePtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_CUDA_RawMetricsConfig_Create");
+  NVPW_RawMetricsConfig_DestroyPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_Destroy");
+  NVPW_RawMetricsConfig_BeginPassGroupPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_BeginPassGroup");
+  NVPW_RawMetricsConfig_EndPassGroupPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_EndPassGroup");
+  NVPW_RawMetricsConfig_AddMetricsPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_AddMetrics");
+  NVPW_RawMetricsConfig_GenerateConfigImagePtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_GenerateConfigImage");
+  NVPW_RawMetricsConfig_GetConfigImagePtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_GetConfigImage");
+  if (cuda_runtime_version >= 11040) {
+    NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSizePtr = DLSYM_AND_CHECK(
+        dl_libhost, "NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize");
+    NVPW_CUDA_MetricsEvaluator_InitializePtr =
+        DLSYM_AND_CHECK(dl_libhost, "NVPW_CUDA_MetricsEvaluator_Initialize");
+    NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequestPtr =
+        DLSYM_AND_CHECK(
+            dl_libhost,
+            "NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest");
+    NVPW_MetricsEvaluator_GetMetricRawDependenciesPtr = DLSYM_AND_CHECK(
+        dl_libhost, "NVPW_MetricsEvaluator_GetMetricRawDependencies");
+    NVPW_MetricsEvaluator_DestroyPtr =
+        DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsEvaluator_Destroy");
+    NVPW_CUDA_RawMetricsConfig_Create_V2Ptr =
+        DLSYM_AND_CHECK(dl_libhost, "NVPW_CUDA_RawMetricsConfig_Create_V2");
+  }
+
+  NVPW_CounterDataBuilder_CreatePtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_CounterDataBuilder_Create");
+  NVPW_CounterDataBuilder_DestroyPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_CounterDataBuilder_Destroy");
+  NVPW_CounterDataBuilder_AddMetricsPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_CounterDataBuilder_AddMetrics");
+  NVPW_CounterDataBuilder_GetCounterDataPrefixPtr = DLSYM_AND_CHECK(
+      dl_libhost, "NVPW_CounterDataBuilder_GetCounterDataPrefix");
+
+  NVPW_CounterData_GetNumRangesPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_CounterData_GetNumRanges");
+  NVPW_Profiler_CounterData_GetRangeDescriptionsPtr = DLSYM_AND_CHECK(
+      dl_libhost, "NVPW_Profiler_CounterData_GetRangeDescriptions");
+  NVPW_MetricsContext_SetCounterDataPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_SetCounterData");
+  NVPW_MetricsContext_EvaluateToGpuValuesPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_MetricsContext_EvaluateToGpuValues");
+  NVPW_RawMetricsConfig_GetNumPassesPtr =
+      DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_GetNumPasses");
+
+  dl_cupti = dlopen(libcuptipath, RTLD_NOW | RTLD_GLOBAL | RTLD_NODELETE);
+  if ((!dl_cupti) || (dlerror() != NULL)) {
+    dl_cupti = dlopen("libcupti.so", RTLD_NOW | RTLD_GLOBAL | RTLD_NODELETE);
+    if ((!dl_cupti) || (dlerror() != NULL)) {
+      DEBUG_PRINT(DEBUGLEV_INFO, CUpti library libcupti.so not found);
+      return -1;
+    }
+  }
+  cuptiProfilerInitializePtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerInitialize");
+  cuptiProfilerDeInitializePtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerDeInitialize");
+  cuptiDeviceGetChipNamePtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiDeviceGetChipName");
+  cuptiProfilerCounterDataImageCalculateSizePtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerCounterDataImageCalculateSize");
+  cuptiProfilerCounterDataImageInitializePtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerCounterDataImageInitialize");
+  cuptiProfilerCounterDataImageCalculateScratchBufferSizePtr = DLSYM_AND_CHECK(
+      dl_cupti, "cuptiProfilerCounterDataImageCalculateScratchBufferSize");
+  cuptiProfilerCounterDataImageInitializeScratchBufferPtr = DLSYM_AND_CHECK(
+      dl_cupti, "cuptiProfilerCounterDataImageInitializeScratchBuffer");
+  cuptiProfilerBeginSessionPtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerBeginSession");
+  cuptiProfilerSetConfigPtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerSetConfig");
+  cuptiProfilerBeginPassPtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerBeginPass");
+  cuptiProfilerEnableProfilingPtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerEnableProfiling");
+  cuptiProfilerPushRangePtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerPushRange");
+  cuptiProfilerPopRangePtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerPopRange");
+  cuptiProfilerDisableProfilingPtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerDisableProfiling");
+  cuptiProfilerEndPassPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerEndPass");
+  cuptiProfilerFlushCounterDataPtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerFlushCounterData");
+  cuptiProfilerUnsetConfigPtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerUnsetConfig");
+  cuptiProfilerEndSessionPtr =
+      DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerEndSession");
+  cuptiGetResultStringPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiGetResultString");
+
+  dlerror();
+  int curDeviceId = -1;
+  GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Run cuInit);
+  LIKWID_CU_CALL((*cuInitPtr)(0), return -EFAULT);
+  GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Run cuDeviceGetCount);
+  LIKWID_CU_CALL((*cuDeviceGetCountPtr)(&curDeviceId), return -EFAULT);
+  // GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Run cudaGetDevice);
+  // LIKWID_CUDA_API_CALL((*cudaGetDevicePtr)(&curDeviceId), return -EFAULT);
+  CUdevice dev;
+  GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Run cuDeviceGet);
+  LIKWID_CU_CALL((*cuDeviceGetPtr)(&dev, 0), return -EFAULT);
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Run cuDeviceGetAttribute for major CC);
-    LIKWID_CU_CALL((*cuDeviceGetAttributePtr)(&curDeviceId, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, dev), return -EFAULT);
+    LIKWID_CU_CALL(
+        (*cuDeviceGetAttributePtr)(
+            &curDeviceId, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, dev),
+        return -EFAULT);
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Run cuDeviceGetAttribute for minor CC);
-    LIKWID_CU_CALL((*cuDeviceGetAttributePtr)(&curDeviceId, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, dev), return -EFAULT);
-
-    LIKWID_CUDA_API_CALL((*cudaDriverGetVersionPtr)(&cuda_version), return -EFAULT);
-    LIKWID_CUDA_API_CALL((*cudaRuntimeGetVersionPtr)(&cuda_runtime_version), return -EFAULT);
-
-    if (cuda_version >= 11000 && cuda_runtime_version >= 11000)
-    {
-        cuptiProfilerGetCounterAvailabilityPtr = DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerGetCounterAvailability");
-        NVPW_RawMetricsConfig_SetCounterAvailabilityPtr = DLSYM_AND_CHECK(dl_libhost, "NVPW_RawMetricsConfig_SetCounterAvailability");
-    }
-    else
-    {
-        cuptiProfilerGetCounterAvailabilityPtr = &cuptiProfilerGetCounterAvailability;
-        NVPW_RawMetricsConfig_SetCounterAvailabilityPtr = &NVPW_RawMetricsConfig_SetCounterAvailability;
+    LIKWID_CU_CALL(
+        (*cuDeviceGetAttributePtr)(
+            &curDeviceId, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, dev),
+        return -EFAULT);
+
+    LIKWID_CUDA_API_CALL((*cudaDriverGetVersionPtr)(&cuda_version),
+                         return -EFAULT);
+    LIKWID_CUDA_API_CALL((*cudaRuntimeGetVersionPtr)(&cuda_runtime_version),
+                         return -EFAULT);
+
+    if (cuda_version >= 11000 && cuda_runtime_version >= 11000) {
+      cuptiProfilerGetCounterAvailabilityPtr =
+          DLSYM_AND_CHECK(dl_cupti, "cuptiProfilerGetCounterAvailability");
+      NVPW_RawMetricsConfig_SetCounterAvailabilityPtr = DLSYM_AND_CHECK(
+          dl_libhost, "NVPW_RawMetricsConfig_SetCounterAvailability");
+    } else {
+      cuptiProfilerGetCounterAvailabilityPtr =
+          &cuptiProfilerGetCounterAvailability;
+      NVPW_RawMetricsConfig_SetCounterAvailabilityPtr =
+          &NVPW_RawMetricsConfig_SetCounterAvailability;
     }
 
     return 0;
 }
 
-static void
-release_perfworks_libraries(void)
-{
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Finalize PerfWorks Libaries);
-    if (dl_perfworks_libcuda)
-    {
-        dlclose(dl_perfworks_libcuda);
-        dl_perfworks_libcuda = NULL;
-    }
-    if (dl_perfworks_libcudart)
-    {
-        dlclose(dl_perfworks_libcudart);
-        dl_perfworks_libcudart = NULL;
-    }
-    if (dl_libhost)
-    {
-        dlclose(dl_libhost);
-        dl_libhost = NULL;
-    }
-    if (dl_cupti)
-    {
-        dlclose(dl_cupti);
-        dl_cupti = NULL;
-    }
+static void release_perfworks_libraries(void) {
+  GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Finalize PerfWorks Libaries);
+  if (dl_perfworks_libcuda) {
+    dlclose(dl_perfworks_libcuda);
+    dl_perfworks_libcuda = NULL;
+  }
+  if (dl_perfworks_libcudart) {
+    dlclose(dl_perfworks_libcudart);
+    dl_perfworks_libcudart = NULL;
+  }
+  if (dl_libhost) {
+    dlclose(dl_libhost);
+    dl_libhost = NULL;
+  }
+  if (dl_cupti) {
+    dlclose(dl_cupti);
+    dl_cupti = NULL;
+  }
 }
 
-static int perfworks_check_nv_context(NvmonDevice_t device, CUcontext currentContext)
-{
-    int j = 0;
-    int need_pop = 0;
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Current context %ld DevContext %ld, currentContext, device->context);
-    if (!device->context)
-    {
-        int context_of_dev = -1;
-        for (j = 0; j < nvGroupSet->numberOfGPUs; j++)
-        {
-            NvmonDevice_t dev = &nvGroupSet->gpus[j];
-            if (dev->context == currentContext)
-            {
-                context_of_dev = j;
-                break;
-            }
-        }
-        if (context_of_dev < 0)// && !device->context)
-        {
-            device->context = currentContext;
+static int perfworks_check_nv_context(NvmonDevice_t device,
+                                      CUcontext currentContext) {
+  int j = 0;
+  int need_pop = 0;
+  GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Current context % ld DevContext % ld,
+                 currentContext, device->context);
+  if (!device->context) {
+    int context_of_dev = -1;
+    for (j = 0; j < nvGroupSet->numberOfGPUs; j++) {
+      NvmonDevice_t dev = &nvGroupSet->gpus[j];
+      if (dev->context == currentContext) {
+        context_of_dev = j;
+        break;
+      }
+    }
+    if (context_of_dev < 0) // && !device->context)
+    {
+      device->context = currentContext;
             GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Reuse context %ld for device %d, device->context, device->deviceId);
-        }
-        else
-        {
-            LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId), return -EFAULT);
-            LIKWID_CUDA_API_CALL((*cudaFreePtr)(NULL), return -EFAULT);
-            //LIKWID_CUDA_API_CALL((*cuCtxGetCurrentPtr)(), return -EFAULT);
-            LIKWID_CUDA_API_CALL((*cuDevicePrimaryCtxRetainPtr)(&device->context, device->cuDevice), return -EFAULT);
+    } else {
+      LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId),
+                           return -EFAULT);
+      LIKWID_CUDA_API_CALL((*cudaFreePtr)(NULL), return -EFAULT);
+      // LIKWID_CUDA_API_CALL((*cuCtxGetCurrentPtr)(), return -EFAULT);
+      LIKWID_CUDA_API_CALL(
+          (*cuDevicePrimaryCtxRetainPtr)(&device->context, device->cuDevice),
+          return -EFAULT);
             GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, New context %ld for device %d, device->context, device->deviceId);
-        }
     }
-    else if (device->context != currentContext)
-    {
+  } else if (device->context != currentContext) {
         GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Use context %ld for device %d, device->context, device->deviceId);
-        LIKWID_CUDA_API_CALL((*cuCtxPushCurrentPtr)(device->context), return -EFAULT);
+        LIKWID_CUDA_API_CALL((*cuCtxPushCurrentPtr)(device->context),
+                             return -EFAULT);
         need_pop = 1;
-    }
-    else
-    {
+  } else {
         GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Context %ld fits for device %d, device->context, device->deviceId);
-    }
-    return need_pop;
+  }
+  return need_pop;
 }
 
-static int cuptiProfiler_init()
-{
-    if (!cuptiProfiler_initialized)
-    {
-        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Init CUpti Profiler);
-        if (dl_perfworks_libcuda == NULL ||
-            dl_perfworks_libcudart == NULL ||
-            dl_libhost == NULL ||
-            dl_cupti == NULL)
-        {
-            if (link_perfworks_libraries() < 0)
-                return -1;
-        }
-        // LIKWID_CU_CALL((*cuInitPtr)(0), return -1);
-        // CUdevice dev;
-        // LIKWID_CU_CALL((*cuDeviceGetPtr)(&dev, 0), return -1);
-        CUpti_Profiler_Initialize_Params profilerInitializeParams = {CUpti_Profiler_Initialize_Params_STRUCT_SIZE};
-        LIKWID_CUPTI_API_CALL((*cuptiProfilerInitializePtr)(&profilerInitializeParams), return -1);
-        NVPW_InitializeHost_Params initializeHostParams = { NVPW_InitializeHost_Params_STRUCT_SIZE };
-        LIKWID_NVPW_API_CALL((*NVPW_InitializeHostPtr)(&initializeHostParams), return -1);
-        cuptiProfiler_initialized = 1;
+static int cuptiProfiler_init() {
+  if (!cuptiProfiler_initialized) {
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Init CUpti Profiler);
+    if (dl_perfworks_libcuda == NULL || dl_perfworks_libcudart == NULL ||
+        dl_libhost == NULL || dl_cupti == NULL) {
+      if (link_perfworks_libraries() < 0)
+        return -1;
     }
+    // LIKWID_CU_CALL((*cuInitPtr)(0), return -1);
+    // CUdevice dev;
+    // LIKWID_CU_CALL((*cuDeviceGetPtr)(&dev, 0), return -1);
+    CUpti_Profiler_Initialize_Params profilerInitializeParams = {
+        CUpti_Profiler_Initialize_Params_STRUCT_SIZE};
+    LIKWID_CUPTI_API_CALL(
+        (*cuptiProfilerInitializePtr)(&profilerInitializeParams), return -1);
+    NVPW_InitializeHost_Params initializeHostParams = {
+        NVPW_InitializeHost_Params_STRUCT_SIZE};
+    LIKWID_NVPW_API_CALL((*NVPW_InitializeHostPtr)(&initializeHostParams),
+                         return -1);
+    cuptiProfiler_initialized = 1;
+  }
 }
 
-static void cuptiProfiler_finalize()
-{
-    if (cuptiProfiler_initialized)
-    {
-        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Finalize CUpti Profiler);
-        CUpti_Profiler_DeInitialize_Params profilerDeInitializeParams = {CUpti_Profiler_DeInitialize_Params_STRUCT_SIZE};
-        LIKWID_CUPTI_API_CALL((*cuptiProfilerDeInitializePtr)(&profilerDeInitializeParams), cuptiProfiler_initialized = 0; return;);
+static void cuptiProfiler_finalize() {
+  if (cuptiProfiler_initialized) {
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Finalize CUpti Profiler);
+    CUpti_Profiler_DeInitialize_Params profilerDeInitializeParams = {
+        CUpti_Profiler_DeInitialize_Params_STRUCT_SIZE};
+    LIKWID_CUPTI_API_CALL(
+        (*cuptiProfilerDeInitializePtr)(&profilerDeInitializeParams),
         cuptiProfiler_initialized = 0;
-        release_perfworks_libraries();
-    }
+        return;);
+    cuptiProfiler_initialized = 0;
+    release_perfworks_libraries();
+  }
 }
 
-
-static int nvmon_perfworks_parse_metric(char* inoutmetric, int* isolated, int* keepInstances)
-{
-    if (!inoutmetric)
-        return 0;
-    int len = strlen(inoutmetric);
-
-    bstring outmetric = bfromcstr(inoutmetric);
-    int newline = bstrchrp(outmetric, '\n', 0);
-    if (newline != BSTR_ERR)
-    {
-        bdelete(outmetric, newline, 1);
-    }
-    btrimws(outmetric);
-    if (blength(outmetric) > 0)
-    {
-        *keepInstances = 0;
-        if (bchar(outmetric, blength(outmetric)-1) == '+')
-        {
-            *keepInstances = 1;
-            bdelete(outmetric, blength(outmetric)-1, 1);
-        }
-        if (blength(outmetric) > 0)
-        {
-            *isolated = 1;
-            if (bchar(outmetric, blength(outmetric)-1) == '$')
-            {
-                bdelete(outmetric, blength(outmetric)-1, 1);
-            }
-            else if (bchar(outmetric, blength(outmetric)-1) == '&')
-            {
-                *isolated = 0;
-                bdelete(outmetric, blength(outmetric)-1, 1);
-            }
-            if (blength(outmetric) > 0)
-            {
-                int ret = snprintf(inoutmetric, len, "%s", bdata(outmetric));
-                if (ret > 0)
-                {
-                    inoutmetric[ret] = '\0';
-                }
-                bdestroy(outmetric);
-                return 1;
-            } 
+static int nvmon_perfworks_parse_metric(char *inoutmetric, int *isolated,
+                                        int *keepInstances) {
+  if (!inoutmetric)
+    return 0;
+  int len = strlen(inoutmetric);
+
+  bstring outmetric = bfromcstr(inoutmetric);
+  int newline = bstrchrp(outmetric, '\n', 0);
+  if (newline != BSTR_ERR) {
+    bdelete(outmetric, newline, 1);
+  }
+  btrimws(outmetric);
+  if (blength(outmetric) > 0) {
+    *keepInstances = 0;
+    if (bchar(outmetric, blength(outmetric) - 1) == '+') {
+      *keepInstances = 1;
+      bdelete(outmetric, blength(outmetric) - 1, 1);
+    }
+    if (blength(outmetric) > 0) {
+      *isolated = 1;
+      if (bchar(outmetric, blength(outmetric) - 1) == '$') {
+        bdelete(outmetric, blength(outmetric) - 1, 1);
+      } else if (bchar(outmetric, blength(outmetric) - 1) == '&') {
+        *isolated = 0;
+        bdelete(outmetric, blength(outmetric) - 1, 1);
+      }
+      if (blength(outmetric) > 0) {
+        int ret = snprintf(inoutmetric, len, "%s", bdata(outmetric));
+        if (ret > 0) {
+          inoutmetric[ret] = '\0';
         }
+        bdestroy(outmetric);
+        return 1;
+      }
     }
-    return 0;
-    
+  }
+  return 0;
 }
 
-// static int expand_metric(NVPA_MetricsContext* context, char* inmetric, struct bstrList* events)
+// static int expand_metric(NVPA_MetricsContext* context, char* inmetric, struct
+// bstrList* events)
 // {
 //     int iso = 0;
 //     int keep = 0;
 //     nvmon_perfworks_parse_metric(inmetric, &iso, &keep);
 //     keep = 1;
-//     NVPW_MetricsContext_GetMetricProperties_Begin_Params getMetricPropertiesBeginParams = { NVPW_MetricsContext_GetMetricProperties_Begin_Params_STRUCT_SIZE };
+//     NVPW_MetricsContext_GetMetricProperties_Begin_Params
+//     getMetricPropertiesBeginParams = {
+//     NVPW_MetricsContext_GetMetricProperties_Begin_Params_STRUCT_SIZE };
 //     getMetricPropertiesBeginParams.pMetricsContext = context;
 //     getMetricPropertiesBeginParams.pMetricName = inmetric;
-//     LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_BeginPtr)(&getMetricPropertiesBeginParams), return -EFAULT);
+//     LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_BeginPtr)(&getMetricPropertiesBeginParams),
+//     return -EFAULT);
 
-//     for (char** dep = getMetricPropertiesBeginParams.ppRawMetricDependencies; *dep ; ++dep)
+//     for (char** dep = getMetricPropertiesBeginParams.ppRawMetricDependencies;
+//     *dep ; ++dep)
 //     {
 //         bstrListAddChar(events, *dep);
 //     }
-//     NVPW_MetricsContext_GetMetricProperties_End_Params getMetricPropertiesEndParams = { NVPW_MetricsContext_GetMetricProperties_End_Params_STRUCT_SIZE };
+//     NVPW_MetricsContext_GetMetricProperties_End_Params
+//     getMetricPropertiesEndParams = {
+//     NVPW_MetricsContext_GetMetricProperties_End_Params_STRUCT_SIZE };
 //     getMetricPropertiesEndParams.pMetricsContext = context;
-//     LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_EndPtr)(&getMetricPropertiesEndParams), return -EFAULT);
-//     return 0;
+//     LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_EndPtr)(&getMetricPropertiesEndParams),
+//     return -EFAULT); return 0;
 // }
 
-static int combineBstrLists(struct bstrList* inout, struct bstrList* add, int no_duplicates)
-{
-    int i, j;
-    for (i = 0; i < add->qty; i++)
-    {
-        if (no_duplicates)
-        {
-            int found = 0;
-            for (j = 0; j < inout->qty; j++)
-            {
-                if (bstrcmp(add->entry[i], inout->entry[j]) == BSTR_OK)
-                {
-                    found = 1;
-                    break;
-                }
-            }
-            if (!found)
-            {
-                bstrListAdd(inout, add->entry[i]);
-            }
-        }
-        else
-        {
-            bstrListAdd(inout, add->entry[i]);
+static int combineBstrLists(struct bstrList *inout, struct bstrList *add,
+                            int no_duplicates) {
+  int i, j;
+  for (i = 0; i < add->qty; i++) {
+    if (no_duplicates) {
+      int found = 0;
+      for (j = 0; j < inout->qty; j++) {
+        if (bstrcmp(add->entry[i], inout->entry[j]) == BSTR_OK) {
+          found = 1;
+          break;
         }
-    }
+      }
+      if (!found) {
+        bstrListAdd(inout, add->entry[i]);
+      }
+    } else {
+      bstrListAdd(inout, add->entry[i]);
+    }
+  }
 }
 
-void nvmon_perfworks_freeDevice(NvmonDevice_t dev)
-{
-    if (dev)
-    {
-        if (dev->chip)
-        {
-            free(dev->chip);
-            dev->chip = NULL;
+void nvmon_perfworks_freeDevice(NvmonDevice_t dev) {
+  if (dev) {
+    if (dev->chip) {
+      free(dev->chip);
+      dev->chip = NULL;
+    }
+    if (dev->nvEventSets) {
+      for (int i = 0; i < dev->numNvEventSets; i++) {
+        NvmonEventSet *evset = &dev->nvEventSets[i];
+        bstrListDestroy(evset->events);
+        if (evset->nvEvents) {
+          free(evset->nvEvents);
+          evset->nvEvents = NULL;
         }
-        if (dev->nvEventSets)
-        {
-            for (int i = 0; i < dev->numNvEventSets; i++)
-            {
-                NvmonEventSet* evset = &dev->nvEventSets[i];
-                bstrListDestroy(evset->events);
-                if (evset->nvEvents)
-                {
-                    free(evset->nvEvents);
-                    evset->nvEvents = NULL;
-                }
-                if (evset->results)
-                {
-                    free(evset->results);
-                    evset->results = NULL;
-                }
-                if (evset->configImage)
-                {
-                    free(evset->configImage);
-                    evset->configImage = NULL;
-                    evset->configImageSize = 0;
-                }
-                if (evset->counterDataImage)
-                {
-                    free(evset->counterDataImage);
-                    evset->counterDataImage = NULL;
-                    evset->counterDataImageSize = 0;
-                }
-                if (evset->counterDataScratchBuffer)
-                {
-                    free(evset->counterDataScratchBuffer);
-                    evset->counterDataScratchBuffer = NULL;
-                    evset->counterDataScratchBufferSize = 0;
-                }
-                if (evset->counterDataImagePrefix)
-                {
-                    free(evset->counterDataImagePrefix);
-                    evset->counterDataImagePrefix = NULL;
-                    evset->counterDataImagePrefixSize = 0;
-                }
-                if (evset->counterAvailabilityImage)
-                {
-                    free(evset->counterAvailabilityImage);
-                    evset->counterAvailabilityImage = NULL;
-                    evset->counterAvailabilityImageSize = 0;
-                }
-            }
-            free(dev->nvEventSets);
-            dev->nvEventSets = NULL;
-            dev->numNvEventSets = 0;
-            dev->activeEventSet = -1;
+        if (evset->results) {
+          free(evset->results);
+          evset->results = NULL;
         }
-        if (dev->allevents)
-        {
-            int i = 0;
-            if (dev->nvEventSets != NULL)
-            {
-                for (i = 0; i < dev->numNvEventSets; i++)
-                {
-                    NvmonEventSet* ev = &dev->nvEventSets[i];
-                    if (ev->results)
-                    {
-                        free(ev->results);
-                    }
-                    if (ev->nvEvents)
-                    {
-                        free(ev->nvEvents);
-                    }
-                }
-            }
-            for (i = 0; i < dev->numAllEvents; i++)
-            {
-                free(dev->allevents[i]);
-            }
-            free(dev->allevents);
-            dev->allevents = NULL;
-            dev->numAllEvents = 0;
+        if (evset->configImage) {
+          free(evset->configImage);
+          evset->configImage = NULL;
+          evset->configImageSize = 0;
         }
-    }
-}
-
-static void prepare_metric_name(bstring metric)
-{
-    bstring double_us = bfromcstr("__");
-    bstring us = bfromcstr("_");
-    bstring dot = bfromcstr(".");
-    btrimws(metric);
-    int newline = bstrchrp(metric, '\n', 0);
-    if (newline != BSTR_ERR)
-    {
-        bdelete(metric, newline, 1);
-    }
-    btoupper(metric);
-
-    bfindreplace(metric, double_us, us, 0);
-    bfindreplace(metric, dot, us, 0);
-
-    bdestroy(double_us);
-    bdestroy(us);
-    bdestroy(dot);
-}
-
-
-int
-nvmon_perfworks_createDevice(int id, NvmonDevice *dev)
-{
-    int ierr = 0;
-    size_t i = 0;
-    int count = 0;
-    struct tagbstring sumtype = bsStatic (".sum");
-    struct tagbstring mintype = bsStatic (".min");
-    struct tagbstring maxtype = bsStatic (".max");
-
-    if (dl_perfworks_libcuda == NULL ||
-        dl_perfworks_libcudart == NULL ||
-        dl_libhost == NULL ||
-        dl_cupti == NULL)
-    {
-        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, link_perfworks_libraries in createDevice);
-        ierr = link_perfworks_libraries();
-        if (ierr < 0)
-        {
-            return -ENODEV;
+        if (evset->counterDataImage) {
+          free(evset->counterDataImage);
+          evset->counterDataImage = NULL;
+          evset->counterDataImageSize = 0;
         }
-    }
-
-    LIKWID_CU_CALL((*cuDeviceGetCountPtr)(&count), return -1);
-    if (count == 0)
-    {
-        printf("No GPUs found\n");
-        return -1;
-    }
-    if (id < 0 || id >= count)
-    {
-        printf("GPU %d not available\n", id);
-        return -1;
-    }
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Found %d GPUs, count);
-    
-
-    // Assign device ID and get cuDevice from CUDA
-    CU_CALL((*cuDeviceGetPtr)(&dev->cuDevice, id), return -1);
-    dev->deviceId = id;
-    dev->context = 0UL;
-
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Current GPU %d, id);
-    CUpti_Profiler_Initialize_Params profilerInitializeParams = {CUpti_Profiler_Initialize_Params_STRUCT_SIZE};
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerInitializePtr)(&profilerInitializeParams), return -1);
-
-    NVPW_InitializeHost_Params initializeHostParams = { NVPW_InitializeHost_Params_STRUCT_SIZE };
-    LIKWID_NVPW_API_CALL((*NVPW_InitializeHostPtr)(&initializeHostParams), return -1;);
-    /* Dirty hack to support CUDA 10.1 and CUDA 11.0 insted of
-    CUpti_Device_GetChipName_Params getChipNameParams = { CUpti_Device_GetChipName_Params_STRUCT_SIZE };
-    */
-    size_t getChipNameParams_size = 0;
-    if (cuda_runtime_version < 11000)
-    {
-        getChipNameParams_size = CUpti_Device_GetChipName_Params_STRUCT_SIZE10;
-    }
-    else
-    {
-        getChipNameParams_size = CUpti_Device_GetChipName_Params_STRUCT_SIZE11;
-    }
-    CUpti_Device_GetChipName_Params getChipNameParams = { getChipNameParams_size };
-    /* End of dirty hack */
-    getChipNameParams.deviceIndex = id;
-    LIKWID_CUPTI_API_CALL((*cuptiDeviceGetChipNamePtr)(&getChipNameParams), return -1;);
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Current GPU chip %s, getChipNameParams.pChipName);
-    dev->chip = malloc(strlen(getChipNameParams.pChipName)+2);
-    if (dev->chip)
-    {
-        int ret = snprintf(dev->chip, strlen(getChipNameParams.pChipName)+1, "%s", getChipNameParams.pChipName);
-        if (ret > 0)
-        {
-            dev->chip[ret] = '\0';
+        if (evset->counterDataScratchBuffer) {
+          free(evset->counterDataScratchBuffer);
+          evset->counterDataScratchBuffer = NULL;
+          evset->counterDataScratchBufferSize = 0;
         }
-    }
+        if (evset->counterDataImagePrefix) {
+          free(evset->counterDataImagePrefix);
+          evset->counterDataImagePrefix = NULL;
+          evset->counterDataImagePrefixSize = 0;
+        }
+        if (evset->counterAvailabilityImage) {
+          free(evset->counterAvailabilityImage);
+          evset->counterAvailabilityImage = NULL;
+          evset->counterAvailabilityImageSize = 0;
+        }
+      }
+      free(dev->nvEventSets);
+      dev->nvEventSets = NULL;
+      dev->numNvEventSets = 0;
+      dev->activeEventSet = -1;
+    }
+    if (dev->allevents) {
+      int i = 0;
+      if (dev->nvEventSets != NULL) {
+        for (i = 0; i < dev->numNvEventSets; i++) {
+          NvmonEventSet *ev = &dev->nvEventSets[i];
+          if (ev->results) {
+            free(ev->results);
+          }
+          if (ev->nvEvents) {
+            free(ev->nvEvents);
+          }
+        }
+      }
+      for (i = 0; i < dev->numAllEvents; i++) {
+        free(dev->allevents[i]);
+      }
+      free(dev->allevents);
+      dev->allevents = NULL;
+      dev->numAllEvents = 0;
+    }
+  }
+}
 
-    
+static void prepare_metric_name(bstring metric) {
+  bstring double_us = bfromcstr("__");
+  bstring us = bfromcstr("_");
+  bstring dot = bfromcstr(".");
+  btrimws(metric);
+  int newline = bstrchrp(metric, '\n', 0);
+  if (newline != BSTR_ERR) {
+    bdelete(metric, newline, 1);
+  }
+  btoupper(metric);
+
+  bfindreplace(metric, double_us, us, 0);
+  bfindreplace(metric, dot, us, 0);
+
+  bdestroy(double_us);
+  bdestroy(us);
+  bdestroy(dot);
+}
 
-    NVPW_CUDA_MetricsContext_Create_Params metricsContextCreateParams = { NVPW_CUDA_MetricsContext_Create_Params_STRUCT_SIZE };
-    metricsContextCreateParams.pChipName = dev->chip;
+int nvmon_perfworks_createDevice(int id, NvmonDevice *dev) {
+  int ierr = 0;
+  size_t i = 0;
+  int count = 0;
+  struct tagbstring sumtype = bsStatic(".sum");
+  struct tagbstring mintype = bsStatic(".min");
+  struct tagbstring maxtype = bsStatic(".max");
+
+  if (dl_perfworks_libcuda == NULL || dl_perfworks_libcudart == NULL ||
+      dl_libhost == NULL || dl_cupti == NULL) {
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, link_perfworks_libraries in createDevice);
+    ierr = link_perfworks_libraries();
+    if (ierr < 0) {
+      return -ENODEV;
+    }
+  }
+
+  LIKWID_CU_CALL((*cuDeviceGetCountPtr)(&count), return -1);
+  if (count == 0) {
+    printf("No GPUs found\n");
+    return -1;
+  }
+  if (id < 0 || id >= count) {
+    printf("GPU %d not available\n", id);
+    return -1;
+  }
+  GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Found % d GPUs, count);
+
+  // Assign device ID and get cuDevice from CUDA
+  CU_CALL((*cuDeviceGetPtr)(&dev->cuDevice, id), return -1);
+  dev->deviceId = id;
+  dev->context = 0UL;
+
+  GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Current GPU % d, id);
+  CUpti_Profiler_Initialize_Params profilerInitializeParams = {
+      CUpti_Profiler_Initialize_Params_STRUCT_SIZE};
+  LIKWID_CUPTI_API_CALL(
+      (*cuptiProfilerInitializePtr)(&profilerInitializeParams), return -1);
+
+  NVPW_InitializeHost_Params initializeHostParams = {
+      NVPW_InitializeHost_Params_STRUCT_SIZE};
+  LIKWID_NVPW_API_CALL((*NVPW_InitializeHostPtr)(&initializeHostParams),
+                       return -1;);
+  /* Dirty hack to support CUDA 10.1 and CUDA 11.0 insted of
+  CUpti_Device_GetChipName_Params getChipNameParams = {
+  CUpti_Device_GetChipName_Params_STRUCT_SIZE };
+  */
+  size_t getChipNameParams_size = 0;
+  if (cuda_runtime_version < 11000) {
+    getChipNameParams_size = CUpti_Device_GetChipName_Params_STRUCT_SIZE10;
+  } else {
+    getChipNameParams_size = CUpti_Device_GetChipName_Params_STRUCT_SIZE11;
+  }
+  CUpti_Device_GetChipName_Params getChipNameParams = {getChipNameParams_size};
+  /* End of dirty hack */
+  getChipNameParams.deviceIndex = id;
+  LIKWID_CUPTI_API_CALL((*cuptiDeviceGetChipNamePtr)(&getChipNameParams),
+                        return -1;);
+  GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Current GPU chip % s,
+                 getChipNameParams.pChipName);
+  dev->chip = malloc(strlen(getChipNameParams.pChipName) + 2);
+  if (dev->chip) {
+    int ret = snprintf(dev->chip, strlen(getChipNameParams.pChipName) + 1, "%s",
+                       getChipNameParams.pChipName);
+    if (ret > 0) {
+      dev->chip[ret] = '\0';
+    }
+  }
+
+  NVPW_CUDA_MetricsContext_Create_Params metricsContextCreateParams = {
+      NVPW_CUDA_MetricsContext_Create_Params_STRUCT_SIZE};
+  metricsContextCreateParams.pChipName = dev->chip;
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Create metric context for chip '%s', dev->chip);
-    LIKWID_NVPW_API_CALL((*NVPW_CUDA_MetricsContext_CreatePtr)(&metricsContextCreateParams), return -1);
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_CUDA_MetricsContext_CreatePtr)(&metricsContextCreateParams),
+        return -1);
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Create metric context done);
 
-    NVPW_MetricsContext_Destroy_Params metricsContextDestroyParams = { NVPW_MetricsContext_Destroy_Params_STRUCT_SIZE };
-    metricsContextDestroyParams.pMetricsContext = metricsContextCreateParams.pMetricsContext;
+    NVPW_MetricsContext_Destroy_Params metricsContextDestroyParams = {
+        NVPW_MetricsContext_Destroy_Params_STRUCT_SIZE};
+    metricsContextDestroyParams.pMetricsContext =
+        metricsContextCreateParams.pMetricsContext;
 
-    NVPW_MetricsContext_GetMetricNames_Begin_Params getMetricNameBeginParams = { NVPW_MetricsContext_GetMetricNames_Begin_Params_STRUCT_SIZE };
-    getMetricNameBeginParams.pMetricsContext = metricsContextCreateParams.pMetricsContext;
+    NVPW_MetricsContext_GetMetricNames_Begin_Params getMetricNameBeginParams = {
+        NVPW_MetricsContext_GetMetricNames_Begin_Params_STRUCT_SIZE};
+    getMetricNameBeginParams.pMetricsContext =
+        metricsContextCreateParams.pMetricsContext;
     getMetricNameBeginParams.hidePeakSubMetrics = 1;
     getMetricNameBeginParams.hidePerCycleSubMetrics = 1;
     getMetricNameBeginParams.hidePctOfPeakSubMetrics = 1;
-    //getMetricNameBeginParams.hidePctOfPeakSubMetricsOnThroughputs = 1;
+    // getMetricNameBeginParams.hidePctOfPeakSubMetricsOnThroughputs = 1;
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Create metric context getMetricNames);
-    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricNames_BeginPtr)(&getMetricNameBeginParams), return -1);
-
-    NVPW_MetricsContext_GetMetricNames_End_Params getMetricNameEndParams = { NVPW_MetricsContext_GetMetricNames_End_Params_STRUCT_SIZE };
-    getMetricNameEndParams.pMetricsContext = metricsContextCreateParams.pMetricsContext;
-
-
-
-    dev->allevents = malloc(getMetricNameBeginParams.numMetrics * sizeof(NvmonEvent_t));
-    if (dev->allevents)
-    {
-        for (i = 0; i < getMetricNameBeginParams.numMetrics; i++)
-        {
-            NvmonEvent_t event = malloc(sizeof(NvmonEvent));
-            if (event)
-            {
-                memset(event, 0, sizeof(NvmonEvent));
-                bstring t = bfromcstr(getMetricNameBeginParams.ppMetricNames[i]);
-                prepare_metric_name(t);
-
-                int ret = snprintf(event->name, NVMON_DEFAULT_STR_LEN-1, "%s", bdata(t));
-                if (ret > 0)
-                {
-                    event->name[ret] = '\0';
-                }
-                if (binstrrcaseless(t, blength(t)-1, &sumtype) != BSTR_OK)
-                {
-                    event->rtype = ENTITY_TYPE_SUM;
-                }
-                else if (binstrrcaseless(t, blength(t)-1, &mintype) != BSTR_OK)
-                {
-                    event->rtype = ENTITY_TYPE_MIN;
-                }
-                else if (binstrrcaseless(t, blength(t)-1, &maxtype) != BSTR_OK)
-                {
-                    event->rtype = ENTITY_TYPE_MAX;
-                }
-                else
-                {
-                    event->rtype = ENTITY_TYPE_INSTANT;
-                }
-                bdestroy(t);
-                ret = snprintf(event->real, NVMON_DEFAULT_STR_LEN-1, "%s", getMetricNameBeginParams.ppMetricNames[i]);
-                if (ret > 0)
-                {
-                    event->real[ret] = '\0';
-                }
-                event->eventId = i;
-                event->type = NVMON_PERFWORKS_EVENT;
-                dev->allevents[i] = event;
-                
-            }
-            else
-            {
-                ierr = -ENOMEM;
-                break;
-            }
+    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricNames_BeginPtr)(
+                             &getMetricNameBeginParams),
+                         return -1);
+
+    NVPW_MetricsContext_GetMetricNames_End_Params getMetricNameEndParams = {
+        NVPW_MetricsContext_GetMetricNames_End_Params_STRUCT_SIZE};
+    getMetricNameEndParams.pMetricsContext =
+        metricsContextCreateParams.pMetricsContext;
+
+    dev->allevents =
+        malloc(getMetricNameBeginParams.numMetrics * sizeof(NvmonEvent_t));
+    if (dev->allevents) {
+      for (i = 0; i < getMetricNameBeginParams.numMetrics; i++) {
+        NvmonEvent_t event = malloc(sizeof(NvmonEvent));
+        if (event) {
+          memset(event, 0, sizeof(NvmonEvent));
+          bstring t = bfromcstr(getMetricNameBeginParams.ppMetricNames[i]);
+          prepare_metric_name(t);
+
+          int ret =
+              snprintf(event->name, NVMON_DEFAULT_STR_LEN - 1, "%s", bdata(t));
+          if (ret > 0) {
+            event->name[ret] = '\0';
+          }
+          if (binstrrcaseless(t, blength(t) - 1, &sumtype) != BSTR_OK) {
+            event->rtype = ENTITY_TYPE_SUM;
+          } else if (binstrrcaseless(t, blength(t) - 1, &mintype) != BSTR_OK) {
+            event->rtype = ENTITY_TYPE_MIN;
+          } else if (binstrrcaseless(t, blength(t) - 1, &maxtype) != BSTR_OK) {
+            event->rtype = ENTITY_TYPE_MAX;
+          } else {
+            event->rtype = ENTITY_TYPE_INSTANT;
+          }
+          bdestroy(t);
+          ret = snprintf(event->real, NVMON_DEFAULT_STR_LEN - 1, "%s",
+                         getMetricNameBeginParams.ppMetricNames[i]);
+          if (ret > 0) {
+            event->real[ret] = '\0';
+          }
+          event->eventId = i;
+          event->type = NVMON_PERFWORKS_EVENT;
+          dev->allevents[i] = event;
+
+        } else {
+          ierr = -ENOMEM;
+          break;
         }
-        dev->numAllEvents = i;
-    }
-    else
-    {
-        ierr = -ENOMEM;
-
+      }
+      dev->numAllEvents = i;
+    } else {
+      ierr = -ENOMEM;
     }
     dev->nvEventSets = NULL;
     dev->numNvEventSets = 0;
     dev->activeEventSet = -1;
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Destroy metric context getMetricNames);
-    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricNames_EndPtr)(&getMetricNameEndParams), return -1);
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_MetricsContext_GetMetricNames_EndPtr)(&getMetricNameEndParams),
+        return -1);
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Destroy metric context);
-    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_DestroyPtr)(&metricsContextDestroyParams), return -1);
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_MetricsContext_DestroyPtr)(&metricsContextDestroyParams),
+        return -1);
     return ierr;
-
 }
 
-int nvmon_perfworks_getEventsOfGpu(int gpuId, NvmonEventList_t* list)
-{
-    int ret = 0;
-    NvmonDevice device;
-    int err = nvmon_perfworks_createDevice(gpuId, &device);
-    if (!err)
-    {
-        NvmonEventList_t l = malloc(sizeof(NvmonEventList));
-        if (l)
-        {
-            l->events = malloc(sizeof(NvmonEventListEntry) * device.numAllEvents);
-            if (l->events)
-            {
-                for (int i = 0; i < device.numAllEvents; i++)
-                {
-                    NvmonEventListEntry* out = &l->events[i];
-                    NvmonEvent_t event = device.allevents[i];
-                    out->name = malloc(strlen(event->name)+2);
-                    if (out->name)
-                    {
-                        ret = snprintf(out->name, strlen(event->name)+1, "%s", event->name);
-                        if (ret > 0)
-                        {
-                            out->name[ret] = '\0';
-                        }
-                    }
-                    out->limit = malloc(10*sizeof(char));
-                    if (out->limit)
-                    {
-                        ret = snprintf(out->limit, 9, "GPU");
-                        if (ret > 0)
-                        {
-                            out->limit[ret] = '\0';
-                        }
-                    }
-                    out->desc = NULL;
-                }
-                l->numEvents = device.numAllEvents;
-                *list = l;
+int nvmon_perfworks_getEventsOfGpu(int gpuId, NvmonEventList_t *list) {
+  int ret = 0;
+  NvmonDevice device;
+  int err = nvmon_perfworks_createDevice(gpuId, &device);
+  if (!err) {
+    NvmonEventList_t l = malloc(sizeof(NvmonEventList));
+    if (l) {
+      l->events = malloc(sizeof(NvmonEventListEntry) * device.numAllEvents);
+      if (l->events) {
+        for (int i = 0; i < device.numAllEvents; i++) {
+          NvmonEventListEntry *out = &l->events[i];
+          NvmonEvent_t event = device.allevents[i];
+          out->name = malloc(strlen(event->name) + 2);
+          if (out->name) {
+            ret =
+                snprintf(out->name, strlen(event->name) + 1, "%s", event->name);
+            if (ret > 0) {
+              out->name[ret] = '\0';
             }
-            else
-            {
-                free(l);
-                nvmon_cupti_freeDevice(&device);
-                return -ENOMEM;
+          }
+          out->limit = malloc(10 * sizeof(char));
+          if (out->limit) {
+            ret = snprintf(out->limit, 9, "GPU");
+            if (ret > 0) {
+              out->limit[ret] = '\0';
             }
+          }
+          out->desc = NULL;
         }
+        l->numEvents = device.numAllEvents;
+        *list = l;
+      } else {
+        free(l);
+        nvmon_cupti_freeDevice(&device);
+        return -ENOMEM;
+      }
     }
-    else
-    {
-        ERROR_PRINT(No such device %d, gpuId);
-    }
-    return 0;
-
+  } else {
+    ERROR_PRINT(No such device % d, gpuId);
+  }
+  return 0;
 }
 
-static int nvmon_perfworks_getMetricRequests114(char* chip, NVPA_MetricsContext* context,
-                                              struct bstrList* events, uint8_t* availImage,
-                                              NVPA_RawMetricRequest** requests)
-{
-    int i = 0;
-    int j = 0;
-    int isolated = 1;
-    int keepInstances = 1;
-
-    NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params calculateScratchBufferSizeParam = {NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params_STRUCT_SIZE};
-    calculateScratchBufferSizeParam.pChipName = chip;
-    calculateScratchBufferSizeParam.pCounterAvailabilityImage = availImage;
-    LIKWID_NVPW_API_CALL((*NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSizePtr)(&calculateScratchBufferSizeParam), return -1);
+static int nvmon_perfworks_getMetricRequests114(
+    char *chip, NVPA_MetricsContext *context, struct bstrList *events,
+    uint8_t *availImage, NVPA_RawMetricRequest **requests) {
+  int i = 0;
+  int j = 0;
+  int isolated = 1;
+  int keepInstances = 1;
+
+  NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params
+      calculateScratchBufferSizeParam = {
+          NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params_STRUCT_SIZE};
+  calculateScratchBufferSizeParam.pChipName = chip;
+  calculateScratchBufferSizeParam.pCounterAvailabilityImage = availImage;
+  LIKWID_NVPW_API_CALL(
+      (*NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSizePtr)(
+          &calculateScratchBufferSizeParam),
+      return -1);
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Create scratch buffer for %s and %p, chip, availImage);
-    uint8_t* scratch = malloc(calculateScratchBufferSizeParam.scratchBufferSize);
-    if (!scratch)
-    {
-        return -ENOMEM;
-    }
-    NVPW_CUDA_MetricsEvaluator_Initialize_Params metricEvaluatorInitializeParams = {NVPW_CUDA_MetricsEvaluator_Initialize_Params_STRUCT_SIZE};
-    metricEvaluatorInitializeParams.scratchBufferSize = calculateScratchBufferSizeParam.scratchBufferSize;
+    uint8_t *scratch =
+        malloc(calculateScratchBufferSizeParam.scratchBufferSize);
+    if (!scratch) {
+      return -ENOMEM;
+    }
+    NVPW_CUDA_MetricsEvaluator_Initialize_Params
+        metricEvaluatorInitializeParams = {
+            NVPW_CUDA_MetricsEvaluator_Initialize_Params_STRUCT_SIZE};
+    metricEvaluatorInitializeParams.scratchBufferSize =
+        calculateScratchBufferSizeParam.scratchBufferSize;
     metricEvaluatorInitializeParams.pScratchBuffer = scratch;
     metricEvaluatorInitializeParams.pChipName = chip;
     metricEvaluatorInitializeParams.pCounterAvailabilityImage = availImage;
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Init Metric evaluator);
-    LIKWID_NVPW_API_CALL((*NVPW_CUDA_MetricsEvaluator_InitializePtr)(&metricEvaluatorInitializeParams), free(scratch); return -1);
-    NVPW_MetricsEvaluator* metricEvaluator = metricEvaluatorInitializeParams.pMetricsEvaluator;
+    LIKWID_NVPW_API_CALL((*NVPW_CUDA_MetricsEvaluator_InitializePtr)(
+                             &metricEvaluatorInitializeParams),
+                         free(scratch);
+                         return -1);
+    NVPW_MetricsEvaluator *metricEvaluator =
+        metricEvaluatorInitializeParams.pMetricsEvaluator;
 
     int raw_metrics = 0;
     int max_raw_deps = 0;
-    for (i = 0; i < events->qty; i++)
-    {
-        NVPW_MetricEvalRequest metricEvalRequest;
-        char mname[1024];
-        //nvmon_perfworks_parse_metric(events->entry[i], &isolated, &keepInstances);
-        NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params convertMetricToEvalRequest = {NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params_STRUCT_SIZE};
-        convertMetricToEvalRequest.pMetricsEvaluator = metricEvaluator;
-        int ret = snprintf(mname, 1023, "%s", bdata(events->entry[i]));
-        mname[ret] = '\0';
-        convertMetricToEvalRequest.pMetricName = mname;
-        convertMetricToEvalRequest.pMetricEvalRequest = &metricEvalRequest;
-        convertMetricToEvalRequest.metricEvalRequestStructSize = NVPW_MetricEvalRequest_STRUCT_SIZE;
-        LIKWID_NVPW_API_CALL((*NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequestPtr)(&convertMetricToEvalRequest), free(scratch); return -1);
-
-        NVPW_MetricsEvaluator_GetMetricRawDependencies_Params getMetricRawDependenciesParms = {NVPW_MetricsEvaluator_GetMetricRawDependencies_Params_STRUCT_SIZE};
-        getMetricRawDependenciesParms.pMetricsEvaluator = metricEvaluator;
-        getMetricRawDependenciesParms.pMetricEvalRequests = &metricEvalRequest;
-        getMetricRawDependenciesParms.numMetricEvalRequests = 1;
-        getMetricRawDependenciesParms.metricEvalRequestStructSize = NVPW_MetricEvalRequest_STRUCT_SIZE;
-        getMetricRawDependenciesParms.metricEvalRequestStrideSize = sizeof(NVPW_MetricEvalRequest);
-        LIKWID_NVPW_API_CALL((*NVPW_MetricsEvaluator_GetMetricRawDependenciesPtr)(&getMetricRawDependenciesParms), free(scratch); return -1);
-        raw_metrics += getMetricRawDependenciesParms.numRawDependencies;
-        max_raw_deps = (max_raw_deps < getMetricRawDependenciesParms.numRawDependencies ? getMetricRawDependenciesParms.numRawDependencies : max_raw_deps);
-    }
-
-    NVPA_RawMetricRequest *reqs = (NVPA_RawMetricRequest*) malloc(raw_metrics * sizeof(NVPA_RawMetricRequest));
-    if (!reqs)
-    {
-        free(scratch);
-        return -ENOMEM;
-    }
-    const char** rawDeps = (const char**) malloc(max_raw_deps * sizeof(const char*));
-    if (!rawDeps)
-    {
-        free(scratch);
-        free(reqs);
-        return -ENOMEM;
+    for (i = 0; i < events->qty; i++) {
+      NVPW_MetricEvalRequest metricEvalRequest;
+      char mname[1024];
+      // nvmon_perfworks_parse_metric(events->entry[i], &isolated,
+      // &keepInstances);
+      NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params
+          convertMetricToEvalRequest = {
+              NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params_STRUCT_SIZE};
+      convertMetricToEvalRequest.pMetricsEvaluator = metricEvaluator;
+      int ret = snprintf(mname, 1023, "%s", bdata(events->entry[i]));
+      mname[ret] = '\0';
+      convertMetricToEvalRequest.pMetricName = mname;
+      convertMetricToEvalRequest.pMetricEvalRequest = &metricEvalRequest;
+      convertMetricToEvalRequest.metricEvalRequestStructSize =
+          NVPW_MetricEvalRequest_STRUCT_SIZE;
+      LIKWID_NVPW_API_CALL(
+          (*NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequestPtr)(
+              &convertMetricToEvalRequest),
+          free(scratch);
+          return -1);
+
+      NVPW_MetricsEvaluator_GetMetricRawDependencies_Params
+          getMetricRawDependenciesParms = {
+              NVPW_MetricsEvaluator_GetMetricRawDependencies_Params_STRUCT_SIZE};
+      getMetricRawDependenciesParms.pMetricsEvaluator = metricEvaluator;
+      getMetricRawDependenciesParms.pMetricEvalRequests = &metricEvalRequest;
+      getMetricRawDependenciesParms.numMetricEvalRequests = 1;
+      getMetricRawDependenciesParms.metricEvalRequestStructSize =
+          NVPW_MetricEvalRequest_STRUCT_SIZE;
+      getMetricRawDependenciesParms.metricEvalRequestStrideSize =
+          sizeof(NVPW_MetricEvalRequest);
+      LIKWID_NVPW_API_CALL((*NVPW_MetricsEvaluator_GetMetricRawDependenciesPtr)(
+                               &getMetricRawDependenciesParms),
+                           free(scratch);
+                           return -1);
+      raw_metrics += getMetricRawDependenciesParms.numRawDependencies;
+      max_raw_deps =
+          (max_raw_deps < getMetricRawDependenciesParms.numRawDependencies
+               ? getMetricRawDependenciesParms.numRawDependencies
+               : max_raw_deps);
+    }
+
+    NVPA_RawMetricRequest *reqs = (NVPA_RawMetricRequest *)malloc(
+        raw_metrics * sizeof(NVPA_RawMetricRequest));
+    if (!reqs) {
+      free(scratch);
+      return -ENOMEM;
+    }
+    const char **rawDeps =
+        (const char **)malloc(max_raw_deps * sizeof(const char *));
+    if (!rawDeps) {
+      free(scratch);
+      free(reqs);
+      return -ENOMEM;
     }
 
     raw_metrics = 0;
-    for (i = 0; i < events->qty; i++)
-    {
-        NVPW_MetricEvalRequest metricEvalRequest;
-        char mname[1024];
-        NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params convertMetricToEvalRequest = {NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params_STRUCT_SIZE};
-        convertMetricToEvalRequest.pMetricsEvaluator = metricEvaluator;
-        int ret = snprintf(mname, 1023, "%s", bdata(events->entry[i]));
-        mname[ret] = '\0';
-        convertMetricToEvalRequest.pMetricName = mname;
-        convertMetricToEvalRequest.pMetricEvalRequest = &metricEvalRequest;
-        convertMetricToEvalRequest.metricEvalRequestStructSize = NVPW_MetricEvalRequest_STRUCT_SIZE;
-        LIKWID_NVPW_API_CALL((*NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequestPtr)(&convertMetricToEvalRequest), free(scratch); return -1);
-
-        NVPW_MetricsEvaluator_GetMetricRawDependencies_Params getMetricRawDependenciesParms = {NVPW_MetricsEvaluator_GetMetricRawDependencies_Params_STRUCT_SIZE};
-        getMetricRawDependenciesParms.pMetricsEvaluator = metricEvaluator;
-        getMetricRawDependenciesParms.pMetricEvalRequests = &metricEvalRequest;
-        getMetricRawDependenciesParms.numMetricEvalRequests = 1;
-        getMetricRawDependenciesParms.metricEvalRequestStructSize = NVPW_MetricEvalRequest_STRUCT_SIZE;
-        getMetricRawDependenciesParms.metricEvalRequestStrideSize = sizeof(NVPW_MetricEvalRequest);
-        LIKWID_NVPW_API_CALL((*NVPW_MetricsEvaluator_GetMetricRawDependenciesPtr)(&getMetricRawDependenciesParms), free(scratch); return -1);
-        getMetricRawDependenciesParms.ppRawDependencies = rawDeps;
-        LIKWID_NVPW_API_CALL((*NVPW_MetricsEvaluator_GetMetricRawDependenciesPtr)(&getMetricRawDependenciesParms), free(scratch); return -1);
-
-        for (j = 0; j < getMetricRawDependenciesParms.numRawDependencies; ++j)
-        {
-            reqs[raw_metrics].pMetricName = rawDeps[j];
-            reqs[raw_metrics].isolated = isolated;
-            reqs[raw_metrics].keepInstances = keepInstances;
-            reqs[raw_metrics].structSize = NVPA_RAW_METRIC_REQUEST_STRUCT_SIZE;
-            raw_metrics++;
-        }
+    for (i = 0; i < events->qty; i++) {
+      NVPW_MetricEvalRequest metricEvalRequest;
+      char mname[1024];
+      NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params
+          convertMetricToEvalRequest = {
+              NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequest_Params_STRUCT_SIZE};
+      convertMetricToEvalRequest.pMetricsEvaluator = metricEvaluator;
+      int ret = snprintf(mname, 1023, "%s", bdata(events->entry[i]));
+      mname[ret] = '\0';
+      convertMetricToEvalRequest.pMetricName = mname;
+      convertMetricToEvalRequest.pMetricEvalRequest = &metricEvalRequest;
+      convertMetricToEvalRequest.metricEvalRequestStructSize =
+          NVPW_MetricEvalRequest_STRUCT_SIZE;
+      LIKWID_NVPW_API_CALL(
+          (*NVPW_MetricsEvaluator_ConvertMetricNameToMetricEvalRequestPtr)(
+              &convertMetricToEvalRequest),
+          free(scratch);
+          return -1);
+
+      NVPW_MetricsEvaluator_GetMetricRawDependencies_Params
+          getMetricRawDependenciesParms = {
+              NVPW_MetricsEvaluator_GetMetricRawDependencies_Params_STRUCT_SIZE};
+      getMetricRawDependenciesParms.pMetricsEvaluator = metricEvaluator;
+      getMetricRawDependenciesParms.pMetricEvalRequests = &metricEvalRequest;
+      getMetricRawDependenciesParms.numMetricEvalRequests = 1;
+      getMetricRawDependenciesParms.metricEvalRequestStructSize =
+          NVPW_MetricEvalRequest_STRUCT_SIZE;
+      getMetricRawDependenciesParms.metricEvalRequestStrideSize =
+          sizeof(NVPW_MetricEvalRequest);
+      LIKWID_NVPW_API_CALL((*NVPW_MetricsEvaluator_GetMetricRawDependenciesPtr)(
+                               &getMetricRawDependenciesParms),
+                           free(scratch);
+                           return -1);
+      getMetricRawDependenciesParms.ppRawDependencies = rawDeps;
+      LIKWID_NVPW_API_CALL((*NVPW_MetricsEvaluator_GetMetricRawDependenciesPtr)(
+                               &getMetricRawDependenciesParms),
+                           free(scratch);
+                           return -1);
+
+      for (j = 0; j < getMetricRawDependenciesParms.numRawDependencies; ++j) {
+        reqs[raw_metrics].pMetricName = rawDeps[j];
+        reqs[raw_metrics].isolated = isolated;
+        reqs[raw_metrics].keepInstances = keepInstances;
+        reqs[raw_metrics].structSize = NVPA_RAW_METRIC_REQUEST_STRUCT_SIZE;
+        raw_metrics++;
+      }
     }
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Destroy Metric evaluator);
-    NVPW_MetricsEvaluator_Destroy_Params metricEvaluatorDestroyParams = { NVPW_MetricsEvaluator_Destroy_Params_STRUCT_SIZE };
+    NVPW_MetricsEvaluator_Destroy_Params metricEvaluatorDestroyParams = {
+        NVPW_MetricsEvaluator_Destroy_Params_STRUCT_SIZE};
     metricEvaluatorDestroyParams.pMetricsEvaluator = metricEvaluator;
-    LIKWID_NVPW_API_CALL((*NVPW_MetricsEvaluator_DestroyPtr)(&metricEvaluatorDestroyParams), free(scratch); free(rawDeps); free(reqs); return -1);
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_MetricsEvaluator_DestroyPtr)(&metricEvaluatorDestroyParams),
+        free(scratch);
+        free(rawDeps); free(reqs); return -1);
 
     free(scratch);
     free(rawDeps);
@@ -1176,661 +1274,811 @@ static int nvmon_perfworks_getMetricRequests114(char* chip, NVPA_MetricsContext*
     return raw_metrics;
 }
 
-static int nvmon_perfworks_getMetricRequests3(NVPA_MetricsContext* context,
-                                              struct bstrList* events,
-                                              NVPA_RawMetricRequest** requests)
-{
-    int i = 0;
-    int j = 0;
-    int isolated = 1;
-    int keepInstances = 1;
-
-    int raw_metrics = 0;
-    for (i = 0; i < events->qty; i++)
-    {
-        //nvmon_perfworks_parse_metric(events->entry[i], &isolated, &keepInstances);
-        keepInstances = 1; /* Bug in Nvidia API */
-        NVPW_MetricsContext_GetMetricProperties_Begin_Params getMetricPropertiesBeginParams = { NVPW_MetricsContext_GetMetricProperties_Begin_Params_STRUCT_SIZE };
-        getMetricPropertiesBeginParams.pMetricsContext = context;
-        getMetricPropertiesBeginParams.pMetricName = bdata(events->entry[i]);
-        NVPW_MetricsContext_GetMetricProperties_End_Params getMetricPropertiesEndParams = { NVPW_MetricsContext_GetMetricProperties_End_Params_STRUCT_SIZE };
-        getMetricPropertiesEndParams.pMetricsContext = context;
-        LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_BeginPtr)(&getMetricPropertiesBeginParams), return -1);
-        for (const char** dep = getMetricPropertiesBeginParams.ppRawMetricDependencies; *dep ; ++dep)
-            raw_metrics++;
-        LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_EndPtr)(&getMetricPropertiesEndParams), return -1);
-    }
-
-    NVPA_RawMetricRequest* reqs = (NVPA_RawMetricRequest*) malloc(raw_metrics * sizeof(NVPA_RawMetricRequest));
-    if (!reqs)
-        return -ENOMEM;
-
-    raw_metrics = 0;
-
-    for (i = 0; i < events->qty; i++)
-    {
-        NVPW_MetricsContext_GetMetricProperties_Begin_Params getMetricPropertiesBeginParams = { NVPW_MetricsContext_GetMetricProperties_Begin_Params_STRUCT_SIZE };
-        getMetricPropertiesBeginParams.pMetricsContext = context;
-        getMetricPropertiesBeginParams.pMetricName = bdata(events->entry[i]);
-        NVPW_MetricsContext_GetMetricProperties_End_Params getMetricPropertiesEndParams = { NVPW_MetricsContext_GetMetricProperties_End_Params_STRUCT_SIZE };
-        getMetricPropertiesEndParams.pMetricsContext = context;
-        LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_BeginPtr)(&getMetricPropertiesBeginParams), free(reqs); return -1);
-        for (const char** dep = getMetricPropertiesBeginParams.ppRawMetricDependencies; *dep; ++dep)
-        {
-            NVPA_RawMetricRequest* req = &reqs[raw_metrics];
-            char* s = (char*)malloc((strlen(*dep)+2) * sizeof(char));
-            int ret = snprintf(s, strlen(*dep)+1, "%s", *dep);
-            s[ret] = '\0';
-            req->pMetricName = (const char*)s;
-            req->isolated = isolated;
-            req->keepInstances = keepInstances;
-            raw_metrics++;
-        }
-        LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_EndPtr)(&getMetricPropertiesEndParams), free(reqs); return -1);
-    }
-    *requests = reqs;
-    return raw_metrics;
+static int
+nvmon_perfworks_getMetricRequests3(NVPA_MetricsContext *context,
+                                   struct bstrList *events,
+                                   NVPA_RawMetricRequest **requests) {
+  int i = 0;
+  int j = 0;
+  int isolated = 1;
+  int keepInstances = 1;
+
+  int raw_metrics = 0;
+  for (i = 0; i < events->qty; i++) {
+    // nvmon_perfworks_parse_metric(events->entry[i], &isolated,
+    // &keepInstances);
+    keepInstances = 1; /* Bug in Nvidia API */
+    NVPW_MetricsContext_GetMetricProperties_Begin_Params
+        getMetricPropertiesBeginParams = {
+            NVPW_MetricsContext_GetMetricProperties_Begin_Params_STRUCT_SIZE};
+    getMetricPropertiesBeginParams.pMetricsContext = context;
+    getMetricPropertiesBeginParams.pMetricName = bdata(events->entry[i]);
+    NVPW_MetricsContext_GetMetricProperties_End_Params
+        getMetricPropertiesEndParams = {
+            NVPW_MetricsContext_GetMetricProperties_End_Params_STRUCT_SIZE};
+    getMetricPropertiesEndParams.pMetricsContext = context;
+    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_BeginPtr)(
+                             &getMetricPropertiesBeginParams),
+                         return -1);
+    for (const char **dep =
+             getMetricPropertiesBeginParams.ppRawMetricDependencies;
+         *dep; ++dep)
+      raw_metrics++;
+    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_EndPtr)(
+                             &getMetricPropertiesEndParams),
+                         return -1);
+  }
+
+  NVPA_RawMetricRequest *reqs = (NVPA_RawMetricRequest *)malloc(
+      raw_metrics * sizeof(NVPA_RawMetricRequest));
+  if (!reqs)
+    return -ENOMEM;
+
+  raw_metrics = 0;
+
+  for (i = 0; i < events->qty; i++) {
+    NVPW_MetricsContext_GetMetricProperties_Begin_Params
+        getMetricPropertiesBeginParams = {
+            NVPW_MetricsContext_GetMetricProperties_Begin_Params_STRUCT_SIZE};
+    getMetricPropertiesBeginParams.pMetricsContext = context;
+    getMetricPropertiesBeginParams.pMetricName = bdata(events->entry[i]);
+    NVPW_MetricsContext_GetMetricProperties_End_Params
+        getMetricPropertiesEndParams = {
+            NVPW_MetricsContext_GetMetricProperties_End_Params_STRUCT_SIZE};
+    getMetricPropertiesEndParams.pMetricsContext = context;
+    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_BeginPtr)(
+                             &getMetricPropertiesBeginParams),
+                         free(reqs);
+                         return -1);
+    for (const char **dep =
+             getMetricPropertiesBeginParams.ppRawMetricDependencies;
+         *dep; ++dep) {
+      NVPA_RawMetricRequest *req = &reqs[raw_metrics];
+      char *s = (char *)malloc((strlen(*dep) + 2) * sizeof(char));
+      int ret = snprintf(s, strlen(*dep) + 1, "%s", *dep);
+      s[ret] = '\0';
+      req->pMetricName = (const char *)s;
+      req->isolated = isolated;
+      req->keepInstances = keepInstances;
+      raw_metrics++;
+    }
+    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_EndPtr)(
+                             &getMetricPropertiesEndParams),
+                         free(reqs);
+                         return -1);
+  }
+  *requests = reqs;
+  return raw_metrics;
 }
 
+static int nvmon_perfworks_getMetricRequests(NVPA_MetricsContext *context,
+                                             struct bstrList *events,
+                                             NVPA_RawMetricRequest **requests) {
+  int i = 0;
+  int isolated = 1;
+  int keepInstances = 1;
+  struct bstrList *temp = bstrListCreate();
+  const char **raw_events = NULL;
+  int num_raw = 0;
+  for (i = 0; i < events->qty; i++) {
+    // nvmon_perfworks_parse_metric(events->entry[i], &isolated,
+    // &keepInstances);
+    keepInstances = 1; /* Bug in Nvidia API */
+    NVPW_MetricsContext_GetMetricProperties_Begin_Params
+        getMetricPropertiesBeginParams = {
+            NVPW_MetricsContext_GetMetricProperties_Begin_Params_STRUCT_SIZE};
+    NVPW_MetricsContext_GetMetricProperties_End_Params
+        getMetricPropertiesEndParams = {
+            NVPW_MetricsContext_GetMetricProperties_End_Params_STRUCT_SIZE};
+    getMetricPropertiesBeginParams.pMetricsContext = context;
+    getMetricPropertiesBeginParams.pMetricName = bdata(events->entry[i]);
+    getMetricPropertiesEndParams.pMetricsContext = context;
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Metric % s, bdata(events->entry[i]));
+    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_BeginPtr)(
+                             &getMetricPropertiesBeginParams),
+                         bstrListDestroy(temp);
+                         return -EFAULT);
 
-static int nvmon_perfworks_getMetricRequests(NVPA_MetricsContext* context, struct bstrList* events, NVPA_RawMetricRequest** requests)
-{
-    int i = 0;
-    int isolated = 1;
-    int keepInstances = 1;
-    struct bstrList* temp = bstrListCreate();
-    const char ** raw_events = NULL;
-    int num_raw = 0;
-    for (i = 0; i < events->qty; i++)
-    {
-        //nvmon_perfworks_parse_metric(events->entry[i], &isolated, &keepInstances);
-        keepInstances = 1; /* Bug in Nvidia API */
-        NVPW_MetricsContext_GetMetricProperties_Begin_Params getMetricPropertiesBeginParams = { NVPW_MetricsContext_GetMetricProperties_Begin_Params_STRUCT_SIZE };
-        NVPW_MetricsContext_GetMetricProperties_End_Params getMetricPropertiesEndParams = { NVPW_MetricsContext_GetMetricProperties_End_Params_STRUCT_SIZE };
-        getMetricPropertiesBeginParams.pMetricsContext = context;
-        getMetricPropertiesBeginParams.pMetricName = bdata(events->entry[i]);
-        getMetricPropertiesEndParams.pMetricsContext = context;
-        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Metric %s, bdata(events->entry[i]));
-        LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_BeginPtr)(&getMetricPropertiesBeginParams), bstrListDestroy(temp); return -EFAULT);
-
-        int count = 0;
-        for (const char** dep = getMetricPropertiesBeginParams.ppRawMetricDependencies; *dep ; ++dep)
-        {
-            GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Metric depend %s, *dep);
-            bstrListAddChar(temp, (char*)*dep);
-        }
-        
-        LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_EndPtr)(&getMetricPropertiesEndParams), bstrListDestroy(temp); return -EFAULT);
-
-    }
-    int num_reqs = 0;
-    NVPA_RawMetricRequest* reqs = malloc((temp->qty+1) * NVPA_RAW_METRIC_REQUEST_STRUCT_SIZE);
-    if (!reqs)
-    {
-        bstrListDestroy(temp);
-        return -ENOMEM;
-    }
-    for (i = 0; i < temp->qty; i++)
-    {
-        NVPA_RawMetricRequest* req = &reqs[num_reqs];
-        char* s = malloc((blength(temp->entry[i])+2) * sizeof(char));
-        if (s)
-        {
-            int ret = snprintf(s, blength(temp->entry[i])+1, "%s", bdata(temp->entry[i]));
-            if (ret > 0)
-            {
-                s[ret] = '\0';
-            }
-            req->structSize = NVPA_RAW_METRIC_REQUEST_STRUCT_SIZE;
-            req->pMetricName = s;
-            GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Metric Request %s, s);
-            req->isolated = isolated;
-            req->keepInstances = keepInstances;
-            num_reqs++;
-        }
-        
-    }
+    int count = 0;
+    for (const char **dep =
+             getMetricPropertiesBeginParams.ppRawMetricDependencies;
+         *dep; ++dep) {
+      GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Metric depend % s, *dep);
+      bstrListAddChar(temp, (char *)*dep);
+    }
+
+    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_GetMetricProperties_EndPtr)(
+                             &getMetricPropertiesEndParams),
+                         bstrListDestroy(temp);
+                         return -EFAULT);
+  }
+  int num_reqs = 0;
+  NVPA_RawMetricRequest *reqs =
+      malloc((temp->qty + 1) * NVPA_RAW_METRIC_REQUEST_STRUCT_SIZE);
+  if (!reqs) {
     bstrListDestroy(temp);
-    *requests = reqs;
-    return num_reqs;
+    return -ENOMEM;
+  }
+  for (i = 0; i < temp->qty; i++) {
+    NVPA_RawMetricRequest *req = &reqs[num_reqs];
+    char *s = malloc((blength(temp->entry[i]) + 2) * sizeof(char));
+    if (s) {
+      int ret =
+          snprintf(s, blength(temp->entry[i]) + 1, "%s", bdata(temp->entry[i]));
+      if (ret > 0) {
+        s[ret] = '\0';
+      }
+      req->structSize = NVPA_RAW_METRIC_REQUEST_STRUCT_SIZE;
+      req->pMetricName = s;
+      GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Metric Request % s, s);
+      req->isolated = isolated;
+      req->keepInstances = keepInstances;
+      num_reqs++;
+    }
+  }
+  bstrListDestroy(temp);
+  *requests = reqs;
+  return num_reqs;
 }
 
-static int nvmon_perfworks_createConfigImage(char* chip, struct bstrList* events, uint8_t **configImage, uint8_t* availImage)
-{
-    int i = 0;
-    int ierr = 0;
-    uint8_t* cimage = NULL;
-    int num_reqs = 0;
-
-    NVPA_RawMetricRequest* reqs = NULL;
-    NVPA_RawMetricsConfig* pRawMetricsConfig = NULL;
-    NVPW_CUDA_MetricsContext_Create_Params metricsContextCreateParams = { NVPW_CUDA_MetricsContext_Create_Params_STRUCT_SIZE };
-    metricsContextCreateParams.pChipName = chip;
-    LIKWID_NVPW_API_CALL((*NVPW_CUDA_MetricsContext_CreatePtr)(&metricsContextCreateParams), return -1;);
-    NVPW_MetricsContext_Destroy_Params metricsContextDestroyParams = { NVPW_MetricsContext_Destroy_Params_STRUCT_SIZE };
-    metricsContextDestroyParams.pMetricsContext = metricsContextCreateParams.pMetricsContext;
-
-    if (cuda_runtime_version < 11040 && NVPA_RawMetricsConfig_CreatePtr)
-    {
+static int nvmon_perfworks_createConfigImage(char *chip,
+                                             struct bstrList *events,
+                                             uint8_t **configImage,
+                                             uint8_t *availImage) {
+  int i = 0;
+  int ierr = 0;
+  uint8_t *cimage = NULL;
+  int num_reqs = 0;
+
+  NVPA_RawMetricRequest *reqs = NULL;
+  NVPA_RawMetricsConfig *pRawMetricsConfig = NULL;
+  NVPW_CUDA_MetricsContext_Create_Params metricsContextCreateParams = {
+      NVPW_CUDA_MetricsContext_Create_Params_STRUCT_SIZE};
+  metricsContextCreateParams.pChipName = chip;
+  LIKWID_NVPW_API_CALL(
+      (*NVPW_CUDA_MetricsContext_CreatePtr)(&metricsContextCreateParams),
+      return -1;);
+  NVPW_MetricsContext_Destroy_Params metricsContextDestroyParams = {
+      NVPW_MetricsContext_Destroy_Params_STRUCT_SIZE};
+  metricsContextDestroyParams.pMetricsContext =
+      metricsContextCreateParams.pMetricsContext;
+
+  if (cuda_runtime_version < 11040 && NVPA_RawMetricsConfig_CreatePtr) {
         GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Create config image for chip %s, chip);
-        num_reqs = nvmon_perfworks_getMetricRequests3(metricsContextCreateParams.pMetricsContext, events, &reqs);
+        num_reqs = nvmon_perfworks_getMetricRequests3(
+            metricsContextCreateParams.pMetricsContext, events, &reqs);
         GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Create config image for chip %s with %d metric requests, chip, num_reqs);
 
-        NVPA_RawMetricsConfigOptions metricsConfigOptions = { NVPA_RAW_METRICS_CONFIG_OPTIONS_STRUCT_SIZE };
-        metricsConfigOptions.activityKind = NVPA_ACTIVITY_KIND_PROFILER;
-        metricsConfigOptions.pChipName = chip;
-
-        LIKWID_NVPW_API_CALL((*NVPA_RawMetricsConfig_CreatePtr)(&metricsConfigOptions, &pRawMetricsConfig), ierr = -1; goto nvmon_perfworks_createConfigImage_out);
-    }
-    else if (cuda_runtime_version >= 11040 && NVPW_CUDA_RawMetricsConfig_Create_V2Ptr)
-    {
-        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Create config image for chip %s, chip);
-        num_reqs = nvmon_perfworks_getMetricRequests114(chip, metricsContextCreateParams.pMetricsContext, events, availImage, &reqs);
-        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Create config image for chip %s with %d metric requests, chip, num_reqs);
-
-        NVPW_CUDA_RawMetricsConfig_Create_V2_Params rawMetricsConfigCreateParams = { NVPW_CUDA_RawMetricsConfig_Create_V2_Params_STRUCT_SIZE };
-        rawMetricsConfigCreateParams.activityKind = NVPA_ACTIVITY_KIND_PROFILER;
-        rawMetricsConfigCreateParams.pChipName = chip;
-        rawMetricsConfigCreateParams.pCounterAvailabilityImage = availImage;
-        LIKWID_NVPW_API_CALL((*NVPW_CUDA_RawMetricsConfig_Create_V2Ptr)(&rawMetricsConfigCreateParams), free(reqs); return -1);
-        pRawMetricsConfig = rawMetricsConfigCreateParams.pRawMetricsConfig;
-    }
-    NVPW_RawMetricsConfig_Destroy_Params rawMetricsConfigDestroyParams = { NVPW_RawMetricsConfig_Destroy_Params_STRUCT_SIZE };
-    rawMetricsConfigDestroyParams.pRawMetricsConfig = pRawMetricsConfig;
-
-    if(availImage)
-    {
-        NVPW_RawMetricsConfig_SetCounterAvailability_Params setCounterAvailabilityParams = {NVPW_RawMetricsConfig_SetCounterAvailability_Params_STRUCT_SIZE};
-        setCounterAvailabilityParams.pRawMetricsConfig = pRawMetricsConfig;
-        setCounterAvailabilityParams.pCounterAvailabilityImage = availImage;
-        LIKWID_NVPW_API_CALL((*NVPW_RawMetricsConfig_SetCounterAvailabilityPtr)(&setCounterAvailabilityParams), ierr = -1; goto nvmon_perfworks_createConfigImage_out);
-    }
-
-    NVPW_RawMetricsConfig_BeginPassGroup_Params beginPassGroupParams = { NVPW_RawMetricsConfig_BeginPassGroup_Params_STRUCT_SIZE };
-    beginPassGroupParams.pRawMetricsConfig = pRawMetricsConfig;
-    LIKWID_NVPW_API_CALL((*NVPW_RawMetricsConfig_BeginPassGroupPtr)(&beginPassGroupParams), ierr = -1; goto nvmon_perfworks_createConfigImage_out;);
-
-
-    NVPW_RawMetricsConfig_AddMetrics_Params addMetricsParams = { NVPW_RawMetricsConfig_AddMetrics_Params_STRUCT_SIZE };
-    addMetricsParams.pRawMetricsConfig = pRawMetricsConfig;
-    addMetricsParams.pRawMetricRequests = reqs;
-    addMetricsParams.numMetricRequests = num_reqs;
-    LIKWID_NVPW_API_CALL((*NVPW_RawMetricsConfig_AddMetricsPtr)(&addMetricsParams), ierr = -1; goto nvmon_perfworks_createConfigImage_out;);
-
-    NVPW_RawMetricsConfig_EndPassGroup_Params endPassGroupParams = { NVPW_RawMetricsConfig_EndPassGroup_Params_STRUCT_SIZE };
-    endPassGroupParams.pRawMetricsConfig = pRawMetricsConfig;
-    LIKWID_NVPW_API_CALL((*NVPW_RawMetricsConfig_EndPassGroupPtr)(&endPassGroupParams), ierr = -1; goto nvmon_perfworks_createConfigImage_out);
-
-    NVPW_RawMetricsConfig_GetNumPasses_Params getNumPassesParams = { NVPW_RawMetricsConfig_GetNumPasses_Params_STRUCT_SIZE };
-    getNumPassesParams.pRawMetricsConfig = pRawMetricsConfig;
-    LIKWID_NVPW_API_CALL((*NVPW_RawMetricsConfig_GetNumPassesPtr)(&getNumPassesParams), ierr = -1; goto nvmon_perfworks_createConfigImage_out);
-    if (getNumPassesParams.numPipelinedPasses + getNumPassesParams.numIsolatedPasses > 1)
-    {
-        errno = 1;
-        ierr = -errno;
-        ERROR_PRINT(Given GPU eventset requires multiple passes. Currently not supported.)
-        goto nvmon_perfworks_createConfigImage_out;
-    }
-
-    NVPW_RawMetricsConfig_GenerateConfigImage_Params generateConfigImageParams = { NVPW_RawMetricsConfig_GenerateConfigImage_Params_STRUCT_SIZE };
-    generateConfigImageParams.pRawMetricsConfig = pRawMetricsConfig;
-    LIKWID_NVPW_API_CALL((*NVPW_RawMetricsConfig_GenerateConfigImagePtr)(&generateConfigImageParams), ierr = -1; goto nvmon_perfworks_createConfigImage_out);
+        NVPW_CUDA_RawMetricsConfig_Create_Params nvpw_metricsConfigCreateParams;
+        nvpw_metricsConfigCreateParams.structSize =
+            NVPW_CUDA_RawMetricsConfig_Create_Params_STRUCT_SIZE;
+        nvpw_metricsConfigCreateParams.pPriv = NULL;
+        nvpw_metricsConfigCreateParams.activityKind =
+            NVPA_ACTIVITY_KIND_PROFILER;
+        nvpw_metricsConfigCreateParams.pChipName = chip;
+
+        LIKWID_NVPW_API_CALL((*NVPW_CUDA_RawMetricsConfig_CreatePtr)(
+                                 &nvpw_metricsConfigCreateParams),
+                             ierr = -1;
+                             goto nvmon_perfworks_createConfigImage_out);
+        NVPW_RawMetricsConfig_Destroy_Params rawMetricsConfigDestroyParams = {
+            NVPW_RawMetricsConfig_Destroy_Params_STRUCT_SIZE};
+        rawMetricsConfigDestroyParams.pRawMetricsConfig =
+            nvpw_metricsConfigCreateParams.pRawMetricsConfig;
+
+        if (availImage) {
+          NVPW_RawMetricsConfig_SetCounterAvailability_Params
+              setCounterAvailabilityParams = {
+                  NVPW_RawMetricsConfig_SetCounterAvailability_Params_STRUCT_SIZE};
+          setCounterAvailabilityParams.pRawMetricsConfig =
+              nvpw_metricsConfigCreateParams.pRawMetricsConfig;
+          setCounterAvailabilityParams.pCounterAvailabilityImage = availImage;
+          LIKWID_NVPW_API_CALL(
+              (*NVPW_RawMetricsConfig_SetCounterAvailabilityPtr)(
+                  &setCounterAvailabilityParams),
+              ierr = -1;
+              goto nvmon_perfworks_createConfigImage_out);
+        }
 
-    NVPW_RawMetricsConfig_GetConfigImage_Params getConfigImageParams = { NVPW_RawMetricsConfig_GetConfigImage_Params_STRUCT_SIZE };
-    getConfigImageParams.pRawMetricsConfig = pRawMetricsConfig;
-    getConfigImageParams.bytesAllocated = 0;
-    getConfigImageParams.pBuffer = NULL;
-    LIKWID_NVPW_API_CALL((*NVPW_RawMetricsConfig_GetConfigImagePtr)(&getConfigImageParams), ierr = -1; goto nvmon_perfworks_createConfigImage_out);
+        NVPW_RawMetricsConfig_BeginPassGroup_Params beginPassGroupParams = {
+            NVPW_RawMetricsConfig_BeginPassGroup_Params_STRUCT_SIZE};
+        beginPassGroupParams.pRawMetricsConfig =
+            nvpw_metricsConfigCreateParams.pRawMetricsConfig;
+        LIKWID_NVPW_API_CALL(
+            (*NVPW_RawMetricsConfig_BeginPassGroupPtr)(&beginPassGroupParams),
+            ierr = -1;
+            goto nvmon_perfworks_createConfigImage_out;);
+
+        NVPW_RawMetricsConfig_AddMetrics_Params addMetricsParams = {
+            NVPW_RawMetricsConfig_AddMetrics_Params_STRUCT_SIZE};
+        addMetricsParams.pRawMetricsConfig =
+            nvpw_metricsConfigCreateParams.pRawMetricsConfig;
+        addMetricsParams.pRawMetricRequests = reqs;
+        addMetricsParams.numMetricRequests = num_reqs;
+        LIKWID_NVPW_API_CALL(
+            (*NVPW_RawMetricsConfig_AddMetricsPtr)(&addMetricsParams),
+            ierr = -1;
+            goto nvmon_perfworks_createConfigImage_out;);
+
+        NVPW_RawMetricsConfig_EndPassGroup_Params endPassGroupParams = {
+            NVPW_RawMetricsConfig_EndPassGroup_Params_STRUCT_SIZE};
+        endPassGroupParams.pRawMetricsConfig =
+            nvpw_metricsConfigCreateParams.pRawMetricsConfig;
+        LIKWID_NVPW_API_CALL(
+            (*NVPW_RawMetricsConfig_EndPassGroupPtr)(&endPassGroupParams),
+            ierr = -1;
+            goto nvmon_perfworks_createConfigImage_out);
+
+        NVPW_RawMetricsConfig_GetNumPasses_Params getNumPassesParams = {
+            NVPW_RawMetricsConfig_GetNumPasses_Params_STRUCT_SIZE};
+        getNumPassesParams.pRawMetricsConfig =
+            nvpw_metricsConfigCreateParams.pRawMetricsConfig;
+        LIKWID_NVPW_API_CALL(
+            (*NVPW_RawMetricsConfig_GetNumPassesPtr)(&getNumPassesParams),
+            ierr = -1;
+            goto nvmon_perfworks_createConfigImage_out);
+        if (getNumPassesParams.numPipelinedPasses +
+                getNumPassesParams.numIsolatedPasses >
+            1) {
+          errno = 1;
+          ierr = -errno;
+          ERROR_PRINT(Given GPU eventset requires multiple passes
+                          .Currently not supported.)
+          goto nvmon_perfworks_createConfigImage_out;
+        }
 
-    cimage = malloc(getConfigImageParams.bytesCopied);
-    if (!cimage)
-    {
-        ierr = -ENOMEM;
-        goto nvmon_perfworks_createConfigImage_out;
-    }
-    int ci_size = getConfigImageParams.bytesCopied;
+        NVPW_RawMetricsConfig_GenerateConfigImage_Params
+            generateConfigImageParams = {
+                NVPW_RawMetricsConfig_GenerateConfigImage_Params_STRUCT_SIZE};
+        generateConfigImageParams.pRawMetricsConfig =
+            nvpw_metricsConfigCreateParams.pRawMetricsConfig;
+        LIKWID_NVPW_API_CALL((*NVPW_RawMetricsConfig_GenerateConfigImagePtr)(
+                                 &generateConfigImageParams),
+                             ierr = -1;
+                             goto nvmon_perfworks_createConfigImage_out);
+
+        NVPW_RawMetricsConfig_GetConfigImage_Params getConfigImageParams = {
+            NVPW_RawMetricsConfig_GetConfigImage_Params_STRUCT_SIZE};
+        getConfigImageParams.pRawMetricsConfig =
+            nvpw_metricsConfigCreateParams.pRawMetricsConfig;
+        getConfigImageParams.bytesAllocated = 0;
+        getConfigImageParams.pBuffer = NULL;
+        LIKWID_NVPW_API_CALL(
+            (*NVPW_RawMetricsConfig_GetConfigImagePtr)(&getConfigImageParams),
+            ierr = -1;
+            goto nvmon_perfworks_createConfigImage_out);
+
+        cimage = malloc(getConfigImageParams.bytesCopied);
+        if (!cimage) {
+          ierr = -ENOMEM;
+          goto nvmon_perfworks_createConfigImage_out;
+        }
+        int ci_size = getConfigImageParams.bytesCopied;
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Allocated %d byte for configImage, ci_size);
 
     getConfigImageParams.bytesAllocated = getConfigImageParams.bytesCopied;
     getConfigImageParams.pBuffer = cimage;
-    LIKWID_NVPW_API_CALL((*NVPW_RawMetricsConfig_GetConfigImagePtr)(&getConfigImageParams), free(cimage); ierr = -1; goto nvmon_perfworks_createConfigImage_out);
-
-    
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_RawMetricsConfig_GetConfigImagePtr)(&getConfigImageParams),
+        free(cimage);
+        ierr = -1; goto nvmon_perfworks_createConfigImage_out);
 
-nvmon_perfworks_createConfigImage_out:
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, nvmon_perfworks_createConfigImage_out enter %d, ierr);
+  nvmon_perfworks_createConfigImage_out:
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP,
+                   nvmon_perfworks_createConfigImage_out enter % d, ierr);
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, NVPW_RawMetricsConfig_Destroy);
-    LIKWID_NVPW_API_CALL((*NVPW_RawMetricsConfig_DestroyPtr)(&rawMetricsConfigDestroyParams), return -1;);
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_RawMetricsConfig_DestroyPtr)(&rawMetricsConfigDestroyParams),
+        return -1;);
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, NVPW_MetricsContext_Destroy);
-    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_DestroyPtr)(&metricsContextDestroyParams), return -1;);
-/*    for (i = 0; i < num_reqs; i++)*/
-/*    {*/
-/*        free((void*)reqs[i].pMetricName);*/
-/*    }*/
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_MetricsContext_DestroyPtr)(&metricsContextDestroyParams),
+        return -1;);
+    /*    for (i = 0; i < num_reqs; i++)*/
+    /*    {*/
+    /*        free((void*)reqs[i].pMetricName);*/
+    /*    }*/
     free(reqs);
-    if (ierr == 0)
-    {
-        ierr = ci_size;
-        *configImage = cimage;
-    }
-    else
-    {
-        free(cimage);
+    if (ierr == 0) {
+      ierr = ci_size;
+      *configImage = cimage;
+    } else {
+      free(cimage);
     }
-    
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, nvmon_perfworks_createConfigImage returns %d, ierr);
+
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP,
+                   nvmon_perfworks_createConfigImage returns % d, ierr);
     return ierr;
-}
+  }
 
-static int nvmon_perfworks_createCounterDataPrefixImage(char* chip, struct bstrList* events, uint8_t **cdpImage)
-{
+  static int nvmon_perfworks_createCounterDataPrefixImage(
+      char *chip, struct bstrList *events, uint8_t **cdpImage) {
     int i = 0;
     int ierr = 0;
-    NVPA_RawMetricRequest* reqs = NULL;
-    uint8_t* cdp = NULL;
+    NVPA_RawMetricRequest *reqs = NULL;
+    uint8_t *cdp = NULL;
     int num_reqs = 0;
 
-    NVPW_CUDA_MetricsContext_Create_Params metricsContextCreateParams = { NVPW_CUDA_MetricsContext_Create_Params_STRUCT_SIZE };
+    NVPW_CUDA_MetricsContext_Create_Params metricsContextCreateParams = {
+        NVPW_CUDA_MetricsContext_Create_Params_STRUCT_SIZE};
 
     metricsContextCreateParams.pChipName = chip;
-    LIKWID_NVPW_API_CALL((*NVPW_CUDA_MetricsContext_CreatePtr)(&metricsContextCreateParams), ierr = -1; goto nvmon_perfworks_createCounterDataPrefixImage_out);
-
-    NVPW_MetricsContext_Destroy_Params metricsContextDestroyParams = { NVPW_MetricsContext_Destroy_Params_STRUCT_SIZE };
-    metricsContextDestroyParams.pMetricsContext = metricsContextCreateParams.pMetricsContext;
-    if (cuda_runtime_version < 11040)
-    {
-        num_reqs = nvmon_perfworks_getMetricRequests3(metricsContextCreateParams.pMetricsContext, events, &reqs);
-    }
-    else if (cuda_runtime_version >= 11040)
-    {
-        num_reqs = nvmon_perfworks_getMetricRequests114(chip, metricsContextCreateParams.pMetricsContext, events, NULL, &reqs);
-    }
-
-    NVPW_CounterDataBuilder_Create_Params counterDataBuilderCreateParams = { NVPW_CounterDataBuilder_Create_Params_STRUCT_SIZE };
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_CUDA_MetricsContext_CreatePtr)(&metricsContextCreateParams),
+        ierr = -1;
+        goto nvmon_perfworks_createCounterDataPrefixImage_out);
+
+    NVPW_MetricsContext_Destroy_Params metricsContextDestroyParams = {
+        NVPW_MetricsContext_Destroy_Params_STRUCT_SIZE};
+    metricsContextDestroyParams.pMetricsContext =
+        metricsContextCreateParams.pMetricsContext;
+    if (cuda_runtime_version < 11040) {
+      num_reqs = nvmon_perfworks_getMetricRequests3(
+          metricsContextCreateParams.pMetricsContext, events, &reqs);
+    } else if (cuda_runtime_version >= 11040) {
+      num_reqs = nvmon_perfworks_getMetricRequests114(
+          chip, metricsContextCreateParams.pMetricsContext, events, NULL,
+          &reqs);
+    }
+
+    NVPW_CounterDataBuilder_Create_Params counterDataBuilderCreateParams = {
+        NVPW_CounterDataBuilder_Create_Params_STRUCT_SIZE};
     counterDataBuilderCreateParams.pChipName = chip;
-    LIKWID_NVPW_API_CALL((*NVPW_CounterDataBuilder_CreatePtr)(&counterDataBuilderCreateParams), ierr = -1; goto nvmon_perfworks_createCounterDataPrefixImage_out);
-
-    NVPW_CounterDataBuilder_Destroy_Params counterDataBuilderDestroyParams = { NVPW_CounterDataBuilder_Destroy_Params_STRUCT_SIZE };
-    counterDataBuilderDestroyParams.pCounterDataBuilder = counterDataBuilderCreateParams.pCounterDataBuilder;
-
-    NVPW_CounterDataBuilder_AddMetrics_Params addMetricsParams = { NVPW_CounterDataBuilder_AddMetrics_Params_STRUCT_SIZE };
-    addMetricsParams.pCounterDataBuilder = counterDataBuilderCreateParams.pCounterDataBuilder;
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_CounterDataBuilder_CreatePtr)(&counterDataBuilderCreateParams),
+        ierr = -1;
+        goto nvmon_perfworks_createCounterDataPrefixImage_out);
+
+    NVPW_CounterDataBuilder_Destroy_Params counterDataBuilderDestroyParams = {
+        NVPW_CounterDataBuilder_Destroy_Params_STRUCT_SIZE};
+    counterDataBuilderDestroyParams.pCounterDataBuilder =
+        counterDataBuilderCreateParams.pCounterDataBuilder;
+
+    NVPW_CounterDataBuilder_AddMetrics_Params addMetricsParams = {
+        NVPW_CounterDataBuilder_AddMetrics_Params_STRUCT_SIZE};
+    addMetricsParams.pCounterDataBuilder =
+        counterDataBuilderCreateParams.pCounterDataBuilder;
     addMetricsParams.pRawMetricRequests = reqs;
     addMetricsParams.numMetricRequests = num_reqs;
-    LIKWID_NVPW_API_CALL((*NVPW_CounterDataBuilder_AddMetricsPtr)(&addMetricsParams), ierr = -1; goto nvmon_perfworks_createCounterDataPrefixImage_out);
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_CounterDataBuilder_AddMetricsPtr)(&addMetricsParams), ierr = -1;
+        goto nvmon_perfworks_createCounterDataPrefixImage_out);
 
     size_t counterDataPrefixSize = 0;
-    NVPW_CounterDataBuilder_GetCounterDataPrefix_Params getCounterDataPrefixParams = { NVPW_CounterDataBuilder_GetCounterDataPrefix_Params_STRUCT_SIZE };
-    getCounterDataPrefixParams.pCounterDataBuilder = counterDataBuilderCreateParams.pCounterDataBuilder;
+    NVPW_CounterDataBuilder_GetCounterDataPrefix_Params
+        getCounterDataPrefixParams = {
+            NVPW_CounterDataBuilder_GetCounterDataPrefix_Params_STRUCT_SIZE};
+    getCounterDataPrefixParams.pCounterDataBuilder =
+        counterDataBuilderCreateParams.pCounterDataBuilder;
     getCounterDataPrefixParams.bytesAllocated = 0;
     getCounterDataPrefixParams.pBuffer = NULL;
-    LIKWID_NVPW_API_CALL((*NVPW_CounterDataBuilder_GetCounterDataPrefixPtr)(&getCounterDataPrefixParams), ierr = -1; goto nvmon_perfworks_createCounterDataPrefixImage_out);
+    LIKWID_NVPW_API_CALL((*NVPW_CounterDataBuilder_GetCounterDataPrefixPtr)(
+                             &getCounterDataPrefixParams),
+                         ierr = -1;
+                         goto nvmon_perfworks_createCounterDataPrefixImage_out);
 
-    cdp = malloc(getCounterDataPrefixParams.bytesCopied+10);
-    if (!cdp)
-    {
-        ierr = -ENOMEM;
-        goto nvmon_perfworks_createCounterDataPrefixImage_out;
+    cdp = malloc(getCounterDataPrefixParams.bytesCopied + 10);
+    if (!cdp) {
+      ierr = -ENOMEM;
+      goto nvmon_perfworks_createCounterDataPrefixImage_out;
     }
     int pi_size = getCounterDataPrefixParams.bytesCopied;
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Allocated %d byte for configPrefixImage, pi_size);
 
-    getCounterDataPrefixParams.bytesAllocated = getCounterDataPrefixParams.bytesCopied+10;
+    getCounterDataPrefixParams.bytesAllocated =
+        getCounterDataPrefixParams.bytesCopied + 10;
     getCounterDataPrefixParams.pBuffer = cdp;
-    LIKWID_NVPW_API_CALL((*NVPW_CounterDataBuilder_GetCounterDataPrefixPtr)(&getCounterDataPrefixParams), free(cdp); ierr = -1; goto nvmon_perfworks_createCounterDataPrefixImage_out);
-
-    
-
-nvmon_perfworks_createCounterDataPrefixImage_out:
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, nvmon_perfworks_createCounterDataPrefixImage_out enter %d, ierr);
+    LIKWID_NVPW_API_CALL((*NVPW_CounterDataBuilder_GetCounterDataPrefixPtr)(
+                             &getCounterDataPrefixParams),
+                         free(cdp);
+                         ierr = -1;
+                         goto nvmon_perfworks_createCounterDataPrefixImage_out);
+
+  nvmon_perfworks_createCounterDataPrefixImage_out:
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP,
+                   nvmon_perfworks_createCounterDataPrefixImage_out enter % d,
+                   ierr);
     // for (i = 0; i < num_reqs; i++)
     // {
     //     free((void*)reqs[i].pMetricName);
     // }
     // free(reqs);
-    LIKWID_NVPW_API_CALL((*NVPW_CounterDataBuilder_DestroyPtr)(&counterDataBuilderDestroyParams), ierr = -1;);
-    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_DestroyPtr)(&metricsContextDestroyParams), ierr = -1);
-/*    for (i = 0; i < num_reqs; i++)*/
-/*    {*/
-/*        free((void*)reqs[i].pMetricName);*/
-/*    }*/
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_CounterDataBuilder_DestroyPtr)(&counterDataBuilderDestroyParams),
+        ierr = -1;);
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_MetricsContext_DestroyPtr)(&metricsContextDestroyParams),
+        ierr = -1);
+    /*    for (i = 0; i < num_reqs; i++)*/
+    /*    {*/
+    /*        free((void*)reqs[i].pMetricName);*/
+    /*    }*/
     free(reqs);
-    if (ierr == 0)
-    {
-        ierr = pi_size;
-        *cdpImage = cdp;
-    }
-    else
-    {
-        free(cdp);
-    }
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, nvmon_perfworks_createCounterDataPrefixImage returns %d, ierr);
+    if (ierr == 0) {
+      ierr = pi_size;
+      *cdpImage = cdp;
+    } else {
+      free(cdp);
+    }
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP,
+                   nvmon_perfworks_createCounterDataPrefixImage returns % d,
+                   ierr);
     return ierr;
-}
+  }
 
-
-int
-nvmon_perfworks_addEventSet(NvmonDevice_t device, const char* eventString)
-{
+  int nvmon_perfworks_addEventSet(NvmonDevice_t device,
+                                  const char *eventString) {
     int i = 0, j = 0;
     int curDeviceId = -1;
     CUcontext curContext;
-    struct bstrList* tmp, *eventtokens = NULL;
+    struct bstrList *tmp, *eventtokens = NULL;
     int gid = -1;
-    uint8_t* configImage = NULL;
-    uint8_t* prefixImage = NULL;
-    uint8_t* availImage = NULL;
+    uint8_t *configImage = NULL;
+    uint8_t *prefixImage = NULL;
+    uint8_t *availImage = NULL;
     size_t availImageSize = 0;
 
-    //cuptiProfiler_init();
+    // cuptiProfiler_init();
 
     LIKWID_CUDA_API_CALL((*cudaGetDevicePtr)(&curDeviceId), return -EFAULT);
     LIKWID_CUDA_API_CALL((*cudaFreePtr)(NULL), return -EFAULT);
     LIKWID_CU_CALL((*cuCtxGetCurrentPtr)(&curContext), return -EFAULT);
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Add events to GPU device %d with context %u, device->deviceId, curContext);
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP,
+                   Add events to GPU device % d with context % u,
+                   device->deviceId, curContext);
 
-    if (curDeviceId != device->deviceId)
-    {
-        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Switching to GPU device %d, device->deviceId);
-        LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId), return -EFAULT);
+    if (curDeviceId != device->deviceId) {
+      GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Switching to GPU device % d,
+                     device->deviceId);
+      LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId),
+                           return -EFAULT);
     }
 
     int popContext = perfworks_check_nv_context(device, curContext);
-    if (popContext < 0)
-    {
-        errno = -popContext;
-        ERROR_PRINT(Failed to get context);
+    if (popContext < 0) {
+      errno = -popContext;
+      ERROR_PRINT(Failed to get context);
     }
 
     bstring eventBString = bfromcstr(eventString);
     tmp = bsplit(eventBString, ',');
     bdestroy(eventBString);
 
-    NvmonEvent_t* nvEvents = malloc(tmp->qty * sizeof(NvmonEvent_t));
-    if (!nvEvents)
-    {
-        bstrListDestroy(tmp);
-        return -ENOMEM;
+    NvmonEvent_t *nvEvents = malloc(tmp->qty * sizeof(NvmonEvent_t));
+    if (!nvEvents) {
+      bstrListDestroy(tmp);
+      return -ENOMEM;
     }
     eventtokens = bstrListCreate();
 
-    for (i = 0; i < tmp->qty; i++)
-    {
-        struct bstrList* parts = bsplit(tmp->entry[i], ':');
-        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, %s, bdata(parts->entry[0]));
-        for (j = 0; j < device->numAllEvents; j++)
-        {
-            bstring bname = bfromcstr(device->allevents[j]->name);
-            if (bstrcmp(parts->entry[0], bname) == BSTR_OK)
-            {
-                bstrListAddChar(eventtokens, device->allevents[j]->real);
-                nvEvents[i] = device->allevents[j];
-                GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Adding real event %s, device->allevents[j]->real);
-            }
+    for (i = 0; i < tmp->qty; i++) {
+      struct bstrList *parts = bsplit(tmp->entry[i], ':');
+      GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, % s, bdata(parts->entry[0]));
+      for (j = 0; j < device->numAllEvents; j++) {
+        bstring bname = bfromcstr(device->allevents[j]->name);
+        if (bstrcmp(parts->entry[0], bname) == BSTR_OK) {
+          bstrListAddChar(eventtokens, device->allevents[j]->real);
+          nvEvents[i] = device->allevents[j];
+          GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Adding real event % s,
+                         device->allevents[j]->real);
         }
-        bstrListDestroy(parts);
+      }
+      bstrListDestroy(parts);
     }
     bstrListDestroy(tmp);
-    if (eventtokens->qty == 0)
-    {
-        ERROR_PRINT(No event in eventset);
-        bstrListDestroy(eventtokens);
-        if(popContext > 0)
-        {
+    if (eventtokens->qty == 0) {
+      ERROR_PRINT(No event in eventset);
+      bstrListDestroy(eventtokens);
+      if (popContext > 0) {
             GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Pop Context %ld for device %d, device->context, device->deviceId);
-            LIKWID_CU_CALL((*cuCtxPopCurrentPtr)(&device->context), return -EFAULT);
-        }
-        if (curDeviceId != device->deviceId)
-        {
-            GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Switching to GPU device %d, device->deviceId);
-            LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId), return -EFAULT);
-        }
-        return -EFAULT;
+            LIKWID_CU_CALL((*cuCtxPopCurrentPtr)(&device->context),
+                           return -EFAULT);
+      }
+      if (curDeviceId != device->deviceId) {
+        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Switching to GPU device % d,
+                       device->deviceId);
+        LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId),
+                             return -EFAULT);
+      }
+      return -EFAULT;
+    }
+
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP,
+                   Increase size of eventSet space on device % d,
+                   device->deviceId);
+    NvmonEventSet *tmpEventSet =
+        realloc(device->nvEventSets,
+                (device->numNvEventSets + 1) * sizeof(NvmonEventSet));
+    if (!tmpEventSet) {
+      ERROR_PRINT(Cannot enlarge GPU % d eventSet list, device->deviceId);
+      bstrListDestroy(eventtokens);
+      return -ENOMEM;
     }
-
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Increase size of eventSet space on device %d, device->deviceId);
-    NvmonEventSet* tmpEventSet = realloc(device->nvEventSets, (device->numNvEventSets+1)*sizeof(NvmonEventSet));
-    if (!tmpEventSet)
-    {
-        ERROR_PRINT(Cannot enlarge GPU %d eventSet list, device->deviceId);
+    device->nvEventSets = tmpEventSet;
+    NvmonEventSet *newEventSet = &device->nvEventSets[device->numNvEventSets];
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Filling eventset % d on device % d,
+                   device->numNvEventSets, device->deviceId);
+
+    if (cuda_version >= 11000 && cuda_runtime_version >= 11000) {
+      CUpti_Profiler_GetCounterAvailability_Params
+          getCounterAvailabilityParams = {
+              CUpti_Profiler_GetCounterAvailability_Params_STRUCT_SIZE};
+      getCounterAvailabilityParams.ctx = device->context;
+      LIKWID_CUPTI_API_CALL((*cuptiProfilerGetCounterAvailabilityPtr)(
+                                &getCounterAvailabilityParams),
+                            return -EFAULT);
+
+      availImage =
+          malloc(getCounterAvailabilityParams.counterAvailabilityImageSize);
+      if (!availImage) {
         bstrListDestroy(eventtokens);
         return -ENOMEM;
-    }
-    device->nvEventSets = tmpEventSet;
-    NvmonEventSet* newEventSet = &device->nvEventSets[device->numNvEventSets];
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Filling eventset %d on device %d, device->numNvEventSets, device->deviceId);
-
-
-    if (cuda_version >= 11000 && cuda_runtime_version >= 11000)
-    {
-        CUpti_Profiler_GetCounterAvailability_Params getCounterAvailabilityParams = {CUpti_Profiler_GetCounterAvailability_Params_STRUCT_SIZE};
-        getCounterAvailabilityParams.ctx = device->context;
-        LIKWID_CUPTI_API_CALL((*cuptiProfilerGetCounterAvailabilityPtr)(&getCounterAvailabilityParams), return -EFAULT);
-        
-        availImage = malloc(getCounterAvailabilityParams.counterAvailabilityImageSize);
-        if (!availImage)
-        {
-            bstrListDestroy(eventtokens);
-            return -ENOMEM;
-        }
-        getCounterAvailabilityParams.ctx = device->context;
-        getCounterAvailabilityParams.pCounterAvailabilityImage = availImage;
-        LIKWID_CUPTI_API_CALL((*cuptiProfilerGetCounterAvailabilityPtr)(&getCounterAvailabilityParams), return -EFAULT);
-        availImageSize = getCounterAvailabilityParams.counterAvailabilityImageSize;
-    }
-
-    int ci_size = nvmon_perfworks_createConfigImage(device->chip, eventtokens, &configImage, availImage);
-    int pi_size = nvmon_perfworks_createCounterDataPrefixImage(device->chip, eventtokens, &prefixImage);
-    
-    
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Filling eventset %d on device %d, device->numNvEventSets, device->deviceId);
-    if (configImage && prefixImage)
-    {
-        newEventSet->configImage = configImage;
-        newEventSet->configImageSize = (size_t)ci_size;
-        newEventSet->counterDataImagePrefix = prefixImage;
-        newEventSet->counterDataImagePrefixSize = (size_t)pi_size;
-        newEventSet->counterDataImage = NULL;
-        newEventSet->counterDataImageSize = 0; 
-        newEventSet->counterDataScratchBuffer = NULL;
-        newEventSet->counterDataScratchBufferSize = 0;
-        newEventSet->counterAvailabilityImage = availImage;
-        newEventSet->counterAvailabilityImageSize = availImageSize;
-        newEventSet->events = eventtokens;
-        newEventSet->numberOfEvents = eventtokens->qty;
-        newEventSet->id = device->numNvEventSets;
-        gid = device->numNvEventSets;
-        newEventSet->nvEvents = malloc(eventtokens->qty * sizeof(NvmonEvent_t));
-        if (newEventSet->nvEvents == NULL)
-        {
+      }
+      getCounterAvailabilityParams.ctx = device->context;
+      getCounterAvailabilityParams.pCounterAvailabilityImage = availImage;
+      LIKWID_CUPTI_API_CALL((*cuptiProfilerGetCounterAvailabilityPtr)(
+                                &getCounterAvailabilityParams),
+                            return -EFAULT);
+      availImageSize =
+          getCounterAvailabilityParams.counterAvailabilityImageSize;
+    }
+
+    int ci_size = nvmon_perfworks_createConfigImage(device->chip, eventtokens,
+                                                    &configImage, availImage);
+    int pi_size = nvmon_perfworks_createCounterDataPrefixImage(
+        device->chip, eventtokens, &prefixImage);
+
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Filling eventset % d on device % d,
+                   device->numNvEventSets, device->deviceId);
+    if (configImage && prefixImage) {
+      newEventSet->configImage = configImage;
+      newEventSet->configImageSize = (size_t)ci_size;
+      newEventSet->counterDataImagePrefix = prefixImage;
+      newEventSet->counterDataImagePrefixSize = (size_t)pi_size;
+      newEventSet->counterDataImage = NULL;
+      newEventSet->counterDataImageSize = 0;
+      newEventSet->counterDataScratchBuffer = NULL;
+      newEventSet->counterDataScratchBufferSize = 0;
+      newEventSet->counterAvailabilityImage = availImage;
+      newEventSet->counterAvailabilityImageSize = availImageSize;
+      newEventSet->events = eventtokens;
+      newEventSet->numberOfEvents = eventtokens->qty;
+      newEventSet->id = device->numNvEventSets;
+      gid = device->numNvEventSets;
+      newEventSet->nvEvents = malloc(eventtokens->qty * sizeof(NvmonEvent_t));
+      if (newEventSet->nvEvents == NULL) {
             ERROR_PRINT(Cannot allocate event list for group %d\n, gid);
             return -ENOMEM;
+      }
+      memset(newEventSet->nvEvents, 0, eventtokens->qty * sizeof(NvmonEvent_t));
+      for (i = 0; i < eventtokens->qty; i++) {
+        for (j = 0; j < device->numAllEvents; j++) {
+          bstring brealname = bfromcstr(device->allevents[j]->real);
+          if (bstrcmp(eventtokens->entry[i], brealname) == BSTR_OK) {
+            newEventSet->nvEvents[i] = device->allevents[j];
+          }
         }
-        memset(newEventSet->nvEvents, 0, eventtokens->qty * sizeof(NvmonEvent_t));
-        for (i = 0; i < eventtokens->qty; i++)
-        {
-            for (j = 0; j < device->numAllEvents; j++)
-            {
-                bstring brealname = bfromcstr(device->allevents[j]->real);
-                if (bstrcmp(eventtokens->entry[i], brealname) == BSTR_OK)
-                {
-                    newEventSet->nvEvents[i] = device->allevents[j];
-                }
-            }
-        }
-        newEventSet->results = malloc(eventtokens->qty * sizeof(NvmonEventResult));
-        if (newEventSet->results == NULL)
-        {
+      }
+      newEventSet->results =
+          malloc(eventtokens->qty * sizeof(NvmonEventResult));
+      if (newEventSet->results == NULL) {
             ERROR_PRINT(Cannot allocate result list for group %d\n, gid);
             return -ENOMEM;
-        }
-        memset(newEventSet->results, 0, eventtokens->qty * sizeof(NvmonEventResult));
-        newEventSet->nvEvents = nvEvents;
-        device->numNvEventSets++;
-        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Adding eventset %d, gid);
+      }
+      memset(newEventSet->results, 0,
+             eventtokens->qty * sizeof(NvmonEventResult));
+      newEventSet->nvEvents = nvEvents;
+      device->numNvEventSets++;
+      GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Adding eventset % d, gid);
     }
 
-    if(popContext > 0)
-    {
+    if (popContext > 0) {
         GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Pop Context %ld for device %d, device->context, device->deviceId);
         LIKWID_CU_CALL((*cuCtxPopCurrentPtr)(&device->context), return -EFAULT);
     }
-    if (curDeviceId != device->deviceId)
-    {
-        LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(curDeviceId), return -EFAULT);
+    if (curDeviceId != device->deviceId) {
+      LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(curDeviceId), return -EFAULT);
     }
     return gid;
-}
-
+  }
 
-static int nvmon_perfworks_setupCounterImageData(NvmonEventSet* eventSet)//int size, uint8_t** counterDataImage, uint8_t** counterDataScratchBuffer, uint8_t* counterDataImagePrefix)
-{
+  static int nvmon_perfworks_setupCounterImageData(
+      NvmonEventSet *
+      eventSet) // int size, uint8_t** counterDataImage, uint8_t**
+                // counterDataScratchBuffer, uint8_t* counterDataImagePrefix)
+  {
     int cimage_size = 0;
 
     CUpti_Profiler_CounterDataImageOptions counterDataImageOptions;
-    counterDataImageOptions.counterDataPrefixSize = eventSet->counterDataImagePrefixSize;
-    counterDataImageOptions.pCounterDataPrefix = eventSet->counterDataImagePrefix;
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, counterDataPrefixSize %ld, eventSet->counterDataImagePrefixSize);
+    counterDataImageOptions.counterDataPrefixSize =
+        eventSet->counterDataImagePrefixSize;
+    counterDataImageOptions.pCounterDataPrefix =
+        eventSet->counterDataImagePrefix;
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, counterDataPrefixSize % ld,
+                   eventSet->counterDataImagePrefixSize);
     counterDataImageOptions.maxNumRanges = 1;
     counterDataImageOptions.maxNumRangeTreeNodes = 1;
-    counterDataImageOptions.maxRangeNameLength = NVMON_DEFAULT_STR_LEN-1;
+    counterDataImageOptions.maxRangeNameLength = NVMON_DEFAULT_STR_LEN - 1;
 
-    CUpti_Profiler_CounterDataImage_CalculateSize_Params calculateSizeParams = {CUpti_Profiler_CounterDataImage_CalculateSize_Params_STRUCT_SIZE};
+    CUpti_Profiler_CounterDataImage_CalculateSize_Params calculateSizeParams = {
+        CUpti_Profiler_CounterDataImage_CalculateSize_Params_STRUCT_SIZE};
 
     calculateSizeParams.pOptions = &counterDataImageOptions;
-    calculateSizeParams.sizeofCounterDataImageOptions = CUpti_Profiler_CounterDataImageOptions_STRUCT_SIZE;
+    calculateSizeParams.sizeofCounterDataImageOptions =
+        CUpti_Profiler_CounterDataImageOptions_STRUCT_SIZE;
 
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerCounterDataImageCalculateSizePtr)(&calculateSizeParams), return -EFAULT);
+    LIKWID_CUPTI_API_CALL(
+        (*cuptiProfilerCounterDataImageCalculateSizePtr)(&calculateSizeParams),
+        return -EFAULT);
 
-    CUpti_Profiler_CounterDataImage_Initialize_Params initializeParams = {CUpti_Profiler_CounterDataImage_Initialize_Params_STRUCT_SIZE};
-    initializeParams.sizeofCounterDataImageOptions = CUpti_Profiler_CounterDataImageOptions_STRUCT_SIZE;
+    CUpti_Profiler_CounterDataImage_Initialize_Params initializeParams = {
+        CUpti_Profiler_CounterDataImage_Initialize_Params_STRUCT_SIZE};
+    initializeParams.sizeofCounterDataImageOptions =
+        CUpti_Profiler_CounterDataImageOptions_STRUCT_SIZE;
     initializeParams.pOptions = &counterDataImageOptions;
-    initializeParams.counterDataImageSize = calculateSizeParams.counterDataImageSize;
+    initializeParams.counterDataImageSize =
+        calculateSizeParams.counterDataImageSize;
 
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Resize counterDataImage to %ld, calculateSizeParams.counterDataImageSize);
-    uint8_t* tmp = realloc(eventSet->counterDataImage, calculateSizeParams.counterDataImageSize);
-    if (!tmp)
-    {
-        return -ENOMEM;
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Resize counterDataImage to % ld,
+                   calculateSizeParams.counterDataImageSize);
+    uint8_t *tmp = realloc(eventSet->counterDataImage,
+                           calculateSizeParams.counterDataImageSize);
+    if (!tmp) {
+      return -ENOMEM;
     }
     eventSet->counterDataImage = tmp;
     eventSet->counterDataImageSize = calculateSizeParams.counterDataImageSize;
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Resized counterDataImage to %ld, eventSet->counterDataImageSize);
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Resized counterDataImage to % ld,
+                   eventSet->counterDataImageSize);
     initializeParams.pCounterDataImage = eventSet->counterDataImage;
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerCounterDataImageInitializePtr)(&initializeParams), return -EFAULT);
-
-    CUpti_Profiler_CounterDataImage_CalculateScratchBufferSize_Params scratchBufferSizeParams = {CUpti_Profiler_CounterDataImage_CalculateScratchBufferSize_Params_STRUCT_SIZE};
-    scratchBufferSizeParams.counterDataImageSize = calculateSizeParams.counterDataImageSize;
+    LIKWID_CUPTI_API_CALL(
+        (*cuptiProfilerCounterDataImageInitializePtr)(&initializeParams),
+        return -EFAULT);
+
+    CUpti_Profiler_CounterDataImage_CalculateScratchBufferSize_Params
+        scratchBufferSizeParams = {
+            CUpti_Profiler_CounterDataImage_CalculateScratchBufferSize_Params_STRUCT_SIZE};
+    scratchBufferSizeParams.counterDataImageSize =
+        calculateSizeParams.counterDataImageSize;
     scratchBufferSizeParams.pCounterDataImage = eventSet->counterDataImage;
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerCounterDataImageCalculateScratchBufferSizePtr)(&scratchBufferSizeParams), return -EFAULT);
-
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Resize counterDataScratchBuffer to %ld, scratchBufferSizeParams.counterDataScratchBufferSize);
-    tmp = realloc(eventSet->counterDataScratchBuffer, scratchBufferSizeParams.counterDataScratchBufferSize);
-    if(!tmp)
-    {
-        return -ENOMEM;
+    LIKWID_CUPTI_API_CALL(
+        (*cuptiProfilerCounterDataImageCalculateScratchBufferSizePtr)(
+            &scratchBufferSizeParams),
+        return -EFAULT);
+
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Resize counterDataScratchBuffer to % ld,
+                   scratchBufferSizeParams.counterDataScratchBufferSize);
+    tmp = realloc(eventSet->counterDataScratchBuffer,
+                  scratchBufferSizeParams.counterDataScratchBufferSize);
+    if (!tmp) {
+      return -ENOMEM;
     }
     eventSet->counterDataScratchBuffer = tmp;
-    eventSet->counterDataScratchBufferSize = scratchBufferSizeParams.counterDataScratchBufferSize;
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Resized counterDataScratchBuffer to %ld, eventSet->counterDataScratchBufferSize);
-
-    CUpti_Profiler_CounterDataImage_InitializeScratchBuffer_Params initScratchBufferParams = {CUpti_Profiler_CounterDataImage_InitializeScratchBuffer_Params_STRUCT_SIZE};
-    initScratchBufferParams.counterDataImageSize = calculateSizeParams.counterDataImageSize;
-
-    initScratchBufferParams.pCounterDataImage = initializeParams.pCounterDataImage;
-    initScratchBufferParams.counterDataScratchBufferSize = scratchBufferSizeParams.counterDataScratchBufferSize;
-    initScratchBufferParams.pCounterDataScratchBuffer = eventSet->counterDataScratchBuffer;
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerCounterDataImageInitializeScratchBufferPtr)(&initScratchBufferParams), return -EFAULT);
+    eventSet->counterDataScratchBufferSize =
+        scratchBufferSizeParams.counterDataScratchBufferSize;
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Resized counterDataScratchBuffer to % ld,
+                   eventSet->counterDataScratchBufferSize);
+
+    CUpti_Profiler_CounterDataImage_InitializeScratchBuffer_Params
+        initScratchBufferParams = {
+            CUpti_Profiler_CounterDataImage_InitializeScratchBuffer_Params_STRUCT_SIZE};
+    initScratchBufferParams.counterDataImageSize =
+        calculateSizeParams.counterDataImageSize;
+
+    initScratchBufferParams.pCounterDataImage =
+        initializeParams.pCounterDataImage;
+    initScratchBufferParams.counterDataScratchBufferSize =
+        scratchBufferSizeParams.counterDataScratchBufferSize;
+    initScratchBufferParams.pCounterDataScratchBuffer =
+        eventSet->counterDataScratchBuffer;
+    LIKWID_CUPTI_API_CALL(
+        (*cuptiProfilerCounterDataImageInitializeScratchBufferPtr)(
+            &initScratchBufferParams),
+        return -EFAULT);
 
     return 0;
-}
-
+  }
 
-typedef struct {
+  typedef struct {
     int num_ranges;
-    struct bstrList* names;
-    double* values;
-} PerfWorksMetricRanges;
-
-static void freeCharList(int len, char** l)
-{
-    if (len >= 0 && l)
-    {
-        int i = 0;
-        for (i = 0; i < len; i++)
-        {
-            free(l[i]);
-        }
-        free(l);
-    }
-}
-
-
-
-static int nvmon_perfworks_getMetricValue(char* chip, uint8_t* counterDataImage, struct bstrList* events, double** values)
-{
+    struct bstrList *names;
+    double *values;
+  } PerfWorksMetricRanges;
+
+  static void freeCharList(int len, char **l) {
+    if (len >= 0 && l) {
+      int i = 0;
+      for (i = 0; i < len; i++) {
+        free(l[i]);
+      }
+      free(l);
+    }
+  }
+
+  static int nvmon_perfworks_getMetricValue(
+      char *chip, uint8_t *counterDataImage, struct bstrList *events,
+      double **values) {
     if ((!chip) || (!counterDataImage) || (!events) || (!values))
-        return -EINVAL;
+      return -EINVAL;
     int i = 0;
     int j = 0;
     int ierr = 0;
-    char** metricnames = NULL;
-    double* gpuValues = malloc(events->qty * sizeof(double));
+    char **metricnames = NULL;
+    double *gpuValues = malloc(events->qty * sizeof(double));
     if (!gpuValues)
-        return -ENOMEM;
+      return -ENOMEM;
 
-
-    NVPW_CUDA_MetricsContext_Create_Params metricsContextCreateParams = { NVPW_CUDA_MetricsContext_Create_Params_STRUCT_SIZE };
+    NVPW_CUDA_MetricsContext_Create_Params metricsContextCreateParams = {
+        NVPW_CUDA_MetricsContext_Create_Params_STRUCT_SIZE};
     metricsContextCreateParams.pChipName = chip;
-    LIKWID_NVPW_API_CALL((*NVPW_CUDA_MetricsContext_CreatePtr)(&metricsContextCreateParams), ierr = -1; goto nvmon_perfworks_getMetricValue_out);
-
-    NVPW_MetricsContext_Destroy_Params metricsContextDestroyParams = { NVPW_MetricsContext_Destroy_Params_STRUCT_SIZE };
-    metricsContextDestroyParams.pMetricsContext = metricsContextCreateParams.pMetricsContext;
-
-    NVPW_CounterData_GetNumRanges_Params getNumRangesParams = { NVPW_CounterData_GetNumRanges_Params_STRUCT_SIZE };
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_CUDA_MetricsContext_CreatePtr)(&metricsContextCreateParams),
+        ierr = -1;
+        goto nvmon_perfworks_getMetricValue_out);
+
+    NVPW_MetricsContext_Destroy_Params metricsContextDestroyParams = {
+        NVPW_MetricsContext_Destroy_Params_STRUCT_SIZE};
+    metricsContextDestroyParams.pMetricsContext =
+        metricsContextCreateParams.pMetricsContext;
+
+    NVPW_CounterData_GetNumRanges_Params getNumRangesParams = {
+        NVPW_CounterData_GetNumRanges_Params_STRUCT_SIZE};
     getNumRangesParams.pCounterDataImage = counterDataImage;
-    LIKWID_NVPW_API_CALL((*NVPW_CounterData_GetNumRangesPtr)(&getNumRangesParams), ierr = -1; goto nvmon_perfworks_getMetricValue_out;);
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_CounterData_GetNumRangesPtr)(&getNumRangesParams), ierr = -1;
+        goto nvmon_perfworks_getMetricValue_out;);
 
     int num_metricnames = bstrListToCharList(events, &metricnames);
 
-
-    NVPW_MetricsContext_SetCounterData_Params setCounterDataParams = { NVPW_MetricsContext_SetCounterData_Params_STRUCT_SIZE };
-    setCounterDataParams.pMetricsContext = metricsContextCreateParams.pMetricsContext;
+    NVPW_MetricsContext_SetCounterData_Params setCounterDataParams = {
+        NVPW_MetricsContext_SetCounterData_Params_STRUCT_SIZE};
+    setCounterDataParams.pMetricsContext =
+        metricsContextCreateParams.pMetricsContext;
     setCounterDataParams.pCounterDataImage = counterDataImage;
     setCounterDataParams.isolated = 1;
     setCounterDataParams.rangeIndex = 0;
-    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_SetCounterDataPtr)(&setCounterDataParams), ierr = -1; goto nvmon_perfworks_getMetricValue_out;);
-
-    //double* gpuValues = malloc(events->qty * sizeof(double));
-    //memset(gpuValues, 0, events->qty * sizeof(double));
-    NVPW_MetricsContext_EvaluateToGpuValues_Params evalToGpuParams = { NVPW_MetricsContext_EvaluateToGpuValues_Params_STRUCT_SIZE };
-    evalToGpuParams.pMetricsContext = metricsContextCreateParams.pMetricsContext;
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_MetricsContext_SetCounterDataPtr)(&setCounterDataParams),
+        ierr = -1;
+        goto nvmon_perfworks_getMetricValue_out;);
+
+    // double* gpuValues = malloc(events->qty * sizeof(double));
+    // memset(gpuValues, 0, events->qty * sizeof(double));
+    NVPW_MetricsContext_EvaluateToGpuValues_Params evalToGpuParams = {
+        NVPW_MetricsContext_EvaluateToGpuValues_Params_STRUCT_SIZE};
+    evalToGpuParams.pMetricsContext =
+        metricsContextCreateParams.pMetricsContext;
     evalToGpuParams.numMetrics = num_metricnames;
-    evalToGpuParams.ppMetricNames = (const char**)metricnames;
+    evalToGpuParams.ppMetricNames = (const char **)metricnames;
     memset(gpuValues, 0, events->qty * sizeof(double));
     evalToGpuParams.pMetricValues = &gpuValues[0];
-    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_EvaluateToGpuValuesPtr)(&evalToGpuParams), ierr = -1; free(gpuValues); goto nvmon_perfworks_getMetricValue_out;);
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_MetricsContext_EvaluateToGpuValuesPtr)(&evalToGpuParams),
+        ierr = -1;
+        free(gpuValues); goto nvmon_perfworks_getMetricValue_out;);
     // for (j = 0; j < events->qty; j++)
     // {
     //     bstrListAdd(r->names, rname);
     //     r->values[j] += gpuValues[j];
     // }
-    for (j = 0; j < events->qty; j++)
-    {
-        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Final Eval %s: %f, bdata(events->entry[j]), gpuValues[j]);
+    for (j = 0; j < events->qty; j++) {
+      GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Final Eval % s
+                     : % f, bdata(events->entry[j]), gpuValues[j]);
     }
     *values = gpuValues;
 
-nvmon_perfworks_getMetricValue_out:
-    if (ierr != 0) free(gpuValues);
+  nvmon_perfworks_getMetricValue_out:
+    if (ierr != 0)
+      free(gpuValues);
     freeCharList(num_metricnames, metricnames);
-    LIKWID_NVPW_API_CALL((*NVPW_MetricsContext_DestroyPtr)(&metricsContextDestroyParams), ierr = -1);
+    LIKWID_NVPW_API_CALL(
+        (*NVPW_MetricsContext_DestroyPtr)(&metricsContextDestroyParams),
+        ierr = -1);
     return ierr;
-}
-
+  }
 
-int nvmon_perfworks_setupCounters(NvmonDevice_t device, NvmonEventSet* eventSet)
-{
+  int nvmon_perfworks_setupCounters(NvmonDevice_t device,
+                                    NvmonEventSet * eventSet) {
     int size = 0;
     int curDeviceId = 0;
     uint8_t *cimage = NULL;
@@ -1838,22 +2086,23 @@ int nvmon_perfworks_setupCounters(NvmonDevice_t device, NvmonEventSet* eventSet)
     uint8_t *prefix = NULL;
     CUcontext curContext;
     LIKWID_CUDA_API_CALL((*cudaGetDevicePtr)(&curDeviceId), return -EFAULT);
-    if (curDeviceId != device->deviceId)
-    {
-        LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId), return -EFAULT);
+    if (curDeviceId != device->deviceId) {
+      LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId),
+                           return -EFAULT);
     }
     LIKWID_CU_CALL((*cuCtxGetCurrentPtr)(&curContext), return -EFAULT);
     int popContext = perfworks_check_nv_context(device, curContext);
-    if (popContext < 0)
-    {
-        errno = -popContext;
-        ERROR_PRINT(Failed to get context)
+    if (popContext < 0) {
+      errno = -popContext;
+      ERROR_PRINT(Failed to get context)
     }
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Setup Counters on device %d, device->deviceId);
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Setup Counters on device % d,
+                   device->deviceId);
 
-    //cuptiProfiler_init();
+    // cuptiProfiler_init();
 
-    int ret = nvmon_perfworks_setupCounterImageData(eventSet);//&cimage, &scratch, eventSet->counterDataImagePrefix);
+    int ret = nvmon_perfworks_setupCounterImageData(
+        eventSet); //&cimage, &scratch, eventSet->counterDataImagePrefix);
     device->activeEventSet = eventSet->id;
     nvGroupSet->activeGroup = eventSet->id;
     // if (ret > 0)
@@ -1865,222 +2114,246 @@ int nvmon_perfworks_setupCounters(NvmonDevice_t device, NvmonEventSet* eventSet)
     //     // eventSet->counterDataImageSize = ret;
     //     device->activeEventSet = eventSet->id;
     // }
-    if(popContext > 0)
-    {
+    if (popContext > 0) {
         GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Pop Context %ld for device %d, device->context, device->deviceId);
         LIKWID_CU_CALL((*cuCtxPopCurrentPtr)(&device->context), return -EFAULT);
     }
-    if (curDeviceId != device->deviceId)
-    {
-        LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(curDeviceId), return -EFAULT);
+    if (curDeviceId != device->deviceId) {
+      LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(curDeviceId), return -EFAULT);
     }
     return ret;
-}
+  }
 
-int nvmon_perfworks_startCounters(NvmonDevice_t device)
-{
+  int nvmon_perfworks_startCounters(NvmonDevice_t device) {
 
     int numRanges = 1;
-    //CUcontext cuContext;
+    // CUcontext cuContext;
     int curDeviceId = 0;
     CUcontext curContext;
 
     LIKWID_CUDA_API_CALL((*cudaGetDevicePtr)(&curDeviceId), return -EFAULT);
-    if (curDeviceId != device->deviceId)
-    {
-        LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId), return -EFAULT);
+    if (curDeviceId != device->deviceId) {
+      LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId),
+                           return -EFAULT);
     }
-    //LIKWID_CU_CALL((*cuCtxGetCurrentPtr)(&cuContext), return -EFAULT);
+    // LIKWID_CU_CALL((*cuCtxGetCurrentPtr)(&cuContext), return -EFAULT);
     LIKWID_CU_CALL((*cuCtxGetCurrentPtr)(&curContext), return -EFAULT);
     int popContext = perfworks_check_nv_context(device, curContext);
-    if (popContext < 0)
-    {
-        errno = -popContext;
-        ERROR_PRINT(Failed to get context)
-    }
-
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Start Counters on device %d (Eventset %d), device->deviceId, device->activeEventSet);
-    NvmonEventSet* eventSet = &device->nvEventSets[device->activeEventSet];
-
-    CUpti_Profiler_BeginSession_Params beginSessionParams = {CUpti_Profiler_BeginSession_Params_STRUCT_SIZE};
-    size_t CUpti_Profiler_SetConfig_Params_size = 3*sizeof(size_t) + sizeof(void*) + sizeof(CUcontext) + sizeof(const uint8_t*);
-    if (cuda_runtime_version < 11000)
-    {
-        CUpti_Profiler_SetConfig_Params_size = 56;//2*sizeof(uint16_t); 
-    }
-    else
-    {
-        CUpti_Profiler_SetConfig_Params_size = 58;
-    }
-    CUpti_Profiler_SetConfig_Params setConfigParams = {CUpti_Profiler_SetConfig_Params_size};
-    CUpti_Profiler_EnableProfiling_Params enableProfilingParams = {CUpti_Profiler_EnableProfiling_Params_STRUCT_SIZE};
+    if (popContext < 0) {
+      errno = -popContext;
+      ERROR_PRINT(Failed to get context)
+    }
+
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Start Counters on device % d(Eventset % d),
+                   device->deviceId, device->activeEventSet);
+    NvmonEventSet *eventSet = &device->nvEventSets[device->activeEventSet];
+
+    CUpti_Profiler_BeginSession_Params beginSessionParams = {
+        CUpti_Profiler_BeginSession_Params_STRUCT_SIZE};
+    size_t CUpti_Profiler_SetConfig_Params_size =
+        3 * sizeof(size_t) + sizeof(void *) + sizeof(CUcontext) +
+        sizeof(const uint8_t *);
+    if (cuda_runtime_version < 11000) {
+      CUpti_Profiler_SetConfig_Params_size = 56; // 2*sizeof(uint16_t);
+    } else {
+      CUpti_Profiler_SetConfig_Params_size = 58;
+    }
+    CUpti_Profiler_SetConfig_Params setConfigParams = {
+        CUpti_Profiler_SetConfig_Params_size};
+    CUpti_Profiler_EnableProfiling_Params enableProfilingParams = {
+        CUpti_Profiler_EnableProfiling_Params_STRUCT_SIZE};
     enableProfilingParams.ctx = device->context;
-    CUpti_Profiler_PushRange_Params pushRangeParams = {CUpti_Profiler_PushRange_Params_STRUCT_SIZE};
+    CUpti_Profiler_PushRange_Params pushRangeParams = {
+        CUpti_Profiler_PushRange_Params_STRUCT_SIZE};
     pushRangeParams.ctx = device->context;
-    CUpti_Profiler_BeginPass_Params beginPassParams = {CUpti_Profiler_BeginPass_Params_STRUCT_SIZE};
+    CUpti_Profiler_BeginPass_Params beginPassParams = {
+        CUpti_Profiler_BeginPass_Params_STRUCT_SIZE};
     beginPassParams.ctx = device->context;
 
-    beginSessionParams.ctx = device->context;//cuContext;//;
+    beginSessionParams.ctx = device->context; // cuContext;//;
     beginSessionParams.counterDataImageSize = eventSet->counterDataImageSize;
     beginSessionParams.pCounterDataImage = eventSet->counterDataImage;
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, (START) counterDataImageSize %ld, eventSet->counterDataImageSize);
-    beginSessionParams.counterDataScratchBufferSize = eventSet->counterDataScratchBufferSize;
-    beginSessionParams.pCounterDataScratchBuffer = eventSet->counterDataScratchBuffer;
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, (START) counterDataScratchBufferSize %ld, eventSet->counterDataScratchBufferSize);
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, (START)counterDataImageSize % ld,
+                   eventSet->counterDataImageSize);
+    beginSessionParams.counterDataScratchBufferSize =
+        eventSet->counterDataScratchBufferSize;
+    beginSessionParams.pCounterDataScratchBuffer =
+        eventSet->counterDataScratchBuffer;
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, (START)counterDataScratchBufferSize % ld,
+                   eventSet->counterDataScratchBufferSize);
     beginSessionParams.range = CUPTI_UserRange;
     beginSessionParams.replayMode = CUPTI_UserReplay;
     beginSessionParams.maxRangesPerPass = 1;
     beginSessionParams.maxLaunchesPerPass = 1;
 
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerBeginSessionPtr)(&beginSessionParams), return -1);
+    LIKWID_CUPTI_API_CALL((*cuptiProfilerBeginSessionPtr)(&beginSessionParams),
+                          return -1);
 
     setConfigParams.pConfig = eventSet->configImage;
     setConfigParams.configSize = eventSet->configImageSize;
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, (START) configImage %ld, eventSet->configImageSize);
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, (START)configImage % ld,
+                   eventSet->configImageSize);
 
     setConfigParams.passIndex = 0;
     setConfigParams.ctx = device->context;
     setConfigParams.minNestingLevel = 1;
     setConfigParams.numNestingLevels = 1;
     setConfigParams.targetNestingLevel = 1;
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerSetConfigPtr)(&setConfigParams), return -1);
+    LIKWID_CUPTI_API_CALL((*cuptiProfilerSetConfigPtr)(&setConfigParams),
+                          return -1);
 
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerBeginPassPtr)(&beginPassParams), return -1;);
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerEnableProfilingPtr)(&enableProfilingParams), return -1);
+    LIKWID_CUPTI_API_CALL((*cuptiProfilerBeginPassPtr)(&beginPassParams),
+                          return -1;);
+    LIKWID_CUPTI_API_CALL(
+        (*cuptiProfilerEnableProfilingPtr)(&enableProfilingParams), return -1);
     pushRangeParams.pRangeName = "nvmon_perfworks";
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerPushRangePtr)(&pushRangeParams), return -1);
+    LIKWID_CUPTI_API_CALL((*cuptiProfilerPushRangePtr)(&pushRangeParams),
+                          return -1);
 
-    if(popContext > 0)
-    {
+    if (popContext > 0) {
         GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Pop Context %ld for device %d, device->context, device->deviceId);
         LIKWID_CU_CALL((*cuCtxPopCurrentPtr)(&device->context), return -EFAULT);
     }
-    if (curDeviceId != device->deviceId)
-    {
-        LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(curDeviceId), return -EFAULT);
+    if (curDeviceId != device->deviceId) {
+      LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(curDeviceId), return -EFAULT);
     }
 
     return 0;
-}
+  }
 
-int nvmon_perfworks_stopCounters(NvmonDevice_t device)
-{
-    double* values;
+  int nvmon_perfworks_stopCounters(NvmonDevice_t device) {
+    double *values;
     int curDeviceId = 0;
     CUcontext curContext;
     LIKWID_CUDA_API_CALL((*cudaGetDevicePtr)(&curDeviceId), return -EFAULT);
-    if (curDeviceId != device->deviceId)
-    {
-        LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId), return -EFAULT);
+    if (curDeviceId != device->deviceId) {
+      LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(device->deviceId),
+                           return -EFAULT);
     }
     LIKWID_CU_CALL((*cuCtxGetCurrentPtr)(&curContext), return -EFAULT);
     int popContext = perfworks_check_nv_context(device, curContext);
-    if (popContext < 0)
-    {
-        errno = -popContext;
-        ERROR_PRINT(Failed to get context)
+    if (popContext < 0) {
+      errno = -popContext;
+      ERROR_PRINT(Failed to get context)
     }
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Stop Counters on device %d (Eventset %d), device->deviceId, device->activeEventSet);
-    NvmonEventSet* eventSet = &device->nvEventSets[device->activeEventSet];
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Stop Counters on device % d(Eventset % d),
+                   device->deviceId, device->activeEventSet);
+    NvmonEventSet *eventSet = &device->nvEventSets[device->activeEventSet];
 
-    CUpti_Profiler_DisableProfiling_Params disableProfilingParams = {CUpti_Profiler_DisableProfiling_Params_STRUCT_SIZE};
+    CUpti_Profiler_DisableProfiling_Params disableProfilingParams = {
+        CUpti_Profiler_DisableProfiling_Params_STRUCT_SIZE};
     disableProfilingParams.ctx = device->context;
-    CUpti_Profiler_PopRange_Params popRangeParams = {CUpti_Profiler_PopRange_Params_STRUCT_SIZE};
+    CUpti_Profiler_PopRange_Params popRangeParams = {
+        CUpti_Profiler_PopRange_Params_STRUCT_SIZE};
     popRangeParams.ctx = device->context;
     size_t CUpti_Profiler_EndPass_Params_size = 0;
     size_t CUpti_Profiler_FlushCounterData_Params_size = 0;
 
-    if (cuda_runtime_version < 11000)
-    {
-        CUpti_Profiler_EndPass_Params_size = CUpti_Profiler_EndPass_Params_STRUCT_SIZE10;
-        CUpti_Profiler_FlushCounterData_Params_size = CUpti_Profiler_FlushCounterData_Params_STRUCT_SIZE10;
-    }
-    else
-    {
-        CUpti_Profiler_EndPass_Params_size = CUpti_Profiler_EndPass_Params_STRUCT_SIZE11;
-        CUpti_Profiler_FlushCounterData_Params_size = CUpti_Profiler_FlushCounterData_Params_STRUCT_SIZE11;
-    }
-    CUpti_Profiler_EndPass_Params endPassParams = {CUpti_Profiler_EndPass_Params_size};
+    if (cuda_runtime_version < 11000) {
+      CUpti_Profiler_EndPass_Params_size =
+          CUpti_Profiler_EndPass_Params_STRUCT_SIZE10;
+      CUpti_Profiler_FlushCounterData_Params_size =
+          CUpti_Profiler_FlushCounterData_Params_STRUCT_SIZE10;
+    } else {
+      CUpti_Profiler_EndPass_Params_size =
+          CUpti_Profiler_EndPass_Params_STRUCT_SIZE11;
+      CUpti_Profiler_FlushCounterData_Params_size =
+          CUpti_Profiler_FlushCounterData_Params_STRUCT_SIZE11;
+    }
+    CUpti_Profiler_EndPass_Params endPassParams = {
+        CUpti_Profiler_EndPass_Params_size};
     endPassParams.ctx = device->context;
-    CUpti_Profiler_FlushCounterData_Params flushCounterDataParams = {CUpti_Profiler_FlushCounterData_Params_size};
+    CUpti_Profiler_FlushCounterData_Params flushCounterDataParams = {
+        CUpti_Profiler_FlushCounterData_Params_size};
     flushCounterDataParams.ctx = device->context;
-    CUpti_Profiler_UnsetConfig_Params unsetConfigParams = {CUpti_Profiler_UnsetConfig_Params_STRUCT_SIZE};
+    CUpti_Profiler_UnsetConfig_Params unsetConfigParams = {
+        CUpti_Profiler_UnsetConfig_Params_STRUCT_SIZE};
     unsetConfigParams.ctx = device->context;
-    CUpti_Profiler_EndSession_Params endSessionParams = {CUpti_Profiler_EndSession_Params_STRUCT_SIZE};
+    CUpti_Profiler_EndSession_Params endSessionParams = {
+        CUpti_Profiler_EndSession_Params_STRUCT_SIZE};
     endSessionParams.ctx = device->context;
 
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerPopRangePtr)(&popRangeParams), return -1);
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerDisableProfilingPtr)(&disableProfilingParams), return -1);
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerEndPassPtr)(&endPassParams), return -1);
-    if (endPassParams.allPassesSubmitted != 1)
-    {
-        ERROR_PRINT(Events cannot be measured in a single pass and multi-pass/kernel replay is current not supported);
-    }
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerFlushCounterDataPtr)(&flushCounterDataParams), return -1);
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerUnsetConfigPtr)(&unsetConfigParams), return -1);
-    LIKWID_CUPTI_API_CALL((*cuptiProfilerEndSessionPtr)(&endSessionParams), return -1);
-    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Get results on device %d (Eventset %d), device->deviceId, device->activeEventSet);
-
-    nvmon_perfworks_getMetricValue(device->chip, eventSet->counterDataImage, eventSet->events, &values);
+    LIKWID_CUPTI_API_CALL((*cuptiProfilerPopRangePtr)(&popRangeParams),
+                          return -1);
+    LIKWID_CUPTI_API_CALL(
+        (*cuptiProfilerDisableProfilingPtr)(&disableProfilingParams),
+        return -1);
+    LIKWID_CUPTI_API_CALL((*cuptiProfilerEndPassPtr)(&endPassParams),
+                          return -1);
+    if (endPassParams.allPassesSubmitted != 1) {
+      ERROR_PRINT(Events cannot be measured in a single pass and
+                  multi - pass / kernel replay is current not supported);
+    }
+    LIKWID_CUPTI_API_CALL(
+        (*cuptiProfilerFlushCounterDataPtr)(&flushCounterDataParams),
+        return -1);
+    LIKWID_CUPTI_API_CALL((*cuptiProfilerUnsetConfigPtr)(&unsetConfigParams),
+                          return -1);
+    LIKWID_CUPTI_API_CALL((*cuptiProfilerEndSessionPtr)(&endSessionParams),
+                          return -1);
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Get results on device % d(Eventset % d),
+                   device->deviceId, device->activeEventSet);
+
+    nvmon_perfworks_getMetricValue(device->chip, eventSet->counterDataImage,
+                                   eventSet->events, &values);
 
     int i = 0, j = 0;
-    for (j = 0; j < eventSet->numberOfEvents; j++)
-    {
-        double res = values[j];
-        NvmonEvent_t nve = eventSet->nvEvents[j];
-        eventSet->results[j].lastValue = res;
-        switch (nve->rtype)
-        {
-            case ENTITY_TYPE_SUM:
-                eventSet->results[j].fullValue += res;
-                break;
-            case ENTITY_TYPE_MIN:
-                eventSet->results[j].fullValue = (res < eventSet->results[j].fullValue ? res : eventSet->results[j].fullValue);
-                break;
-            case ENTITY_TYPE_MAX:
-                eventSet->results[j].fullValue = (res > eventSet->results[j].fullValue ? res : eventSet->results[j].fullValue);
-                break;
-            case ENTITY_TYPE_INSTANT:
-                eventSet->results[j].fullValue = res;
-                break;
-        }
-        eventSet->results[j].stopValue = eventSet->results[j].fullValue;
-        eventSet->results[j].overflows = 0;
-        GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, %s Last %f Full %f, bdata(eventSet->events->entry[j]), eventSet->results[j].lastValue, eventSet->results[j].fullValue);
-    }
-
-    if(popContext > 0)
-    {
+    for (j = 0; j < eventSet->numberOfEvents; j++) {
+      double res = values[j];
+      NvmonEvent_t nve = eventSet->nvEvents[j];
+      eventSet->results[j].lastValue = res;
+      switch (nve->rtype) {
+      case ENTITY_TYPE_SUM:
+        eventSet->results[j].fullValue += res;
+        break;
+      case ENTITY_TYPE_MIN:
+        eventSet->results[j].fullValue = (res < eventSet->results[j].fullValue
+                                              ? res
+                                              : eventSet->results[j].fullValue);
+        break;
+      case ENTITY_TYPE_MAX:
+        eventSet->results[j].fullValue = (res > eventSet->results[j].fullValue
+                                              ? res
+                                              : eventSet->results[j].fullValue);
+        break;
+      case ENTITY_TYPE_INSTANT:
+        eventSet->results[j].fullValue = res;
+        break;
+      }
+      eventSet->results[j].stopValue = eventSet->results[j].fullValue;
+      eventSet->results[j].overflows = 0;
+      GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, % s Last % f Full % f,
+                     bdata(eventSet->events->entry[j]),
+                     eventSet->results[j].lastValue,
+                     eventSet->results[j].fullValue);
+    }
+
+    if (popContext > 0) {
         GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Pop Context %ld for device %d, device->context, device->deviceId);
         LIKWID_CU_CALL((*cuCtxPopCurrentPtr)(&device->context), return -EFAULT);
     }
-    if (curDeviceId != device->deviceId)
-    {
-        LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(curDeviceId), return -EFAULT);
+    if (curDeviceId != device->deviceId) {
+      LIKWID_CUDA_API_CALL((*cudaSetDevicePtr)(curDeviceId), return -EFAULT);
     }
 
-
     return 0;
+  }
 
-}
-
-int nvmon_perfworks_readCounters(NvmonDevice_t device)
-{
+  int nvmon_perfworks_readCounters(NvmonDevice_t device) {
     nvmon_perfworks_stopCounters(device);
     nvmon_perfworks_startCounters(device);
-}
-
-
-NvmonFunctions nvmon_perfworks_functions = {
-    .freeDevice = nvmon_perfworks_freeDevice,
-    .createDevice = nvmon_perfworks_createDevice,
-    .getEventList = nvmon_perfworks_getEventsOfGpu,
-    .addEvents = nvmon_perfworks_addEventSet,
-    .setupCounters = nvmon_perfworks_setupCounters,
-    .startCounters = nvmon_perfworks_startCounters,
-    .stopCounters = nvmon_perfworks_stopCounters,
-    .readCounters = nvmon_perfworks_readCounters,
-};
+  }
+
+  NvmonFunctions nvmon_perfworks_functions = {
+      .freeDevice = nvmon_perfworks_freeDevice,
+      .createDevice = nvmon_perfworks_createDevice,
+      .getEventList = nvmon_perfworks_getEventsOfGpu,
+      .addEvents = nvmon_perfworks_addEventSet,
+      .setupCounters = nvmon_perfworks_setupCounters,
+      .startCounters = nvmon_perfworks_startCounters,
+      .stopCounters = nvmon_perfworks_stopCounters,
+      .readCounters = nvmon_perfworks_readCounters,
+  };
 #else
 NvmonFunctions nvmon_perfworks_functions = {
     .freeDevice = NULL,
diff --git a/src/includes/nvmon_types.h b/src/includes/nvmon_types.h
index cfd6e511d..8b16f3e88 100644
--- a/src/includes/nvmon_types.h
+++ b/src/includes/nvmon_types.h
@@ -175,6 +175,15 @@ typedef struct {
 } NvmonFunctions;
 
 
+#define NVMON_SOURCE_BACKEND 0
+#define NVMON_SOURCE_NVML 1
+typedef struct {
+    int numEvents;
+    int* sourceTypes;
+    int* sourceIds;
+} NvmonGroupSourceInfo;
+
+
 /*! \brief Structure specifying all performance monitoring event groups
 
 The global NvmonGroupSet structure holds all eventSets and threads that are
@@ -192,6 +201,9 @@ typedef struct {
     NvmonDevice*     gpus; /*!< \brief List of GPUs */
     int              numberOfBackends;
     NvmonFunctions*  backends[3];
+
+    int                   numGroupSources;
+    NvmonGroupSourceInfo* groupSources;
 } NvmonGroupSet;
 
 extern NvmonGroupSet* nvGroupSet;
diff --git a/src/includes/rocmon_types.h b/src/includes/rocmon_types.h
new file mode 100644
index 000000000..7af2e1518
--- /dev/null
+++ b/src/includes/rocmon_types.h
@@ -0,0 +1,143 @@
+/*
+ * =======================================================================================
+ *
+ *      Filename:  nvmon_types.h
+ *
+ *      Description:  Header File of nvmon module.
+ *                    Configures and reads out performance counters
+ *                    on NVIDIA GPUs. Supports multi GPUs.
+ *
+ *      Version:   <VERSION>
+ *      Released:  <DATE>
+ *
+ *      Author:   Thomas Gruber (tg), thomas.gruber@googlemail.com
+ *      Project:  likwid
+ *
+ *      Copyright (C) 2019 RRZE, University Erlangen-Nuremberg
+ *
+ *      This program is free software: you can redistribute it and/or modify it under
+ *      the terms of the GNU General Public License as published by the Free Software
+ *      Foundation, either version 3 of the License, or (at your option) any later
+ *      version.
+ *
+ *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
+ *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+ *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *
+ *      You should have received a copy of the GNU General Public License along with
+ *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * =======================================================================================
+ */
+#ifndef LIKWID_ROCMON_TYPES_H
+#define LIKWID_ROCMON_TYPES_H
+
+#include <likwid.h>
+// #include <hsa.h>
+#ifndef ROCPROFILER_VERSION_MAJOR
+#include <rocprofiler.h>
+#endif
+#include <map.h>
+
+typedef struct {
+    double lastValue;
+    double fullValue;
+} RocmonEventResult;
+
+typedef struct {
+    RocmonEventResult* results; // First rocprofiler results, then SMI results
+    int numResults;
+} RocmonEventResultList;
+
+
+
+struct RocmonSmiEvent_struct;
+typedef int (*RocmonSmiMeasureFunc)(int deviceId, struct RocmonSmiEvent_struct* event, RocmonEventResult* result);
+
+typedef enum {
+    ROCMON_SMI_EVENT_TYPE_NORMAL = 0,
+    ROCMON_SMI_EVENT_TYPE_VARIANT,
+    ROCMON_SMI_EVENT_TYPE_SUBVARIANT,
+    ROCMON_SMI_EVENT_TYPE_INSTANCES
+} RocmonSmiEventType;
+
+typedef struct RocmonSmiEvent_struct {
+    char name[40];
+    uint64_t variant;
+    uint64_t subvariant;
+    uint64_t extra;
+    int instances;
+    RocmonSmiEventType type;
+    RocmonSmiMeasureFunc measureFunc;
+} RocmonSmiEvent;
+
+typedef struct {
+    RocmonSmiEvent* entries;
+    int numEntries;
+} RocmonSmiEventList;
+
+typedef struct {
+    int deviceId; // LIKWID device id
+
+    hsa_agent_t hsa_agent;  // HSA agent handle for this device
+    rocprofiler_t* context; // Rocprofiler context (has activeEvents configured)
+
+    // Available rocprofiler metrics
+    rocprofiler_info_data_t* rocMetrics;
+    int numRocMetrics;
+
+    // Available ROCm SMI events
+    Map_t smiMetrics;
+
+    // Currently configured rocprofiler events (bound to context)
+    rocprofiler_feature_t* activeRocEvents;
+    int numActiveRocEvents;
+
+    // Currently configured ROCm SMI events
+    RocmonSmiEvent* activeSmiEvents;
+    int numActiveSmiEvents;
+
+    // Results for all events in all event sets
+    RocmonEventResultList* groupResults;
+    int numGroupResults;
+
+    // Timestamps in ns
+    struct {
+        uint64_t start;
+        uint64_t read;
+        uint64_t stop;
+    } time;
+} RocmonDevice;
+
+typedef struct {
+    // Event Groups
+    GroupInfo   *groups;
+    int         numGroups;       // Number of allocated groups
+    int         numActiveGroups; // Number of used groups
+    int         activeGroup;     // Currently active group
+
+    // Devices (HSA agents)
+    RocmonDevice    *devices;
+    int             numDevices;
+
+    // System information
+    long double hsa_timestamp_factor; // hsa_timestamp * hsa_timestamp_factor = timestamp_in_ns
+
+    // ROCm SMI events
+    Map_t smiEvents;
+} RocmonContext;
+
+extern RocmonContext *rocmon_context;
+
+
+typedef struct {
+    bstring  tag;
+    int groupID;
+    int gpuCount;
+    int eventCount;
+    double*  time;
+    uint32_t*  count;
+    int* gpulist;
+    double** counters;
+} LikwidRocmResults;
+#endif /* LIKWID_ROCMON_TYPES_H */
diff --git a/src/libnvctr.c b/src/libnvctr.c
index dc3a6248a..ca4800ff8 100644
--- a/src/libnvctr.c
+++ b/src/libnvctr.c
@@ -13,17 +13,18 @@
  *
  *      Copyright (C) 2016 RRZE, University Erlangen-Nuremberg
  *
- *      This program is free software: you can redistribute it and/or modify it under
- *      the terms of the GNU General Public License as published by the Free Software
- *      Foundation, either version 3 of the License, or (at your option) any later
- *      version.
+ *      This program is free software: you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation, either version 3 of the License, or (at your option) any
+ * later version.
  *
- *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
- *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
- *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *      This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
  *
- *      You should have received a copy of the GNU General Public License along with
- *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *      You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
  *
  * =======================================================================================
  */
@@ -32,169 +33,150 @@
 
 #include <syscall.h>
 
-#include <likwid.h>
-#include <lock.h>
 #include <bstrlib.h>
 #include <error.h>
-#include <map.h>
 #include <libnvctr_types.h>
+#include <likwid.h>
+#include <lock.h>
+#include <map.h>
+#include <nvmon_nvml.h>
 #include <nvmon_types.h>
 
 #define gettid() syscall(SYS_gettid)
 
 static int likwid_gpu_init = 0;
-static int* gpu_groups = NULL;
+static int *gpu_groups = NULL;
 static int activeGpuGroup = -1;
 static int numberOfGpuGroups = 0;
-static int* id2Gpu;
+static int *id2Gpu;
 static int num_gpus = 0;
 static pid_t main_tid = -1;
-static Map_t* gpu_maps = NULL;
+static Map_t *gpu_maps = NULL;
 /*static int use_cpu = -1;*/
 
-
-
-void
-likwid_gpuMarkerInit(void)
-{
-    int i = 0;
-    int setgpuinit = 0;
-    int gpuverbosity = 0;
-    char* eventStr = getenv("LIKWID_GEVENTS");
-    char* gpuStr = getenv("LIKWID_GPUS");
-    char* gpuFileStr = getenv("LIKWID_GPUFILEPATH");
-/*    char* cpu4gpuStr = getenv("LIKWID_CPU4GPUS");*/
-    bstring bGpuStr;
-    bstring bGeventStr;
-    int (*ownatoi)(const char*);
-    ownatoi = &atoi;
-
-    if ((eventStr != NULL) && (gpuStr != NULL) && (gpuFileStr != NULL) && likwid_gpu_init == 0)
-    {
-        setgpuinit = 1;
-    }
-    else if (likwid_gpu_init == 0)
-    {
-        fprintf(stderr, "Running without GPU Marker API. Activate GPU Marker API with -m, -G and -W on commandline.\n");
-        return;
-    }
-    else
-    {
-        return;
-    }
-
-    // if (!lock_check())
-    // {
-    //     fprintf(stderr,"Access to GPU performance counters is locked.\n");
-    //     exit(EXIT_FAILURE);
-    // }
-
-    timer_init();
-    topology_gpu_init();
-    if (getenv("LIKWID_DEBUG") != NULL)
-    {
-        nvmon_setVerbosity(ownatoi(getenv("LIKWID_DEBUG")));
-        gpuverbosity = perfmon_verbosity;
-    }
-/*    if (cpu4gpuStr != NULL)*/
-/*    {*/
-/*        use_cpu = ownatoi(getenv("LIKWID_CPU4GPUS"))*/
-/*    }*/
-
-    main_tid = gettid();
-
-    bGpuStr = bfromcstr(gpuStr);
-    struct bstrList* gpuTokens = bsplit(bGpuStr,',');
-    num_gpus = gpuTokens->qty;
-    id2Gpu = malloc(num_gpus * sizeof(int));
-    if (!id2Gpu)
-    {
-        fprintf(stderr,"Cannot allocate space for GPU list.\n");
-        bdestroy(bGpuStr);
-        bstrListDestroy(gpuTokens);
-        return;
-    }
-    gpu_maps = malloc(num_gpus * sizeof(Map_t));
-    if (!gpu_maps)
-    {
-        fprintf(stderr,"Cannot allocate space for results.\n");
-        free(id2Gpu);
-        bdestroy(bGpuStr);
-        bstrListDestroy(gpuTokens);
-        return;
-    }
-    for (i=0; i<num_gpus; i++)
-    {
-        id2Gpu[i] = ownatoi(bdata(gpuTokens->entry[i]));
-    }
+void likwid_gpuMarkerInit(void) {
+  int i = 0;
+  int setgpuinit = 0;
+  int gpuverbosity = 0;
+  char *eventStr = getenv("LIKWID_GEVENTS");
+  char *gpuStr = getenv("LIKWID_GPUS");
+  char *gpuFileStr = getenv("LIKWID_GPUFILEPATH");
+  /*    char* cpu4gpuStr = getenv("LIKWID_CPU4GPUS");*/
+  bstring bGpuStr;
+  bstring bGeventStr;
+  int (*ownatoi)(const char *);
+  ownatoi = &atoi;
+
+  if ((eventStr != NULL) && (gpuStr != NULL) && (gpuFileStr != NULL) &&
+      likwid_gpu_init == 0) {
+    setgpuinit = 1;
+  } else if (likwid_gpu_init == 0) {
+    fprintf(stderr, "Running without GPU Marker API. Activate GPU Marker API "
+                    "with -m, -G and -W on commandline.\n");
+    return;
+  } else {
+    return;
+  }
+
+  // if (!lock_check())
+  // {
+  //     fprintf(stderr,"Access to GPU performance counters is locked.\n");
+  //     exit(EXIT_FAILURE);
+  // }
+
+  timer_init();
+  topology_gpu_init();
+  if (getenv("LIKWID_DEBUG") != NULL) {
+    nvmon_setVerbosity(ownatoi(getenv("LIKWID_DEBUG")));
+    gpuverbosity = perfmon_verbosity;
+  }
+  /*    if (cpu4gpuStr != NULL)*/
+  /*    {*/
+  /*        use_cpu = ownatoi(getenv("LIKWID_CPU4GPUS"))*/
+  /*    }*/
+
+  main_tid = gettid();
+
+  bGpuStr = bfromcstr(gpuStr);
+  struct bstrList *gpuTokens = bsplit(bGpuStr, ',');
+  num_gpus = gpuTokens->qty;
+  id2Gpu = malloc(num_gpus * sizeof(int));
+  if (!id2Gpu) {
+    fprintf(stderr, "Cannot allocate space for GPU list.\n");
     bdestroy(bGpuStr);
     bstrListDestroy(gpuTokens);
-
-    bGeventStr = bfromcstr(eventStr);
-    struct bstrList* gEventStrings = bsplit(bGeventStr,'|');
-    numberOfGpuGroups = gEventStrings->qty;
-    gpu_groups = malloc(numberOfGpuGroups * sizeof(int));
-    if (!gpu_groups)
-    {
-        fprintf(stderr,"Cannot allocate space for group handling.\n");
-        bstrListDestroy(gEventStrings);
-        free(id2Gpu);
-        free(gpu_maps);
-        bdestroy(bGeventStr);
-        return;
-    }
-    
-    i = nvmon_init(num_gpus, id2Gpu);
-    if (i < 0)
-    {
-        fprintf(stderr,"Error init GPU Marker API.\n");
-        free(id2Gpu);
-        free(gpu_maps);
-        free(gpu_groups);
-        bstrListDestroy(gEventStrings);
-        bdestroy(bGeventStr);
-        return;
-    }
-
-    for (i=0; i<gEventStrings->qty; i++)
-    {
-        gpu_groups[i] = nvmon_addEventSet(bdata(gEventStrings->entry[i]));
-    }
+    return;
+  }
+  gpu_maps = malloc(num_gpus * sizeof(Map_t));
+  if (!gpu_maps) {
+    fprintf(stderr, "Cannot allocate space for results.\n");
+    free(id2Gpu);
+    bdestroy(bGpuStr);
+    bstrListDestroy(gpuTokens);
+    return;
+  }
+  for (i = 0; i < num_gpus; i++) {
+    id2Gpu[i] = ownatoi(bdata(gpuTokens->entry[i]));
+  }
+  bdestroy(bGpuStr);
+  bstrListDestroy(gpuTokens);
+
+  bGeventStr = bfromcstr(eventStr);
+  struct bstrList *gEventStrings = bsplit(bGeventStr, '|');
+  numberOfGpuGroups = gEventStrings->qty;
+  gpu_groups = malloc(numberOfGpuGroups * sizeof(int));
+  if (!gpu_groups) {
+    fprintf(stderr, "Cannot allocate space for group handling.\n");
     bstrListDestroy(gEventStrings);
+    free(id2Gpu);
+    free(gpu_maps);
     bdestroy(bGeventStr);
-
-    for (i=0; i<num_gpus; i++)
-    {
-        init_smap(&gpu_maps[i]);
-    }
-    activeGpuGroup = 0;
-
-    i = nvmon_setupCounters(gpu_groups[activeGpuGroup]);
-    if (i)
-    {
-        fprintf(stderr,"Error setting up GPU Marker API.\n");
-        free(gpu_groups);
-        gpu_groups = NULL;
-        numberOfGpuGroups = 0;
-        setgpuinit = 0;
-    }
-    i = nvmon_startCounters();
-    if (i)
-    {
-        fprintf(stderr,"Error starting up GPU Marker API.\n");
-        free(gpu_groups);
-        gpu_groups = NULL;
-        numberOfGpuGroups = 0;
-        setgpuinit = 0;
-    }
-    if (setgpuinit)
-    {
-        likwid_gpu_init = 1;
-    }
-    else
-    {
-        nvmon_finalize();
-    }
+    return;
+  }
+
+  i = nvmon_init(num_gpus, id2Gpu);
+  if (i < 0) {
+    fprintf(stderr, "Error init GPU Marker API.\n");
+    free(id2Gpu);
+    free(gpu_maps);
+    free(gpu_groups);
+    bstrListDestroy(gEventStrings);
+    bdestroy(bGeventStr);
+    return;
+  }
+
+  for (i = 0; i < gEventStrings->qty; i++) {
+    gpu_groups[i] = nvmon_addEventSet(bdata(gEventStrings->entry[i]));
+  }
+  bstrListDestroy(gEventStrings);
+  bdestroy(bGeventStr);
+
+  for (i = 0; i < num_gpus; i++) {
+    init_smap(&gpu_maps[i]);
+  }
+  activeGpuGroup = 0;
+
+  i = nvmon_setupCounters(gpu_groups[activeGpuGroup]);
+  if (i) {
+    fprintf(stderr, "Error setting up GPU Marker API.\n");
+    free(gpu_groups);
+    gpu_groups = NULL;
+    numberOfGpuGroups = 0;
+    setgpuinit = 0;
+  }
+  i = nvmon_startCounters();
+  if (i) {
+    fprintf(stderr, "Error starting up GPU Marker API.\n");
+    free(gpu_groups);
+    gpu_groups = NULL;
+    numberOfGpuGroups = 0;
+    setgpuinit = 0;
+  }
+  if (setgpuinit) {
+    likwid_gpu_init = 1;
+  } else {
+    nvmon_finalize();
+  }
 }
 
 /* File format
@@ -204,370 +186,309 @@ likwid_gpuMarkerInit(void)
  * 4 regionID gpuID countersvalues(space separated)
  * 5 regionID gpuID countersvalues
  */
-void
-likwid_gpuMarkerClose(void)
-{
-    FILE *file = NULL;
-    char* markerfile = NULL;
-    int numberOfGPUs = 0;
-    int numberOfRegions = 0;
-    if (!likwid_gpu_init)
-    {
-        return;
-    }
-    if (gettid() != main_tid)
-    {
-        return;
-    }
-    nvmon_stopCounters();
-    markerfile = getenv("LIKWID_GPUFILEPATH");
-    if (markerfile == NULL)
-    {
-        fprintf(stderr,
-                "Is the application executed with LIKWID wrapper? No file path for the GPU Marker API output defined.\n");
-        return;
-    }
-    numberOfRegions = get_map_size(gpu_maps[0]);
-    numberOfGPUs = nvmon_getNumberOfGPUs();
-    if ((numberOfGPUs == 0)||(numberOfRegions == 0))
-    {
-        fprintf(stderr, "No GPUs or regions defined in hash table\n");
-        return;
-    }
-
-    file = fopen(markerfile,"w");
-    if (file != NULL)
-    {
-        DEBUG_PRINT(DEBUGLEV_DEVELOP,Creating GPU Marker file %s with %d regions %d groups and %d GPUs, markerfile, numberOfRegions, numberOfGpuGroups, numberOfGPUs);
-        bstring thread_regs_grps = bformat("%d %d %d", numberOfGPUs, numberOfRegions, numberOfGpuGroups);
-        fprintf(file,"%s\n", bdata(thread_regs_grps));
-        DEBUG_PRINT(DEBUGLEV_DEVELOP, %s, bdata(thread_regs_grps));
-        bdestroy(thread_regs_grps);
-
-        for (int j = 0; j < numberOfRegions; j++)
-        {
-            LikwidGpuResults *results = NULL;
-            int ret = get_smap_by_idx(gpu_maps[0], j, (void**)&results);
-            if (ret == 0)
-            {
-                bstring tmp = bformat("%d:%s", j, bdata(results->label));
-                fprintf(file,"%s\n", bdata(tmp));
-                DEBUG_PRINT(DEBUGLEV_DEVELOP, %s, bdata(tmp));
-                bdestroy(tmp);
-            }
-        }
-
-        for (int j = 0; j < numberOfRegions; j++)
-        {
-
-            for (int i = 0; i < numberOfGPUs; i++)
-            {
-                LikwidGpuResults *results = NULL;
-                int ret = get_smap_by_idx(gpu_maps[i], j, (void**)&results);
-                if (!ret)
-                {
-                    bstring l = bformat("%d %d %d %u %e %d ", j,
-                                                              results->groupID,
-                                                              id2Gpu[results->gpuID],
-                                                              results->count,
-                                                              results->time,
-                                                              nvmon_getNumberOfEvents(results->groupID));
-                    for (int k = 0; k < nvmon_getNumberOfEvents(results->groupID); k++)
-                    {
-                        bstring tmp = bformat("%e ", results->PMcounters[k]);
-                        bconcat(l, tmp);
-                        bdestroy(tmp);
-                    }
-                    fprintf(file,"%s\n", bdata(l));
-                    DEBUG_PRINT(DEBUGLEV_DEVELOP, %s, bdata(l));
-                    bdestroy(l);
-                }
-                free(results);
-            }
-        }
-        for (int i = 0; i < nvmon_getNumberOfGPUs(); i++)
-        {
-            destroy_smap(gpu_maps[i]);
+void likwid_gpuMarkerClose(void) {
+  FILE *file = NULL;
+  char *markerfile = NULL;
+  int numberOfGPUs = 0;
+  int numberOfRegions = 0;
+  if (!likwid_gpu_init) {
+    return;
+  }
+  if (gettid() != main_tid) {
+    return;
+  }
+  nvmon_stopCounters();
+  markerfile = getenv("LIKWID_GPUFILEPATH");
+  if (markerfile == NULL) {
+    fprintf(stderr, "Is the application executed with LIKWID wrapper? No file "
+                    "path for the GPU Marker API output defined.\n");
+    return;
+  }
+  numberOfRegions = get_map_size(gpu_maps[0]);
+  numberOfGPUs = nvmon_getNumberOfGPUs();
+  if ((numberOfGPUs == 0) || (numberOfRegions == 0)) {
+    fprintf(stderr, "No GPUs or regions defined in hash table\n");
+    return;
+  }
+
+  file = fopen(markerfile, "w");
+  if (file != NULL) {
+    DEBUG_PRINT(DEBUGLEV_DEVELOP,
+                Creating GPU Marker file % s with % d regions % d groups and
+                    % d GPUs,
+                markerfile, numberOfRegions, numberOfGpuGroups, numberOfGPUs);
+    bstring thread_regs_grps =
+        bformat("%d %d %d", numberOfGPUs, numberOfRegions, numberOfGpuGroups);
+    fprintf(file, "%s\n", bdata(thread_regs_grps));
+    DEBUG_PRINT(DEBUGLEV_DEVELOP, % s, bdata(thread_regs_grps));
+    bdestroy(thread_regs_grps);
+
+    for (int j = 0; j < numberOfRegions; j++) {
+      LikwidGpuResults *results = NULL;
+      int ret = get_smap_by_idx(gpu_maps[0], j, (void **)&results);
+      if (ret == 0) {
+        bstring tmp = bformat("%d:%s", j, bdata(results->label));
+        fprintf(file, "%s\n", bdata(tmp));
+        DEBUG_PRINT(DEBUGLEV_DEVELOP, % s, bdata(tmp));
+        bdestroy(tmp);
+      }
+    }
+
+    for (int j = 0; j < numberOfRegions; j++) {
+
+      for (int i = 0; i < numberOfGPUs; i++) {
+        LikwidGpuResults *results = NULL;
+        int ret = get_smap_by_idx(gpu_maps[i], j, (void **)&results);
+        if (!ret) {
+          bstring l =
+              bformat("%d %d %d %u %e %d ", j, results->groupID,
+                      id2Gpu[results->gpuID], results->count, results->time,
+                      nvmon_getNumberOfEvents(results->groupID));
+          for (int k = 0; k < nvmon_getNumberOfEvents(results->groupID); k++) {
+            bstring tmp = bformat("%e ", results->PMcounters[k]);
+            bconcat(l, tmp);
+            bdestroy(tmp);
+          }
+          fprintf(file, "%s\n", bdata(l));
+          DEBUG_PRINT(DEBUGLEV_DEVELOP, % s, bdata(l));
+          bdestroy(l);
         }
+        free(results);
+      }
     }
-    else
-    {
-        fprintf(stderr, "Cannot open file %s\n", markerfile);
-        fprintf(stderr, "%s", strerror(errno));
+    for (int i = 0; i < nvmon_getNumberOfGPUs(); i++) {
+      destroy_smap(gpu_maps[i]);
     }
+  } else {
+    fprintf(stderr, "Cannot open file %s\n", markerfile);
+    fprintf(stderr, "%s", strerror(errno));
+  }
 
-    //nvmon_finalize();
+  // nvmon_finalize();
 }
 
-
-int
-likwid_gpuMarkerRegisterRegion(const char* regionTag)
-{
-    if (!likwid_gpu_init)
-    {
-        return -EFAULT;
-    }
-    if (gettid() != main_tid)
-    {
-        return 0;
-    }
-    for (int i = 0; i < nvmon_getNumberOfGPUs(); i++)
-    {
-        LikwidGpuResults* res = malloc(sizeof(LikwidGpuResults));
-        if (!res)
-        {
-            fprintf(stderr, "Failed to register region %s\n", regionTag);
-        }
-        res->time = 0;
-        res->count = 0;
-        res->gpuID = i;
-        res->state = GPUMARKER_STATE_NEW;
-        res->groupID = activeGpuGroup;
-        res->label = bformat("%s-%d", regionTag, activeGpuGroup);
-        res->nevents = nvmon_getNumberOfEvents(activeGpuGroup);
-        for (int j = 0; j < res->nevents; j++)
-        {
-            res->StartPMcounters[j] = 0.0;
-            res->PMcounters[j] = 0.0;
-        }
-        add_smap(gpu_maps[i], bdata(res->label), res);
-    }
-}
-
-
-int
-likwid_gpuMarkerStartRegion(const char* regionTag)
-{
-    bstring tag;
-    if (!likwid_gpu_init)
-    {
-        return -EFAULT;
-    }
-    if (activeGpuGroup < 0)
-    {
-        return -EFAULT;
-    }
-    if (gettid() != main_tid)
-    {
-        return 0;
-    }
-
-    nvmon_readCounters();
-    tag = bformat("%s-%d", regionTag, activeGpuGroup);
-
-    for (int i = 0; i < nvmon_getNumberOfGPUs(); i++)
-    {
-        LikwidGpuResults *results = NULL;
-        int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void**)&results);
-        if (ret < 0)
-        {
-            results = malloc(sizeof(LikwidGpuResults));
-            if (!results)
-            {
-                fprintf(stderr, "Failed to register region %s\n", regionTag);
-                return -EFAULT;
-            }
-            memset(results, 0, sizeof(LikwidGpuResults));
-            results->time = 0;
-            results->count = 0;
-            results->gpuID = i;
-            results->state = GPUMARKER_STATE_NEW;
-            results->groupID = activeGpuGroup;
-            results->label = bstrcpy(tag);
-            results->nevents = nvmon_getNumberOfEvents(results->groupID);
-            for (int j = 0; j < results->nevents; j++)
-            {
-                results->StartPMcounters[j] = 0.0;
-                results->PMcounters[j] = 0.0;
-            }
-            add_smap(gpu_maps[i], bdata(results->label), results);
-            ret = 0;
-        }
-        if (ret == 0 && results->state == GPUMARKER_STATE_START)
-        {
-            fprintf(stderr, "WARN: Starting an already-started region %s\n", regionTag);
-            return -EFAULT;
-        }
-        for (int j = 0; j < results->nevents; j++)
-        {
-            NvmonDevice_t device = &nvGroupSet->gpus[i];
-            if (device->backend == LIKWID_NVMON_CUPTI_BACKEND)
-                results->StartPMcounters[j] = nvmon_getLastResult(results->groupID, j, i);
-            else if (device->backend == LIKWID_NVMON_PERFWORKS_BACKEND)
-                results->StartPMcounters[j] = nvmon_getResult(results->groupID, j, i);
-            GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, START Device %d Event %d: %f, i, j, results->StartPMcounters[j]);
-        }
-        results->state = GPUMARKER_STATE_START;
-        timer_start(&(results->startTime));
-    }
-    bdestroy(tag);
+int likwid_gpuMarkerRegisterRegion(const char *regionTag) {
+  if (!likwid_gpu_init) {
+    return -EFAULT;
+  }
+  if (gettid() != main_tid) {
     return 0;
+  }
+  for (int i = 0; i < nvmon_getNumberOfGPUs(); i++) {
+    LikwidGpuResults *res = malloc(sizeof(LikwidGpuResults));
+    if (!res) {
+      fprintf(stderr, "Failed to register region %s\n", regionTag);
+    }
+    res->time = 0;
+    res->count = 0;
+    res->gpuID = i;
+    res->state = GPUMARKER_STATE_NEW;
+    res->groupID = activeGpuGroup;
+    res->label = bformat("%s-%d", regionTag, activeGpuGroup);
+    res->nevents = nvmon_getNumberOfEvents(activeGpuGroup);
+    for (int j = 0; j < res->nevents; j++) {
+      res->StartPMcounters[j] = 0.0;
+      res->PMcounters[j] = 0.0;
+    }
+    add_smap(gpu_maps[i], bdata(res->label), res);
+  }
 }
 
-int
-likwid_gpuMarkerStopRegion(const char* regionTag)
-{
-    bstring tag;
-    if (!likwid_gpu_init)
-    {
-        return -EFAULT;
-    }
-    if (activeGpuGroup < 0)
-    {
+int likwid_gpuMarkerStartRegion(const char *regionTag) {
+  bstring tag;
+  if (!likwid_gpu_init) {
+    return -EFAULT;
+  }
+  if (activeGpuGroup < 0) {
+    return -EFAULT;
+  }
+  if (gettid() != main_tid) {
+    return 0;
+  }
+
+  nvmon_readCounters();
+  tag = bformat("%s-%d", regionTag, activeGpuGroup);
+
+  for (int i = 0; i < nvmon_getNumberOfGPUs(); i++) {
+    LikwidGpuResults *results = NULL;
+    int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void **)&results);
+    if (ret < 0) {
+      results = malloc(sizeof(LikwidGpuResults));
+      if (!results) {
+        fprintf(stderr, "Failed to register region %s\n", regionTag);
         return -EFAULT;
-    }
-    if (gettid() != main_tid)
-    {
-        return 0;
-    }
-    TimerData timestamp;
-    timer_stop(&timestamp);
-
-    nvmon_readCounters();
-    tag = bformat("%s-%d", regionTag, activeGpuGroup);
-    for (int i = 0; i < nvmon_getNumberOfGPUs(); i++)
-    {
-        LikwidGpuResults *results = NULL;
-        int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void**)&results);
-        if ((ret < 0) || (results->state != GPUMARKER_STATE_START))
-        {
-            fprintf(stderr, "WARN: Stopping an unknown/not-started region %s\n", regionTag);
-            return -EFAULT;
-        }
+      }
+      memset(results, 0, sizeof(LikwidGpuResults));
+      results->time = 0;
+      results->count = 0;
+      results->gpuID = i;
+      results->state = GPUMARKER_STATE_NEW;
+      results->groupID = activeGpuGroup;
+      results->label = bstrcpy(tag);
+      results->nevents = nvmon_getNumberOfEvents(results->groupID);
+      for (int j = 0; j < results->nevents; j++) {
+        results->StartPMcounters[j] = 0.0;
+        results->PMcounters[j] = 0.0;
+      }
+      add_smap(gpu_maps[i], bdata(results->label), results);
+      ret = 0;
+    }
+    if (ret == 0 && results->state == GPUMARKER_STATE_START) {
+      fprintf(stderr, "WARN: Starting an already-started region %s\n",
+              regionTag);
+      return -EFAULT;
+    }
+    for (int j = 0; j < results->nevents; j++) {
+      NvmonDevice_t device = &nvGroupSet->gpus[i];
+      if (device->backend == LIKWID_NVMON_CUPTI_BACKEND)
+        results->StartPMcounters[j] =
+            nvmon_getLastResult(results->groupID, j, i);
+      else if (device->backend == LIKWID_NVMON_PERFWORKS_BACKEND)
+        results->StartPMcounters[j] = nvmon_getResult(results->groupID, j, i);
+      GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, START Device % d Event % d
+                     : % f, i, j, results->StartPMcounters[j]);
+    }
+    results->state = GPUMARKER_STATE_START;
+    timer_start(&(results->startTime));
+  }
+  bdestroy(tag);
+  return 0;
+}
 
-        results->startTime.stop.int64 = timestamp.stop.int64;
-        results->time += timer_print(&(results->startTime));
-        results->count++;
-        for (int j = 0; j < results->nevents; j++)
-        {
-            double end = nvmon_getResult(results->groupID, j, i);
-            NvmonDevice_t device = &nvGroupSet->gpus[i];
-/*            if (device->backend == LIKWID_NVMON_CUPTI_BACKEND)*/
-/*                results->PMcounters[j] += end - results->StartPMcounters[j];*/
-/*            else if (device->backend == LIKWID_NVMON_PERFWORKS_BACKEND)*/
-/*            {*/
-/*                */
-/*            }*/
-            results->PMcounters[j] += end - results->StartPMcounters[j];
-            GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, STOP Device %d Event %d: %f - %f, i, j, end, results->StartPMcounters[j]);
-        }
-        results->state = GPUMARKER_STATE_STOP;
-    }
-    bdestroy(tag);
+int likwid_gpuMarkerStopRegion(const char *regionTag) {
+  bstring tag;
+  if (!likwid_gpu_init) {
+    return -EFAULT;
+  }
+  if (activeGpuGroup < 0) {
+    return -EFAULT;
+  }
+  if (gettid() != main_tid) {
     return 0;
+  }
+  TimerData timestamp;
+  timer_stop(&timestamp);
+
+  nvmon_readCounters();
+  tag = bformat("%s-%d", regionTag, activeGpuGroup);
+  for (int i = 0; i < nvmon_getNumberOfGPUs(); i++) {
+    LikwidGpuResults *results = NULL;
+    int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void **)&results);
+    if ((ret < 0) || (results->state != GPUMARKER_STATE_START)) {
+      fprintf(stderr, "WARN: Stopping an unknown/not-started region %s\n",
+              regionTag);
+      return -EFAULT;
+    }
+
+    results->startTime.stop.int64 = timestamp.stop.int64;
+    results->time += timer_print(&(results->startTime));
+    results->count++;
+    for (int j = 0; j < results->nevents; j++) {
+      double end = nvmon_getResult(results->groupID, j, i);
+      NvmonDevice_t device = &nvGroupSet->gpus[i];
+      /*            if (device->backend == LIKWID_NVMON_CUPTI_BACKEND)*/
+      /*                results->PMcounters[j] += end -
+       * results->StartPMcounters[j];*/
+      /*            else if (device->backend ==
+       * LIKWID_NVMON_PERFWORKS_BACKEND)*/
+      /*            {*/
+      /*                */
+      /*            }*/
+      results->PMcounters[j] += end - results->StartPMcounters[j];
+      GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, STOP Device % d Event % d
+                     : % f - % f, i, j, end, results->StartPMcounters[j]);
+    }
+    results->state = GPUMARKER_STATE_STOP;
+  }
+  bdestroy(tag);
+  return 0;
 }
 
-void
-likwid_gpuMarkerGetRegion(
-        const char* regionTag,
-        int* nr_gpus,
-        int* nr_events,
-        double** events,
-        double* time,
-        int *count)
-{
-    if (!likwid_gpu_init)
-    {
-        *nr_gpus = 0;
-        *nr_events = 0;
-        return;
-    }
-    if (gettid() != main_tid)
-    {
-        *nr_gpus = 0;
-        *nr_events = 0;
-        return;
-    }
-    bstring tag = bformat("%s-%d", regionTag, activeGpuGroup);
-    if (count != NULL)
-    {
-        for (int i = 0; i < MIN(nvmon_getNumberOfGPUs(), *nr_gpus); i++)
-        {
-            LikwidGpuResults *results = NULL;
-            int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void**)&results);
-            if (ret == 0)
-            {
-                count[i] = results->count;
-            }
-        }
-    }
-    if (time != NULL)
-    {
-        for (int i = 0; i < MIN(nvmon_getNumberOfGPUs(), *nr_gpus); i++)
-        {
-            LikwidGpuResults *results = NULL;
-            int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void**)&results);
-            if (ret == 0)
-            {
-                time[i] = results->time;
-            }
-        }
-    }
-    if (nr_events != NULL && events != NULL && *nr_events > 0)
-    {
-        for (int i = 0; i < MIN(nvmon_getNumberOfGPUs(), *nr_gpus); i++)
-        {
-            LikwidGpuResults *results = NULL;
-            int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void**)&results);
-            if (ret == 0)
-            {
-                for (int j = 0; j < MIN(nvmon_getNumberOfEvents(activeGpuGroup), *nr_events); j++)
-                {
-                    events[i][j] = results->PMcounters[j];
-                }
-            }
+void likwid_gpuMarkerGetRegion(const char *regionTag, int *nr_gpus,
+                               int *nr_events, double **events, double *time,
+                               int *count) {
+  if (!likwid_gpu_init) {
+    *nr_gpus = 0;
+    *nr_events = 0;
+    return;
+  }
+  if (gettid() != main_tid) {
+    *nr_gpus = 0;
+    *nr_events = 0;
+    return;
+  }
+  bstring tag = bformat("%s-%d", regionTag, activeGpuGroup);
+  if (count != NULL) {
+    for (int i = 0; i < MIN(nvmon_getNumberOfGPUs(), *nr_gpus); i++) {
+      LikwidGpuResults *results = NULL;
+      int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void **)&results);
+      if (ret == 0) {
+        count[i] = results->count;
+      }
+    }
+  }
+  if (time != NULL) {
+    for (int i = 0; i < MIN(nvmon_getNumberOfGPUs(), *nr_gpus); i++) {
+      LikwidGpuResults *results = NULL;
+      int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void **)&results);
+      if (ret == 0) {
+        time[i] = results->time;
+      }
+    }
+  }
+  if (nr_events != NULL && events != NULL && *nr_events > 0) {
+    for (int i = 0; i < MIN(nvmon_getNumberOfGPUs(), *nr_gpus); i++) {
+      LikwidGpuResults *results = NULL;
+      int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void **)&results);
+      if (ret == 0) {
+        for (int j = 0;
+             j < MIN(nvmon_getNumberOfEvents(activeGpuGroup), *nr_events);
+             j++) {
+          events[i][j] = results->PMcounters[j];
         }
-        *nr_events = MIN(nvmon_getNumberOfEvents(activeGpuGroup), *nr_events);
+      }
     }
-    *nr_gpus = MIN(nvmon_getNumberOfGPUs(), *nr_gpus);
-    bdestroy(tag);
-    return;
+    *nr_events = MIN(nvmon_getNumberOfEvents(activeGpuGroup), *nr_events);
+  }
+  *nr_gpus = MIN(nvmon_getNumberOfGPUs(), *nr_gpus);
+  bdestroy(tag);
+  return;
 }
 
-
-int
-likwid_gpuMarkerResetRegion(const char* regionTag)
-{
-    if (!likwid_gpu_init)
-    {
-        return -EFAULT;
-    }
-    if (gettid() != main_tid)
-    {
-        return 0;
-    }
-    bstring tag = bformat("%s-%d", regionTag, activeGpuGroup);
-    for (int i = 0; i < nvmon_getNumberOfGPUs(); i++)
-    {
-        LikwidGpuResults *results = NULL;
-        int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void**)&results);
-        if ((ret < 0) || (results->state != GPUMARKER_STATE_STOP))
-        {
-            fprintf(stderr, "ERROR: Can only reset known/stopped regions\n");
-            return -EFAULT;
-        }
-        memset(results->PMcounters, 0, nvmon_getNumberOfEvents(activeGpuGroup)*sizeof(double));
-        results->count = 0;
-        results->time = 0;
-        timer_reset(&results->startTime);
-    }
+int likwid_gpuMarkerResetRegion(const char *regionTag) {
+  if (!likwid_gpu_init) {
+    return -EFAULT;
+  }
+  if (gettid() != main_tid) {
+    return 0;
+  }
+  bstring tag = bformat("%s-%d", regionTag, activeGpuGroup);
+  for (int i = 0; i < nvmon_getNumberOfGPUs(); i++) {
+    LikwidGpuResults *results = NULL;
+    int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void **)&results);
+    if ((ret < 0) || (results->state != GPUMARKER_STATE_STOP)) {
+      fprintf(stderr, "ERROR: Can only reset known/stopped regions\n");
+      return -EFAULT;
+    }
+    memset(results->PMcounters, 0,
+           nvmon_getNumberOfEvents(activeGpuGroup) * sizeof(double));
+    results->count = 0;
+    results->time = 0;
+    timer_reset(&results->startTime);
+  }
 }
 
-void
-likwid_gpuMarkerNextGroup(void)
-{
-    if (!likwid_gpu_init)
-    {
-        return;
-    }
-    if (gettid() != main_tid)
-    {
-        return;
-    }
-    int next_group = (activeGpuGroup + 1) % numberOfGpuGroups;
-    if (next_group != activeGpuGroup)
-    {
-        DEBUG_PRINT(DEBUGLEV_DEVELOP, Switch from GPU group %d to group %d, activeGpuGroup, next_group);
-        nvmon_switchActiveGroup(next_group);
-    }
+void likwid_gpuMarkerNextGroup(void) {
+  if (!likwid_gpu_init) {
+    return;
+  }
+  if (gettid() != main_tid) {
+    return;
+  }
+  int next_group = (activeGpuGroup + 1) % numberOfGpuGroups;
+  if (next_group != activeGpuGroup) {
+    DEBUG_PRINT(DEBUGLEV_DEVELOP, Switch from GPU group % d to group % d,
+                activeGpuGroup, next_group);
+    nvmon_switchActiveGroup(next_group);
+  }
 }
diff --git a/src/luawid.c b/src/luawid.c
index 782abf62d..96909ace5 100644
--- a/src/luawid.c
+++ b/src/luawid.c
@@ -13,42 +13,43 @@
  *
  *      Copyright (C) 2016 RRZE, University Erlangen-Nuremberg
  *
- *      This program is free software: you can redistribute it and/or modify it under
- *      the terms of the GNU General Public License as published by the Free Software
- *      Foundation, either version 3 of the License, or (at your option) any later
- *      version.
+ *      This program is free software: you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation, either version 3 of the License, or (at your option) any
+ * later version.
  *
- *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
- *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
- *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *      This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
  *
- *      You should have received a copy of the GNU General Public License along with
- *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *      You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
  *
  * =======================================================================================
  */
 
 /* #####   HEADER FILE INCLUDES   ######################################### */
 
-#include <stdlib.h>
+#include <pwd.h>
+#include <sched.h>
 #include <stdio.h>
+#include <stdlib.h>
 #include <string.h>
-#include <unistd.h>
-#include <sys/wait.h>
-#include <sys/types.h>
+#include <sys/syscall.h>
 #include <sys/time.h>
-#include <time.h>
-#include <sched.h>
 #include <sys/types.h>
-#include <pwd.h>
-#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
 
-#include <lua.h>                               /* Always include this */
-#include <lauxlib.h>                           /* Always include this */
-#include <lualib.h>                            /* Always include this */
+#include <lauxlib.h> /* Always include this */
+#include <lua.h>     /* Always include this */
+#include <lualib.h>  /* Always include this */
 
 #include <likwid.h>
 #include <tree.h>
+
 #include <access.h>
 #include <bstrlib.h>
 #include <perfmon.h>
@@ -72,6 +73,7 @@ static int power_hasRAPL = 0;
 static int config_isInitialized = 0;
 
 static int nvmon_initialized = 0;
+static int rocmon_initialized = 0;
 
 /* #####   VARIABLES  -  EXPORTED VARIABLES   ############################# */
 
@@ -84,925 +86,796 @@ Configuration_t configfile = NULL;
 
 /* #####   FUNCTION DEFINITIONS  -  LOCAL TO THIS SOURCE FILE   ########### */
 
-static int
-lua_likwid_getConfiguration(lua_State* L)
-{
-    int ret = 0;
-    if (config_isInitialized == 0)
-    {
-        ret = init_configuration();
-        if (ret == 0)
-        {
-            config_isInitialized = 1;
-            configfile = get_configuration();
-        }
-        else
-        {
-            lua_newtable(L);
-            lua_pushstring(L, "configFile");
-            lua_pushnil(L);
-            lua_settable(L,-3);
-            lua_pushstring(L, "topologyFile");
-            lua_pushnil(L);
-            lua_settable(L,-3);
-            lua_pushstring(L, "daemonPath");
-            lua_pushnil(L);
-            lua_settable(L,-3);
-            lua_pushstring(L, "groupPath");
-            lua_pushnil(L);
-            lua_settable(L,-3);
-            lua_pushstring(L, "daemonMode");
-            lua_pushinteger(L, -1);
-            lua_settable(L,-3);
-            lua_pushstring(L, "maxNumThreads");
-            lua_pushinteger(L, 0);
-            lua_settable(L,-3);
-            lua_pushstring(L, "maxNumNodes");
-            lua_pushinteger(L, 0);
-            lua_settable(L,-3);
-            return 1;
-        }
-    }
-    if ((config_isInitialized) && (configfile == NULL))
-    {
-        configfile = get_configuration();
-    }
-    if (configfile)
-    {
-        lua_newtable(L);
-        lua_pushstring(L, "configFile");
-        if (configfile->configFileName != NULL)
-            lua_pushstring(L, configfile->configFileName);
-        else
-            lua_pushnil(L);
-        lua_settable(L,-3);
-        lua_pushstring(L, "topologyFile");
-        lua_pushstring(L, configfile->topologyCfgFileName);
-        lua_settable(L,-3);
-        lua_pushstring(L, "daemonPath");
-        if (configfile->daemonPath != NULL)
-            lua_pushstring(L, configfile->daemonPath);
-        else
-            lua_pushnil(L);
-        lua_settable(L,-3);
-        lua_pushstring(L, "groupPath");
-        lua_pushstring(L, configfile->groupPath);
-        lua_settable(L,-3);
-        lua_pushstring(L, "daemonMode");
-        lua_pushinteger(L, (int)configfile->daemonMode);
-        lua_settable(L,-3);
-        lua_pushstring(L, "maxNumThreads");
-        lua_pushinteger(L, configfile->maxNumThreads);
-        lua_settable(L,-3);
-        lua_pushstring(L, "maxNumNodes");
-        lua_pushinteger(L, configfile->maxNumNodes);
-        lua_settable(L,-3);
-        return 1;
-    }
-    return 0;
-}
-
-static int
-lua_likwid_putConfiguration(lua_State* L)
-{
-    if (config_isInitialized == 1)
-    {
-        destroy_configuration();
-        config_isInitialized = 0;
-        configfile = NULL;
-    }
-    return 0;
-}
-
-static int
-lua_likwid_setGroupPath(lua_State* L)
-{
-    int ret;
-    const char* tmpString;
-    if (config_isInitialized == 0)
-    {
-        ret = init_configuration();
-        if (ret == 0)
-        {
-            config_isInitialized = 1;
-        }
-    }
-    tmpString = luaL_checkstring(L, 1);
-    ret = config_setGroupPath((char*)tmpString);
-    if (ret < 0)
-    {
-        lua_pushstring(L,"Cannot set group path");
-        lua_error(L);
-    }
-    return 0;
-}
-
-static int
-lua_likwid_setAccessMode(lua_State* L)
-{
-    int flag;
-    flag = luaL_checknumber(L,1);
-    luaL_argcheck(L, flag >= 0 && flag <= 1, 1, "invalid access mode, only 0 (direct) and 1 (accessdaemon) allowed");
-    HPMmode(flag);
-    lua_pushinteger(L,0);
-    return 1;
-}
-
-static int
-lua_likwid_getAccessMode(lua_State* L)
-{
+static int lua_likwid_getConfiguration(lua_State *L) {
+  int ret = 0;
+  if (config_isInitialized == 0) {
+    ret = init_configuration();
+    if (ret == 0) {
+      config_isInitialized = 1;
+      configfile = get_configuration();
+    } else {
+      lua_newtable(L);
+      lua_pushstring(L, "configFile");
+      lua_pushnil(L);
+      lua_settable(L, -3);
+      lua_pushstring(L, "topologyFile");
+      lua_pushnil(L);
+      lua_settable(L, -3);
+      lua_pushstring(L, "daemonPath");
+      lua_pushnil(L);
+      lua_settable(L, -3);
+      lua_pushstring(L, "groupPath");
+      lua_pushnil(L);
+      lua_settable(L, -3);
+      lua_pushstring(L, "daemonMode");
+      lua_pushinteger(L, -1);
+      lua_settable(L, -3);
+      lua_pushstring(L, "maxNumThreads");
+      lua_pushinteger(L, 0);
+      lua_settable(L, -3);
+      lua_pushstring(L, "maxNumNodes");
+      lua_pushinteger(L, 0);
+      lua_settable(L, -3);
+      return 1;
+    }
+  }
+  if ((config_isInitialized) && (configfile == NULL)) {
+    configfile = get_configuration();
+  }
+  if (configfile) {
+    lua_newtable(L);
+    lua_pushstring(L, "configFile");
+    if (configfile->configFileName != NULL)
+      lua_pushstring(L, configfile->configFileName);
+    else
+      lua_pushnil(L);
+    lua_settable(L, -3);
+    lua_pushstring(L, "topologyFile");
+    lua_pushstring(L, configfile->topologyCfgFileName);
+    lua_settable(L, -3);
+    lua_pushstring(L, "daemonPath");
+    if (configfile->daemonPath != NULL)
+      lua_pushstring(L, configfile->daemonPath);
+    else
+      lua_pushnil(L);
+    lua_settable(L, -3);
+    lua_pushstring(L, "groupPath");
+    lua_pushstring(L, configfile->groupPath);
+    lua_settable(L, -3);
+    lua_pushstring(L, "daemonMode");
+    lua_pushinteger(L, (int)configfile->daemonMode);
+    lua_settable(L, -3);
+    lua_pushstring(L, "maxNumThreads");
+    lua_pushinteger(L, configfile->maxNumThreads);
+    lua_settable(L, -3);
+    lua_pushstring(L, "maxNumNodes");
+    lua_pushinteger(L, configfile->maxNumNodes);
+    lua_settable(L, -3);
+    return 1;
+  }
+  return 0;
+}
+
+static int lua_likwid_putConfiguration(lua_State *L) {
+  if (config_isInitialized == 1) {
+    destroy_configuration();
+    config_isInitialized = 0;
+    configfile = NULL;
+  }
+  return 0;
+}
+
+static int lua_likwid_setGroupPath(lua_State *L) {
+  int ret;
+  const char *tmpString;
+  if (config_isInitialized == 0) {
+    ret = init_configuration();
+    if (ret == 0) {
+      config_isInitialized = 1;
+    }
+  }
+  tmpString = luaL_checkstring(L, 1);
+  ret = config_setGroupPath((char *)tmpString);
+  if (ret < 0) {
+    lua_pushstring(L, "Cannot set group path");
+    lua_error(L);
+  }
+  return 0;
+}
+
+static int lua_likwid_setAccessMode(lua_State *L) {
+  int flag;
+  flag = luaL_checknumber(L, 1);
+  luaL_argcheck(
+      L, flag >= 0 && flag <= 1, 1,
+      "invalid access mode, only 0 (direct) and 1 (accessdaemon) allowed");
+  HPMmode(flag);
+  lua_pushinteger(L, 0);
+  return 1;
+}
+
+static int lua_likwid_getAccessMode(lua_State *L) {
 #ifdef LIKWID_USE_PERFEVENT
-    lua_pushinteger(L, ACCESSMODE_PERF);
+  lua_pushinteger(L, ACCESSMODE_PERF);
 #else
-    init_configuration();
-    Configuration_t config = get_configuration();
-    lua_pushinteger(L, config->daemonMode);
+  init_configuration();
+  Configuration_t config = get_configuration();
+  lua_pushinteger(L, config->daemonMode);
 #endif
-    return 1;
-}
-
-static int
-lua_likwid_init(lua_State* L)
-{
-    int ret;
-    int nrThreads = luaL_checknumber(L,1);
-    luaL_argcheck(L, nrThreads > 0, 1, "CPU count must be greater than 0");
-    int cpus[nrThreads];
-    if (!lua_istable(L, -1)) {
-      lua_pushstring(L,"No table given as second argument");
-      lua_error(L);
-    }
-    for (ret = 1; ret<=nrThreads; ret++)
-    {
-        lua_rawgeti(L,-1,ret);
+  return 1;
+}
+
+static int lua_likwid_init(lua_State *L) {
+  int ret;
+  int nrThreads = luaL_checknumber(L, 1);
+  luaL_argcheck(L, nrThreads > 0, 1, "CPU count must be greater than 0");
+  int cpus[nrThreads];
+  if (!lua_istable(L, -1)) {
+    lua_pushstring(L, "No table given as second argument");
+    lua_error(L);
+  }
+  for (ret = 1; ret <= nrThreads; ret++) {
+    lua_rawgeti(L, -1, ret);
 #if LUA_VERSION_NUM == 501
-        cpus[ret-1] = ((lua_Integer)lua_tointeger(L,-1));
+    cpus[ret - 1] = ((lua_Integer)lua_tointeger(L, -1));
 #else
-        cpus[ret-1] = ((lua_Unsigned)lua_tointegerx(L,-1, NULL));
+    cpus[ret - 1] = ((lua_Unsigned)lua_tointegerx(L, -1, NULL));
 #endif
-        lua_pop(L,1);
-    }
-    if (topology_isInitialized == 0)
-    {
-        topology_init();
-        topology_isInitialized = 1;
-        cpuinfo = get_cpuInfo();
-        cputopo = get_cpuTopology();
-    }
-    if ((topology_isInitialized) && (cpuinfo == NULL))
-    {
-        cpuinfo = get_cpuInfo();
-    }
-    if ((topology_isInitialized) && (cputopo == NULL))
-    {
-        cputopo = get_cpuTopology();
-    }
-    if (numa_isInitialized == 0)
-    {
-        numa_init();
-        numa_isInitialized = 1;
-        numainfo = get_numaTopology();
-    }
-    if ((numa_isInitialized) && (numainfo == NULL))
-    {
-        numainfo = get_numaTopology();
-    }
-    if (timer_isInitialized == 0)
-    {
-        timer_init();
-        timer_isInitialized = 1;
-    }
-    if (perfmon_isInitialized == 0)
-    {
-        ret = perfmon_init(nrThreads, &(cpus[0]));
-        if (ret != 0)
-        {
-            lua_pushstring(L,"Cannot initialize likwid perfmon");
-            perfmon_finalize();
-            lua_pushinteger(L,ret);
-            return 1;
-        }
-        perfmon_isInitialized = 1;
-        timer_isInitialized = 1;
-        lua_pushinteger(L,ret);
-    }
-    return 1;
-}
-
-static int
-lua_likwid_addEventSet(lua_State* L)
-{
-    int groupId, n;
-    const char* tmpString;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    n = lua_gettop(L);
-    tmpString = luaL_checkstring(L, n);
-    luaL_argcheck(L, strlen(tmpString) > 0, n, "Event string must be larger than 0");
-
-    groupId = perfmon_addEventSet((char*)tmpString);
-    if (groupId >= 0)
-        lua_pushinteger(L, groupId+1);
-    else
-        lua_pushinteger(L, groupId);
-    return 1;
+    lua_pop(L, 1);
+  }
+  if (topology_isInitialized == 0) {
+    topology_init();
+    topology_isInitialized = 1;
+    cpuinfo = get_cpuInfo();
+    cputopo = get_cpuTopology();
+  }
+  if ((topology_isInitialized) && (cpuinfo == NULL)) {
+    cpuinfo = get_cpuInfo();
+  }
+  if ((topology_isInitialized) && (cputopo == NULL)) {
+    cputopo = get_cpuTopology();
+  }
+  if (numa_isInitialized == 0) {
+    numa_init();
+    numa_isInitialized = 1;
+    numainfo = get_numaTopology();
+  }
+  if ((numa_isInitialized) && (numainfo == NULL)) {
+    numainfo = get_numaTopology();
+  }
+  if (timer_isInitialized == 0) {
+    timer_init();
+    timer_isInitialized = 1;
+  }
+  if (perfmon_isInitialized == 0) {
+    ret = perfmon_init(nrThreads, &(cpus[0]));
+    if (ret != 0) {
+      lua_pushstring(L, "Cannot initialize likwid perfmon");
+      perfmon_finalize();
+      lua_pushinteger(L, ret);
+      return 1;
+    }
+    perfmon_isInitialized = 1;
+    timer_isInitialized = 1;
+    lua_pushinteger(L, ret);
+  }
+  return 1;
 }
 
-static int
-lua_likwid_setupCounters(lua_State* L)
-{
-    int ret;
-    int groupId = lua_tonumber(L,1);
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    ret = perfmon_setupCounters(groupId-1);
-    lua_pushinteger(L,ret);
-    return 1;
+static int lua_likwid_addEventSet(lua_State *L) {
+  int groupId, n;
+  const char *tmpString;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  n = lua_gettop(L);
+  tmpString = luaL_checkstring(L, n);
+  luaL_argcheck(L, strlen(tmpString) > 0, n,
+                "Event string must be larger than 0");
+
+  groupId = perfmon_addEventSet((char *)tmpString);
+  if (groupId >= 0)
+    lua_pushinteger(L, groupId + 1);
+  else
+    lua_pushinteger(L, groupId);
+  return 1;
+}
+
+static int lua_likwid_setupCounters(lua_State *L) {
+  int ret;
+  int groupId = lua_tonumber(L, 1);
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  ret = perfmon_setupCounters(groupId - 1);
+  lua_pushinteger(L, ret);
+  return 1;
 }
 
-static int
-lua_likwid_startCounters(lua_State* L)
-{
-    int ret;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    ret = perfmon_startCounters();
-    lua_pushinteger(L,ret);
-    return 1;
+static int lua_likwid_startCounters(lua_State *L) {
+  int ret;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  ret = perfmon_startCounters();
+  lua_pushinteger(L, ret);
+  return 1;
 }
 
-static int
-lua_likwid_stopCounters(lua_State* L)
-{
-    int ret;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    ret = perfmon_stopCounters();
-    lua_pushinteger(L,ret);
-    return 1;
+static int lua_likwid_stopCounters(lua_State *L) {
+  int ret;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  ret = perfmon_stopCounters();
+  lua_pushinteger(L, ret);
+  return 1;
 }
 
-static int
-lua_likwid_readCounters(lua_State* L)
-{
-    int ret;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    ret = perfmon_readCounters();
-    lua_pushinteger(L,ret);
-    return 1;
+static int lua_likwid_readCounters(lua_State *L) {
+  int ret;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  ret = perfmon_readCounters();
+  lua_pushinteger(L, ret);
+  return 1;
 }
 
-static int
-lua_likwid_switchGroup(lua_State* L)
-{
-    int ret = -1;
-    int newgroup = lua_tonumber(L,1)-1;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    if (newgroup >= perfmon_getNumberOfGroups())
-    {
-        newgroup = 0;
-    }
-    if (newgroup == perfmon_getIdOfActiveGroup())
-    {
-        lua_pushinteger(L, ret);
-        return 1;
-    }
-    ret = perfmon_switchActiveGroup(newgroup);
+static int lua_likwid_switchGroup(lua_State *L) {
+  int ret = -1;
+  int newgroup = lua_tonumber(L, 1) - 1;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  if (newgroup >= perfmon_getNumberOfGroups()) {
+    newgroup = 0;
+  }
+  if (newgroup == perfmon_getIdOfActiveGroup()) {
     lua_pushinteger(L, ret);
     return 1;
+  }
+  ret = perfmon_switchActiveGroup(newgroup);
+  lua_pushinteger(L, ret);
+  return 1;
+}
+
+static int lua_likwid_finalize(lua_State *L) {
+  if (perfmon_isInitialized == 1) {
+    perfmon_finalize();
+    perfmon_isInitialized = 0;
+  }
+  if (affinity_isInitialized == 1) {
+    affinity_finalize();
+    affinity_isInitialized = 0;
+    affinity = NULL;
+  }
+  if (numa_isInitialized == 1) {
+    numa_finalize();
+    numa_isInitialized = 0;
+    numainfo = NULL;
+  }
+  if (topology_isInitialized == 1) {
+    topology_finalize();
+    topology_isInitialized = 0;
+    cputopo = NULL;
+    cpuinfo = NULL;
+  }
+  if (timer_isInitialized == 1) {
+    timer_finalize();
+    timer_isInitialized = 0;
+  }
+  if (config_isInitialized == 1) {
+    destroy_configuration();
+    config_isInitialized = 0;
+    configfile = NULL;
+  }
+  return 0;
+}
+
+static int lua_likwid_getResult(lua_State *L) {
+  int groupId, eventId, threadId;
+  double result = 0;
+  groupId = lua_tonumber(L, 1);
+  eventId = lua_tonumber(L, 2);
+  threadId = lua_tonumber(L, 3);
+  result = perfmon_getResult(groupId - 1, eventId - 1, threadId - 1);
+  lua_pushnumber(L, result);
+  return 1;
+}
+
+static int lua_likwid_getLastResult(lua_State *L) {
+  int groupId, eventId, threadId;
+  double result = 0;
+  groupId = lua_tonumber(L, 1);
+  eventId = lua_tonumber(L, 2);
+  threadId = lua_tonumber(L, 3);
+  result = perfmon_getLastResult(groupId - 1, eventId - 1, threadId - 1);
+  lua_pushnumber(L, result);
+  return 1;
+}
+
+static int lua_likwid_getMetric(lua_State *L) {
+  int groupId, metricId, threadId;
+  double result = 0;
+  groupId = lua_tonumber(L, 1);
+  metricId = lua_tonumber(L, 2);
+  threadId = lua_tonumber(L, 3);
+  result = perfmon_getMetric(groupId - 1, metricId - 1, threadId - 1);
+  lua_pushnumber(L, result);
+  return 1;
+}
+
+static int lua_likwid_getLastMetric(lua_State *L) {
+  int groupId, metricId, threadId;
+  double result = 0;
+  groupId = lua_tonumber(L, 1);
+  metricId = lua_tonumber(L, 2);
+  threadId = lua_tonumber(L, 3);
+  result = perfmon_getLastMetric(groupId - 1, metricId - 1, threadId - 1);
+  lua_pushnumber(L, result);
+  return 1;
+}
+
+static int lua_likwid_getNumberOfGroups(lua_State *L) {
+  int number;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  number = perfmon_getNumberOfGroups();
+  lua_pushinteger(L, number);
+  return 1;
 }
 
-static int
-lua_likwid_finalize(lua_State* L)
-{
-    if (perfmon_isInitialized == 1)
-    {
-        perfmon_finalize();
-        perfmon_isInitialized = 0;
-    }
-    if (affinity_isInitialized == 1)
-    {
-        affinity_finalize();
-        affinity_isInitialized = 0;
-        affinity = NULL;
-    }
-    if (numa_isInitialized == 1)
-    {
-        numa_finalize();
-        numa_isInitialized = 0;
-        numainfo = NULL;
-    }
-    if (topology_isInitialized == 1)
-    {
-        topology_finalize();
-        topology_isInitialized = 0;
-        cputopo = NULL;
-        cpuinfo = NULL;
-    }
-    if (timer_isInitialized == 1)
-    {
-        timer_finalize();
-        timer_isInitialized = 0;
-    }
-    if (config_isInitialized == 1)
-    {
-        destroy_configuration();
-        config_isInitialized = 0;
-        configfile = NULL;
-    }
+static int lua_likwid_getIdOfActiveGroup(lua_State *L) {
+  int number;
+  if (perfmon_isInitialized == 0) {
     return 0;
+  }
+  number = perfmon_getIdOfActiveGroup();
+  lua_pushinteger(L, number + 1);
+  return 1;
 }
 
-static int
-lua_likwid_getResult(lua_State* L)
-{
-    int groupId, eventId, threadId;
-    double result = 0;
-    groupId = lua_tonumber(L,1);
-    eventId = lua_tonumber(L,2);
-    threadId = lua_tonumber(L,3);
-    result = perfmon_getResult(groupId-1, eventId-1, threadId-1);
-    lua_pushnumber(L,result);
-    return 1;
+static int lua_likwid_getRuntimeOfGroup(lua_State *L) {
+  double time;
+  int groupId;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  time = perfmon_getTimeOfGroup(groupId - 1);
+  lua_pushnumber(L, time);
+  return 1;
 }
 
-static int
-lua_likwid_getLastResult(lua_State* L)
-{
-    int groupId, eventId, threadId;
-    double result = 0;
-    groupId = lua_tonumber(L,1);
-    eventId = lua_tonumber(L,2);
-    threadId = lua_tonumber(L,3);
-    result = perfmon_getLastResult(groupId-1, eventId-1, threadId-1);
-    lua_pushnumber(L,result);
-    return 1;
+static int lua_likwid_getNumberOfEvents(lua_State *L) {
+  int number, groupId;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  number = perfmon_getNumberOfEvents(groupId - 1);
+  lua_pushinteger(L, number);
+  return 1;
 }
 
-static int
-lua_likwid_getMetric(lua_State* L)
-{
-    int groupId, metricId, threadId;
-    double result = 0;
-    groupId = lua_tonumber(L,1);
-    metricId = lua_tonumber(L,2);
-    threadId = lua_tonumber(L,3);
-    result = perfmon_getMetric(groupId-1, metricId-1, threadId-1);
-    lua_pushnumber(L,result);
-    return 1;
+static int lua_likwid_getNumberOfThreads(lua_State *L) {
+  int number;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  number = perfmon_getNumberOfThreads();
+  lua_pushinteger(L, number);
+  return 1;
 }
 
-static int
-lua_likwid_getLastMetric(lua_State* L)
-{
-    int groupId, metricId, threadId;
-    double result = 0;
-    groupId = lua_tonumber(L,1);
-    metricId = lua_tonumber(L,2);
-    threadId = lua_tonumber(L,3);
-    result = perfmon_getLastMetric(groupId-1, metricId-1, threadId-1);
-    lua_pushnumber(L,result);
-    return 1;
+static int lua_likwid_getNameOfEvent(lua_State *L) {
+  int eventId, groupId;
+  char *tmp;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  eventId = lua_tonumber(L, 2);
+  tmp = perfmon_getEventName(groupId - 1, eventId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
+}
+
+static int lua_likwid_getNameOfCounter(lua_State *L) {
+  int eventId, groupId;
+  char *tmp;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  eventId = lua_tonumber(L, 2);
+  tmp = perfmon_getCounterName(groupId - 1, eventId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
 }
 
-static int
-lua_likwid_getNumberOfGroups(lua_State* L)
-{
-    int number;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    number = perfmon_getNumberOfGroups();
-    lua_pushinteger(L,number);
-    return 1;
+static int lua_likwid_getNumberOfMetrics(lua_State *L) {
+  int number, groupId;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  number = perfmon_getNumberOfMetrics(groupId - 1);
+  lua_pushinteger(L, number);
+  return 1;
 }
 
-static int
-lua_likwid_getIdOfActiveGroup(lua_State* L)
-{
-    int number;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    number = perfmon_getIdOfActiveGroup();
-    lua_pushinteger(L,number+1);
-    return 1;
+static int lua_likwid_getNameOfMetric(lua_State *L) {
+  int metricId, groupId;
+  char *tmp;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  metricId = lua_tonumber(L, 2);
+  tmp = perfmon_getMetricName(groupId - 1, metricId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
+}
+
+static int lua_likwid_getNameOfGroup(lua_State *L) {
+  int groupId;
+  char *tmp;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  tmp = perfmon_getGroupName(groupId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
 }
 
-static int
-lua_likwid_getRuntimeOfGroup(lua_State* L)
-{
-    double time;
-    int groupId;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    time = perfmon_getTimeOfGroup(groupId-1);
-    lua_pushnumber(L, time);
-    return 1;
+static int lua_likwid_getShortInfoOfGroup(lua_State *L) {
+  int groupId;
+  char *tmp;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  tmp = perfmon_getGroupInfoShort(groupId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
 }
 
-static int
-lua_likwid_getNumberOfEvents(lua_State* L)
-{
-    int number, groupId;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    number = perfmon_getNumberOfEvents(groupId-1);
-    lua_pushinteger(L,number);
-    return 1;
-}
+static int lua_likwid_getLongInfoOfGroup(lua_State *L) {
+  int groupId;
+  char *tmp;
+  if (perfmon_isInitialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  tmp = perfmon_getGroupInfoLong(groupId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
+}
+
+static int lua_likwid_getGroups(lua_State *L) {
+  int i, ret;
+  char **tmp, **infos, **longs;
+  if (topology_isInitialized == 0) {
+    topology_init();
+  }
+  ret = perfmon_getGroups(&tmp, &infos, &longs);
+  if (ret > 0) {
+    lua_newtable(L);
+    for (i = 0; i < ret; i++) {
+      lua_pushinteger(L, (lua_Integer)(i + 1));
+      lua_newtable(L);
+      lua_pushstring(L, "Name");
+      lua_pushstring(L, tmp[i]);
+      lua_settable(L, -3);
+      lua_pushstring(L, "Info");
+      lua_pushstring(L, infos[i]);
+      lua_settable(L, -3);
+      lua_pushstring(L, "Long");
+      lua_pushstring(L, longs[i]);
+      lua_settable(L, -3);
+      lua_settable(L, -3);
+    }
+    perfmon_returnGroups(ret, tmp, infos, longs);
+    return 1;
+  }
+  return 0;
+}
+
+static int lua_likwid_printSupportedCPUs(lua_State *L) {
+  print_supportedCPUs();
+  return 0;
+}
+
+static int lua_likwid_getCpuInfo(lua_State *L) {
+  if (topology_isInitialized == 0) {
+    topology_init();
+    topology_isInitialized = 1;
+    cpuinfo = get_cpuInfo();
+  }
+  if ((topology_isInitialized) && (cpuinfo == NULL)) {
+    cpuinfo = get_cpuInfo();
+  }
+  lua_newtable(L);
+  lua_pushstring(L, "family");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->family));
+  lua_settable(L, -3);
+  lua_pushstring(L, "model");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->model));
+  lua_settable(L, -3);
+  lua_pushstring(L, "stepping");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->stepping));
+  lua_settable(L, -3);
+  lua_pushstring(L, "vendor");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->vendor));
+  lua_settable(L, -3);
+  lua_pushstring(L, "part");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->part));
+  lua_settable(L, -3);
+  lua_pushstring(L, "clock");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->clock));
+  lua_settable(L, -3);
+  lua_pushstring(L, "turbo");
+  lua_pushinteger(L, cpuinfo->turbo);
+  lua_settable(L, -3);
+  lua_pushstring(L, "name");
+  lua_pushstring(L, cpuinfo->name);
+  lua_settable(L, -3);
+  lua_pushstring(L, "osname");
+  lua_pushstring(L, cpuinfo->osname);
+  lua_settable(L, -3);
+  lua_pushstring(L, "short_name");
+  lua_pushstring(L, cpuinfo->short_name);
+  lua_settable(L, -3);
+  lua_pushstring(L, "features");
+  lua_pushstring(L, cpuinfo->features);
+  lua_settable(L, -3);
+  lua_pushstring(L, "architecture");
+  lua_pushstring(L, cpuinfo->architecture);
+  lua_settable(L, -3);
+  lua_pushstring(L, "isIntel");
+  lua_pushinteger(L, cpuinfo->isIntel);
+  lua_settable(L, -3);
+  lua_pushstring(L, "featureFlags");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->featureFlags));
+  lua_settable(L, -3);
+  lua_pushstring(L, "perf_version");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->perf_version));
+  lua_settable(L, -3);
+  lua_pushstring(L, "perf_num_ctr");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->perf_num_ctr));
+  lua_settable(L, -3);
+  lua_pushstring(L, "perf_width_ctr");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->perf_width_ctr));
+  lua_settable(L, -3);
+  lua_pushstring(L, "perf_num_fixed_ctr");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->perf_num_fixed_ctr));
+  lua_settable(L, -3);
+  lua_pushstring(L, "supportUncore");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->supportUncore));
+  lua_settable(L, -3);
+  lua_pushstring(L, "supportClientmem");
+  lua_pushinteger(L, (lua_Integer)(cpuinfo->supportClientmem));
+  lua_settable(L, -3);
+  return 1;
+}
+
+static int lua_likwid_getCpuTopology(lua_State *L) {
+  int i;
+  TreeNode *socketNode;
+  int socketCount = 0;
+  TreeNode *coreNode;
+  int coreCount = 0;
+  TreeNode *threadNode;
+  int threadCount = 0;
+  if (topology_isInitialized == 0) {
+    topology_init();
+    topology_isInitialized = 1;
+    cputopo = get_cpuTopology();
+  }
+  if ((topology_isInitialized) && (cputopo == NULL)) {
+    cputopo = get_cpuTopology();
+  }
+  if (numa_isInitialized == 0) {
+    if (numa_init() == 0) {
+      numa_isInitialized = 1;
+      numainfo = get_numaTopology();
+    }
+  }
+  if ((numa_isInitialized) && (numainfo == NULL)) {
+    numainfo = get_numaTopology();
+  }
+
+  lua_newtable(L);
+
+  lua_pushstring(L, "numHWThreads");
+  lua_pushinteger(L, (lua_Integer)(cputopo->numHWThreads));
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "activeHWThreads");
+  lua_pushinteger(L, (lua_Integer)(cputopo->activeHWThreads));
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "numSockets");
+  lua_pushinteger(L, (lua_Integer)(cputopo->numSockets));
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "numDies");
+  lua_pushinteger(L, (lua_Integer)(cputopo->numDies));
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "numCoresPerSocket");
+  lua_pushinteger(L, (lua_Integer)(cputopo->numCoresPerSocket));
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "numThreadsPerCore");
+  lua_pushinteger(L, (lua_Integer)(cputopo->numThreadsPerCore));
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "numCacheLevels");
+  lua_pushinteger(L, cputopo->numCacheLevels);
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "threadPool");
+  lua_newtable(L);
+  for (i = 0; i < cputopo->numHWThreads; i++) {
+    lua_pushnumber(L, i);
+    lua_newtable(L);
+    lua_pushstring(L, "threadId");
+    lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].threadId));
+    lua_settable(L, -3);
+    lua_pushstring(L, "coreId");
+    lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].coreId));
+    lua_settable(L, -3);
+    lua_pushstring(L, "packageId");
+    lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].packageId));
+    lua_settable(L, -3);
+    lua_pushstring(L, "apicId");
+    lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].apicId));
+    lua_settable(L, -3);
+    lua_pushstring(L, "dieId");
+    lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].dieId));
+    lua_settable(L, -3);
+    lua_pushstring(L, "inCpuSet");
+    lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].inCpuSet));
+    lua_settable(L, -3);
+    lua_settable(L, -3);
+  }
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "cacheLevels");
+  lua_newtable(L);
+  for (i = 0; i < cputopo->numCacheLevels; i++) {
+    lua_pushnumber(L, i + 1);
+    lua_newtable(L);
 
-static int
-lua_likwid_getNumberOfThreads(lua_State* L)
-{
-    int number;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    number = perfmon_getNumberOfThreads();
-    lua_pushinteger(L,number);
-    return 1;
+    lua_pushstring(L, "level");
+    lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].level));
+    lua_settable(L, -3);
+
+    lua_pushstring(L, "associativity");
+    lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].associativity));
+    lua_settable(L, -3);
+
+    lua_pushstring(L, "sets");
+    lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].sets));
+    lua_settable(L, -3);
+
+    lua_pushstring(L, "lineSize");
+    lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].lineSize));
+    lua_settable(L, -3);
+
+    lua_pushstring(L, "size");
+    lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].size));
+    lua_settable(L, -3);
+
+    lua_pushstring(L, "threads");
+    lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].threads));
+    lua_settable(L, -3);
+
+    lua_pushstring(L, "inclusive");
+    lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].inclusive));
+    lua_settable(L, -3);
+
+    lua_pushstring(L, "type");
+    switch (cputopo->cacheLevels[i].type) {
+    case DATACACHE:
+      lua_pushstring(L, "DATACACHE");
+      break;
+    case INSTRUCTIONCACHE:
+      lua_pushstring(L, "INSTRUCTIONCACHE");
+      break;
+    case UNIFIEDCACHE:
+      lua_pushstring(L, "UNIFIEDCACHE");
+      break;
+    case ITLB:
+      lua_pushstring(L, "ITLB");
+      break;
+    case DTLB:
+      lua_pushstring(L, "DTLB");
+      break;
+    case NOCACHE:
+    default:
+      lua_pushstring(L, "NOCACHE");
+      break;
+    }
+    lua_settable(L, -3);
+    lua_settable(L, -3);
+  }
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "topologyTree");
+  lua_newtable(L);
+
+  socketNode = tree_getChildNode(cputopo->topologyTree);
+  while (socketNode != NULL) {
+    lua_pushinteger(L, socketCount);
+    lua_newtable(L);
+    lua_pushstring(L, "ID");
+    lua_pushinteger(L, (lua_Integer)(socketNode->id));
+    lua_settable(L, -3);
+    lua_pushstring(L, "Children");
+    lua_newtable(L);
+    coreCount = 0;
+    coreNode = tree_getChildNode(socketNode);
+    while (coreNode != NULL) {
+      lua_pushinteger(L, coreCount);
+      lua_newtable(L);
+      lua_pushstring(L, "ID");
+      lua_pushinteger(L, (lua_Integer)(coreNode->id));
+      lua_settable(L, -3);
+      lua_pushstring(L, "Children");
+      lua_newtable(L);
+      threadNode = tree_getChildNode(coreNode);
+      threadCount = 0;
+      while (threadNode != NULL) {
+        lua_pushinteger(L, (lua_Integer)(threadCount));
+        lua_pushinteger(L, (lua_Integer)(threadNode->id));
+        lua_settable(L, -3);
+        threadNode = tree_getNextNode(threadNode);
+        threadCount++;
+      }
+      lua_settable(L, -3);
+      coreNode = tree_getNextNode(coreNode);
+      coreCount++;
+      lua_settable(L, -3);
+    }
+    lua_settable(L, -3);
+    socketNode = tree_getNextNode(socketNode);
+    socketCount++;
+    lua_settable(L, -3);
+  }
+  lua_settable(L, -3);
+  return 1;
+}
+
+static int lua_likwid_putTopology(lua_State *L) {
+  if (topology_isInitialized == 1) {
+    topology_finalize();
+    topology_isInitialized = 0;
+    cpuinfo = NULL;
+    cputopo = NULL;
+  }
+  return 0;
 }
 
 static int
-lua_likwid_getNameOfEvent(lua_State* L)
+lua_likwid_getEventsAndCounters(lua_State* L)
 {
-    int eventId, groupId;
-    char* tmp;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    eventId = lua_tonumber(L,2);
-    tmp = perfmon_getEventName(groupId-1, eventId-1);
-    lua_pushstring(L,tmp);
-    return 1;
-}
+    int i = 0, insert = 1;
 
-static int
-lua_likwid_getNameOfCounter(lua_State* L)
-{
-    int eventId, groupId;
-    char* tmp;
-    if (perfmon_isInitialized == 0)
+    if (topology_isInitialized == 0)
     {
-        return 0;
+        topology_init();
+        topology_isInitialized = 1;
+        cpuinfo = get_cpuInfo();
     }
-    groupId = lua_tonumber(L,1);
-    eventId = lua_tonumber(L,2);
-    tmp = perfmon_getCounterName(groupId-1, eventId-1);
-    lua_pushstring(L,tmp);
-    return 1;
-}
-
-static int
-lua_likwid_getNumberOfMetrics(lua_State* L)
-{
-    int number, groupId;
-    if (perfmon_isInitialized == 0)
+    if ((topology_isInitialized) && (cpuinfo == NULL))
     {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    number = perfmon_getNumberOfMetrics(groupId-1);
-    lua_pushinteger(L,number);
-    return 1;
-}
-
-static int
-lua_likwid_getNameOfMetric(lua_State* L)
-{
-    int metricId, groupId;
-    char* tmp;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    metricId = lua_tonumber(L,2);
-    tmp = perfmon_getMetricName(groupId-1, metricId-1);
-    lua_pushstring(L,tmp);
-    return 1;
-}
-
-static int
-lua_likwid_getNameOfGroup(lua_State* L)
-{
-    int groupId;
-    char* tmp;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    tmp = perfmon_getGroupName(groupId-1);
-    lua_pushstring(L,tmp);
-    return 1;
-}
-
-static int
-lua_likwid_getShortInfoOfGroup(lua_State* L)
-{
-    int groupId;
-    char* tmp;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    tmp = perfmon_getGroupInfoShort(groupId-1);
-    lua_pushstring(L,tmp);
-    return 1;
-}
-
-static int
-lua_likwid_getLongInfoOfGroup(lua_State* L)
-{
-    int groupId;
-    char* tmp;
-    if (perfmon_isInitialized == 0)
-    {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    tmp = perfmon_getGroupInfoLong(groupId-1);
-    lua_pushstring(L,tmp);
-    return 1;
-}
-
-static int
-lua_likwid_getGroups(lua_State* L)
-{
-    int i, ret;
-    char** tmp, **infos, **longs;
-    if (topology_isInitialized == 0)
-    {
-        topology_init();
-    }
-    ret = perfmon_getGroups(&tmp, &infos, &longs);
-    if (ret > 0)
-    {
-        lua_newtable(L);
-        for (i=0;i<ret;i++)
-        {
-            lua_pushinteger(L, (lua_Integer)( i+1));
-            lua_newtable(L);
-            lua_pushstring(L, "Name");
-            lua_pushstring(L, tmp[i]);
-            lua_settable(L,-3);
-            lua_pushstring(L, "Info");
-            lua_pushstring(L, infos[i]);
-            lua_settable(L,-3);
-            lua_pushstring(L, "Long");
-            lua_pushstring(L, longs[i]);
-            lua_settable(L,-3);
-            lua_settable(L,-3);
-        }
-        perfmon_returnGroups(ret, tmp, infos, longs);
-        return 1;
-    }
-    return 0;
-}
-
-
-static int
-lua_likwid_printSupportedCPUs(lua_State* L)
-{
-    print_supportedCPUs();
-    return 0;
-}
-
-static int
-lua_likwid_getCpuInfo(lua_State* L)
-{
-    if (topology_isInitialized == 0)
-    {
-        topology_init();
-        topology_isInitialized = 1;
-        cpuinfo = get_cpuInfo();
-    }
-    if ((topology_isInitialized) && (cpuinfo == NULL))
-    {
-        cpuinfo = get_cpuInfo();
-    }
-    lua_newtable(L);
-    lua_pushstring(L,"family");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->family));
-    lua_settable(L,-3);
-    lua_pushstring(L,"model");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->model));
-    lua_settable(L,-3);
-    lua_pushstring(L,"stepping");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->stepping));
-    lua_settable(L,-3);
-    lua_pushstring(L,"vendor");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->vendor));
-    lua_settable(L,-3);
-    lua_pushstring(L,"part");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->part));
-    lua_settable(L,-3);
-    lua_pushstring(L,"clock");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->clock));
-    lua_settable(L,-3);
-    lua_pushstring(L,"turbo");
-    lua_pushinteger(L,cpuinfo->turbo);
-    lua_settable(L,-3);
-    lua_pushstring(L,"name");
-    lua_pushstring(L,cpuinfo->name);
-    lua_settable(L,-3);
-    lua_pushstring(L,"osname");
-    lua_pushstring(L,cpuinfo->osname);
-    lua_settable(L,-3);
-    lua_pushstring(L,"short_name");
-    lua_pushstring(L,cpuinfo->short_name);
-    lua_settable(L,-3);
-    lua_pushstring(L,"features");
-    lua_pushstring(L,cpuinfo->features);
-    lua_settable(L,-3);
-    lua_pushstring(L,"architecture");
-    lua_pushstring(L,cpuinfo->architecture);
-    lua_settable(L,-3);
-    lua_pushstring(L,"isIntel");
-    lua_pushinteger(L,cpuinfo->isIntel);
-    lua_settable(L,-3);
-    lua_pushstring(L,"featureFlags");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->featureFlags));
-    lua_settable(L,-3);
-    lua_pushstring(L,"perf_version");
-    lua_pushinteger(L, (lua_Integer)( cpuinfo->perf_version));
-    lua_settable(L,-3);
-    lua_pushstring(L,"perf_num_ctr");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->perf_num_ctr));
-    lua_settable(L,-3);
-    lua_pushstring(L,"perf_width_ctr");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->perf_width_ctr));
-    lua_settable(L,-3);
-    lua_pushstring(L,"perf_num_fixed_ctr");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->perf_num_fixed_ctr));
-    lua_settable(L,-3);
-    lua_pushstring(L,"supportUncore");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->supportUncore));
-    lua_settable(L,-3);
-    lua_pushstring(L,"supportClientmem");
-    lua_pushinteger(L, (lua_Integer)(cpuinfo->supportClientmem));
-    lua_settable(L,-3);
-    return 1;
-}
-
-static int
-lua_likwid_getCpuTopology(lua_State* L)
-{
-    int i;
-    TreeNode* socketNode;
-    int socketCount = 0;
-    TreeNode* coreNode;
-    int coreCount = 0;
-    TreeNode* threadNode;
-    int threadCount = 0;
-    if (topology_isInitialized == 0)
-    {
-        topology_init();
-        topology_isInitialized = 1;
-        cputopo = get_cpuTopology();
-    }
-    if ((topology_isInitialized) && (cputopo == NULL))
-    {
-        cputopo = get_cpuTopology();
-    }
-    if (numa_isInitialized == 0)
-    {
-        if (numa_init() == 0)
-        {
-            numa_isInitialized = 1;
-            numainfo = get_numaTopology();
-        }
-    }
-    if ((numa_isInitialized) && (numainfo == NULL))
-    {
-        numainfo = get_numaTopology();
-    }
-
-    lua_newtable(L);
-
-    lua_pushstring(L,"numHWThreads");
-    lua_pushinteger(L, (lua_Integer)(cputopo->numHWThreads));
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"activeHWThreads");
-    lua_pushinteger(L, (lua_Integer)(cputopo->activeHWThreads));
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"numSockets");
-    lua_pushinteger(L, (lua_Integer)(cputopo->numSockets));
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"numDies");
-    lua_pushinteger(L, (lua_Integer)(cputopo->numDies));
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"numCoresPerSocket");
-    lua_pushinteger(L, (lua_Integer)(cputopo->numCoresPerSocket));
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"numThreadsPerCore");
-    lua_pushinteger(L, (lua_Integer)(cputopo->numThreadsPerCore));
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"numCacheLevels");
-    lua_pushinteger(L,cputopo->numCacheLevels);
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"threadPool");
-    lua_newtable(L);
-    for(i=0;i<cputopo->numHWThreads;i++)
-    {
-        lua_pushnumber(L,i);
-        lua_newtable(L);
-        lua_pushstring(L,"threadId");
-        lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].threadId));
-        lua_settable(L,-3);
-        lua_pushstring(L,"coreId");
-        lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].coreId));
-        lua_settable(L,-3);
-        lua_pushstring(L,"packageId");
-        lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].packageId));
-        lua_settable(L,-3);
-        lua_pushstring(L,"apicId");
-        lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].apicId));
-        lua_settable(L,-3);
-        lua_pushstring(L,"dieId");
-        lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].dieId));
-        lua_settable(L,-3);
-        lua_pushstring(L,"inCpuSet");
-        lua_pushinteger(L, (lua_Integer)(cputopo->threadPool[i].inCpuSet));
-        lua_settable(L,-3);
-        lua_settable(L,-3);
-    }
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"cacheLevels");
-    lua_newtable(L);
-    for(i=0;i<cputopo->numCacheLevels;i++)
-    {
-        lua_pushnumber(L,i+1);
-        lua_newtable(L);
-
-        lua_pushstring(L,"level");
-        lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].level));
-        lua_settable(L,-3);
-
-        lua_pushstring(L,"associativity");
-        lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].associativity));
-        lua_settable(L,-3);
-
-        lua_pushstring(L,"sets");
-        lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].sets));
-        lua_settable(L,-3);
-
-        lua_pushstring(L,"lineSize");
-        lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].lineSize));
-        lua_settable(L,-3);
-
-        lua_pushstring(L,"size");
-        lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].size));
-        lua_settable(L,-3);
-
-        lua_pushstring(L,"threads");
-        lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].threads));
-        lua_settable(L,-3);
-
-        lua_pushstring(L,"inclusive");
-        lua_pushinteger(L, (lua_Integer)(cputopo->cacheLevels[i].inclusive));
-        lua_settable(L,-3);
-
-        lua_pushstring(L,"type");
-        switch (cputopo->cacheLevels[i].type)
-        {
-            case DATACACHE:
-                lua_pushstring(L,"DATACACHE");
-                break;
-            case INSTRUCTIONCACHE:
-                lua_pushstring(L,"INSTRUCTIONCACHE");
-                break;
-            case UNIFIEDCACHE:
-                lua_pushstring(L,"UNIFIEDCACHE");
-                break;
-            case ITLB:
-                lua_pushstring(L,"ITLB");
-                break;
-            case DTLB:
-                lua_pushstring(L,"DTLB");
-                break;
-            case NOCACHE:
-            default:
-                lua_pushstring(L,"NOCACHE");
-                break;
-        }
-        lua_settable(L,-3);
-        lua_settable(L,-3);
-    }
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"topologyTree");
-    lua_newtable(L);
-
-    socketNode = tree_getChildNode(cputopo->topologyTree);
-    while (socketNode != NULL)
-    {
-        lua_pushinteger(L, socketCount);
-        lua_newtable(L);
-        lua_pushstring(L, "ID");
-        lua_pushinteger(L, (lua_Integer)(socketNode->id));
-        lua_settable(L, -3);
-        lua_pushstring(L, "Children");
-        lua_newtable(L);
-        coreCount = 0;
-        coreNode = tree_getChildNode(socketNode);
-        while (coreNode != NULL)
-        {
-            lua_pushinteger(L, coreCount);
-            lua_newtable(L);
-            lua_pushstring(L, "ID");
-            lua_pushinteger(L, (lua_Integer)(coreNode->id));
-            lua_settable(L,-3);
-            lua_pushstring(L, "Children");
-            lua_newtable(L);
-            threadNode = tree_getChildNode(coreNode);
-            threadCount = 0;
-            while (threadNode != NULL)
-            {
-                lua_pushinteger(L, (lua_Integer)(threadCount));
-                lua_pushinteger(L, (lua_Integer)(threadNode->id));
-                lua_settable(L,-3);
-                threadNode = tree_getNextNode(threadNode);
-                threadCount++;
-            }
-            lua_settable(L,-3);
-            coreNode = tree_getNextNode(coreNode);
-            coreCount++;
-            lua_settable(L,-3);
-        }
-        lua_settable(L,-3);
-        socketNode = tree_getNextNode(socketNode);
-        socketCount++;
-        lua_settable(L,-3);
-    }
-    lua_settable(L,-3);
-    return 1;
-}
-
-static int
-lua_likwid_putTopology(lua_State* L)
-{
-    if (topology_isInitialized == 1)
-    {
-        topology_finalize();
-        topology_isInitialized = 0;
-        cpuinfo = NULL;
-        cputopo = NULL;
-    }
-    return 0;
-}
-
-
-static int
-lua_likwid_getEventsAndCounters(lua_State* L)
-{
-    int i = 0, insert = 1;
-
-    if (topology_isInitialized == 0)
-    {
-        topology_init();
-        topology_isInitialized = 1;
-        cpuinfo = get_cpuInfo();
-    }
-    if ((topology_isInitialized) && (cpuinfo == NULL))
-    {
-        cpuinfo = get_cpuInfo();
+        cpuinfo = get_cpuInfo();
     }
     if (affinity_isInitialized == 0)
     {
@@ -1102,330 +975,290 @@ lua_likwid_getEventsAndCounters(lua_State* L)
     return 1;
 }
 
-static int
-lua_likwid_getOnlineDevices(lua_State* L)
-{
-    int i;
+static int lua_likwid_getOnlineDevices(lua_State *L) {
+  int i;
+  lua_newtable(L);
+  for (i = 0; i <= MAX_NUM_PCI_DEVICES; i++) {
+    if (pci_devices[i].online) {
+      lua_pushstring(L, pci_devices[i].likwid_name);
+      lua_newtable(L);
+      lua_pushstring(L, "Name");
+      lua_pushstring(L, pci_devices[i].name);
+      lua_settable(L, -3);
+      lua_pushstring(L, "Path");
+      lua_pushstring(L, pci_devices[i].path);
+      lua_settable(L, -3);
+      lua_pushstring(L, "Type");
+      lua_pushstring(L, pci_types[pci_devices[i].type].name);
+      lua_settable(L, -3);
+      lua_pushstring(L, "TypeDescription");
+      lua_pushstring(L, pci_types[pci_devices[i].type].desc);
+      lua_settable(L, -3);
+    }
+    lua_settable(L, -3);
+  }
+  return 1;
+}
+
+static int lua_likwid_getNumaInfo(lua_State *L) {
+  uint32_t i, j;
+  if (topology_isInitialized == 0) {
+    topology_init();
+    topology_isInitialized = 1;
+    cpuinfo = get_cpuInfo();
+    cputopo = get_cpuTopology();
+  }
+  if ((topology_isInitialized) && (cpuinfo == NULL)) {
+    cpuinfo = get_cpuInfo();
+  }
+  if ((topology_isInitialized) && (cputopo == NULL)) {
+    cputopo = get_cpuTopology();
+  }
+  if (numa_isInitialized == 0) {
+    if (numa_init() == 0) {
+      numa_isInitialized = 1;
+      numainfo = get_numaTopology();
+    } else {
+      lua_newtable(L);
+      lua_pushstring(L, "numberOfNodes");
+      lua_pushinteger(L, (lua_Integer)(0));
+      lua_settable(L, -3);
+      lua_pushstring(L, "nodes");
+      lua_newtable(L);
+      lua_settable(L, -3);
+      return 1;
+    }
+  }
+  if ((numa_isInitialized) && (numainfo == NULL)) {
+    numainfo = get_numaTopology();
+  }
+  if (affinity_isInitialized == 0) {
+    affinity_init();
+    affinity_isInitialized = 1;
+    affinity = get_affinityDomains();
+  }
+  if ((affinity_isInitialized) && (affinity == NULL)) {
+    affinity = get_affinityDomains();
+  }
+  lua_newtable(L);
+  lua_pushstring(L, "numberOfNodes");
+  lua_pushinteger(L, (lua_Integer)(numainfo->numberOfNodes));
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "nodes");
+  lua_newtable(L);
+  for (i = 0; i < numainfo->numberOfNodes; i++) {
+    lua_pushinteger(L, i + 1);
+    lua_newtable(L);
+    lua_pushstring(L, "id");
+    lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].id));
+    lua_settable(L, -3);
+    lua_pushstring(L, "totalMemory");
+    lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].totalMemory));
+    lua_settable(L, -3);
+    lua_pushstring(L, "freeMemory");
+    lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].freeMemory));
+    lua_settable(L, -3);
+    lua_pushstring(L, "numberOfProcessors");
+    lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].numberOfProcessors));
+    lua_settable(L, -3);
+    lua_pushstring(L, "numberOfDistances");
+    lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].numberOfDistances));
+    lua_settable(L, -3);
+    lua_pushstring(L, "processors");
     lua_newtable(L);
-    for(i=0;i<=MAX_NUM_PCI_DEVICES;i++)
+    for (j = 0; j < numainfo->nodes[i].numberOfProcessors; j++) {
+      lua_pushinteger(L, (lua_Integer)(j + 1));
+      lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].processors[j]));
+      lua_settable(L, -3);
+    }
+    lua_settable(L, -3);
+    /*lua_pushstring(L,"processorsCompact");
+    lua_newtable(L);
+    for(j=0;j<numa->nodes[i].numberOfProcessors;j++)
     {
-        if (pci_devices[i].online)
-        {
-            lua_pushstring(L,pci_devices[i].likwid_name);
-            lua_newtable(L);
-            lua_pushstring(L, "Name");
-            lua_pushstring(L,pci_devices[i].name);
-            lua_settable(L,-3);
-            lua_pushstring(L, "Path");
-            lua_pushstring(L,pci_devices[i].path);
-            lua_settable(L,-3);
-            lua_pushstring(L, "Type");
-            lua_pushstring(L,pci_types[pci_devices[i].type].name);
-            lua_settable(L,-3);
-            lua_pushstring(L, "TypeDescription");
-            lua_pushstring(L,pci_types[pci_devices[i].type].desc);
-            lua_settable(L,-3);
-        }
+        lua_pushinteger(L, (lua_Integer)(j);
+        lua_pushinteger(L, (lua_Integer)(numa->nodes[i].processorsCompact[j]);
         lua_settable(L,-3);
     }
-    return 1;
+    lua_settable(L,-3);*/
+    lua_pushstring(L, "distances");
+    lua_newtable(L);
+    for (j = 0; j < numainfo->nodes[i].numberOfDistances; j++) {
+      lua_pushinteger(L, j + 1);
+      lua_newtable(L);
+      lua_pushinteger(L, j);
+      lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].distances[j]));
+      lua_settable(L, -3);
+      lua_settable(L, -3);
+    }
+    lua_settable(L, -3);
+    lua_settable(L, -3);
+  }
+  lua_settable(L, -3);
+  return 1;
+}
+
+static int lua_likwid_putNumaInfo(lua_State *L) {
+  if (numa_isInitialized) {
+    numa_finalize();
+    numa_isInitialized = 0;
+    numainfo = NULL;
+  }
+  return 0;
+}
+
+static int lua_likwid_setMemInterleaved(lua_State *L) {
+  int ret;
+  int nrThreads = luaL_checknumber(L, 1);
+  luaL_argcheck(L, nrThreads > 0, 1, "Thread count must be greater than 0");
+  int cpus[nrThreads];
+  if (!lua_istable(L, -1)) {
+    lua_pushstring(L, "No table given as second argument");
+    lua_error(L);
+  }
+  for (ret = 1; ret <= nrThreads; ret++) {
+    lua_rawgeti(L, -1, ret);
+#if LUA_VERSION_NUM == 501
+    cpus[ret - 1] = ((lua_Integer)lua_tointeger(L, -1));
+#else
+    cpus[ret - 1] = ((lua_Unsigned)lua_tointegerx(L, -1, NULL));
+#endif
+    lua_pop(L, 1);
+  }
+  numa_setInterleaved(cpus, nrThreads);
+  return 0;
+}
+
+static int lua_likwid_setMembind(lua_State *L) {
+  int ret;
+  int nrThreads = luaL_checknumber(L, 1);
+  luaL_argcheck(L, nrThreads > 0, 1, "Thread count must be greater than 0");
+  int cpus[nrThreads];
+  if (!lua_istable(L, -1)) {
+    lua_pushstring(L, "No table given as second argument");
+    lua_error(L);
+  }
+  for (ret = 1; ret <= nrThreads; ret++) {
+    lua_rawgeti(L, -1, ret);
+#if LUA_VERSION_NUM == 501
+    cpus[ret - 1] = ((lua_Integer)lua_tointeger(L, -1));
+#else
+    cpus[ret - 1] = ((lua_Unsigned)lua_tointegerx(L, -1, NULL));
+#endif
+    lua_pop(L, 1);
+  }
+  numa_setMembind(cpus, nrThreads);
+  return 0;
+}
+
+static int lua_likwid_getAffinityInfo(lua_State *L) {
+  int i, j;
+
+  if (topology_isInitialized == 0) {
+    topology_init();
+    topology_isInitialized = 1;
+    cpuinfo = get_cpuInfo();
+    cputopo = get_cpuTopology();
+  }
+  if ((topology_isInitialized) && (cpuinfo == NULL)) {
+    cpuinfo = get_cpuInfo();
+  }
+  if ((topology_isInitialized) && (cputopo == NULL)) {
+    cputopo = get_cpuTopology();
+  }
+  if (numa_isInitialized == 0) {
+    if (numa_init() == 0) {
+      numa_isInitialized = 1;
+      numainfo = get_numaTopology();
+    }
+  }
+  if ((numa_isInitialized) && (numainfo == NULL)) {
+    numainfo = get_numaTopology();
+  }
+  if (affinity_isInitialized == 0) {
+    affinity_init();
+    affinity_isInitialized = 1;
+    affinity = get_affinityDomains();
+  }
+  if ((affinity_isInitialized) && (affinity == NULL)) {
+    affinity = get_affinityDomains();
+  }
+
+  if (!affinity) {
+    lua_pushstring(L, "Cannot initialize affinity groups");
+    lua_error(L);
+  }
+  lua_newtable(L);
+  lua_pushstring(L, "numberOfAffinityDomains");
+  lua_pushinteger(L, (lua_Integer)(affinity->numberOfAffinityDomains));
+  lua_settable(L, -3);
+  lua_pushstring(L, "numberOfSocketDomains");
+  lua_pushinteger(L, (lua_Integer)(affinity->numberOfSocketDomains));
+  lua_settable(L, -3);
+  lua_pushstring(L, "numberOfNumaDomains");
+  lua_pushinteger(L, (lua_Integer)(affinity->numberOfNumaDomains));
+  lua_settable(L, -3);
+  lua_pushstring(L, "numberOfProcessorsPerSocket");
+  lua_pushinteger(L, (lua_Integer)(affinity->numberOfProcessorsPerSocket));
+  lua_settable(L, -3);
+  lua_pushstring(L, "numberOfCacheDomains");
+  lua_pushinteger(L, (lua_Integer)(affinity->numberOfCacheDomains));
+  lua_settable(L, -3);
+  lua_pushstring(L, "numberOfCoresPerCache");
+  lua_pushinteger(L, (lua_Integer)(affinity->numberOfCoresPerCache));
+  lua_settable(L, -3);
+  lua_pushstring(L, "numberOfProcessorsPerCache");
+  lua_pushinteger(L, (lua_Integer)(affinity->numberOfProcessorsPerCache));
+  lua_settable(L, -3);
+  lua_pushstring(L, "domains");
+  lua_newtable(L);
+  for (i = 0; i < affinity->numberOfAffinityDomains; i++) {
+    lua_pushinteger(L, (lua_Integer)(i + 1));
+    lua_newtable(L);
+    lua_pushstring(L, "tag");
+    lua_pushstring(L, bdata(affinity->domains[i].tag));
+    lua_settable(L, -3);
+    lua_pushstring(L, "numberOfProcessors");
+    lua_pushinteger(L, (lua_Integer)(affinity->domains[i].numberOfProcessors));
+    lua_settable(L, -3);
+    lua_pushstring(L, "numberOfCores");
+    lua_pushinteger(L, (lua_Integer)(affinity->domains[i].numberOfCores));
+    lua_settable(L, -3);
+    lua_pushstring(L, "processorList");
+    lua_newtable(L);
+    for (j = 0; j < affinity->domains[i].numberOfProcessors; j++) {
+      lua_pushinteger(L, (lua_Integer)(j + 1));
+      lua_pushinteger(L, (lua_Integer)(affinity->domains[i].processorList[j]));
+      lua_settable(L, -3);
+    }
+    lua_settable(L, -3);
+    lua_settable(L, -3);
+  }
+  lua_settable(L, -3);
+  return 1;
 }
 
+
 static int
-lua_likwid_getNumaInfo(lua_State* L)
+lua_likwid_cpustr_to_cpulist(lua_State* L)
 {
-    uint32_t i,j;
-    if (topology_isInitialized == 0)
+    int ret = 0;
+    char* cpustr = (char *)luaL_checkstring(L, 1);
+    if (!cputopo)
     {
         topology_init();
-        topology_isInitialized = 1;
-        cpuinfo = get_cpuInfo();
         cputopo = get_cpuTopology();
+        topology_isInitialized = 1;
     }
-    if ((topology_isInitialized) && (cpuinfo == NULL))
-    {
-        cpuinfo = get_cpuInfo();
-    }
-    if ((topology_isInitialized) && (cputopo == NULL))
+    int* cpulist = (int*) malloc(cputopo->numHWThreads * sizeof(int));
+    if (cpulist == NULL)
     {
-        cputopo = get_cpuTopology();
+        lua_pushnumber(L, 0);
+        return 1;
     }
-    if (numa_isInitialized == 0)
-    {
-        if (numa_init() == 0)
-        {
-            numa_isInitialized = 1;
-            numainfo = get_numaTopology();
-        }
-        else
-        {
-            lua_newtable(L);
-            lua_pushstring(L,"numberOfNodes");
-            lua_pushinteger(L, (lua_Integer)(0));
-            lua_settable(L,-3);
-            lua_pushstring(L,"nodes");
-            lua_newtable(L);
-            lua_settable(L,-3);
-            return 1;
-        }
-    }
-    if ((numa_isInitialized) && (numainfo == NULL))
-    {
-        numainfo = get_numaTopology();
-    }
-    if (affinity_isInitialized == 0)
-    {
-        affinity_init();
-        affinity_isInitialized = 1;
-        affinity = get_affinityDomains();
-    }
-    if ((affinity_isInitialized) && (affinity == NULL))
-    {
-        affinity = get_affinityDomains();
-    }
-    lua_newtable(L);
-    lua_pushstring(L,"numberOfNodes");
-    lua_pushinteger(L, (lua_Integer)(numainfo->numberOfNodes));
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"nodes");
-    lua_newtable(L);
-    for(i=0;i<numainfo->numberOfNodes;i++)
-    {
-        lua_pushinteger(L, i+1);
-        lua_newtable(L);
-        lua_pushstring(L,"id");
-        lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].id));
-        lua_settable(L,-3);
-        lua_pushstring(L,"totalMemory");
-        lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].totalMemory));
-        lua_settable(L,-3);
-        lua_pushstring(L,"freeMemory");
-        lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].freeMemory));
-        lua_settable(L,-3);
-        lua_pushstring(L,"numberOfProcessors");
-        lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].numberOfProcessors));
-        lua_settable(L,-3);
-        lua_pushstring(L,"numberOfDistances");
-        lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].numberOfDistances));
-        lua_settable(L,-3);
-        lua_pushstring(L,"processors");
-        lua_newtable(L);
-        for(j=0;j<numainfo->nodes[i].numberOfProcessors;j++)
-        {
-            lua_pushinteger(L, (lua_Integer)(j+1));
-            lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].processors[j]));
-            lua_settable(L,-3);
-        }
-        lua_settable(L,-3);
-        /*lua_pushstring(L,"processorsCompact");
-        lua_newtable(L);
-        for(j=0;j<numa->nodes[i].numberOfProcessors;j++)
-        {
-            lua_pushinteger(L, (lua_Integer)(j);
-            lua_pushinteger(L, (lua_Integer)(numa->nodes[i].processorsCompact[j]);
-            lua_settable(L,-3);
-        }
-        lua_settable(L,-3);*/
-        lua_pushstring(L,"distances");
-        lua_newtable(L);
-        for(j=0;j<numainfo->nodes[i].numberOfDistances;j++)
-        {
-            lua_pushinteger(L,j+1);
-            lua_newtable(L);
-            lua_pushinteger(L,j);
-            lua_pushinteger(L, (lua_Integer)(numainfo->nodes[i].distances[j]));
-            lua_settable(L,-3);
-            lua_settable(L,-3);
-        }
-        lua_settable(L,-3);
-        lua_settable(L,-3);
-    }
-    lua_settable(L,-3);
-    return 1;
-}
-
-static int
-lua_likwid_putNumaInfo(lua_State* L)
-{
-    if (numa_isInitialized)
-    {
-        numa_finalize();
-        numa_isInitialized = 0;
-        numainfo = NULL;
-    }
-    return 0;
-}
-
-static int
-lua_likwid_setMemInterleaved(lua_State* L)
-{
-    int ret;
-    int nrThreads = luaL_checknumber(L,1);
-    luaL_argcheck(L, nrThreads > 0, 1, "Thread count must be greater than 0");
-    int cpus[nrThreads];
-    if (!lua_istable(L, -1)) {
-      lua_pushstring(L,"No table given as second argument");
-      lua_error(L);
-    }
-    for (ret = 1; ret<=nrThreads; ret++)
-    {
-        lua_rawgeti(L,-1,ret);
-#if LUA_VERSION_NUM == 501
-        cpus[ret-1] = ((lua_Integer)lua_tointeger(L,-1));
-#else
-        cpus[ret-1] = ((lua_Unsigned)lua_tointegerx(L,-1, NULL));
-#endif
-        lua_pop(L,1);
-    }
-    numa_setInterleaved(cpus, nrThreads);
-    return 0;
-}
-
-static int
-lua_likwid_setMembind(lua_State* L)
-{
-    int ret;
-    int nrThreads = luaL_checknumber(L,1);
-    luaL_argcheck(L, nrThreads > 0, 1, "Thread count must be greater than 0");
-    int cpus[nrThreads];
-    if (!lua_istable(L, -1)) {
-      lua_pushstring(L,"No table given as second argument");
-      lua_error(L);
-    }
-    for (ret = 1; ret<=nrThreads; ret++)
-    {
-        lua_rawgeti(L,-1,ret);
-#if LUA_VERSION_NUM == 501
-        cpus[ret-1] = ((lua_Integer)lua_tointeger(L,-1));
-#else
-        cpus[ret-1] = ((lua_Unsigned)lua_tointegerx(L,-1, NULL));
-#endif
-        lua_pop(L,1);
-    }
-    numa_setMembind(cpus, nrThreads);
-    return 0;
-}
-
-static int
-lua_likwid_getAffinityInfo(lua_State* L)
-{
-    int i,j;
-
-    if (topology_isInitialized == 0)
-    {
-        topology_init();
-        topology_isInitialized = 1;
-        cpuinfo = get_cpuInfo();
-        cputopo = get_cpuTopology();
-    }
-    if ((topology_isInitialized) && (cpuinfo == NULL))
-    {
-        cpuinfo = get_cpuInfo();
-    }
-    if ((topology_isInitialized) && (cputopo == NULL))
-    {
-        cputopo = get_cpuTopology();
-    }
-    if (numa_isInitialized == 0)
-    {
-        if (numa_init() == 0)
-        {
-            numa_isInitialized = 1;
-            numainfo = get_numaTopology();
-        }
-    }
-    if ((numa_isInitialized) && (numainfo == NULL))
-    {
-        numainfo = get_numaTopology();
-    }
-    if (affinity_isInitialized == 0)
-    {
-        affinity_init();
-        affinity_isInitialized = 1;
-        affinity = get_affinityDomains();
-    }
-    if ((affinity_isInitialized) && (affinity == NULL))
-    {
-        affinity = get_affinityDomains();
-    }
-
-    if (!affinity)
-    {
-        lua_pushstring(L,"Cannot initialize affinity groups");
-        lua_error(L);
-    }
-    lua_newtable(L);
-    lua_pushstring(L,"numberOfAffinityDomains");
-    lua_pushinteger(L, (lua_Integer)(affinity->numberOfAffinityDomains));
-    lua_settable(L,-3);
-    lua_pushstring(L,"numberOfSocketDomains");
-    lua_pushinteger(L, (lua_Integer)(affinity->numberOfSocketDomains));
-    lua_settable(L,-3);
-    lua_pushstring(L,"numberOfNumaDomains");
-    lua_pushinteger(L, (lua_Integer)(affinity->numberOfNumaDomains));
-    lua_settable(L,-3);
-    lua_pushstring(L,"numberOfProcessorsPerSocket");
-    lua_pushinteger(L, (lua_Integer)(affinity->numberOfProcessorsPerSocket));
-    lua_settable(L,-3);
-    lua_pushstring(L,"numberOfCacheDomains");
-    lua_pushinteger(L, (lua_Integer)(affinity->numberOfCacheDomains));
-    lua_settable(L,-3);
-    lua_pushstring(L,"numberOfCoresPerCache");
-    lua_pushinteger(L, (lua_Integer)(affinity->numberOfCoresPerCache));
-    lua_settable(L,-3);
-    lua_pushstring(L,"numberOfProcessorsPerCache");
-    lua_pushinteger(L, (lua_Integer)(affinity->numberOfProcessorsPerCache));
-    lua_settable(L,-3);
-    lua_pushstring(L,"domains");
-    lua_newtable(L);
-    for(i=0;i<affinity->numberOfAffinityDomains;i++)
-    {
-        lua_pushinteger(L, (lua_Integer)( i+1));
-        lua_newtable(L);
-        lua_pushstring(L,"tag");
-        lua_pushstring(L,bdata(affinity->domains[i].tag));
-        lua_settable(L,-3);
-        lua_pushstring(L,"numberOfProcessors");
-        lua_pushinteger(L, (lua_Integer)(affinity->domains[i].numberOfProcessors));
-        lua_settable(L,-3);
-        lua_pushstring(L,"numberOfCores");
-        lua_pushinteger(L, (lua_Integer)(affinity->domains[i].numberOfCores));
-        lua_settable(L,-3);
-        lua_pushstring(L,"processorList");
-        lua_newtable(L);
-        for(j=0;j<affinity->domains[i].numberOfProcessors;j++)
-        {
-            lua_pushinteger(L, (lua_Integer)(j+1));
-            lua_pushinteger(L, (lua_Integer)(affinity->domains[i].processorList[j]));
-            lua_settable(L,-3);
-        }
-        lua_settable(L,-3);
-        lua_settable(L,-3);
-    }
-    lua_settable(L,-3);
-    return 1;
-}
-
-static int
-lua_likwid_cpustr_to_cpulist(lua_State* L)
-{
-    int ret = 0;
-    char* cpustr = (char *)luaL_checkstring(L, 1);
-    if (!cputopo)
-    {
-        topology_init();
-        cputopo = get_cpuTopology();
-        topology_isInitialized = 1;
-    }
-    int* cpulist = (int*) malloc(cputopo->numHWThreads * sizeof(int));
-    if (cpulist == NULL)
-    {
-        lua_pushnumber(L, 0);
-        return 1;
-    }
-    ret = cpustr_to_cpulist(cpustr, cpulist, cputopo->numHWThreads);
-    if (ret <= 0)
+    ret = cpustr_to_cpulist(cpustr, cpulist, cputopo->numHWThreads);
+    if (ret <= 0)
     {
         free(cpulist);
         lua_pushnumber(L, 0);
@@ -1443,6 +1276,7 @@ lua_likwid_cpustr_to_cpulist(lua_State* L)
     return 2;
 }
 
+
 static int
 lua_likwid_nodestr_to_nodelist(lua_State* L)
 {
@@ -1703,2138 +1537,2498 @@ lua_likwid_getPowerInfo(lua_State* L)
     return 1;
 }
 
-static int
-lua_likwid_putPowerInfo(lua_State* L)
-{
-    if (power_isInitialized)
-    {
-        power_finalize();
-        power_isInitialized = 0;
-        power = NULL;
-    }
-    return 0;
+
+static int lua_likwid_putPowerInfo(lua_State *L) {
+  if (power_isInitialized) {
+    power_finalize();
+    power_isInitialized = 0;
+    power = NULL;
+  }
+  return 0;
 }
 
-static int
-lua_likwid_startPower(lua_State* L)
-{
-    PowerData pwrdata;
-    int cpuId = lua_tonumber(L,1);
-    luaL_argcheck(L, cpuId >= 0, 1, "CPU ID must be greater than 0");
+static int lua_likwid_startPower(lua_State *L) {
+  PowerData pwrdata;
+  int cpuId = lua_tonumber(L, 1);
+  luaL_argcheck(L, cpuId >= 0, 1, "CPU ID must be greater than 0");
 #if LUA_VERSION_NUM == 501
-    PowerType type = (PowerType) ((lua_Integer)lua_tointeger(L,2));
+  PowerType type = (PowerType)((lua_Integer)lua_tointeger(L, 2));
 #else
-    PowerType type = (PowerType) ((lua_Unsigned)lua_tointegerx(L,2, NULL));
+  PowerType type = (PowerType)((lua_Unsigned)lua_tointegerx(L, 2, NULL));
 #endif
-    luaL_argcheck(L, type >= PKG+1 && type <= NUM_POWER_DOMAINS, 2, "Type not valid");
-    power_start(&pwrdata, cpuId, type-1);
-    lua_pushnumber(L,pwrdata.before);
-    return 1;
+  luaL_argcheck(L, type >= PKG + 1 && type <= NUM_POWER_DOMAINS, 2,
+                "Type not valid");
+  power_start(&pwrdata, cpuId, type - 1);
+  lua_pushnumber(L, pwrdata.before);
+  return 1;
 }
 
-static int
-lua_likwid_stopPower(lua_State* L)
-{
-    PowerData pwrdata;
-    int cpuId = lua_tonumber(L,1);
-    luaL_argcheck(L, cpuId >= 0, 1, "CPU ID must be greater than 0");
+static int lua_likwid_stopPower(lua_State *L) {
+  PowerData pwrdata;
+  int cpuId = lua_tonumber(L, 1);
+  luaL_argcheck(L, cpuId >= 0, 1, "CPU ID must be greater than 0");
 #if LUA_VERSION_NUM == 501
-    PowerType type = (PowerType) ((lua_Integer)lua_tointeger(L,2));
+  PowerType type = (PowerType)((lua_Integer)lua_tointeger(L, 2));
 #else
-    PowerType type = (PowerType) ((lua_Unsigned)lua_tointegerx(L,2, NULL));
+  PowerType type = (PowerType)((lua_Unsigned)lua_tointegerx(L, 2, NULL));
 #endif
-    luaL_argcheck(L, type >= PKG+1 && type <= NUM_POWER_DOMAINS, 2, "Type not valid");
-    power_stop(&pwrdata, cpuId, type-1);
-    lua_pushnumber(L,pwrdata.after);
-    return 1;
-}
-
-static int
-lua_likwid_printEnergy(lua_State* L)
-{
-    PowerData pwrdata;
-    pwrdata.before = lua_tonumber(L,1);
-    pwrdata.after = lua_tonumber(L,2);
-    pwrdata.domain = lua_tonumber(L,3);
-    lua_pushnumber(L,power_printEnergy(&pwrdata));
+  luaL_argcheck(L, type >= PKG + 1 && type <= NUM_POWER_DOMAINS, 2,
+                "Type not valid");
+  power_stop(&pwrdata, cpuId, type - 1);
+  lua_pushnumber(L, pwrdata.after);
+  return 1;
+}
+
+static int lua_likwid_printEnergy(lua_State *L) {
+  PowerData pwrdata;
+  pwrdata.before = lua_tonumber(L, 1);
+  pwrdata.after = lua_tonumber(L, 2);
+  pwrdata.domain = lua_tonumber(L, 3);
+  lua_pushnumber(L, power_printEnergy(&pwrdata));
+  return 1;
+}
+
+static int lua_likwid_power_limitGet(lua_State *L) {
+  int err;
+  int cpuId = lua_tonumber(L, 1);
+  int domain = lua_tonumber(L, 2);
+  double power = 0.0;
+  double time = 0.0;
+  err = power_limitGet(cpuId, domain, &power, &time);
+  if (err < 0) {
+    lua_pushnumber(L, err);
     return 1;
+  }
+  lua_pushnumber(L, power);
+  lua_pushnumber(L, time);
+  return 2;
 }
 
-static int
-lua_likwid_power_limitGet(lua_State* L)
-{
-    int err;
-    int cpuId = lua_tonumber(L,1);
-    int domain = lua_tonumber(L,2);
-    double power = 0.0;
-    double time = 0.0;
-    err = power_limitGet(cpuId, domain, &power, &time);
-    if (err < 0)
-    {
-        lua_pushnumber(L,err);
-        return 1;
-    }
-    lua_pushnumber(L,power);
-    lua_pushnumber(L,time);
-    return 2;
+static int lua_likwid_power_limitSet(lua_State *L) {
+  int cpuId = lua_tonumber(L, 1);
+  int domain = lua_tonumber(L, 2);
+  double power = lua_tonumber(L, 3);
+  double time = lua_tonumber(L, 4);
+  int clamp = lua_tonumber(L, 5);
+  lua_pushinteger(L, power_limitSet(cpuId, domain, power, time, clamp));
+  return 1;
 }
 
-static int
-lua_likwid_power_limitSet(lua_State* L)
-{
-    int cpuId = lua_tonumber(L,1);
-    int domain = lua_tonumber(L,2);
-    double power = lua_tonumber(L,3);
-    double time = lua_tonumber(L,4);
-    int clamp  = lua_tonumber(L,5);
-    lua_pushinteger(L, power_limitSet(cpuId, domain, power, time, clamp));
-    return 1;
+static int lua_likwid_power_limitState(lua_State *L) {
+  int cpuId = lua_tonumber(L, 1);
+  int domain = lua_tonumber(L, 2);
+  lua_pushnumber(L, power_limitState(cpuId, domain));
+  return 1;
 }
 
-static int
-lua_likwid_power_limitState(lua_State* L)
-{
-    int cpuId = lua_tonumber(L,1);
-    int domain = lua_tonumber(L,2);
-    lua_pushnumber(L,power_limitState(cpuId, domain));
-    return 1;
+static int lua_likwid_getCpuClock(lua_State *L) {
+  if (timer_isInitialized == 0) {
+    timer_init();
+    timer_isInitialized = 1;
+  }
+  lua_pushnumber(L, timer_getCpuClock());
+  return 1;
 }
 
-static int
-lua_likwid_getCpuClock(lua_State* L)
-{
-    if (timer_isInitialized == 0)
-    {
-        timer_init();
-        timer_isInitialized = 1;
-    }
-    lua_pushnumber(L,timer_getCpuClock());
-    return 1;
+static int lua_likwid_getCycleClock(lua_State *L) {
+  if (timer_isInitialized == 0) {
+    timer_init();
+    timer_isInitialized = 1;
+  }
+  lua_pushnumber(L, timer_getCycleClock());
+  return 1;
 }
 
-static int
-lua_likwid_getCycleClock(lua_State* L)
-{
-    if (timer_isInitialized == 0)
-    {
-        timer_init();
-        timer_isInitialized = 1;
-    }
-    lua_pushnumber(L,timer_getCycleClock());
-    return 1;
+static int lua_sleep(lua_State *L) {
+#if LUA_VERSION_NUM == 501
+  lua_pushnumber(L, timer_sleep(((lua_Integer)lua_tointeger(L, -1))));
+#else
+  lua_pushnumber(L, timer_sleep(((lua_Unsigned)lua_tointegerx(L, -1, NULL))));
+#endif
+  return 1;
+}
+
+static int lua_likwid_startClock(lua_State *L) {
+  TimerData timer;
+  double value;
+  if (timer_isInitialized == 0) {
+    timer_init();
+    timer_isInitialized = 1;
+  }
+  timer_start(&timer);
+  value = (double)timer.start.int64;
+  lua_pushnumber(L, value);
+  return 1;
+}
+
+static int lua_likwid_stopClock(lua_State *L) {
+  TimerData timer;
+  double value;
+  if (timer_isInitialized == 0) {
+    timer_init();
+    timer_isInitialized = 1;
+  }
+  timer_stop(&timer);
+  value = (double)timer.stop.int64;
+  lua_pushnumber(L, value);
+  return 1;
+}
+
+static int lua_likwid_getClockCycles(lua_State *L) {
+  TimerData timer;
+  double start, stop;
+  start = lua_tonumber(L, 1);
+  stop = lua_tonumber(L, 2);
+  timer.start.int64 = (uint64_t)start;
+  timer.stop.int64 = (uint64_t)stop;
+  if (timer_isInitialized == 0) {
+    timer_init();
+    timer_isInitialized = 1;
+  }
+  lua_pushnumber(L, (double)timer_printCycles(&timer));
+  return 1;
+}
+
+static int lua_likwid_getClock(lua_State *L) {
+  TimerData timer;
+  double runtime, start, stop;
+  if (timer_isInitialized == 0) {
+    timer_init();
+    timer_isInitialized = 1;
+  }
+  start = lua_tonumber(L, 1);
+  stop = lua_tonumber(L, 2);
+  timer.start.int64 = (uint64_t)start;
+  timer.stop.int64 = (uint64_t)stop;
+  runtime = timer_print(&timer);
+  lua_pushnumber(L, runtime);
+  return 1;
+}
+
+static int lua_likwid_initTemp(lua_State *L) {
+#if LUA_VERSION_NUM == 501
+  int cpuid = ((lua_Integer)lua_tointeger(L, -1));
+#else
+  int cpuid = ((lua_Unsigned)lua_tointegerx(L, -1, NULL));
+#endif
+  thermal_init(cpuid);
+  return 0;
 }
 
-static int
-lua_sleep(lua_State* L)
-{
+static int lua_likwid_readTemp(lua_State *L) {
 #if LUA_VERSION_NUM == 501
-    lua_pushnumber(L, timer_sleep(((lua_Integer)lua_tointeger(L,-1))));
+  int cpuid = ((lua_Integer)lua_tointeger(L, -1));
 #else
-    lua_pushnumber(L, timer_sleep(((lua_Unsigned)lua_tointegerx(L,-1, NULL))));
+  int cpuid = ((lua_Unsigned)lua_tointegerx(L, -1, NULL));
 #endif
-    return 1;
+  uint32_t data;
+  if (thermal_read(cpuid, &data)) {
+    lua_pushstring(L, "Cannot read thermal data");
+    lua_error(L);
+  }
+  lua_pushnumber(L, data);
+  return 1;
 }
 
-static int
-lua_likwid_startClock(lua_State* L)
-{
-    TimerData timer;
-    double value;
-    if (timer_isInitialized == 0)
-    {
-        timer_init();
-        timer_isInitialized = 1;
-    }
-    timer_start(&timer);
-    value = (double)timer.start.int64;
-    lua_pushnumber(L, value);
-    return 1;
-}
+static volatile int recv_sigint = 0;
 
-static int
-lua_likwid_stopClock(lua_State* L)
-{
-    TimerData timer;
-    double value;
-    if (timer_isInitialized == 0)
-    {
-        timer_init();
-        timer_isInitialized = 1;
-    }
-    timer_stop(&timer);
-    value = (double)timer.stop.int64;
-    lua_pushnumber(L, value);
-    return 1;
+static void signal_catcher(int signo) {
+  if (signo == SIGINT) {
+    recv_sigint++;
+  }
+  return;
 }
 
-static int
-lua_likwid_getClockCycles(lua_State* L)
-{
-    TimerData timer;
-    double start, stop;
-    start = lua_tonumber(L,1);
-    stop = lua_tonumber(L,2);
-    timer.start.int64 = (uint64_t)start;
-    timer.stop.int64 = (uint64_t)stop;
-    if (timer_isInitialized == 0)
-    {
-        timer_init();
-        timer_isInitialized = 1;
-    }
-    lua_pushnumber(L, (double)timer_printCycles(&timer));
-    return 1;
+static int lua_likwid_catch_signal(lua_State *L) {
+  signal(SIGINT, signal_catcher);
+  return 0;
 }
 
-static int
-lua_likwid_getClock(lua_State* L)
-{
-    TimerData timer;
-    double runtime, start, stop;
-    if (timer_isInitialized == 0)
-    {
-        timer_init();
-        timer_isInitialized = 1;
-    }
-    start = lua_tonumber(L,1);
-    stop = lua_tonumber(L,2);
-    timer.start.int64 = (uint64_t)start;
-    timer.stop.int64 = (uint64_t)stop;
-    runtime = timer_print(&timer);
-    lua_pushnumber(L, runtime);
-    return 1;
+static int lua_likwid_return_signal_state(lua_State *L) {
+  lua_pushnumber(L, recv_sigint);
+  return 1;
 }
 
-static int
-lua_likwid_initTemp(lua_State* L)
-{
+static int lua_likwid_send_signal(lua_State *L) {
+  int err = 0;
 #if LUA_VERSION_NUM == 501
-    int cpuid = ((lua_Integer)lua_tointeger(L,-1));
+  pid_t pid = ((lua_Integer)lua_tointeger(L, 1));
+  int signal = ((lua_Integer)lua_tointeger(L, 2));
 #else
-    int cpuid = ((lua_Unsigned)lua_tointegerx(L,-1, NULL));
+  pid_t pid = ((lua_Unsigned)lua_tointegerx(L, 1, NULL));
+  int signal = ((lua_Unsigned)lua_tointegerx(L, 2, NULL));
 #endif
-    thermal_init(cpuid);
-    return 0;
+  err = kill(pid, signal);
+  lua_pushnumber(L, err);
+  return 1;
 }
 
-static int
-lua_likwid_readTemp(lua_State* L)
-{
-#if LUA_VERSION_NUM == 501
-    int cpuid = ((lua_Integer)lua_tointeger(L,-1));
-#else
-    int cpuid = ((lua_Unsigned)lua_tointegerx(L,-1, NULL));
-#endif
-    uint32_t data;
-    if (thermal_read(cpuid, &data)) {
-        lua_pushstring(L,"Cannot read thermal data");
-        lua_error(L);
-    }
-    lua_pushnumber(L, data);
-    return 1;
-}
-
-static volatile int recv_sigint = 0;
-
-static void signal_catcher(int signo)
-{
-    if (signo == SIGINT)
-    {
-        recv_sigint++;
-    }
-    return;
-}
-
-static int
-lua_likwid_catch_signal(lua_State* L)
-{
-    signal(SIGINT,signal_catcher);
-    return 0;
-}
-
-static int
-lua_likwid_return_signal_state(lua_State* L)
-{
-    lua_pushnumber(L, recv_sigint);
-    return 1;
-}
-
-static int
-lua_likwid_send_signal(lua_State* L)
-{
-    int err = 0;
-#if LUA_VERSION_NUM == 501
-    pid_t pid = ((lua_Integer)lua_tointeger(L,1));
-    int signal = ((lua_Integer)lua_tointeger(L,2));
-#else
-    pid_t pid = ((lua_Unsigned)lua_tointegerx(L,1, NULL));
-    int signal = ((lua_Unsigned)lua_tointegerx(L,2, NULL));
-#endif
-    err = kill(pid, signal);
-    lua_pushnumber(L, err);
-    return 1;
-}
-
-/* #####   FUNCTION DEFINITIONS  -  EXPORTED FUNCTIONS   ################## */
-
-int
-parse(char *line, char **argv, int maxlen)
-{
-    int pos = 0;
-    int len = 0;
-    int in_string = 0;
-    while (*line != '\0' && len < maxlen)
-    {
-        if (*line == '"' || *line == '\'')
-        {
-            in_string = (!in_string);
-            line++;
-            pos++;
-            continue;
-        }
-        if (!in_string)
-        {
-            if ((*line == ' ' || *line == '\t' || *line == '\n'))
-            {
-                *line++ = '\0';     /* replace white spaces with 0    */
-                pos++;
-            }
-            *argv++ = line;          /* save the argument position     */
-            len++;
-        }
-        else if ((*line == ' ' || *line == '\t' || *line == '\n'))
-        {
-            line++;
-            pos++;
-        }
-        while (*line != '\0' && *line != ' ' && *line != '\t' && *line != '\n' && *line != '"' && *line != '\'')
-        {
-            line++;
-            pos++;
-        }
-    }
-    *argv = (char *)'\0';
-    return (len < maxlen || *line == '\0' ? len : -1);
+/* #####   FUNCTION DEFINITIONS  -  EXPORTED FUNCTIONS   ################## */
+
+int parse(char *line, char **argv, int maxlen) {
+  int pos = 0;
+  int len = 0;
+  int in_string = 0;
+  while (*line != '\0' && len < maxlen) {
+    if (*line == '"' || *line == '\'') {
+      in_string = (!in_string);
+      line++;
+      pos++;
+      continue;
+    }
+    if (!in_string) {
+      if ((*line == ' ' || *line == '\t' || *line == '\n')) {
+        *line++ = '\0'; /* replace white spaces with 0    */
+        pos++;
+      }
+      *argv++ = line; /* save the argument position     */
+      len++;
+    } else if ((*line == ' ' || *line == '\t' || *line == '\n')) {
+      line++;
+      pos++;
+    }
+    while (*line != '\0' && *line != ' ' && *line != '\t' && *line != '\n' &&
+           *line != '"' && *line != '\'') {
+      line++;
+      pos++;
+    }
+  }
+  *argv = (char *)'\0';
+  return (len < maxlen || *line == '\0' ? len : -1);
 }
 
 /* #####   FUNCTION DEFINITIONS  -  LOCAL TO THIS SOURCE FILE   ########### */
 
-static void
-catch_sigchild(int signo)
-{
-    ;;
-}
-
-static int
-lua_likwid_startProgram(lua_State* L)
-{
-    pid_t pid, ppid;
-    int status;
-    char *exec;
-    char *argv[MAX_NUM_CLIARGS];
-    exec = (char *)luaL_checkstring(L, 1);
-    int nrThreads = luaL_checknumber(L,2);
-    CpuTopology_t cputopo = get_cpuTopology();
-    if (nrThreads > cputopo->numHWThreads)
-    {
-        lua_pushstring(L,"Number of threads greater than available HW threads");
-        lua_error(L);
-        return 0;
+static void catch_sigchild(int signo) {
+  ;
+  ;
+}
+
+static int lua_likwid_startProgram(lua_State *L) {
+  pid_t pid, ppid;
+  int status;
+  char *exec;
+  char *argv[MAX_NUM_CLIARGS];
+  exec = (char *)luaL_checkstring(L, 1);
+  int nrThreads = luaL_checknumber(L, 2);
+  CpuTopology_t cputopo = get_cpuTopology();
+  if (nrThreads > cputopo->numHWThreads) {
+    lua_pushstring(L, "Number of threads greater than available HW threads");
+    lua_error(L);
+    return 0;
+  }
+  int *cpus = malloc(cputopo->numHWThreads * sizeof(int));
+  if (!cpus)
+    return 0;
+  cpu_set_t cpuset;
+  if (nrThreads > 0) {
+    if (!lua_istable(L, -1)) {
+      lua_pushstring(L, "No table given as second argument");
+      lua_error(L);
+      free(cpus);
     }
-    int *cpus = malloc(cputopo->numHWThreads * sizeof(int));
-    if (!cpus)
-        return 0;
-    cpu_set_t cpuset;
-    if (nrThreads > 0)
-    {
-        if (!lua_istable(L, -1)) {
-          lua_pushstring(L,"No table given as second argument");
-          lua_error(L);
-          free(cpus);
-        }
-        for (status = 1; status<=nrThreads; status++)
-        {
-            lua_rawgeti(L,-1,status);
+    for (status = 1; status <= nrThreads; status++) {
+      lua_rawgeti(L, -1, status);
 #if LUA_VERSION_NUM == 501
-            cpus[status-1] = ((lua_Integer)lua_tointeger(L,-1));
+      cpus[status - 1] = ((lua_Integer)lua_tointeger(L, -1));
 #else
-            cpus[status-1] = ((lua_Unsigned)lua_tointegerx(L,-1, NULL));
+      cpus[status - 1] = ((lua_Unsigned)lua_tointegerx(L, -1, NULL));
 #endif
-            lua_pop(L,1);
-        }
-    }
-    /*else
-    {
-        int count = 0;
-        for (nrThreads = 0; nrThreads < cpuid_topology.numHWThreads; nrThreads++)
-        {
-            if (cpuid_topology.threadPool[nrThreads].inCpuSet == 1)
-            {
-                cpus[count] = cpuid_topology.threadPool[nrThreads].apicId;
-                count++;
-            }
-        }
-        nrThreads = count;
-    }*/
-    int args = parse(exec, argv, MAX_NUM_CLIARGS);
-    if (args < 0)
-    {
-        lua_pushstring(L,"Number of CLI args greater than configured");
-        lua_error(L);
-        free(cpus);
-        return 0;
-    }
-    ppid = getpid();
-    pid = fork();
-    if (pid < 0)
-    {
-        free(cpus);
-        return 0;
-    }
-    else if ( pid == 0)
-    {
-        if (nrThreads > 0)
-        {
-            affinity_pinProcesses(nrThreads, cpus);
-        }
-        timer_sleep(10);
-        status = execvp(*argv, argv);
-        if (status < 0)
-        {
-            kill(ppid, SIGCHLD);
-        }
-        return 0;
-    }
-    else
-    {
-        signal(SIGCHLD, catch_sigchild);
-        free(cpus);
-        lua_pushnumber(L, pid);
+      lua_pop(L, 1);
+    }
+  }
+  /*else
+  {
+      int count = 0;
+      for (nrThreads = 0; nrThreads < cpuid_topology.numHWThreads; nrThreads++)
+      {
+          if (cpuid_topology.threadPool[nrThreads].inCpuSet == 1)
+          {
+              cpus[count] = cpuid_topology.threadPool[nrThreads].apicId;
+              count++;
+          }
+      }
+      nrThreads = count;
+  }*/
+  int args = parse(exec, argv, MAX_NUM_CLIARGS);
+  if (args < 0) {
+    lua_pushstring(L, "Number of CLI args greater than configured");
+    lua_error(L);
+    free(cpus);
+    return 0;
+  }
+  ppid = getpid();
+  pid = fork();
+  if (pid < 0) {
+    free(cpus);
+    return 0;
+  } else if (pid == 0) {
+    if (nrThreads > 0) {
+      affinity_pinProcesses(nrThreads, cpus);
     }
-    return 1;
-}
-
-static int
-lua_likwid_checkProgram(lua_State* L)
-{
-    int ret = -1;
-    int exited = 0;
-    if (lua_gettop(L) == 1)
-    {
-        int status = 0;
-        pid_t retpid = 0;
-        pid_t pid = lua_tonumber(L, 1);
-        retpid = waitpid(pid, &status, WNOHANG|WUNTRACED|WCONTINUED);
-        if (retpid == pid)
-        {
-            if (WIFEXITED(status))
-            {
-                ret = WEXITSTATUS(status);
-                exited = 1;
-            }
-            else if (WIFSIGNALED(status))
-            {
-                ret = 128 + WTERMSIG(status);
-                exited = 1;
-            }
-            else
-            {
-                ret = 0;
-            }
-        }
+    timer_sleep(10);
+    status = execvp(*argv, argv);
+    if (status < 0) {
+      kill(ppid, SIGCHLD);
     }
-    lua_pushinteger(L, (lua_Integer)ret);
-    lua_pushboolean(L, exited);
-    return 2;
-}
-
-static int
-lua_likwid_killProgram(lua_State* L)
-{
-    pid_t pid = lua_tonumber(L, 1);
-    kill(pid, SIGTERM);
     return 0;
-}
-
-static int
-lua_likwid_waitpid(lua_State* L)
-{
+  } else {
+    signal(SIGCHLD, catch_sigchild);
+    free(cpus);
+    lua_pushnumber(L, pid);
+  }
+  return 1;
+}
+
+static int lua_likwid_checkProgram(lua_State *L) {
+  int ret = -1;
+  int exited = 0;
+  if (lua_gettop(L) == 1) {
     int status = 0;
-    int ret = -1;
+    pid_t retpid = 0;
     pid_t pid = lua_tonumber(L, 1);
-    pid_t retpid = waitpid(pid, &status, 0);
-    if (pid == retpid)
-    {
-        if (WIFEXITED(status))
-        {
-            ret = WEXITSTATUS(status);
-        }
-        else if (WIFSIGNALED(status))
-        {
-            ret = 128 + WTERMSIG(status);
-        }
-    }
-    lua_pushinteger(L, (lua_Integer)ret);
-    return 1;
-}
-
-static int
-lua_likwid_memSweep(lua_State* L)
-{
-    int i;
-    int nrThreads = luaL_checknumber(L,1);
-    luaL_argcheck(L, nrThreads > 0, 1, "Thread count must be greater than 0");
-    int cpus[nrThreads];
-    if (!lua_istable(L, -1)) {
-      lua_pushstring(L,"No table given as second argument");
-      lua_error(L);
-    }
-    for (i = 1; i <= nrThreads; i++)
-    {
-        lua_rawgeti(L,-1,i);
+    retpid = waitpid(pid, &status, WNOHANG | WUNTRACED | WCONTINUED);
+    if (retpid == pid) {
+      if (WIFEXITED(status)) {
+        ret = WEXITSTATUS(status);
+        exited = 1;
+      } else if (WIFSIGNALED(status)) {
+        ret = 128 + WTERMSIG(status);
+        exited = 1;
+      } else {
+        ret = 0;
+      }
+    }
+  }
+  lua_pushinteger(L, (lua_Integer)ret);
+  lua_pushboolean(L, exited);
+  return 2;
+}
+
+static int lua_likwid_killProgram(lua_State *L) {
+  pid_t pid = lua_tonumber(L, 1);
+  kill(pid, SIGTERM);
+  return 0;
+}
+
+static int lua_likwid_waitpid(lua_State *L) {
+  int status = 0;
+  int ret = -1;
+  pid_t pid = lua_tonumber(L, 1);
+  pid_t retpid = waitpid(pid, &status, 0);
+  if (pid == retpid) {
+    if (WIFEXITED(status)) {
+      ret = WEXITSTATUS(status);
+    } else if (WIFSIGNALED(status)) {
+      ret = 128 + WTERMSIG(status);
+    }
+  }
+  lua_pushinteger(L, (lua_Integer)ret);
+  return 1;
+}
+
+static int lua_likwid_memSweep(lua_State *L) {
+  int i;
+  int nrThreads = luaL_checknumber(L, 1);
+  luaL_argcheck(L, nrThreads > 0, 1, "Thread count must be greater than 0");
+  int cpus[nrThreads];
+  if (!lua_istable(L, -1)) {
+    lua_pushstring(L, "No table given as second argument");
+    lua_error(L);
+  }
+  for (i = 1; i <= nrThreads; i++) {
+    lua_rawgeti(L, -1, i);
 #if LUA_VERSION_NUM == 501
-        cpus[i-1] = ((lua_Integer)lua_tointeger(L,-1));
+    cpus[i - 1] = ((lua_Integer)lua_tointeger(L, -1));
 #else
-        cpus[i-1] = ((lua_Unsigned)lua_tointegerx(L,-1, NULL));
+    cpus[i - 1] = ((lua_Unsigned)lua_tointegerx(L, -1, NULL));
 #endif
-        lua_pop(L,1);
-    }
-    memsweep_threadGroup(cpus, nrThreads);
-    return 0;
-}
-
-static int
-lua_likwid_memSweepDomain(lua_State* L)
-{
-    int domain = luaL_checknumber(L,1);
-    luaL_argcheck(L, domain >= 0, 1, "Domain ID must be greater or equal 0");
-    memsweep_domain(domain);
-    return 0;
-}
-
-static int
-lua_likwid_pinProcess(lua_State* L)
-{
-    int cpuID = luaL_checknumber(L,-2);
-    int silent = luaL_checknumber(L,-1);
-    luaL_argcheck(L, cpuID >= 0, 1, "CPU ID must be greater or equal 0");
-    if (affinity_isInitialized == 0)
-    {
-        affinity_init();
-        affinity_isInitialized = 1;
-        affinity = get_affinityDomains();
-    }
-    affinity_pinProcess(cpuID);
-    if (!silent)
-    {
+    lua_pop(L, 1);
+  }
+  memsweep_threadGroup(cpus, nrThreads);
+  return 0;
+}
+
+static int lua_likwid_memSweepDomain(lua_State *L) {
+  int domain = luaL_checknumber(L, 1);
+  luaL_argcheck(L, domain >= 0, 1, "Domain ID must be greater or equal 0");
+  memsweep_domain(domain);
+  return 0;
+}
+
+static int lua_likwid_pinProcess(lua_State *L) {
+  int cpuID = luaL_checknumber(L, -2);
+  int silent = luaL_checknumber(L, -1);
+  luaL_argcheck(L, cpuID >= 0, 1, "CPU ID must be greater or equal 0");
+  if (affinity_isInitialized == 0) {
+    affinity_init();
+    affinity_isInitialized = 1;
+    affinity = get_affinityDomains();
+  }
+  affinity_pinProcess(cpuID);
+  if (!silent) {
 #ifdef COLOR
-            color_on(BRIGHT, COLOR);
+    color_on(BRIGHT, COLOR);
 #endif
-            printf("[likwid-pin] Main PID -> hwthread %d - OK",  cpuID);
+    printf("[likwid-pin] Main PID -> hwthread %d - OK", cpuID);
 #ifdef COLOR
-            color_reset();
+    color_reset();
 #endif
-            printf("\n");
-    }
-    return 0;
+    printf("\n");
+  }
+  return 0;
 }
 
-static int
-lua_likwid_pinThread(lua_State* L)
-{
-    int cpuID = luaL_checknumber(L,-2);
-    int silent = luaL_checknumber(L,-1);
+static int lua_likwid_pinThread(lua_State *L) {
+  int cpuID = luaL_checknumber(L, -2);
+  int silent = luaL_checknumber(L, -1);
 #ifdef HAS_SCHEDAFFINITY
-    luaL_argcheck(L, cpuID >= 0, 1, "CPU ID must be greater or equal 0");
-    if (affinity_isInitialized == 0)
-    {
-        affinity_init();
-        affinity_isInitialized = 1;
-        affinity = get_affinityDomains();
-    }
-    affinity_pinThread(cpuID);
-    if (!silent)
-    {
+  luaL_argcheck(L, cpuID >= 0, 1, "CPU ID must be greater or equal 0");
+  if (affinity_isInitialized == 0) {
+    affinity_init();
+    affinity_isInitialized = 1;
+    affinity = get_affinityDomains();
+  }
+  affinity_pinThread(cpuID);
+  if (!silent) {
 #ifdef COLOR
-            color_on(BRIGHT, COLOR);
+    color_on(BRIGHT, COLOR);
 #endif
-            printf("[likwid-pin] PID %lu -> hwthread %d - OK", gettid(), cpuID);
+    printf("[likwid-pin] PID %lu -> hwthread %d - OK", gettid(), cpuID);
 #ifdef COLOR
-            color_reset();
+    color_reset();
 #endif
-            printf("\n");
-    }
+    printf("\n");
+  }
 #else
-    printf("Pinning of threads is not supported by your system\n")
+  printf("Pinning of threads is not supported by your system\n");
 #endif
-    return 0;
+  return 0;
 }
 
-static int
-lua_likwid_setenv(lua_State* L)
-{
-    const char* element = (const char*)luaL_checkstring(L, -2);
-    const char* value = (const char*)luaL_checkstring(L, -1);
-    setenv(element, value, 1);
-    return 0;
+static int lua_likwid_setenv(lua_State *L) {
+  const char *element = (const char *)luaL_checkstring(L, -2);
+  const char *value = (const char *)luaL_checkstring(L, -1);
+  setenv(element, value, 1);
+  return 0;
 }
 
-static int
-lua_likwid_unsetenv(lua_State* L)
-{
-    const char* element = (const char*)luaL_checkstring(L, -1);
-    unsetenv(element);
-    return 0;
+static int lua_likwid_unsetenv(lua_State *L) {
+  const char *element = (const char *)luaL_checkstring(L, -1);
+  unsetenv(element);
+  return 0;
 }
 
-static int
-lua_likwid_getpid(lua_State* L)
-{
-    lua_pushinteger(L, (lua_Integer)(getpid()));
-    return 1;
+static int lua_likwid_getpid(lua_State *L) {
+  lua_pushinteger(L, (lua_Integer)(getpid()));
+  return 1;
 }
 
-static int
-lua_likwid_setVerbosity(lua_State* L)
-{
-    int verbosity = lua_tointeger(L,-1);
-    luaL_argcheck(L, (verbosity >= 0 && verbosity <= DEBUGLEV_DEVELOP), -1,
+static int lua_likwid_setVerbosity(lua_State *L) {
+  int verbosity = lua_tointeger(L, -1);
+  luaL_argcheck(L, (verbosity >= 0 && verbosity <= DEBUGLEV_DEVELOP), -1,
                 "Verbosity must be between 0 (only errors) and 3 (developer)");
-    perfmon_setVerbosity(verbosity);
+  perfmon_setVerbosity(verbosity);
 #ifdef LIKWID_WITH_NVMON
-    nvmon_setVerbosity(verbosity);
+  nvmon_setVerbosity(verbosity);
 #endif /* LIKWID_WITH_NVMON */
-    return 0;
+#ifdef LIKWID_WITH_ROCMON
+  rocmon_setVerbosity(verbosity);
+#endif /* LIKWID_WITH_ROCMON */
+  return 0;
+}
+
+static int lua_likwid_getVerbosity(lua_State *L) {
+  lua_pushinteger(L, (lua_Integer)(perfmon_verbosity));
+  return 1;
+}
+
+static int lua_likwid_access(lua_State *L) {
+  int flags = 0;
+  const char *file = (const char *)luaL_checkstring(L, 1);
+  const char *perm = (const char *)luaL_checkstring(L, 2);
+  if (!perm) {
+    flags = F_OK;
+  } else {
+    for (int i = 0; i < strlen(perm); i++) {
+      if (perm[i] == 'r') {
+        flags |= R_OK;
+      } else if (perm[i] == 'w') {
+        flags |= W_OK;
+      } else if (perm[i] == 'x') {
+        flags |= X_OK;
+      } else if (perm[i] == 'e') {
+        flags |= F_OK;
+      }
+    }
+  }
+  if (file) {
+    lua_pushinteger(L, access(file, flags));
+    return 1;
+  }
+  lua_pushinteger(L, -1);
+  return 1;
 }
 
-static int
-lua_likwid_getVerbosity(lua_State* L)
-{
-    lua_pushinteger(L, (lua_Integer)(perfmon_verbosity));
-    return 1;
+static int lua_likwid_markerInit(lua_State *L) {
+  likwid_markerInit();
+  return 0;
 }
 
-static int
-lua_likwid_access(lua_State* L)
-{
-    int flags = 0;
-    const char* file = (const char*)luaL_checkstring(L, 1);
-    const char* perm = (const char*)luaL_checkstring(L, 2);
-    if (!perm)
-    {
-        flags = F_OK;
-    }
-    else
-    {
-        for (int i=0;i<strlen(perm);i++)
-        {
-            if (perm[i] == 'r') {
-                flags |= R_OK;
-            } else if (perm[i] == 'w') {
-                flags |= W_OK;
-            } else if (perm[i] == 'x') {
-                flags |= X_OK;
-            } else if (perm[i] == 'e') {
-                flags |= F_OK;
-            }
-        }
-    }
-    if (file)
-    {
-        lua_pushinteger(L, access(file, flags));
-        return 1;
-    }
-    lua_pushinteger(L, -1);
-    return 1;
+static int lua_likwid_markerThreadInit(lua_State *L) {
+  likwid_markerThreadInit();
+  return 0;
 }
 
-static int
-lua_likwid_markerInit(lua_State* L)
-{
-    likwid_markerInit();
-    return 0;
+static int lua_likwid_markerClose(lua_State *L) {
+  likwid_markerClose();
+  return 0;
 }
 
-static int
-lua_likwid_markerThreadInit(lua_State* L)
-{
-    likwid_markerThreadInit();
-    return 0;
+static int lua_likwid_markerNext(lua_State *L) {
+  likwid_markerNextGroup();
+  return 0;
 }
 
-static int
-lua_likwid_markerClose(lua_State* L)
-{
-    likwid_markerClose();
-    return 0;
+static int lua_likwid_registerRegion(lua_State *L) {
+  const char *tag = (const char *)luaL_checkstring(L, -1);
+  lua_pushinteger(L, likwid_markerRegisterRegion(tag));
+  return 1;
 }
 
-static int
-lua_likwid_markerNext(lua_State* L)
-{
-    likwid_markerNextGroup();
-    return 0;
+static int lua_likwid_startRegion(lua_State *L) {
+  const char *tag = (const char *)luaL_checkstring(L, -1);
+  lua_pushinteger(L, likwid_markerStartRegion(tag));
+  return 1;
 }
 
-static int
-lua_likwid_registerRegion(lua_State* L)
-{
-    const char* tag = (const char*)luaL_checkstring(L, -1);
-    lua_pushinteger(L, likwid_markerRegisterRegion(tag));
-    return 1;
+static int lua_likwid_stopRegion(lua_State *L) {
+  const char *tag = (const char *)luaL_checkstring(L, -1);
+  lua_pushinteger(L, likwid_markerStopRegion(tag));
+  return 1;
 }
 
-static int
-lua_likwid_startRegion(lua_State* L)
-{
-    const char* tag = (const char*)luaL_checkstring(L, -1);
-    lua_pushinteger(L, likwid_markerStartRegion(tag));
-    return 1;
+static int lua_likwid_getRegion(lua_State *L) {
+  int i = 0;
+  const char *tag = (const char *)luaL_checkstring(L, -1);
+  int nr_events = perfmon_getNumberOfEvents(perfmon_getIdOfActiveGroup());
+  double *events = NULL;
+  double time = 0.0;
+  int count = 0;
+  events = (double *)malloc(nr_events * sizeof(double));
+  if (events == NULL) {
+    lua_pushstring(L, "Cannot allocate memory for event data\n");
+    lua_error(L);
+  }
+  for (i = 0; i < nr_events; i++) {
+    events[i] = 0.0;
+  }
+  likwid_markerGetRegion(tag, &nr_events, events, &time, &count);
+  lua_pushinteger(L, nr_events);
+  lua_newtable(L);
+  for (i = 0; i < nr_events; i++) {
+    lua_pushinteger(L, i + 1);
+    lua_pushnumber(L, events[i]);
+    lua_settable(L, -3);
+  }
+  lua_pushnumber(L, time);
+  lua_pushinteger(L, count);
+  free(events);
+  return 4;
 }
 
-static int
-lua_likwid_stopRegion(lua_State* L)
-{
-    const char* tag = (const char*)luaL_checkstring(L, -1);
-    lua_pushinteger(L, likwid_markerStopRegion(tag));
-    return 1;
+static int lua_likwid_resetRegion(lua_State *L) {
+  const char *tag = (const char *)luaL_checkstring(L, -1);
+  lua_pushinteger(L, likwid_markerResetRegion(tag));
+  return 1;
+}
+
+static int lua_likwid_cpuFeatures_init(lua_State *L) {
+  cpuFeatures_init();
+  return 0;
+}
+
+static int lua_likwid_cpuFeatures_print(lua_State *L) {
+  int cpu = lua_tointeger(L, -1);
+  cpuFeatures_print(cpu);
+  return 0;
 }
 
-static int
-lua_likwid_getRegion(lua_State* L)
-{
-    int i = 0;
-    const char* tag = (const char*)luaL_checkstring(L, -1);
-    int nr_events = perfmon_getNumberOfEvents(perfmon_getIdOfActiveGroup());
-    double* events = NULL;
-    double time = 0.0;
-    int count = 0;
-    events = (double*) malloc(nr_events * sizeof(double));
-    if (events == NULL)
-    {
-        lua_pushstring(L,"Cannot allocate memory for event data\n");
-        lua_error(L);
-    }
-    for (i = 0; i < nr_events; i++)
-    {
-        events[i] = 0.0;
-    }
-    likwid_markerGetRegion(tag, &nr_events, events, &time, &count);
-    lua_pushinteger(L, nr_events);
-    lua_newtable(L);
-    for (i=0;i<nr_events;i++)
-    {
-        lua_pushinteger(L, i+1);
-        lua_pushnumber(L, events[i]);
-        lua_settable(L, -3);
-    }
-    lua_pushnumber(L, time);
-    lua_pushinteger(L, count);
-    free(events);
-    return 4;
+static int lua_likwid_cpuFeatures_get(lua_State *L) {
+  int cpu = lua_tointeger(L, -2);
+  CpuFeature feature = lua_tointeger(L, -1);
+  lua_pushinteger(L, cpuFeatures_get(cpu, feature));
+  return 1;
 }
 
-static int
-lua_likwid_resetRegion(lua_State* L)
-{
-    const char* tag = (const char*)luaL_checkstring(L, -1);
-    lua_pushinteger(L, likwid_markerResetRegion(tag));
+static int lua_likwid_cpuFeatures_name(lua_State *L) {
+  char *name = NULL;
+#if LUA_VERSION_NUM == 501
+  CpuFeature feature = ((lua_Integer)lua_tointeger(L, -1));
+#else
+  CpuFeature feature = ((lua_Unsigned)lua_tointegerx(L, -1, NULL));
+#endif
+  name = cpuFeatures_name(feature);
+  if (name != NULL) {
+    lua_pushstring(L, name);
     return 1;
+  }
+  return 0;
 }
 
-static int
-lua_likwid_cpuFeatures_init(lua_State* L)
-{
-    cpuFeatures_init();
-    return 0;
+static int lua_likwid_cpuFeatures_enable(lua_State *L) {
+  int cpu = lua_tointeger(L, -3);
+  CpuFeature feature = lua_tointeger(L, -2);
+  int verbose = lua_tointeger(L, -1);
+  lua_pushinteger(L, cpuFeatures_enable(cpu, feature, verbose));
+  return 1;
 }
 
-static int
-lua_likwid_cpuFeatures_print(lua_State* L)
-{
-    int cpu = lua_tointeger(L,-1);
-    cpuFeatures_print(cpu);
-    return 0;
+static int lua_likwid_cpuFeatures_disable(lua_State *L) {
+  int cpu = lua_tointeger(L, -3);
+  CpuFeature feature = lua_tointeger(L, -2);
+  int verbose = lua_tointeger(L, -1);
+  lua_pushinteger(L, cpuFeatures_disable(cpu, feature, verbose));
+  return 1;
 }
 
-static int
-lua_likwid_cpuFeatures_get(lua_State* L)
-{
-    int cpu = lua_tointeger(L,-2);
-    CpuFeature feature = lua_tointeger(L,-1);
-    lua_pushinteger(L, cpuFeatures_get(cpu, feature));
-    return 1;
+static int lua_likwid_markerFile_read(lua_State *L) {
+  const char *filename = (const char *)luaL_checkstring(L, -1);
+  int ret = perfmon_readMarkerFile(filename);
+  lua_pushinteger(L, ret);
+  return 1;
 }
 
-static int
-lua_likwid_cpuFeatures_name(lua_State* L)
-{
-    char* name = NULL;
-#if LUA_VERSION_NUM == 501
-    CpuFeature feature = ((lua_Integer)lua_tointeger(L,-1));
-#else
-    CpuFeature feature = ((lua_Unsigned)lua_tointegerx(L,-1, NULL));
-#endif
-    name = cpuFeatures_name(feature);
-    if (name != NULL)
-    {
-        lua_pushstring(L, name);
-        return 1;
-    }
-    return 0;
+static int lua_likwid_markerFile_destroy(lua_State *L) {
+  perfmon_destroyMarkerResults();
+  return 0;
 }
 
-static int
-lua_likwid_cpuFeatures_enable(lua_State* L)
-{
-    int cpu = lua_tointeger(L,-3);
-    CpuFeature feature = lua_tointeger(L,-2);
-    int verbose = lua_tointeger(L,-1);
-    lua_pushinteger(L, cpuFeatures_enable(cpu, feature, verbose));
-    return 1;
+static int lua_likwid_markerNumRegions(lua_State *L) {
+  lua_pushinteger(L, perfmon_getNumberOfRegions());
+  return 1;
 }
 
-static int
-lua_likwid_cpuFeatures_disable(lua_State* L)
-{
-    int cpu = lua_tointeger(L,-3);
-    CpuFeature feature = lua_tointeger(L,-2);
-    int verbose = lua_tointeger(L,-1);
-    lua_pushinteger(L, cpuFeatures_disable(cpu, feature, verbose));
-    return 1;
+static int lua_likwid_markerRegionGroup(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushinteger(L, perfmon_getGroupOfRegion(region - 1) + 1);
+  return 1;
 }
 
-static int
-lua_likwid_markerFile_read(lua_State* L)
-{
-    const char* filename = (const char*)luaL_checkstring(L, -1);
-    int ret = perfmon_readMarkerFile(filename);
-    lua_pushinteger(L, ret);
-    return 1;
+static int lua_likwid_markerRegionTag(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushstring(L, perfmon_getTagOfRegion(region - 1));
+  return 1;
 }
 
-static int
-lua_likwid_markerFile_destroy(lua_State* L)
-{
-    perfmon_destroyMarkerResults();
-    return 0;
+static int lua_likwid_markerRegionEvents(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushinteger(L, perfmon_getEventsOfRegion(region - 1));
+  return 1;
 }
 
-static int
-lua_likwid_markerNumRegions(lua_State* L)
-{
-    lua_pushinteger(L, perfmon_getNumberOfRegions());
-    return 1;
+static int lua_likwid_markerRegionThreads(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushinteger(L, perfmon_getThreadsOfRegion(region - 1));
+  return 1;
 }
 
-static int
-lua_likwid_markerRegionGroup(lua_State* L)
-{
-    int region = lua_tointeger(L,-1);
-    lua_pushinteger(L, perfmon_getGroupOfRegion(region-1)+1);
-    return 1;
-}
-
-static int
-lua_likwid_markerRegionTag(lua_State* L)
-{
-    int region = lua_tointeger(L,-1);
-    lua_pushstring(L, perfmon_getTagOfRegion(region-1));
-    return 1;
-}
-
-static int
-lua_likwid_markerRegionEvents(lua_State* L)
-{
-    int region = lua_tointeger(L,-1);
-    lua_pushinteger(L, perfmon_getEventsOfRegion(region-1));
-    return 1;
-}
-
-static int
-lua_likwid_markerRegionThreads(lua_State* L)
-{
-    int region = lua_tointeger(L,-1);
-    lua_pushinteger(L, perfmon_getThreadsOfRegion(region-1));
+static int lua_likwid_markerRegionCpulist(lua_State *L) {
+  int i = 0;
+  int region = lua_tointeger(L, -1);
+  int *cpulist;
+  int regionCPUs = 0;
+  if (topology_isInitialized == 0) {
+    topology_init();
+    topology_isInitialized = 1;
+    cpuinfo = get_cpuInfo();
+    cputopo = get_cpuTopology();
+  }
+  if ((topology_isInitialized) && (cpuinfo == NULL)) {
+    cpuinfo = get_cpuInfo();
+  }
+  if ((topology_isInitialized) && (cputopo == NULL)) {
+    cputopo = get_cpuTopology();
+  }
+  cpulist = (int *)malloc(cputopo->numHWThreads * sizeof(int));
+  if (cpulist == NULL) {
+    return 0;
+  }
+  regionCPUs =
+      perfmon_getCpulistOfRegion(region - 1, cputopo->numHWThreads, cpulist);
+  if (regionCPUs > 0) {
+    lua_newtable(L);
+    for (i = 0; i < regionCPUs; i++) {
+      lua_pushinteger(L, i + 1);
+      lua_pushinteger(L, cpulist[i]);
+      lua_settable(L, -3);
+    }
     return 1;
+  }
+  return 0;
 }
 
-static int
-lua_likwid_markerRegionCpulist(lua_State* L)
-{
-    int i = 0;
-    int region = lua_tointeger(L,-1);
-    int* cpulist;
-    int regionCPUs = 0;
-    if (topology_isInitialized == 0)
-    {
-        topology_init();
-        topology_isInitialized = 1;
-        cpuinfo = get_cpuInfo();
-        cputopo = get_cpuTopology();
-    }
-    if ((topology_isInitialized) && (cpuinfo == NULL))
-    {
-        cpuinfo = get_cpuInfo();
-    }
-    if ((topology_isInitialized) && (cputopo == NULL))
-    {
-        cputopo = get_cpuTopology();
-    }
-    cpulist = (int*)malloc(cputopo->numHWThreads * sizeof(int));
-    if (cpulist == NULL)
-    {
-        return 0;
-    }
-    regionCPUs = perfmon_getCpulistOfRegion(region-1, cputopo->numHWThreads, cpulist);
-    if (regionCPUs > 0)
-    {
-        lua_newtable(L);
-        for (i=0; i < regionCPUs; i++)
-        {
-            lua_pushinteger(L, i+1);
-            lua_pushinteger(L, cpulist[i]);
-            lua_settable(L, -3);
-        }
-        return 1;
-    }
-    return 0;
+static int lua_likwid_markerRegionTime(lua_State *L) {
+  int region = lua_tointeger(L, -2);
+  int thread = lua_tointeger(L, -1);
+  lua_pushnumber(L, perfmon_getTimeOfRegion(region - 1, thread - 1));
+  return 1;
 }
 
-static int
-lua_likwid_markerRegionTime(lua_State* L)
-{
-    int region = lua_tointeger(L,-2);
-    int thread = lua_tointeger(L,-1);
-    lua_pushnumber(L, perfmon_getTimeOfRegion(region-1, thread-1));
-    return 1;
+static int lua_likwid_markerRegionCount(lua_State *L) {
+  int region = lua_tointeger(L, -2);
+  int thread = lua_tointeger(L, -1);
+  lua_pushinteger(L, perfmon_getCountOfRegion(region - 1, thread - 1));
+  return 1;
 }
 
-static int
-lua_likwid_markerRegionCount(lua_State* L)
-{
-    int region = lua_tointeger(L,-2);
-    int thread = lua_tointeger(L,-1);
-    lua_pushinteger(L, perfmon_getCountOfRegion(region-1, thread-1));
-    return 1;
+static int lua_likwid_markerRegionResult(lua_State *L) {
+  int region = lua_tointeger(L, -3);
+  int event = lua_tointeger(L, -2);
+  int thread = lua_tointeger(L, -1);
+  lua_pushnumber(
+      L, perfmon_getResultOfRegionThread(region - 1, event - 1, thread - 1));
+  return 1;
 }
 
-static int
-lua_likwid_markerRegionResult(lua_State* L)
-{
-    int region = lua_tointeger(L,-3);
-    int event = lua_tointeger(L,-2);
-    int thread = lua_tointeger(L,-1);
-    lua_pushnumber(L, perfmon_getResultOfRegionThread(region-1, event-1, thread-1));
-    return 1;
+static int lua_likwid_markerRegionMetric(lua_State *L) {
+  int region = lua_tointeger(L, -3);
+  int metric = lua_tointeger(L, -2);
+  int thread = lua_tointeger(L, -1);
+  lua_pushnumber(
+      L, perfmon_getMetricOfRegionThread(region - 1, metric - 1, thread - 1));
+  return 1;
 }
 
-static int
-lua_likwid_markerRegionMetric(lua_State* L)
-{
-    int region = lua_tointeger(L,-3);
-    int metric = lua_tointeger(L,-2);
-    int thread = lua_tointeger(L,-1);
-    lua_pushnumber(L, perfmon_getMetricOfRegionThread(region-1, metric-1, thread-1));
-    return 1;
+static int lua_likwid_initFreq(lua_State *L) {
+  lua_pushnumber(L, freq_init());
+  return 1;
 }
 
-static int
-lua_likwid_initFreq(lua_State* L)
-{
-    lua_pushnumber(L, freq_init());
-    return 1;
+static int lua_likwid_finalizeFreq(lua_State *L) {
+  freq_finalize();
+  return 0;
 }
 
-static int
-lua_likwid_finalizeFreq(lua_State* L)
-{
-    freq_finalize();
-    return 0;
+static int lua_likwid_getCpuClockBase(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -1);
+  lua_pushnumber(L, freq_getCpuClockBase(cpu_id));
+  return 1;
 }
 
-static int
-lua_likwid_getCpuClockBase(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-1);
-    lua_pushnumber(L, freq_getCpuClockBase(cpu_id));
-    return 1;
+static int lua_likwid_getCpuClockCurrent(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -1);
+  lua_pushnumber(L, freq_getCpuClockCurrent(cpu_id));
+  return 1;
 }
 
-static int
-lua_likwid_getCpuClockCurrent(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-1);
-    lua_pushnumber(L, freq_getCpuClockCurrent(cpu_id));
-    return 1;
+static int lua_likwid_getCpuClockMin(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -1);
+  lua_pushnumber(L, freq_getCpuClockMin(cpu_id));
+  return 1;
 }
 
-static int
-lua_likwid_getCpuClockMin(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-1);
-    lua_pushnumber(L, freq_getCpuClockMin(cpu_id));
-    return 1;
+static int lua_likwid_getConfCpuClockMin(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -1);
+  lua_pushnumber(L, freq_getConfCpuClockMin(cpu_id));
+  return 1;
 }
 
-static int
-lua_likwid_getConfCpuClockMin(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-1);
-    lua_pushnumber(L, freq_getConfCpuClockMin(cpu_id));
-    return 1;
+static int lua_likwid_setCpuClockMin(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -2);
+  const unsigned long freq = lua_tointeger(L, -1);
+  lua_pushnumber(L, freq_setCpuClockMin(cpu_id, freq));
+  return 1;
 }
 
-static int
-lua_likwid_setCpuClockMin(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-2);
-    const unsigned long freq = lua_tointeger(L,-1);
-    lua_pushnumber(L, freq_setCpuClockMin(cpu_id, freq));
-    return 1;
+static int lua_likwid_getCpuClockMax(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -1);
+  lua_pushnumber(L, freq_getCpuClockMax(cpu_id));
+  return 1;
 }
 
-static int
-lua_likwid_getCpuClockMax(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-1);
-    lua_pushnumber(L, freq_getCpuClockMax(cpu_id));
-    return 1;
+static int lua_likwid_getConfCpuClockMax(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -1);
+  lua_pushnumber(L, freq_getConfCpuClockMax(cpu_id));
+  return 1;
 }
 
-static int
-lua_likwid_getConfCpuClockMax(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-1);
-    lua_pushnumber(L, freq_getConfCpuClockMax(cpu_id));
-    return 1;
+static int lua_likwid_setCpuClockMax(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -2);
+  const unsigned long freq = lua_tointeger(L, -1);
+  lua_pushnumber(L, freq_setCpuClockMax(cpu_id, freq));
+  return 1;
 }
 
-static int
-lua_likwid_setCpuClockMax(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-2);
-    const unsigned long freq = lua_tointeger(L,-1);
-    lua_pushnumber(L, freq_setCpuClockMax(cpu_id, freq));
-    return 1;
+static int lua_likwid_setTurbo(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -2);
+  const int turbo = lua_tointeger(L, -1);
+  lua_pushnumber(L, freq_setTurbo(cpu_id, turbo));
+  return 1;
 }
 
-static int
-lua_likwid_setTurbo(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-2);
-    const int turbo = lua_tointeger(L,-1);
-    lua_pushnumber(L, freq_setTurbo(cpu_id, turbo));
-    return 1;
+static int lua_likwid_getTurbo(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -1);
+  lua_pushnumber(L, freq_getTurbo(cpu_id));
+  return 1;
 }
 
-static int
-lua_likwid_getTurbo(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-1);
-    lua_pushnumber(L, freq_getTurbo(cpu_id));
-    return 1;
+static int lua_likwid_getGovernor(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -1);
+  char *gov = freq_getGovernor(cpu_id);
+  if (gov) {
+    lua_pushstring(L, gov);
+    free(gov);
+  } else
+    lua_pushnil(L);
+  return 1;
 }
 
-static int
-lua_likwid_getGovernor(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-1);
-    char *gov = freq_getGovernor(cpu_id);
-    if (gov)
-    {
-        lua_pushstring(L, gov);
-        free(gov);
-    }
-    else
-        lua_pushnil(L);
-    return 1;
+static int lua_likwid_setGovernor(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -2);
+  const char *gov = (const char *)luaL_checkstring(L, -1);
+  lua_pushnumber(L, freq_setGovernor(cpu_id, gov));
+  return 1;
 }
 
-static int
-lua_likwid_setGovernor(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-2);
-    const char* gov = (const char*)luaL_checkstring(L, -1);
-    lua_pushnumber(L, freq_setGovernor(cpu_id, gov));
-    return 1;
+static int lua_likwid_getAvailFreq(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -1);
+  char *avail = freq_getAvailFreq(cpu_id);
+  if (avail) {
+    lua_pushstring(L, avail);
+    free(avail);
+  } else
+    lua_pushnil(L);
+  return 1;
 }
 
-static int
-lua_likwid_getAvailFreq(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-1);
-    char* avail = freq_getAvailFreq(cpu_id);
-    if (avail)
-    {
-        lua_pushstring(L, avail);
-        free(avail);
-    }
-    else
-        lua_pushnil(L);
-    return 1;
+static int lua_likwid_getAvailGovs(lua_State *L) {
+  const int cpu_id = lua_tointeger(L, -1);
+  char *avail = freq_getAvailGovs(cpu_id);
+  if (avail) {
+    lua_pushstring(L, avail);
+    free(avail);
+  } else
+    lua_pushnil(L);
+  return 1;
 }
 
-static int
-lua_likwid_getAvailGovs(lua_State* L)
-{
-    const int cpu_id = lua_tointeger(L,-1);
-    char* avail = freq_getAvailGovs(cpu_id);
-    if (avail)
-    {
-        lua_pushstring(L, avail);
-        free(avail);
-    }
-    else
-        lua_pushnil(L);
-    return 1;
+static int lua_likwid_setUncoreFreqMin(lua_State *L) {
+  const int socket_id = lua_tointeger(L, -2);
+  const uint64_t freq = lua_tointeger(L, -1);
+  int err = freq_setUncoreFreqMin(socket_id, freq);
+  lua_pushinteger(L, err);
+  return 1;
 }
 
-
-static int
-lua_likwid_setUncoreFreqMin(lua_State* L)
-{
-    const int socket_id = lua_tointeger(L,-2);
-    const uint64_t freq = lua_tointeger(L,-1);
-    int err = freq_setUncoreFreqMin(socket_id, freq);
-    lua_pushinteger(L, err);
-    return 1;
+static int lua_likwid_getUncoreFreqMin(lua_State *L) {
+  const int socket_id = lua_tointeger(L, -1);
+  uint64_t freq = freq_getUncoreFreqMin(socket_id);
+  lua_pushinteger(L, freq);
+  return 1;
 }
 
-static int
-lua_likwid_getUncoreFreqMin(lua_State* L)
-{
-    const int socket_id = lua_tointeger(L,-1);
-    uint64_t freq = freq_getUncoreFreqMin(socket_id);
-    lua_pushinteger(L, freq);
-    return 1;
+static int lua_likwid_setUncoreFreqMax(lua_State *L) {
+  const int socket_id = lua_tointeger(L, -2);
+  const uint64_t freq = lua_tointeger(L, -1);
+  int err = freq_setUncoreFreqMax(socket_id, freq);
+  lua_pushinteger(L, err);
+  return 1;
 }
 
-static int
-lua_likwid_setUncoreFreqMax(lua_State* L)
-{
-    const int socket_id = lua_tointeger(L,-2);
-    const uint64_t freq = lua_tointeger(L,-1);
-    int err = freq_setUncoreFreqMax(socket_id, freq);
-    lua_pushinteger(L, err);
-    return 1;
+static int lua_likwid_getUncoreFreqMax(lua_State *L) {
+  const int socket_id = lua_tointeger(L, -1);
+  uint64_t freq = freq_getUncoreFreqMax(socket_id);
+  lua_pushinteger(L, freq);
+  return 1;
 }
 
-static int
-lua_likwid_getUncoreFreqMax(lua_State* L)
-{
-    const int socket_id = lua_tointeger(L,-1);
-    uint64_t freq = freq_getUncoreFreqMax(socket_id);
-    lua_pushinteger(L, freq);
-    return 1;
+static int lua_likwid_getUncoreFreqCur(lua_State *L) {
+  const int socket_id = lua_tointeger(L, -1);
+  uint64_t freq = freq_getUncoreFreqCur(socket_id);
+  lua_pushinteger(L, freq);
+  return 1;
 }
 
-static int
-lua_likwid_getUncoreFreqCur(lua_State* L)
-{
-    const int socket_id = lua_tointeger(L,-1);
-    uint64_t freq = freq_getUncoreFreqCur(socket_id);
-    lua_pushinteger(L, freq);
-    return 1;
+static int lua_likwid_getuid(lua_State *L) {
+  int r = geteuid();
+  lua_pushnumber(L, r);
+  return 1;
 }
 
-static int
-lua_likwid_getuid(lua_State* L)
-{
-    int r = geteuid();
-    lua_pushnumber(L, r);
-    return 1;
+static int lua_likwid_geteuid(lua_State *L) {
+  int r = geteuid();
+  lua_pushnumber(L, r);
+  return 1;
 }
 
-static int
-lua_likwid_geteuid(lua_State* L)
-{
-    int r = geteuid();
-    lua_pushnumber(L, r);
-    return 1;
+static int lua_likwid_setuid(lua_State *L) {
+  int id = (int)lua_tonumber(L, 1);
+  int r = setuid((uid_t)id);
+  if (r == 0) {
+    lua_pushboolean(L, 1);
+  } else {
+    lua_pushboolean(L, 0);
+  }
+  return 1;
 }
 
-static int
-lua_likwid_setuid(lua_State* L)
-{
-    int id = (int) lua_tonumber(L, 1);
-    int r = setuid((uid_t) id);
-    if (r == 0)
-    {
-        lua_pushboolean(L, 1);
-    }
-    else
-    {
-        lua_pushboolean(L, 0);
-    }
-    return 1;
+static int lua_likwid_seteuid(lua_State *L) {
+  int id = (int)lua_tonumber(L, 1);
+  int r = seteuid((uid_t)id);
+  if (r == 0) {
+    lua_pushboolean(L, 1);
+  } else {
+    lua_pushboolean(L, 0);
+  }
+  return 1;
 }
 
-static int
-lua_likwid_seteuid(lua_State* L)
-{
-    int id = (int) lua_tonumber(L, 1);
-    int r = seteuid((uid_t) id);
-    if (r == 0)
-    {
-        lua_pushboolean(L, 1);
-    }
-    else
-    {
-        lua_pushboolean(L, 0);
-    }
-    return 1;
+static int lua_likwid_setresuid(lua_State *L) {
+  int ruid = (int)lua_tonumber(L, 1);
+  int euid = (int)lua_tonumber(L, 2);
+  int suid = (int)lua_tonumber(L, 3);
+  int r = setresuid((uid_t)ruid, (uid_t)euid, (uid_t)suid);
+  if (r == 0) {
+    lua_pushboolean(L, 1);
+  } else {
+    lua_pushboolean(L, 0);
+  }
+  return 1;
 }
 
-static int
-lua_likwid_setresuid(lua_State* L)
-{
-    int ruid = (int) lua_tonumber(L, 1);
-    int euid = (int) lua_tonumber(L, 2);
-    int suid = (int) lua_tonumber(L, 3);
-    int r = setresuid((uid_t)ruid, (uid_t)euid, (uid_t)suid);
-    if (r == 0)
-    {
-        lua_pushboolean(L, 1);
-    }
-    else
-    {
-        lua_pushboolean(L, 0);
-    }
+static int lua_likwid_setresuser(lua_State *L) {
+  const char *ruser = (const char *)luaL_checkstring(L, 1);
+  const char *euser = (const char *)luaL_checkstring(L, 2);
+  const char *suser = (const char *)luaL_checkstring(L, 3);
+  struct passwd *p;
+  p = getpwnam(ruser);
+  if (p == NULL) {
+    lua_pushboolean(L, 0);
     return 1;
-}
-
-static int
-lua_likwid_setresuser(lua_State* L)
-{
-    const char* ruser = (const char*) luaL_checkstring(L, 1);
-    const char* euser = (const char*) luaL_checkstring(L, 2);
-    const char* suser = (const char*) luaL_checkstring(L, 3);
-    struct passwd *p;
-    p = getpwnam(ruser);
-    if ( p == NULL )
-    {
-        lua_pushboolean(L, 0);
-        return 1;
-    }
-    uid_t ruid = p->pw_uid;
-    p = getpwnam(euser);
-    if ( p == NULL )
-    {
-        lua_pushboolean(L, 0);
-        return 1;
-    }
-    uid_t euid = p->pw_uid;
-    p = getpwnam(suser);
-    if ( p == NULL )
-    {
-        lua_pushboolean(L, 0);
-        return 1;
-    }
-    uid_t suid = p->pw_uid;
-
-    int r = setresuid(ruid, euid, suid);
-    if (r == 0)
-    {
-        lua_pushboolean(L, 1);
-    }
-    else
-    {
-        lua_pushboolean(L, 0);
-    }
+  }
+  uid_t ruid = p->pw_uid;
+  p = getpwnam(euser);
+  if (p == NULL) {
+    lua_pushboolean(L, 0);
     return 1;
-}
-
-#ifdef LIKWID_WITH_NVMON
-
-GpuTopology_t gputopo = NULL;
-
-static int
-lua_likwid_getGpuTopology(lua_State* L)
-{
-    if (!gputopology_isInitialized)
-    {
-        if (topology_gpu_init() == EXIT_SUCCESS)
-        {
-            gputopo = get_gpuTopology();
-            gputopology_isInitialized = 1;
-        }
-        else
-        {
-            lua_pushnil(L);
-            return 1;
-        }
-    }
-    lua_newtable(L);
-    lua_pushstring(L,"numDevices");
-    lua_pushinteger(L, (lua_Integer)(gputopo->numDevices));
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"devices");
-    lua_newtable(L);
-    for (int i = 0; i < gputopo->numDevices; i++)
-    {
-        GpuDevice* gpu = &gputopo->devices[i];
-        lua_pushinteger(L, i+1);
-        lua_newtable(L);
-        lua_pushstring(L,"id");
-        lua_pushinteger(L, (lua_Integer)(gpu->devid));
-        lua_settable(L,-3);
-        lua_pushstring(L,"numaNode");
-        lua_pushinteger(L, (lua_Integer)(gpu->numaNode));
-        lua_settable(L,-3);
-        lua_pushstring(L,"name");
-        lua_pushstring(L, gpu->name);
-        lua_settable(L,-3);
-        lua_pushstring(L,"short");
-        lua_pushstring(L, gpu->short_name);
-        lua_settable(L,-3);
-        lua_pushstring(L,"memory");
-        lua_pushinteger(L, (lua_Integer)(gpu->mem));
-        lua_settable(L,-3);
-        lua_pushstring(L,"ccapMajor");
-        lua_pushinteger(L, (lua_Integer)(gpu->ccapMajor));
-        lua_settable(L,-3);
-        lua_pushstring(L,"ccapMinor");
-        lua_pushinteger(L, (lua_Integer)(gpu->ccapMinor));
-        lua_settable(L,-3);
-        lua_pushstring(L,"simdWidth");
-        lua_pushinteger(L, (lua_Integer)(gpu->simdWidth));
-        lua_settable(L,-3);
-        lua_pushstring(L,"l2Size");
-        lua_pushinteger(L, (lua_Integer)(gpu->l2Size));
-        lua_settable(L,-3);
-        lua_pushstring(L,"maxThreadsPerBlock");
-        lua_pushinteger(L, (lua_Integer)(gpu->maxThreadsPerBlock));
-        lua_settable(L,-3);
-        lua_pushstring(L,"sharedMemPerBlock");
-        lua_pushinteger(L, (lua_Integer)(gpu->sharedMemPerBlock));
-        lua_settable(L,-3);
-        lua_pushstring(L,"totalConstantMemory");
-        lua_pushinteger(L, (lua_Integer)(gpu->totalConstantMemory));
-        lua_settable(L,-3);
-        lua_pushstring(L,"memPitch");
-        lua_pushinteger(L, (lua_Integer)(gpu->memPitch));
-        lua_settable(L,-3);
-        lua_pushstring(L,"regsPerBlock");
-        lua_pushinteger(L, (lua_Integer)(gpu->regsPerBlock));
-        lua_settable(L,-3);
-        lua_pushstring(L,"clockRatekHz");
-        lua_pushinteger(L, (lua_Integer)(gpu->clockRatekHz));
-        lua_settable(L,-3);
-        lua_pushstring(L,"textureAlign");
-        lua_pushinteger(L, (lua_Integer)(gpu->textureAlign));
-        lua_settable(L,-3);
-        lua_pushstring(L,"surfaceAlign");
-        lua_pushinteger(L, (lua_Integer)(gpu->surfaceAlign));
-        lua_settable(L,-3);
-        lua_pushstring(L,"memClockRatekHz");
-        lua_pushinteger(L, (lua_Integer)(gpu->memClockRatekHz));
-        lua_settable(L,-3);
-        lua_pushstring(L,"pciBus");
-        lua_pushinteger(L, (lua_Integer)(gpu->pciBus));
-        lua_settable(L,-3);
-        lua_pushstring(L,"pciDev");
-        lua_pushinteger(L, (lua_Integer)(gpu->pciDev));
-        lua_settable(L,-3);
-        lua_pushstring(L,"pciDom");
-        lua_pushinteger(L, (lua_Integer)(gpu->pciDom));
-        lua_settable(L,-3);
-        lua_pushstring(L,"maxBlockRegs");
-        lua_pushinteger(L, (lua_Integer)(gpu->maxBlockRegs));
-        lua_settable(L,-3);
-        lua_pushstring(L,"numMultiProcs");
-        lua_pushinteger(L, (lua_Integer)(gpu->numMultiProcs));
-        lua_settable(L,-3);
-        lua_pushstring(L,"maxThreadPerMultiProc");
-        lua_pushinteger(L, (lua_Integer)(gpu->maxThreadPerMultiProc));
-        lua_settable(L,-3);
-        lua_pushstring(L,"memBusWidth");
-        lua_pushinteger(L, (lua_Integer)(gpu->memBusWidth));
-        lua_settable(L,-3);
-        lua_pushstring(L,"unifiedAddrSpace");
-        lua_pushinteger(L, (lua_Integer)(gpu->unifiedAddrSpace));
-        lua_settable(L,-3);
-        lua_pushstring(L,"ecc");
-        lua_pushinteger(L, (lua_Integer)(gpu->ecc));
-        lua_settable(L,-3);
-        lua_pushstring(L,"asyncEngines");
-        lua_pushinteger(L, (lua_Integer)(gpu->asyncEngines));
-        lua_settable(L,-3);
-        lua_pushstring(L,"mapHostMem");
-        lua_pushinteger(L, (lua_Integer)(gpu->mapHostMem));
-        lua_settable(L,-3);
-        lua_pushstring(L,"integrated");
-        lua_pushinteger(L, (lua_Integer)(gpu->integrated));
-        lua_settable(L,-3);
-
-        lua_pushstring(L,"maxThreadsDim");
-        lua_newtable(L);
-        for (int j = 0; j < 3; j++)
-        {
-            lua_pushinteger(L, j+1);
-            lua_pushinteger(L, (lua_Integer)(gpu->maxThreadsDim[j]));
-            lua_settable(L,-3);
-        }
-        lua_settable(L,-3);
-
-        lua_pushstring(L,"maxGridSize");
-        lua_newtable(L);
-        for (int j = 0; j < 3; j++)
-        {
-            lua_pushinteger(L, j+1);
-            lua_pushinteger(L, (lua_Integer)(gpu->maxGridSize[j]));
-            lua_settable(L,-3);
-        }
-        lua_settable(L,-3);
-
-        lua_settable(L,-3);
-    }
-    lua_settable(L,-3);
+  }
+  uid_t euid = p->pw_uid;
+  p = getpwnam(suser);
+  if (p == NULL) {
+    lua_pushboolean(L, 0);
     return 1;
-}
-
-static int
-lua_likwid_putGpuTopology(lua_State* L)
-{
-    if (gputopology_isInitialized)
-    {
-        topology_gpu_finalize();
-    }
-    return 0;
-}
-
-static int
-lua_likwid_gpustr_to_gpulist(lua_State* L)
-{
-    int ret = 0;
-    char* gpustr = (char *)luaL_checkstring(L, 1);
-    if (!gputopology_isInitialized)
-    {
-        if (topology_gpu_init() == EXIT_SUCCESS)
-        {
-            gputopo = get_gpuTopology();
-            gputopology_isInitialized = 1;
-        }
-        else
-        {
-            lua_pushnumber(L, 0);
-            lua_pushnil(L);
-            return 2;
-        }
-    }
-    int* gpulist = (int*) malloc(gputopo->numDevices * sizeof(int));
-    if (gpulist == NULL)
-    {
-        lua_pushstring(L,"Cannot allocate data for the GPU list");
-        lua_error(L);
-    }
-    ret = gpustr_to_gpulist(gpustr, gpulist, gputopo->numDevices);
-    if (ret <= 0)
-    {
-        lua_pushstring(L,"Cannot parse GPU string");
-        lua_error(L);
-    }
-    lua_pushnumber(L, ret);
-    lua_newtable(L);
-    for (int i=0;i<ret;i++)
-    {
-        lua_pushinteger(L, (lua_Integer)( i+1));
-        lua_pushinteger(L, (lua_Integer)( gpulist[i]));
-        lua_settable(L,-3);
-    }
-    free(gpulist);
-    return 2;
-}
-
-static int
-lua_likwid_getGpuEventsAndCounters(lua_State* L)
-{
-    if (!gputopology_isInitialized)
-    {
-        if (topology_gpu_init() == EXIT_SUCCESS)
-        {
-            gputopo = get_gpuTopology();
-            gputopology_isInitialized = 1;
-        }
-        else
-        {
-            lua_pushnil(L);
-            return 1;
-        }
-    }
-
-    lua_newtable(L);
-    lua_pushstring(L,"numDevices");
-    lua_pushinteger(L, (lua_Integer)(gputopo->numDevices));
-    lua_settable(L,-3);
-
-    lua_pushstring(L,"devices");
-    lua_newtable(L);
-    for (int i = 0; i < gputopo->numDevices; i++)
-    {
-        NvmonEventList_t l;
-        GpuDevice* gpu = &gputopo->devices[i];
-        lua_pushinteger(L, gpu->devid);
-        lua_newtable(L);
+  }
+  uid_t suid = p->pw_uid;
 
-        int ret = nvmon_getEventsOfGpu(gpu->devid, &l);
-        if (ret == 0)
-        {
-            for (int j = 0; j < l->numEvents; j++)
-            {
-                lua_pushinteger(L, j+1);
-                lua_newtable(L);
-                lua_pushstring(L,"Name");
-                lua_pushstring(L, l->events[j].name);
-                lua_settable(L,-3);
-                if (l->events[j].desc)
-                {
-                    lua_pushstring(L,"Description");
-                    lua_pushstring(L, l->events[j].desc);
-                    lua_settable(L,-3);
-                }
-                lua_pushstring(L,"Limit");
-                lua_pushstring(L, l->events[j].limit);
-                lua_settable(L,-3);
-                lua_settable(L,-3);
-            }
-            lua_settable(L,-3);
-            nvmon_returnEventsOfGpu(l);
-        }
-    }
-    lua_settable(L,-3);
-    return 1;
+  int r = setresuid(ruid, euid, suid);
+  if (r == 0) {
+    lua_pushboolean(L, 1);
+  } else {
+    lua_pushboolean(L, 0);
+  }
+  return 1;
 }
 
+#ifdef LIKWID_WITH_NVMON
 
-static int
-lua_likwid_getGpuGroups(lua_State* L)
-{
-    int i, ret;
-    char** tmp, **infos, **longs;
-    int gpuId = lua_tonumber(L,1);
-    if (!gputopology_isInitialized)
-    {
-        if (topology_gpu_init() == EXIT_SUCCESS)
-        {
-            gputopo = get_gpuTopology();
-            gputopology_isInitialized = 1;
-        }
-        else
-        {
-            lua_pushnil(L);
-            return 1;
-        }
+GpuTopology_t gputopo = NULL;
+
+static int lua_likwid_getGpuTopology(lua_State *L) {
+  if (!gputopology_isInitialized) {
+    if (topology_gpu_init() == EXIT_SUCCESS) {
+      gputopo = get_gpuTopology();
+      gputopology_isInitialized = 1;
+    } else {
+      lua_pushnil(L);
+      return 1;
+    }
+  }
+  lua_newtable(L);
+  lua_pushstring(L, "numDevices");
+  lua_pushinteger(L, (lua_Integer)(gputopo->numDevices));
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "devices");
+  lua_newtable(L);
+  for (int i = 0; i < gputopo->numDevices; i++) {
+    GpuDevice *gpu = &gputopo->devices[i];
+    lua_pushinteger(L, i + 1);
+    lua_newtable(L);
+    lua_pushstring(L, "id");
+    lua_pushinteger(L, (lua_Integer)(gpu->devid));
+    lua_settable(L, -3);
+    lua_pushstring(L, "numaNode");
+    lua_pushinteger(L, (lua_Integer)(gpu->numaNode));
+    lua_settable(L, -3);
+    lua_pushstring(L, "name");
+    lua_pushstring(L, gpu->name);
+    lua_settable(L, -3);
+    lua_pushstring(L, "short");
+    lua_pushstring(L, gpu->short_name);
+    lua_settable(L, -3);
+    lua_pushstring(L, "memory");
+    lua_pushinteger(L, (lua_Integer)(gpu->mem));
+    lua_settable(L, -3);
+    lua_pushstring(L, "ccapMajor");
+    lua_pushinteger(L, (lua_Integer)(gpu->ccapMajor));
+    lua_settable(L, -3);
+    lua_pushstring(L, "ccapMinor");
+    lua_pushinteger(L, (lua_Integer)(gpu->ccapMinor));
+    lua_settable(L, -3);
+    lua_pushstring(L, "simdWidth");
+    lua_pushinteger(L, (lua_Integer)(gpu->simdWidth));
+    lua_settable(L, -3);
+    lua_pushstring(L, "l2Size");
+    lua_pushinteger(L, (lua_Integer)(gpu->l2Size));
+    lua_settable(L, -3);
+    lua_pushstring(L, "maxThreadsPerBlock");
+    lua_pushinteger(L, (lua_Integer)(gpu->maxThreadsPerBlock));
+    lua_settable(L, -3);
+    lua_pushstring(L, "sharedMemPerBlock");
+    lua_pushinteger(L, (lua_Integer)(gpu->sharedMemPerBlock));
+    lua_settable(L, -3);
+    lua_pushstring(L, "totalConstantMemory");
+    lua_pushinteger(L, (lua_Integer)(gpu->totalConstantMemory));
+    lua_settable(L, -3);
+    lua_pushstring(L, "memPitch");
+    lua_pushinteger(L, (lua_Integer)(gpu->memPitch));
+    lua_settable(L, -3);
+    lua_pushstring(L, "regsPerBlock");
+    lua_pushinteger(L, (lua_Integer)(gpu->regsPerBlock));
+    lua_settable(L, -3);
+    lua_pushstring(L, "clockRatekHz");
+    lua_pushinteger(L, (lua_Integer)(gpu->clockRatekHz));
+    lua_settable(L, -3);
+    lua_pushstring(L, "textureAlign");
+    lua_pushinteger(L, (lua_Integer)(gpu->textureAlign));
+    lua_settable(L, -3);
+    lua_pushstring(L, "surfaceAlign");
+    lua_pushinteger(L, (lua_Integer)(gpu->surfaceAlign));
+    lua_settable(L, -3);
+    lua_pushstring(L, "memClockRatekHz");
+    lua_pushinteger(L, (lua_Integer)(gpu->memClockRatekHz));
+    lua_settable(L, -3);
+    lua_pushstring(L, "pciBus");
+    lua_pushinteger(L, (lua_Integer)(gpu->pciBus));
+    lua_settable(L, -3);
+    lua_pushstring(L, "pciDev");
+    lua_pushinteger(L, (lua_Integer)(gpu->pciDev));
+    lua_settable(L, -3);
+    lua_pushstring(L, "pciDom");
+    lua_pushinteger(L, (lua_Integer)(gpu->pciDom));
+    lua_settable(L, -3);
+    lua_pushstring(L, "maxBlockRegs");
+    lua_pushinteger(L, (lua_Integer)(gpu->maxBlockRegs));
+    lua_settable(L, -3);
+    lua_pushstring(L, "numMultiProcs");
+    lua_pushinteger(L, (lua_Integer)(gpu->numMultiProcs));
+    lua_settable(L, -3);
+    lua_pushstring(L, "maxThreadPerMultiProc");
+    lua_pushinteger(L, (lua_Integer)(gpu->maxThreadPerMultiProc));
+    lua_settable(L, -3);
+    lua_pushstring(L, "memBusWidth");
+    lua_pushinteger(L, (lua_Integer)(gpu->memBusWidth));
+    lua_settable(L, -3);
+    lua_pushstring(L, "unifiedAddrSpace");
+    lua_pushinteger(L, (lua_Integer)(gpu->unifiedAddrSpace));
+    lua_settable(L, -3);
+    lua_pushstring(L, "ecc");
+    lua_pushinteger(L, (lua_Integer)(gpu->ecc));
+    lua_settable(L, -3);
+    lua_pushstring(L, "asyncEngines");
+    lua_pushinteger(L, (lua_Integer)(gpu->asyncEngines));
+    lua_settable(L, -3);
+    lua_pushstring(L, "mapHostMem");
+    lua_pushinteger(L, (lua_Integer)(gpu->mapHostMem));
+    lua_settable(L, -3);
+    lua_pushstring(L, "integrated");
+    lua_pushinteger(L, (lua_Integer)(gpu->integrated));
+    lua_settable(L, -3);
+
+    lua_pushstring(L, "maxThreadsDim");
+    lua_newtable(L);
+    for (int j = 0; j < 3; j++) {
+      lua_pushinteger(L, j + 1);
+      lua_pushinteger(L, (lua_Integer)(gpu->maxThreadsDim[j]));
+      lua_settable(L, -3);
     }
-    ret = nvmon_getGroups(gpuId, &tmp, &infos, &longs);
-    if (ret > 0)
-    {
+    lua_settable(L, -3);
+
+    lua_pushstring(L, "maxGridSize");
+    lua_newtable(L);
+    for (int j = 0; j < 3; j++) {
+      lua_pushinteger(L, j + 1);
+      lua_pushinteger(L, (lua_Integer)(gpu->maxGridSize[j]));
+      lua_settable(L, -3);
+    }
+    lua_settable(L, -3);
+
+    lua_settable(L, -3);
+  }
+  lua_settable(L, -3);
+  return 1;
+}
+
+static int lua_likwid_putGpuTopology(lua_State *L) {
+  if (gputopology_isInitialized) {
+    topology_gpu_finalize();
+  }
+  return 0;
+}
+
+static int lua_likwid_gpustr_to_gpulist(lua_State *L) {
+  int ret = 0;
+  char *gpustr = (char *)luaL_checkstring(L, 1);
+  if (!gputopology_isInitialized) {
+    if (topology_gpu_init() == EXIT_SUCCESS) {
+      gputopo = get_gpuTopology();
+      gputopology_isInitialized = 1;
+    } else {
+      lua_pushnumber(L, 0);
+      lua_pushnil(L);
+      return 2;
+    }
+  }
+  int *gpulist = (int *)malloc(gputopo->numDevices * sizeof(int));
+  if (gpulist == NULL) {
+    lua_pushstring(L, "Cannot allocate data for the GPU list");
+    lua_error(L);
+  }
+  ret = gpustr_to_gpulist(gpustr, gpulist, gputopo->numDevices);
+  if (ret <= 0) {
+    lua_pushstring(L, "Cannot parse GPU string");
+    lua_error(L);
+  }
+  lua_pushnumber(L, ret);
+  lua_newtable(L);
+  for (int i = 0; i < ret; i++) {
+    lua_pushinteger(L, (lua_Integer)(i + 1));
+    lua_pushinteger(L, (lua_Integer)(gpulist[i]));
+    lua_settable(L, -3);
+  }
+  free(gpulist);
+  return 2;
+}
+
+static int lua_likwid_getGpuEventsAndCounters(lua_State *L) {
+  if (!gputopology_isInitialized) {
+    if (topology_gpu_init() == EXIT_SUCCESS) {
+      gputopo = get_gpuTopology();
+      gputopology_isInitialized = 1;
+    } else {
+      lua_pushnil(L);
+      return 1;
+    }
+  }
+
+  lua_newtable(L);
+  lua_pushstring(L, "numDevices");
+  lua_pushinteger(L, (lua_Integer)(gputopo->numDevices));
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "devices");
+  lua_newtable(L);
+  for (int i = 0; i < gputopo->numDevices; i++) {
+    NvmonEventList_t l;
+    GpuDevice *gpu = &gputopo->devices[i];
+    lua_pushinteger(L, gpu->devid);
+    lua_newtable(L);
+
+    int ret = nvmon_getEventsOfGpu(gpu->devid, &l);
+    if (ret == 0) {
+      for (int j = 0; j < l->numEvents; j++) {
+        lua_pushinteger(L, j + 1);
         lua_newtable(L);
-        for (i=0;i<ret;i++)
-        {
-            lua_pushinteger(L, (lua_Integer)( i+1));
-            lua_newtable(L);
-            lua_pushstring(L, "Name");
-            lua_pushstring(L, tmp[i]);
-            lua_settable(L,-3);
-            lua_pushstring(L, "Info");
-            lua_pushstring(L, infos[i]);
-            lua_settable(L,-3);
-            lua_pushstring(L, "Long");
-            lua_pushstring(L, longs[i]);
-            lua_settable(L,-3);
-            lua_settable(L,-3);
+        lua_pushstring(L, "Name");
+        lua_pushstring(L, l->events[j].name);
+        lua_settable(L, -3);
+        if (l->events[j].desc) {
+          lua_pushstring(L, "Description");
+          lua_pushstring(L, l->events[j].desc);
+          lua_settable(L, -3);
         }
-        nvmon_returnGroups(ret, tmp, infos, longs);
-        return 1;
-    }
+        lua_pushstring(L, "Limit");
+        lua_pushstring(L, l->events[j].limit);
+        lua_settable(L, -3);
+        lua_settable(L, -3);
+      }
+      lua_settable(L, -3);
+      nvmon_returnEventsOfGpu(l);
+    }
+  }
+  lua_settable(L, -3);
+  return 1;
+}
+
+static int lua_likwid_getGpuGroups(lua_State *L) {
+  int i, ret;
+  char **tmp, **infos, **longs;
+  int gpuId = lua_tonumber(L, 1);
+  if (!gputopology_isInitialized) {
+    if (topology_gpu_init() == EXIT_SUCCESS) {
+      gputopo = get_gpuTopology();
+      gputopology_isInitialized = 1;
+    } else {
+      lua_pushnil(L);
+      return 1;
+    }
+  }
+  ret = nvmon_getGroups(gpuId, &tmp, &infos, &longs);
+  if (ret > 0) {
+    lua_newtable(L);
+    for (i = 0; i < ret; i++) {
+      lua_pushinteger(L, (lua_Integer)(i + 1));
+      lua_newtable(L);
+      lua_pushstring(L, "Name");
+      lua_pushstring(L, tmp[i]);
+      lua_settable(L, -3);
+      lua_pushstring(L, "Info");
+      lua_pushstring(L, infos[i]);
+      lua_settable(L, -3);
+      lua_pushstring(L, "Long");
+      lua_pushstring(L, longs[i]);
+      lua_settable(L, -3);
+      lua_settable(L, -3);
+    }
+    nvmon_returnGroups(ret, tmp, infos, longs);
+    return 1;
+  }
+  return 0;
+}
+
+static int lua_likwid_nvGetNameOfEvent(lua_State *L) {
+  int eventId, groupId;
+  char *tmp;
+  if (nvmon_initialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  eventId = lua_tonumber(L, 2);
+  tmp = nvmon_getEventName(groupId - 1, eventId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
+}
+
+static int lua_likwid_nvGetNameOfCounter(lua_State *L) {
+  int eventId, groupId;
+  char *tmp;
+  if (nvmon_initialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  eventId = lua_tonumber(L, 2);
+  tmp = nvmon_getCounterName(groupId - 1, eventId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
+}
+
+static int lua_likwid_nvGetNameOfMetric(lua_State *L) {
+  int metricId, groupId;
+  char *tmp;
+  if (nvmon_initialized == 0) {
     return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  metricId = lua_tonumber(L, 2);
+  tmp = nvmon_getMetricName(groupId - 1, metricId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
+}
+
+static int lua_likwid_nvGetNameOfGroup(lua_State *L) {
+  int groupId;
+  char *tmp;
+  if (nvmon_initialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  tmp = nvmon_getGroupName(groupId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
 }
 
+static int lua_likwid_nvMarkerFile_read(lua_State *L) {
+  const char *filename = (const char *)luaL_checkstring(L, -1);
+  int ret = nvmon_readMarkerFile(filename);
+  lua_pushinteger(L, ret);
+  return 1;
+}
 
-static int
-lua_likwid_nvGetNameOfEvent(lua_State* L)
-{
-    int eventId, groupId;
-    char* tmp;
-    if (nvmon_initialized == 0)
-    {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    eventId = lua_tonumber(L,2);
-    tmp = nvmon_getEventName(groupId-1, eventId-1);
-    lua_pushstring(L,tmp);
-    return 1;
+static int lua_likwid_nvMarkerFile_destroy(lua_State *L) {
+  nvmon_destroyMarkerResults();
+  return 0;
 }
 
-static int
-lua_likwid_nvGetNameOfCounter(lua_State* L)
-{
-    int eventId, groupId;
-    char* tmp;
-    if (nvmon_initialized == 0)
-    {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    eventId = lua_tonumber(L,2);
-    tmp = nvmon_getCounterName(groupId-1, eventId-1);
-    lua_pushstring(L,tmp);
-    return 1;
+static int lua_likwid_nvMarkerNumRegions(lua_State *L) {
+  lua_pushinteger(L, nvmon_getNumberOfRegions());
+  return 1;
 }
 
+static int lua_likwid_nvMarkerRegionGroup(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushinteger(L, nvmon_getGroupOfRegion(region - 1) + 1);
+  return 1;
+}
 
-static int
-lua_likwid_nvGetNameOfMetric(lua_State* L)
-{
-    int metricId, groupId;
-    char* tmp;
-    if (nvmon_initialized == 0)
-    {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    metricId = lua_tonumber(L,2);
-    tmp = nvmon_getMetricName(groupId-1, metricId-1);
-    lua_pushstring(L,tmp);
-    return 1;
+static int lua_likwid_nvMarkerRegionTag(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushstring(L, nvmon_getTagOfRegion(region - 1));
+  return 1;
 }
 
-static int
-lua_likwid_nvGetNameOfGroup(lua_State* L)
-{
-    int groupId;
-    char* tmp;
-    if (nvmon_initialized == 0)
-    {
-        return 0;
-    }
-    groupId = lua_tonumber(L,1);
-    tmp = nvmon_getGroupName(groupId-1);
-    lua_pushstring(L,tmp);
-    return 1;
+static int lua_likwid_nvMarkerRegionEvents(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushinteger(L, nvmon_getEventsOfRegion(region - 1));
+  return 1;
 }
 
-static int
-lua_likwid_nvMarkerFile_read(lua_State* L)
-{
-    const char* filename = (const char*)luaL_checkstring(L, -1);
-    int ret = nvmon_readMarkerFile(filename);
-    lua_pushinteger(L, ret);
-    return 1;
+static int lua_likwid_nvMarkerRegionMetrics(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushinteger(L, nvmon_getMetricsOfRegion(region - 1));
+  return 1;
 }
 
-static int
-lua_likwid_nvMarkerFile_destroy(lua_State* L)
-{
-    nvmon_destroyMarkerResults();
+static int lua_likwid_nvMarkerRegionGpus(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushinteger(L, nvmon_getGpusOfRegion(region - 1));
+  return 1;
+}
+
+static int lua_likwid_nvMarkerRegionGpulist(lua_State *L) {
+  int i = 0;
+  int region = lua_tointeger(L, -1);
+  int *gpulist;
+  int regionGPUs = 0;
+  if (gputopology_isInitialized == 0) {
+    topology_gpu_init();
+  }
+  if ((gputopology_isInitialized) && (gputopo == NULL)) {
+    gputopo = get_gpuTopology();
+  }
+
+  gpulist = (int *)malloc(gputopo->numDevices * sizeof(int));
+  if (gpulist == NULL) {
     return 0;
+  }
+  regionGPUs =
+      nvmon_getGpulistOfRegion(region - 1, gputopo->numDevices, gpulist);
+  if (regionGPUs > 0) {
+    lua_newtable(L);
+    for (i = 0; i < regionGPUs; i++) {
+      lua_pushinteger(L, i + 1);
+      lua_pushinteger(L, gpulist[i]);
+      lua_settable(L, -3);
+    }
+    return 1;
+  }
+  return 0;
+}
+
+static int lua_likwid_nvMarkerRegionTime(lua_State *L) {
+  int region = lua_tointeger(L, -2);
+  int thread = lua_tointeger(L, -1);
+  lua_pushnumber(L, nvmon_getTimeOfRegion(region - 1, thread - 1));
+  return 1;
+}
+
+static int lua_likwid_nvMarkerRegionCount(lua_State *L) {
+  int region = lua_tointeger(L, -2);
+  int thread = lua_tointeger(L, -1);
+  lua_pushinteger(L, nvmon_getCountOfRegion(region - 1, thread - 1));
+  return 1;
+}
+
+static int lua_likwid_nvMarkerRegionResult(lua_State *L) {
+  int region = lua_tointeger(L, -3);
+  int event = lua_tointeger(L, -2);
+  int gpu = lua_tointeger(L, -1);
+  lua_pushnumber(L, nvmon_getResultOfRegionGpu(region - 1, event - 1, gpu - 1));
+  return 1;
+}
+
+static int lua_likwid_nvMarkerRegionMetric(lua_State *L) {
+  int region = lua_tointeger(L, -3);
+  int metric = lua_tointeger(L, -2);
+  int gpu = lua_tointeger(L, -1);
+  lua_pushnumber(L,
+                 nvmon_getMetricOfRegionGpu(region - 1, metric - 1, gpu - 1));
+  return 1;
+}
+
+static int lua_likwid_nvInit(lua_State *L) {
+  int ret;
+  int nrGpus = luaL_checknumber(L, 1);
+  luaL_argcheck(L, nrGpus > 0, 1, "GPU count must be greater than 0");
+  int gpus[nrGpus];
+  if (!lua_istable(L, -1)) {
+    lua_pushstring(L, "No table given as second argument");
+    lua_error(L);
+  }
+  for (ret = 1; ret <= nrGpus; ret++) {
+    lua_rawgeti(L, -1, ret);
+#if LUA_VERSION_NUM == 501
+    gpus[ret - 1] = ((lua_Integer)lua_tointeger(L, -1));
+#else
+    gpus[ret - 1] = ((lua_Unsigned)lua_tointegerx(L, -1, NULL));
+#endif
+    lua_pop(L, 1);
+  }
+  if (gputopology_isInitialized == 0) {
+    topology_gpu_init();
+  }
+  if ((gputopology_isInitialized) && (gputopo == NULL)) {
+    gputopo = get_gpuTopology();
+  }
+  if (nvmon_initialized == 0) {
+    ret = nvmon_init(nrGpus, gpus);
+    if (ret != 0) {
+      lua_pushstring(L, "Cannot initialize likwid perfmon");
+      nvmon_finalize();
+      lua_pushinteger(L, ret);
+      return 1;
+    }
+    nvmon_initialized = 1;
+    lua_pushinteger(L, ret);
+  }
+  return 1;
 }
 
-static int
-lua_likwid_nvMarkerNumRegions(lua_State* L)
-{
-    lua_pushinteger(L, nvmon_getNumberOfRegions());
-    return 1;
+static int lua_likwid_nvAddEventSet(lua_State *L) {
+  int groupId, n;
+  const char *tmpString;
+  if (nvmon_initialized == 0) {
+    return 0;
+  }
+  n = lua_gettop(L);
+  tmpString = luaL_checkstring(L, n);
+  luaL_argcheck(L, strlen(tmpString) > 0, n,
+                "Event string must be larger than 0");
+
+  groupId = nvmon_addEventSet((char *)tmpString);
+  if (groupId >= 0)
+    lua_pushinteger(L, groupId + 1);
+  else
+    lua_pushinteger(L, groupId);
+  return 1;
 }
 
-static int
-lua_likwid_nvMarkerRegionGroup(lua_State* L)
-{
-    int region = lua_tointeger(L,-1);
-    lua_pushinteger(L, nvmon_getGroupOfRegion(region-1)+1);
-    return 1;
+static int lua_likwid_nvFinalize(lua_State *L) {
+  if (nvmon_initialized)
+    nvmon_finalize();
+  return 0;
 }
 
-static int
-lua_likwid_nvMarkerRegionTag(lua_State* L)
-{
-    int region = lua_tointeger(L,-1);
-    lua_pushstring(L, nvmon_getTagOfRegion(region-1));
-    return 1;
+static int lua_likwid_nvSupported(lua_State *L) {
+  lua_pushboolean(L, 1);
+  return 1;
 }
+#else
+static int lua_likwid_nvSupported(lua_State *L) {
+  lua_pushboolean(L, 1);
+  return 0;
+}
+#endif /* LIKWID_WITH_NVMON */
 
-static int
-lua_likwid_nvMarkerRegionEvents(lua_State* L)
-{
-    int region = lua_tointeger(L,-1);
-    lua_pushinteger(L, nvmon_getEventsOfRegion(region-1));
+static int hwfeatures_inititalized = 0;
+static int lua_likwid_initHWFeatures(lua_State *L) {
+  int err = 0;
+  if (!hwfeatures_inititalized) {
+    err = hwFeatures_init();
+    if (err == 0) {
+      hwfeatures_inititalized = 1;
+    }
+  }
+  lua_pushnumber(L, err);
+  return 1;
+}
+
+static int lua_likwid_finalizeHWFeatures(lua_State *L) {
+  if (hwfeatures_inititalized) {
+    hwFeatures_finalize();
+    hwfeatures_inititalized = 0;
+  }
+  return 0;
+}
+static int lua_likwid_getHwFeatureList(lua_State *L) {
+  if (!hwfeatures_inititalized) {
+    lua_newtable(L);
     return 1;
+  }
+  HWFeatureList list = {0, NULL};
+  hwFeatures_list(&list);
+  lua_newtable(L);
+  for (int i = 0; i < list.num_features; i++) {
+    lua_pushinteger(L, (lua_Integer)(i + 1));
+    lua_newtable(L);
+    lua_pushstring(L, "Name");
+    lua_pushstring(L, list.features[i].name);
+    lua_settable(L, -3);
+    lua_pushstring(L, "Description");
+    lua_pushstring(L, list.features[i].description);
+    lua_settable(L, -3);
+    lua_pushstring(L, "ReadOnly");
+    lua_pushboolean(L, list.features[i].readonly);
+    lua_settable(L, -3);
+    lua_pushstring(L, "WriteOnly");
+    lua_pushboolean(L, list.features[i].writeonly);
+    lua_settable(L, -3);
+    lua_pushstring(L, "Scope");
+    lua_pushstring(L, HWFeatureScopeNames[list.features[i].scope]);
+    lua_settable(L, -3);
+    lua_settable(L, -3);
+  }
+  hwFeatures_list_return(&list);
+  return 1;
+}
+
+static int lua_likwid_getHwFeature(lua_State *L) {
+  if (hwfeatures_inititalized) {
+    char *feature = (char *)luaL_checkstring(L, 1);
+    int hwt = luaL_checknumber(L, 2);
+    uint64_t value = 0;
+    int err = hwFeatures_getByName(feature, hwt, &value);
+    if (err == 0) {
+      lua_pushinteger(L, value);
+      return 1;
+    }
+  }
+  lua_pushnil(L);
+  return 1;
+}
+
+static int lua_likwid_setHwFeature(lua_State *L) {
+  if (hwfeatures_inititalized) {
+    char *feature = (char *)luaL_checkstring(L, 1);
+    int hwt = luaL_checknumber(L, 2);
+    int value = luaL_checknumber(L, 3);
+    int err = hwFeatures_modifyByName(feature, hwt, value);
+    if (err == 0) {
+      lua_pushboolean(L, 1);
+      return 1;
+    }
+  }
+  lua_pushboolean(L, 0);
+  return 1;
+}
+
+#ifdef LIKWID_WITH_ROCMON
+
+GpuTopology_rocm_t gputopo_rocm = NULL;
+int gputopology_isInitialized_rocm = 0;
+
+static int lua_likwid_rocmSupported(lua_State *L) {
+  lua_pushboolean(L, TRUE);
+  return 1;
+}
+
+static int lua_likwid_getGpuTopology_rocm(lua_State *L) {
+  if (!gputopology_isInitialized_rocm) {
+    if (topology_gpu_init_rocm() == EXIT_SUCCESS) {
+      gputopo_rocm = get_gpuTopology_rocm();
+      gputopology_isInitialized_rocm = 1;
+    } else {
+      lua_pushnil(L);
+      return 1;
+    }
+  }
+
+  lua_newtable(L);
+
+  lua_pushstring(L, "numDevices");
+  lua_pushinteger(L, (lua_Integer)(gputopo_rocm->numDevices));
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "devices");
+  lua_newtable(L);
+  for (int i = 0; i < gputopo_rocm->numDevices; i++) {
+    GpuDevice_rocm *gpu = &gputopo_rocm->devices[i];
+    lua_pushinteger(L, i + 1);
+    lua_newtable(L);
+    lua_pushstring(L, "id");
+    lua_pushinteger(L, (lua_Integer)(gpu->devid));
+    lua_settable(L, -3);
+    lua_pushstring(L, "numaNode");
+    lua_pushinteger(L, (lua_Integer)(gpu->numaNode));
+    lua_settable(L, -3);
+    lua_pushstring(L, "name");
+    lua_pushstring(L, gpu->name);
+    lua_settable(L, -3);
+    lua_pushstring(L, "short");
+    lua_pushstring(L, gpu->short_name);
+    lua_settable(L, -3);
+    lua_pushstring(L, "memory");
+    lua_pushinteger(L, (lua_Integer)(gpu->mem));
+    lua_settable(L, -3);
+    lua_pushstring(L, "ccapMajor");
+    lua_pushinteger(L, (lua_Integer)(gpu->ccapMajor));
+    lua_settable(L, -3);
+    lua_pushstring(L, "ccapMinor");
+    lua_pushinteger(L, (lua_Integer)(gpu->ccapMinor));
+    lua_settable(L, -3);
+    lua_pushstring(L, "l2Size");
+    lua_pushinteger(L, (lua_Integer)(gpu->l2Size));
+    lua_settable(L, -3);
+    lua_pushstring(L, "maxThreadsPerBlock");
+    lua_pushinteger(L, (lua_Integer)(gpu->maxThreadsPerBlock));
+    lua_settable(L, -3);
+    lua_pushstring(L, "sharedMemPerBlock");
+    lua_pushinteger(L, (lua_Integer)(gpu->sharedMemPerBlock));
+    lua_settable(L, -3);
+    lua_pushstring(L, "totalConstantMemory");
+    lua_pushinteger(L, (lua_Integer)(gpu->totalConstantMemory));
+    lua_settable(L, -3);
+    lua_pushstring(L, "memPitch");
+    lua_pushinteger(L, (lua_Integer)(gpu->memPitch));
+    lua_settable(L, -3);
+    lua_pushstring(L, "regsPerBlock");
+    lua_pushinteger(L, (lua_Integer)(gpu->regsPerBlock));
+    lua_settable(L, -3);
+    lua_pushstring(L, "clockRatekHz");
+    lua_pushinteger(L, (lua_Integer)(gpu->clockRatekHz));
+    lua_settable(L, -3);
+    lua_pushstring(L, "textureAlign");
+    lua_pushinteger(L, (lua_Integer)(gpu->textureAlign));
+    lua_settable(L, -3);
+    lua_pushstring(L, "memClockRatekHz");
+    lua_pushinteger(L, (lua_Integer)(gpu->memClockRatekHz));
+    lua_settable(L, -3);
+    lua_pushstring(L, "pciBus");
+    lua_pushinteger(L, (lua_Integer)(gpu->pciBus));
+    lua_settable(L, -3);
+    lua_pushstring(L, "pciDev");
+    lua_pushinteger(L, (lua_Integer)(gpu->pciDev));
+    lua_settable(L, -3);
+    lua_pushstring(L, "pciDom");
+    lua_pushinteger(L, (lua_Integer)(gpu->pciDom));
+    lua_settable(L, -3);
+    lua_pushstring(L, "numMultiProcs");
+    lua_pushinteger(L, (lua_Integer)(gpu->numMultiProcs));
+    lua_settable(L, -3);
+    lua_pushstring(L, "maxThreadPerMultiProc");
+    lua_pushinteger(L, (lua_Integer)(gpu->maxThreadPerMultiProc));
+    lua_settable(L, -3);
+    lua_pushstring(L, "memBusWidth");
+    lua_pushinteger(L, (lua_Integer)(gpu->memBusWidth));
+    lua_settable(L, -3);
+    lua_pushstring(L, "ecc");
+    lua_pushinteger(L, (lua_Integer)(gpu->ecc));
+    lua_settable(L, -3);
+    lua_pushstring(L, "mapHostMem");
+    lua_pushinteger(L, (lua_Integer)(gpu->mapHostMem));
+    lua_settable(L, -3);
+    lua_pushstring(L, "integrated");
+    lua_pushinteger(L, (lua_Integer)(gpu->integrated));
+    lua_settable(L, -3);
+
+    lua_pushstring(L, "maxThreadsDim");
+    lua_newtable(L);
+    for (int j = 0; j < 3; j++) {
+      lua_pushinteger(L, j + 1);
+      lua_pushinteger(L, (lua_Integer)(gpu->maxThreadsDim[j]));
+      lua_settable(L, -3);
+    }
+    lua_settable(L, -3);
+
+    lua_pushstring(L, "maxGridSize");
+    lua_newtable(L);
+    for (int j = 0; j < 3; j++) {
+      lua_pushinteger(L, j + 1);
+      lua_pushinteger(L, (lua_Integer)(gpu->maxGridSize[j]));
+      lua_settable(L, -3);
+    }
+    lua_settable(L, -3);
+
+    lua_settable(L, -3);
+  }
+  lua_settable(L, -3);
+
+  return 1;
+}
+
+static int lua_likwid_putGpuTopology_rocm(lua_State *L) {
+  if (gputopology_isInitialized_rocm) {
+    topology_gpu_finalize_rocm();
+  }
+  return 0;
+}
+
+static int lua_likwid_init_rocm(lua_State *L) {
+  int ret;
+  int nrGpus = luaL_checknumber(L, 1);
+  luaL_argcheck(L, nrGpus > 0, 1, "GPU count must be greater than 0");
+  int gpus[nrGpus];
+  if (!lua_istable(L, -1)) {
+    lua_pushstring(L, "No table given as second argument");
+    lua_error(L);
+  }
+  for (ret = 1; ret <= nrGpus; ret++) {
+    lua_rawgeti(L, -1, ret);
+#if LUA_VERSION_NUM == 501
+    gpus[ret - 1] = ((lua_Integer)lua_tointeger(L, -1));
+#else
+    gpus[ret - 1] = ((lua_Unsigned)lua_tointegerx(L, -1, NULL));
+#endif
+    lua_pop(L, 1);
+  }
+  if (gputopology_isInitialized_rocm == 0) {
+    topology_gpu_init_rocm();
+  }
+  if ((gputopology_isInitialized_rocm) && (gputopo_rocm == NULL)) {
+    gputopo_rocm = get_gpuTopology_rocm();
+  }
+  if (rocmon_initialized == 0) {
+    ret = rocmon_init(nrGpus, gpus);
+    if (ret != 0) {
+      lua_pushstring(L, "Cannot initialize likwid rocmon");
+      rocmon_finalize();
+      lua_pushinteger(L, ret);
+      return 1;
+    }
+    rocmon_initialized = 1;
+    lua_pushinteger(L, ret);
+  }
+  return 1;
+}
+
+static int lua_likwid_gpustr_to_gpulist_rocm(lua_State *L) {
+  int ret = 0;
+  char *gpustr = (char *)luaL_checkstring(L, 1);
+  if (!gputopology_isInitialized_rocm) {
+    if (topology_gpu_init_rocm() == EXIT_SUCCESS) {
+      gputopo_rocm = get_gpuTopology_rocm();
+      gputopology_isInitialized = 1;
+    } else {
+      lua_pushnumber(L, 0);
+      lua_pushnil(L);
+      return 2;
+    }
+  }
+  int *gpulist = (int *)malloc(gputopo_rocm->numDevices * sizeof(int));
+  if (gpulist == NULL) {
+    lua_pushstring(L, "Cannot allocate data for the GPU list");
+    lua_error(L);
+  }
+  ret = gpustr_to_gpulist_rocm(gpustr, gpulist, gputopo_rocm->numDevices);
+  if (ret <= 0) {
+    lua_pushstring(L, "Cannot parse GPU string");
+    lua_error(L);
+  }
+  lua_pushnumber(L, ret);
+  lua_newtable(L);
+  for (int i = 0; i < ret; i++) {
+    lua_pushinteger(L, (lua_Integer)(i + 1));
+    lua_pushinteger(L, (lua_Integer)(gpulist[i]));
+    lua_settable(L, -3);
+  }
+  free(gpulist);
+  return 2;
+}
+
+static int lua_likwid_getGpuEventsAndCounters_rocm(lua_State *L) {
+  int *glist = NULL;
+  if (!gputopology_isInitialized_rocm) {
+    if (topology_gpu_init_rocm() == EXIT_SUCCESS) {
+      gputopo_rocm = get_gpuTopology_rocm();
+      gputopology_isInitialized_rocm = 1;
+    } else {
+      lua_pushnil(L);
+      return 1;
+    }
+  }
+  if (!rocmon_initialized) {
+    glist = malloc(gputopo_rocm->numDevices * sizeof(int));
+    if (!glist) {
+      lua_pushnil(L);
+      return 1;
+    }
+    for (int i = 0; i < gputopo_rocm->numDevices; i++) {
+      GpuDevice_rocm *gpu = &gputopo_rocm->devices[i];
+      glist[i] = gpu->devid;
+    }
+    int ret = rocmon_init(gputopo_rocm->numDevices, glist);
+    if (ret != 0) {
+      lua_pushnil(L);
+      return 1;
+    }
+  }
+  lua_newtable(L);
+  lua_pushstring(L, "numDevices");
+  lua_pushinteger(L, (lua_Integer)(gputopo_rocm->numDevices));
+  lua_settable(L, -3);
+
+  lua_pushstring(L, "devices");
+  lua_newtable(L);
+  for (int i = 0; i < gputopo_rocm->numDevices; i++) {
+    EventList_rocm_t l = NULL;
+    GpuDevice_rocm *gpu = &gputopo_rocm->devices[i];
+    lua_pushinteger(L, gpu->devid);
+    lua_newtable(L);
+
+    int ret = rocmon_getEventsOfGpu(gpu->devid, &l);
+    printf("GPU Events: %d\n", ret);
+    if (ret == 0) {
+      for (int j = 0; j < l->numEvents; j++) {
+        lua_pushinteger(L, j + 1);
+        lua_newtable(L);
+        lua_pushstring(L, "Name");
+        lua_pushstring(L, l->events[j].name);
+        lua_settable(L, -3);
+        if (l->events[j].description) {
+          lua_pushstring(L, "Description");
+          lua_pushstring(L, l->events[j].description);
+          lua_settable(L, -3);
+        }
+        lua_pushstring(L, "Limit");
+        lua_pushstring(L, "ROCM");
+        lua_settable(L, -3);
+        lua_pushstring(L, "Instances");
+        lua_pushinteger(L, l->events[j].instances);
+
+        lua_settable(L, -3);
+        lua_settable(L, -3);
+      }
+      lua_settable(L, -3);
+      if (l) {
+        rocmon_freeEventsOfGpu(l);
+      }
+    }
+  }
+  lua_settable(L, -3);
+  if (glist) {
+    free(glist);
+  }
+  return 1;
+}
+
+static int lua_likwid_getShortInfoOfGroup_rocm(lua_State *L) {
+  int groupId;
+  char *tmp;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  tmp = rocmon_getGroupInfoShort(groupId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
 }
 
-static int
-lua_likwid_nvMarkerRegionMetrics(lua_State* L)
-{
-    int region = lua_tointeger(L,-1);
-    lua_pushinteger(L, nvmon_getMetricsOfRegion(region-1));
-    return 1;
+static int lua_likwid_getLongInfoOfGroup_rocm(lua_State *L) {
+  int groupId;
+  char *tmp;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  tmp = rocmon_getGroupInfoLong(groupId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
+}
+
+static int lua_likwid_getGpuGroups_rocm(lua_State *L) {
+  int i, ret;
+  char **tmp, **infos, **longs;
+  int gpuId = lua_tonumber(L, 1);
+  if (!gputopology_isInitialized_rocm) {
+    if (topology_gpu_init_rocm() == EXIT_SUCCESS) {
+      gputopo_rocm = get_gpuTopology_rocm();
+      gputopology_isInitialized_rocm = 1;
+    } else {
+      lua_pushnil(L);
+      return 1;
+    }
+  }
+  ret = rocmon_getGroups(&tmp, &infos, &longs);
+  if (ret > 0) {
+    lua_newtable(L);
+    for (i = 0; i < ret; i++) {
+      lua_pushinteger(L, (lua_Integer)(i + 1));
+      lua_newtable(L);
+      lua_pushstring(L, "Name");
+      lua_pushstring(L, tmp[i]);
+      lua_settable(L, -3);
+      lua_pushstring(L, "Info");
+      lua_pushstring(L, infos[i]);
+      lua_settable(L, -3);
+      lua_pushstring(L, "Long");
+      lua_pushstring(L, longs[i]);
+      lua_settable(L, -3);
+      lua_settable(L, -3);
+    }
+    rocmon_returnGroups(ret, tmp, infos, longs);
+    return 1;
+  }
+  return 0;
+}
+
+static int lua_likwid_getNameOfEvent_rocm(lua_State *L) {
+  int eventId, groupId;
+  char *tmp;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  eventId = lua_tonumber(L, 2);
+  tmp = rocmon_getEventName(groupId - 1, eventId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
+}
+
+static int lua_likwid_getNameOfCounter_rocm(lua_State *L) {
+  int eventId, groupId;
+  char *tmp;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  eventId = lua_tonumber(L, 2);
+  tmp = rocmon_getCounterName(groupId - 1, eventId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
+}
+
+static int lua_likwid_getNameOfMetric_rocm(lua_State *L) {
+  int metricId, groupId;
+  char *tmp;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  metricId = lua_tonumber(L, 2);
+  tmp = rocmon_getMetricName(groupId - 1, metricId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
+}
+
+static int lua_likwid_getNameOfGroup_rocm(lua_State *L) {
+  int groupId;
+  char *tmp;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  tmp = rocmon_getGroupName(groupId - 1);
+  lua_pushstring(L, tmp);
+  return 1;
 }
 
+static int lua_likwid_markerFile_read_rocm(lua_State *L) {
+  const char *filename = (const char *)luaL_checkstring(L, -1);
+  int ret = rocmon_readMarkerFile(filename);
+  lua_pushinteger(L, ret);
+  return 1;
+}
 
-static int
-lua_likwid_nvMarkerRegionGpus(lua_State* L)
-{
-    int region = lua_tointeger(L,-1);
-    lua_pushinteger(L, nvmon_getGpusOfRegion(region-1));
-    return 1;
+static int lua_likwid_markerFile_destroy_rocm(lua_State *L) {
+  rocmon_destroyMarkerResults();
+  return 0;
 }
 
-static int
-lua_likwid_nvMarkerRegionGpulist(lua_State* L)
-{
-    int i = 0;
-    int region = lua_tointeger(L,-1);
-    int* gpulist;
-    int regionGPUs = 0;
-    if (gputopology_isInitialized == 0)
-    {
-        topology_gpu_init();
-    }
-    if ((gputopology_isInitialized) && (gputopo == NULL))
-    {
-        gputopo = get_gpuTopology();
-    }
+static int lua_likwid_markerNumRegions_rocm(lua_State *L) {
+  lua_pushinteger(L, rocmon_getNumberOfRegions());
+  return 1;
+}
 
-    gpulist = (int*)malloc(gputopo->numDevices * sizeof(int));
-    if (gpulist == NULL)
-    {
-        return 0;
-    }
-    regionGPUs = nvmon_getGpulistOfRegion(region-1, gputopo->numDevices, gpulist);
-    if (regionGPUs > 0)
-    {
-        lua_newtable(L);
-        for (i=0; i < regionGPUs; i++)
-        {
-            lua_pushinteger(L, i+1);
-            lua_pushinteger(L, gpulist[i]);
-            lua_settable(L, -3);
-        }
-        return 1;
-    }
-    return 0;
+static int lua_likwid_markerRegionGroup_rocm(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushinteger(L, rocmon_getGroupOfRegion(region - 1) + 1);
+  return 1;
 }
 
-static int
-lua_likwid_nvMarkerRegionTime(lua_State* L)
-{
-    int region = lua_tointeger(L,-2);
-    int thread = lua_tointeger(L,-1);
-    lua_pushnumber(L, nvmon_getTimeOfRegion(region-1, thread-1));
-    return 1;
+static int lua_likwid_markerRegionTag_rocm(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushstring(L, rocmon_getTagOfRegion(region - 1));
+  return 1;
 }
 
-static int
-lua_likwid_nvMarkerRegionCount(lua_State* L)
-{
-    int region = lua_tointeger(L,-2);
-    int thread = lua_tointeger(L,-1);
-    lua_pushinteger(L, nvmon_getCountOfRegion(region-1, thread-1));
-    return 1;
+static int lua_likwid_markerRegionEvents_rocm(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushinteger(L, rocmon_getEventsOfRegion(region - 1));
+  return 1;
 }
 
-static int
-lua_likwid_nvMarkerRegionResult(lua_State* L)
-{
-    int region = lua_tointeger(L,-3);
-    int event = lua_tointeger(L,-2);
-    int gpu = lua_tointeger(L,-1);
-    lua_pushnumber(L, nvmon_getResultOfRegionGpu(region-1, event-1, gpu-1));
-    return 1;
+static int lua_likwid_markerRegionMetrics_rocm(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushinteger(L, rocmon_getMetricsOfRegion(region - 1));
+  return 1;
 }
 
-static int
-lua_likwid_nvMarkerRegionMetric(lua_State* L)
-{
-    int region = lua_tointeger(L,-3);
-    int metric = lua_tointeger(L,-2);
-    int gpu = lua_tointeger(L,-1);
-    lua_pushnumber(L, nvmon_getMetricOfRegionGpu(region-1, metric-1, gpu-1));
-    return 1;
+static int lua_likwid_markerRegionGpus_rocm(lua_State *L) {
+  int region = lua_tointeger(L, -1);
+  lua_pushinteger(L, rocmon_getGpusOfRegion(region - 1));
+  return 1;
 }
 
+static int lua_likwid_markerRegionGpulist_rocm(lua_State *L) {
+  int i = 0;
+  int region = lua_tointeger(L, -1);
+  int *gpulist;
+  int regionGPUs = 0;
+  if (gputopology_isInitialized_rocm == 0) {
+    topology_gpu_init_rocm();
+    gputopology_isInitialized_rocm = 1;
+  }
+  if ((gputopology_isInitialized_rocm) && (gputopo_rocm == NULL)) {
+    gputopo_rocm = get_gpuTopology_rocm();
+  }
 
-static int
-lua_likwid_nvInit(lua_State* L)
-{
-    int ret;
-    int nrGpus = luaL_checknumber(L,1);
-    luaL_argcheck(L, nrGpus > 0, 1, "GPU count must be greater than 0");
-    int gpus[nrGpus];
-    if (!lua_istable(L, -1)) {
-      lua_pushstring(L,"No table given as second argument");
-      lua_error(L);
-    }
-    for (ret = 1; ret<=nrGpus; ret++)
-    {
-        lua_rawgeti(L,-1,ret);
-#if LUA_VERSION_NUM == 501
-        gpus[ret-1] = ((lua_Integer)lua_tointeger(L,-1));
-#else
-        gpus[ret-1] = ((lua_Unsigned)lua_tointegerx(L,-1, NULL));
-#endif
-        lua_pop(L,1);
-    }
-    if (gputopology_isInitialized == 0)
-    {
-        topology_gpu_init();
-    }
-    if ((gputopology_isInitialized) && (gputopo == NULL))
-    {
-        gputopo = get_gpuTopology();
-    }
-    if (nvmon_initialized == 0)
-    {
-        ret = nvmon_init(nrGpus, gpus);
-        if (ret != 0)
-        {
-            lua_pushstring(L,"Cannot initialize likwid perfmon");
-            nvmon_finalize();
-            lua_pushinteger(L,ret);
-            return 1;
-        }
-        nvmon_initialized = 1;
-        lua_pushinteger(L,ret);
+  gpulist = (int *)malloc(gputopo_rocm->numDevices * sizeof(int));
+  if (gpulist == NULL) {
+    return 0;
+  }
+  regionGPUs =
+      rocmon_getGpulistOfRegion(region - 1, gputopo_rocm->numDevices, gpulist);
+  if (regionGPUs > 0) {
+    lua_newtable(L);
+    for (i = 0; i < regionGPUs; i++) {
+      lua_pushinteger(L, i + 1);
+      lua_pushinteger(L, gpulist[i]);
+      lua_settable(L, -3);
     }
     return 1;
+  }
+  return 0;
 }
 
-static int
-lua_likwid_nvAddEventSet(lua_State* L)
-{
-    int groupId, n;
-    const char* tmpString;
-    if (nvmon_initialized == 0)
-    {
-        return 0;
-    }
-    n = lua_gettop(L);
-    tmpString = luaL_checkstring(L, n);
-    luaL_argcheck(L, strlen(tmpString) > 0, n, "Event string must be larger than 0");
+static int lua_likwid_markerRegionTime_rocm(lua_State *L) {
+  int region = lua_tointeger(L, -2);
+  int thread = lua_tointeger(L, -1);
+  lua_pushnumber(L, rocmon_getTimeOfRegion(region - 1, thread - 1));
+  return 1;
+}
 
-    groupId = nvmon_addEventSet((char*)tmpString);
-    if (groupId >= 0)
-        lua_pushinteger(L, groupId+1);
-    else
-        lua_pushinteger(L, groupId);
-    return 1;
+static int lua_likwid_markerRegionCount_rocm(lua_State *L) {
+  int region = lua_tointeger(L, -2);
+  int thread = lua_tointeger(L, -1);
+  lua_pushinteger(L, rocmon_getCountOfRegion(region - 1, thread - 1));
+  return 1;
 }
 
-static int
-lua_likwid_nvFinalize(lua_State *L)
-{
-    if (nvmon_initialized)
-        nvmon_finalize();
+static int lua_likwid_markerRegionResult_rocm(lua_State *L) {
+  int region = lua_tointeger(L, -3);
+  int event = lua_tointeger(L, -2);
+  int gpu = lua_tointeger(L, -1);
+  lua_pushnumber(L,
+                 rocmon_getResultOfRegionGpu(region - 1, event - 1, gpu - 1));
+  return 1;
+}
+
+static int lua_likwid_markerRegionMetric_rocm(lua_State *L) {
+  int region = lua_tointeger(L, -3);
+  int metric = lua_tointeger(L, -2);
+  int gpu = lua_tointeger(L, -1);
+  lua_pushnumber(L,
+                 rocmon_getMetricOfRegionGpu(region - 1, metric - 1, gpu - 1));
+  return 1;
+}
+
+static int lua_likwid_addEventSet_rocm(lua_State *L) {
+  int groupId, n;
+  const char *tmpString;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  n = lua_gettop(L);
+  tmpString = luaL_checkstring(L, n);
+  luaL_argcheck(L, strlen(tmpString) > 0, n,
+                "Event string must be larger than 0");
+
+  int ret = rocmon_addEventSet((char *)tmpString, &groupId);
+  if (groupId >= 0) {
+    lua_pushinteger(L, groupId + 1);
+  } else {
+    lua_pushstring(L, "Failed to add event string");
+    lua_error(L);
+  }
+  return 1;
+}
+
+static int lua_likwid_setupCounters_rocm(lua_State *L) {
+  int ret;
+  int groupId = lua_tonumber(L, 1);
+  if (rocmon_initialized == 0) {
     return 0;
+  }
+  ret = rocmon_setupCounters(groupId - 1);
+  lua_pushinteger(L, ret);
+  return 1;
 }
 
-static int
-lua_likwid_nvSupported(lua_State *L)
-{
-    lua_pushboolean(L, 1);
-    return 1;
+static int lua_likwid_startCounters_rocm(lua_State *L) {
+  int ret;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  ret = rocmon_startCounters();
+  lua_pushinteger(L, ret);
+  return 1;
 }
-#else
-static int
-lua_likwid_nvSupported(lua_State *L)
-{
-    lua_pushboolean(L, 1);
+
+static int lua_likwid_stopCounters_rocm(lua_State *L) {
+  int ret;
+  if (rocmon_initialized == 0) {
     return 0;
+  }
+  ret = rocmon_stopCounters();
+  lua_pushinteger(L, ret);
+  return 1;
 }
-#endif /* LIKWID_WITH_NVMON */
 
-static int hwfeatures_inititalized = 0;
-static int
-lua_likwid_initHWFeatures(lua_State *L)
-{
-    int err = 0;
-    if (!hwfeatures_inititalized)
-    {
-        err = hwFeatures_init();
-        if (err == 0)
-        {
-            hwfeatures_inititalized = 1;
-        }
-    }
-    lua_pushnumber(L, err);
+static int lua_likwid_readCounters_rocm(lua_State *L) {
+  int ret;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  ret = rocmon_readCounters();
+  lua_pushinteger(L, ret);
+  return 1;
+}
+
+static int lua_likwid_switchGroup_rocm(lua_State *L) {
+  int ret = -1;
+  int newgroup = lua_tonumber(L, 1) - 1;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  if (newgroup >= rocmon_getNumberOfGroups()) {
+    newgroup = 0;
+  }
+  if (newgroup == rocmon_getIdOfActiveGroup()) {
+    lua_pushinteger(L, ret);
     return 1;
+  }
+  ret = rocmon_switchActiveGroup(newgroup);
+  lua_pushinteger(L, ret);
+  return 1;
+}
+
+static int lua_likwid_getResult_rocm(lua_State *L) {
+  int groupId, eventId, threadId;
+  double result = 0;
+  groupId = lua_tonumber(L, 1);
+  eventId = lua_tonumber(L, 2);
+  threadId = lua_tonumber(L, 3);
+  result = rocmon_getResult(groupId - 1, eventId - 1, threadId - 1);
+  lua_pushnumber(L, result);
+  return 1;
+}
+
+static int lua_likwid_getLastResult_rocm(lua_State *L) {
+  int groupId, eventId, threadId;
+  double result = 0;
+  groupId = lua_tonumber(L, 1);
+  eventId = lua_tonumber(L, 2);
+  threadId = lua_tonumber(L, 3);
+  result = rocmon_getLastResult(groupId - 1, eventId - 1, threadId - 1);
+  lua_pushnumber(L, result);
+  return 1;
+}
+
+static int lua_likwid_getIdOfActiveGroup_rocm(lua_State *L) {
+  int number;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  number = rocmon_getIdOfActiveGroup();
+  lua_pushinteger(L, number + 1);
+  return 1;
 }
 
-static int
-lua_likwid_finalizeHWFeatures(lua_State *L)
-{
-    if (hwfeatures_inititalized)
-    {
-        hwFeatures_finalize();
-        hwfeatures_inititalized = 0;
-    }
+static int lua_likwid_getRuntimeOfGroup_rocm(lua_State *L) {
+  double time;
+  int groupId;
+  if (rocmon_initialized == 0) {
     return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  time = rocmon_getTimeOfGroup(groupId - 1);
+  lua_pushnumber(L, time);
+  return 1;
 }
-static int
-lua_likwid_getHwFeatureList(lua_State *L)
-{
-    if (!hwfeatures_inititalized)
-    {
-        lua_newtable(L);
-        return 1;
-    }
-    HWFeatureList list = {0, NULL};
-    hwFeatures_list(&list);
-    lua_newtable(L);
-    for (int i = 0; i < list.num_features; i++)
-    {
-        lua_pushinteger(L, (lua_Integer)( i+1));
-        lua_newtable(L);
-        lua_pushstring(L, "Name");
-        lua_pushstring(L, list.features[i].name);
-        lua_settable(L,-3);
-        lua_pushstring(L, "Description");
-        lua_pushstring(L, list.features[i].description);
-        lua_settable(L,-3);
-        lua_pushstring(L, "ReadOnly");
-        lua_pushboolean(L, list.features[i].readonly);
-        lua_settable(L,-3);
-        lua_pushstring(L, "WriteOnly");
-        lua_pushboolean(L, list.features[i].writeonly);
-        lua_settable(L,-3);
-        lua_pushstring(L, "Scope");
-        lua_pushstring(L, HWFeatureScopeNames[list.features[i].scope]);
-        lua_settable(L,-3);
-        lua_settable(L,-3);
-    }
-    hwFeatures_list_return(&list);
-    return 1;
+
+static int lua_likwid_getLastTimeOfGroup_rocm(lua_State *L) {
+  double time;
+  int groupId;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  time = rocmon_getLastTimeOfGroup(groupId - 1);
+  lua_pushnumber(L, time);
+  return 1;
 }
 
-static int
-lua_likwid_getHwFeature(lua_State *L)
-{
-    if (hwfeatures_inititalized)
-    {
-        char* feature = (char *)luaL_checkstring(L, 1);
-        int hwt = luaL_checknumber(L,2);
-        uint64_t value = 0;
-        int err = hwFeatures_getByName(feature, hwt, &value);
-        if (err == 0)
-        {
-            lua_pushinteger(L, value);
-            return 1;
-        }
-    }
-    lua_pushnil(L);
-    return 1;
+static int lua_likwid_getTimeToLastReadOfGroup_rocm(lua_State *L) {
+  double time;
+  int groupId;
+  if (rocmon_initialized == 0) {
+    return 0;
+  }
+  groupId = lua_tonumber(L, 1);
+  time = rocmon_getTimeToLastReadOfGroup(groupId - 1);
+  lua_pushnumber(L, time);
+  return 1;
 }
 
-static int
-lua_likwid_setHwFeature(lua_State *L)
-{
-    if (hwfeatures_inititalized)
-    {
-        char* feature = (char *)luaL_checkstring(L, 1);
-        int hwt = luaL_checknumber(L,2);
-        int value = luaL_checknumber(L,3);
-        int err = hwFeatures_modifyByName(feature, hwt, value);
-        if (err == 0)
-        {
-            lua_pushboolean(L, 1);
-            return 1;
-        }
-    }
-    lua_pushboolean(L, 0);
-    return 1;
+static int lua_likwid_finalize_rocm(lua_State *L) {
+  if (rocmon_initialized)
+    rocmon_finalize();
+  return 0;
+}
+
+#else
+
+static int lua_likwid_rocmSupported(lua_State *L) {
+  lua_pushboolean(L, FALSE);
+  return 1;
 }
 
+#endif /* LIKWID_WITH_ROCMON */
+
 /* #####   FUNCTION DEFINITIONS  -  EXPORTED FUNCTIONS   ################## */
 
-int __attribute__ ((visibility ("default") ))
-luaopen_liblikwid(lua_State* L){
-    // Configuration functions
-    lua_register(L, "likwid_getConfiguration", lua_likwid_getConfiguration);
-    lua_register(L, "likwid_setGroupPath", lua_likwid_setGroupPath);
-    lua_register(L, "likwid_putConfiguration", lua_likwid_putConfiguration);
-    // Perfmon functions
-    //lua_register(L, "accessClient_setaccessmode",lua_accessClient_setaccessmode);
-    lua_register(L, "likwid_setAccessClientMode",lua_likwid_setAccessMode);
-    lua_register(L, "likwid_getAccessClientMode",lua_likwid_getAccessMode);
-    lua_register(L, "likwid_init",lua_likwid_init);
-    lua_register(L, "likwid_addEventSet", lua_likwid_addEventSet);
-    lua_register(L, "likwid_setupCounters",lua_likwid_setupCounters);
-    lua_register(L, "likwid_startCounters",lua_likwid_startCounters);
-    lua_register(L, "likwid_stopCounters",lua_likwid_stopCounters);
-    lua_register(L, "likwid_readCounters",lua_likwid_readCounters);
-    lua_register(L, "likwid_switchGroup",lua_likwid_switchGroup);
-    lua_register(L, "likwid_finalize",lua_likwid_finalize);
-    lua_register(L, "likwid_getEventsAndCounters", lua_likwid_getEventsAndCounters);
-    // Perfmon results functions
-    lua_register(L, "likwid_getResult",lua_likwid_getResult);
-    lua_register(L, "likwid_getLastResult",lua_likwid_getLastResult);
-    lua_register(L, "likwid_getMetric",lua_likwid_getMetric);
-    lua_register(L, "likwid_getLastMetric",lua_likwid_getLastMetric);
-    lua_register(L, "likwid_getNumberOfGroups",lua_likwid_getNumberOfGroups);
-    lua_register(L, "likwid_getRuntimeOfGroup", lua_likwid_getRuntimeOfGroup);
-    lua_register(L, "likwid_getIdOfActiveGroup",lua_likwid_getIdOfActiveGroup);
-    lua_register(L, "likwid_getNumberOfEvents",lua_likwid_getNumberOfEvents);
-    lua_register(L, "likwid_getNumberOfMetrics",lua_likwid_getNumberOfMetrics);
-    lua_register(L, "likwid_getNumberOfThreads",lua_likwid_getNumberOfThreads);
-    lua_register(L, "likwid_getNameOfEvent",lua_likwid_getNameOfEvent);
-    lua_register(L, "likwid_getNameOfCounter",lua_likwid_getNameOfCounter);
-    lua_register(L, "likwid_getNameOfMetric",lua_likwid_getNameOfMetric);
-    lua_register(L, "likwid_getNameOfGroup",lua_likwid_getNameOfGroup);
-    lua_register(L, "likwid_getGroups",lua_likwid_getGroups);
-    lua_register(L, "likwid_getShortInfoOfGroup",lua_likwid_getShortInfoOfGroup);
-    lua_register(L, "likwid_getLongInfoOfGroup",lua_likwid_getLongInfoOfGroup);
-    // Topology functions
-    lua_register(L, "likwid_getCpuInfo",lua_likwid_getCpuInfo);
-    lua_register(L, "likwid_getCpuTopology",lua_likwid_getCpuTopology);
-    lua_register(L, "likwid_putTopology",lua_likwid_putTopology);
-    lua_register(L, "likwid_getNumaInfo",lua_likwid_getNumaInfo);
-    lua_register(L, "likwid_putNumaInfo",lua_likwid_putNumaInfo);
-    lua_register(L, "likwid_setMemInterleaved", lua_likwid_setMemInterleaved);
-    lua_register(L, "likwid_setMembind", lua_likwid_setMembind);
-    lua_register(L, "likwid_getAffinityInfo",lua_likwid_getAffinityInfo);
-    lua_register(L, "likwid_putAffinityInfo",lua_likwid_putAffinityInfo);
-    lua_register(L, "likwid_getPowerInfo",lua_likwid_getPowerInfo);
-    lua_register(L, "likwid_putPowerInfo",lua_likwid_putPowerInfo);
-    lua_register(L, "likwid_getOnlineDevices", lua_likwid_getOnlineDevices);
-    lua_register(L, "likwid_printSupportedCPUs", lua_likwid_printSupportedCPUs);
-    // CPU string parse functions
-    lua_register(L, "likwid_cpustr_to_cpulist",lua_likwid_cpustr_to_cpulist);
-    lua_register(L, "likwid_nodestr_to_nodelist",lua_likwid_nodestr_to_nodelist);
-    lua_register(L, "likwid_sockstr_to_socklist",lua_likwid_sockstr_to_socklist);
-    // Timer functions
-    lua_register(L, "likwid_getCpuClock",lua_likwid_getCpuClock);
-    lua_register(L, "likwid_getCycleClock",lua_likwid_getCycleClock);
-    lua_register(L, "likwid_startClock",lua_likwid_startClock);
-    lua_register(L, "likwid_stopClock",lua_likwid_stopClock);
-    lua_register(L, "likwid_getClockCycles",lua_likwid_getClockCycles);
-    lua_register(L, "likwid_getClock",lua_likwid_getClock);
-    lua_register(L, "sleep",lua_sleep);
-    // Power functions
-    lua_register(L, "likwid_startPower",lua_likwid_startPower);
-    lua_register(L, "likwid_stopPower",lua_likwid_stopPower);
-    lua_register(L, "likwid_printEnergy",lua_likwid_printEnergy);
-    lua_register(L, "likwid_powerLimitGet",lua_likwid_power_limitGet);
-    lua_register(L, "likwid_powerLimitSet",lua_likwid_power_limitSet);
-    lua_register(L, "likwid_powerLimitState",lua_likwid_power_limitState);
-    // Temperature functions
-    lua_register(L, "likwid_initTemp",lua_likwid_initTemp);
-    lua_register(L, "likwid_readTemp",lua_likwid_readTemp);
-    // MemSweep functions
-    lua_register(L, "likwid_memSweep", lua_likwid_memSweep);
-    lua_register(L, "likwid_memSweepDomain", lua_likwid_memSweepDomain);
-    // Pinning functions
-    lua_register(L, "likwid_pinProcess", lua_likwid_pinProcess);
-    lua_register(L, "likwid_pinThread", lua_likwid_pinThread);
-    // Helper functions
-    lua_register(L, "likwid_setenv", lua_likwid_setenv);
-    lua_register(L, "likwid_unsetenv", lua_likwid_unsetenv);
-    lua_register(L, "likwid_getpid", lua_likwid_getpid);
-    lua_register(L, "likwid_access", lua_likwid_access);
-    lua_register(L, "likwid_startProgram", lua_likwid_startProgram);
-    lua_register(L, "likwid_checkProgram", lua_likwid_checkProgram);
-    lua_register(L, "likwid_killProgram", lua_likwid_killProgram);
-    lua_register(L, "likwid_catchSignal", lua_likwid_catch_signal);
-    lua_register(L, "likwid_getSignalState", lua_likwid_return_signal_state);
-    lua_register(L, "likwid_waitpid", lua_likwid_waitpid);
-    lua_register(L, "likwid_sendSignal", lua_likwid_send_signal);
-    // Verbosity functions
-    lua_register(L, "likwid_setVerbosity", lua_likwid_setVerbosity);
-    lua_register(L, "likwid_getVerbosity", lua_likwid_getVerbosity);
-    // Marker API functions
-    lua_register(L, "likwid_markerInit", lua_likwid_markerInit);
-    lua_register(L, "likwid_markerThreadInit", lua_likwid_markerThreadInit);
-    lua_register(L, "likwid_markerNextGroup", lua_likwid_markerNext);
-    lua_register(L, "likwid_markerClose", lua_likwid_markerClose);
-    lua_register(L, "likwid_registerRegion", lua_likwid_registerRegion);
-    lua_register(L, "likwid_startRegion", lua_likwid_startRegion);
-    lua_register(L, "likwid_stopRegion", lua_likwid_stopRegion);
-    lua_register(L, "likwid_getRegion", lua_likwid_getRegion);
-    lua_register(L, "likwid_resetRegion", lua_likwid_resetRegion);
-    // CPU feature manipulation functions
-    lua_register(L, "likwid_cpuFeaturesInit", lua_likwid_cpuFeatures_init);
-    lua_register(L, "likwid_cpuFeaturesGet", lua_likwid_cpuFeatures_get);
-    lua_register(L, "likwid_cpuFeaturesEnable", lua_likwid_cpuFeatures_enable);
-    lua_register(L, "likwid_cpuFeaturesDisable", lua_likwid_cpuFeatures_disable);
-    // Marker API related functions
-    lua_register(L, "likwid_readMarkerFile", lua_likwid_markerFile_read);
-    lua_register(L, "likwid_destroyMarkerFile", lua_likwid_markerFile_destroy);
-    lua_register(L, "likwid_markerNumRegions", lua_likwid_markerNumRegions);
-    lua_register(L, "likwid_markerRegionGroup", lua_likwid_markerRegionGroup);
-    lua_register(L, "likwid_markerRegionTag", lua_likwid_markerRegionTag);
-    lua_register(L, "likwid_markerRegionEvents", lua_likwid_markerRegionEvents);
-    lua_register(L, "likwid_markerRegionThreads", lua_likwid_markerRegionThreads);
-    lua_register(L, "likwid_markerRegionCpulist", lua_likwid_markerRegionCpulist);
-    lua_register(L, "likwid_markerRegionTime", lua_likwid_markerRegionTime);
-    lua_register(L, "likwid_markerRegionCount", lua_likwid_markerRegionCount);
-    lua_register(L, "likwid_markerRegionResult", lua_likwid_markerRegionResult);
-    lua_register(L, "likwid_markerRegionMetric", lua_likwid_markerRegionMetric);
-    // CPU frequency functions
-    lua_register(L, "likwid_initFreq", lua_likwid_initFreq);
-    lua_register(L, "likwid_finalizeFreq", lua_likwid_finalizeFreq);
-    lua_register(L, "likwid_getCpuClockBase", lua_likwid_getCpuClockBase);
-    lua_register(L, "likwid_getCpuClockCurrent", lua_likwid_getCpuClockCurrent);
-    lua_register(L, "likwid_getCpuClockMin", lua_likwid_getCpuClockMin);
-    lua_register(L, "likwid_getConfCpuClockMin", lua_likwid_getConfCpuClockMin);
-    lua_register(L, "likwid_setCpuClockMin", lua_likwid_setCpuClockMin);
-    lua_register(L, "likwid_getCpuClockMax", lua_likwid_getCpuClockMax);
-    lua_register(L, "likwid_getConfCpuClockMax", lua_likwid_getConfCpuClockMax);
-    lua_register(L, "likwid_setCpuClockMax", lua_likwid_setCpuClockMax);
-    lua_register(L, "likwid_getGovernor", lua_likwid_getGovernor);
-    lua_register(L, "likwid_setGovernor", lua_likwid_setGovernor);
-    lua_register(L, "likwid_getAvailFreq", lua_likwid_getAvailFreq);
-    lua_register(L, "likwid_getAvailGovs", lua_likwid_getAvailGovs);
-    lua_register(L, "likwid_setTurbo", lua_likwid_setTurbo);
-    lua_register(L, "likwid_getTurbo", lua_likwid_getTurbo);
-    lua_register(L, "likwid_setUncoreFreqMin", lua_likwid_setUncoreFreqMin);
-    lua_register(L, "likwid_getUncoreFreqMin", lua_likwid_getUncoreFreqMin);
-    lua_register(L, "likwid_setUncoreFreqMax", lua_likwid_setUncoreFreqMax);
-    lua_register(L, "likwid_getUncoreFreqMax", lua_likwid_getUncoreFreqMax);
-    lua_register(L, "likwid_getUncoreFreqCur", lua_likwid_getUncoreFreqCur);
-    // setuid&friends
-    lua_register(L, "likwid_getuid", lua_likwid_getuid);
-    lua_register(L, "likwid_geteuid", lua_likwid_geteuid);
-    lua_register(L, "likwid_setuid", lua_likwid_setuid);
-    lua_register(L, "likwid_seteuid", lua_likwid_seteuid);
-    lua_register(L, "likwid_setresuid", lua_likwid_setresuid);
-    lua_register(L, "likwid_setresuser", lua_likwid_setresuser);
-    // Nvidia GPU functions
-    lua_register(L, "likwid_nvSupported", lua_likwid_nvSupported);
+int __attribute__((visibility("default"))) luaopen_liblikwid(lua_State *L) {
+  // Configuration functions
+  lua_register(L, "likwid_getConfiguration", lua_likwid_getConfiguration);
+  lua_register(L, "likwid_setGroupPath", lua_likwid_setGroupPath);
+  lua_register(L, "likwid_putConfiguration", lua_likwid_putConfiguration);
+  // Perfmon functions
+  // lua_register(L,
+  // "accessClient_setaccessmode",lua_accessClient_setaccessmode);
+  lua_register(L, "likwid_setAccessClientMode", lua_likwid_setAccessMode);
+  lua_register(L, "likwid_getAccessClientMode", lua_likwid_getAccessMode);
+  lua_register(L, "likwid_init", lua_likwid_init);
+  lua_register(L, "likwid_addEventSet", lua_likwid_addEventSet);
+  lua_register(L, "likwid_setupCounters", lua_likwid_setupCounters);
+  lua_register(L, "likwid_startCounters", lua_likwid_startCounters);
+  lua_register(L, "likwid_stopCounters", lua_likwid_stopCounters);
+  lua_register(L, "likwid_readCounters", lua_likwid_readCounters);
+  lua_register(L, "likwid_switchGroup", lua_likwid_switchGroup);
+  lua_register(L, "likwid_finalize", lua_likwid_finalize);
+  lua_register(L, "likwid_getEventsAndCounters",
+               lua_likwid_getEventsAndCounters);
+  // Perfmon results functions
+  lua_register(L, "likwid_getResult", lua_likwid_getResult);
+  lua_register(L, "likwid_getLastResult", lua_likwid_getLastResult);
+  lua_register(L, "likwid_getMetric", lua_likwid_getMetric);
+  lua_register(L, "likwid_getLastMetric", lua_likwid_getLastMetric);
+  lua_register(L, "likwid_getNumberOfGroups", lua_likwid_getNumberOfGroups);
+  lua_register(L, "likwid_getRuntimeOfGroup", lua_likwid_getRuntimeOfGroup);
+  lua_register(L, "likwid_getIdOfActiveGroup", lua_likwid_getIdOfActiveGroup);
+  lua_register(L, "likwid_getNumberOfEvents", lua_likwid_getNumberOfEvents);
+  lua_register(L, "likwid_getNumberOfMetrics", lua_likwid_getNumberOfMetrics);
+  lua_register(L, "likwid_getNumberOfThreads", lua_likwid_getNumberOfThreads);
+  lua_register(L, "likwid_getNameOfEvent", lua_likwid_getNameOfEvent);
+  lua_register(L, "likwid_getNameOfCounter", lua_likwid_getNameOfCounter);
+  lua_register(L, "likwid_getNameOfMetric", lua_likwid_getNameOfMetric);
+  lua_register(L, "likwid_getNameOfGroup", lua_likwid_getNameOfGroup);
+  lua_register(L, "likwid_getGroups", lua_likwid_getGroups);
+  lua_register(L, "likwid_getShortInfoOfGroup", lua_likwid_getShortInfoOfGroup);
+  lua_register(L, "likwid_getLongInfoOfGroup", lua_likwid_getLongInfoOfGroup);
+  // Topology functions
+  lua_register(L, "likwid_getCpuInfo", lua_likwid_getCpuInfo);
+  lua_register(L, "likwid_getCpuTopology", lua_likwid_getCpuTopology);
+  lua_register(L, "likwid_putTopology", lua_likwid_putTopology);
+  lua_register(L, "likwid_getNumaInfo", lua_likwid_getNumaInfo);
+  lua_register(L, "likwid_putNumaInfo", lua_likwid_putNumaInfo);
+  lua_register(L, "likwid_setMemInterleaved", lua_likwid_setMemInterleaved);
+  lua_register(L, "likwid_setMembind", lua_likwid_setMembind);
+  lua_register(L, "likwid_getAffinityInfo", lua_likwid_getAffinityInfo);
+  lua_register(L, "likwid_putAffinityInfo", lua_likwid_putAffinityInfo);
+  lua_register(L, "likwid_getPowerInfo", lua_likwid_getPowerInfo);
+  lua_register(L, "likwid_putPowerInfo", lua_likwid_putPowerInfo);
+  lua_register(L, "likwid_getOnlineDevices", lua_likwid_getOnlineDevices);
+  lua_register(L, "likwid_printSupportedCPUs", lua_likwid_printSupportedCPUs);
+  // CPU string parse functions
+  lua_register(L, "likwid_cpustr_to_cpulist", lua_likwid_cpustr_to_cpulist);
+  lua_register(L, "likwid_nodestr_to_nodelist", lua_likwid_nodestr_to_nodelist);
+  lua_register(L, "likwid_sockstr_to_socklist", lua_likwid_sockstr_to_socklist);
+  // Timer functions
+  lua_register(L, "likwid_getCpuClock", lua_likwid_getCpuClock);
+  lua_register(L, "likwid_getCycleClock", lua_likwid_getCycleClock);
+  lua_register(L, "likwid_startClock", lua_likwid_startClock);
+  lua_register(L, "likwid_stopClock", lua_likwid_stopClock);
+  lua_register(L, "likwid_getClockCycles", lua_likwid_getClockCycles);
+  lua_register(L, "likwid_getClock", lua_likwid_getClock);
+  lua_register(L, "sleep", lua_sleep);
+  // Power functions
+  lua_register(L, "likwid_startPower", lua_likwid_startPower);
+  lua_register(L, "likwid_stopPower", lua_likwid_stopPower);
+  lua_register(L, "likwid_printEnergy", lua_likwid_printEnergy);
+  lua_register(L, "likwid_powerLimitGet", lua_likwid_power_limitGet);
+  lua_register(L, "likwid_powerLimitSet", lua_likwid_power_limitSet);
+  lua_register(L, "likwid_powerLimitState", lua_likwid_power_limitState);
+  // Temperature functions
+  lua_register(L, "likwid_initTemp", lua_likwid_initTemp);
+  lua_register(L, "likwid_readTemp", lua_likwid_readTemp);
+  // MemSweep functions
+  lua_register(L, "likwid_memSweep", lua_likwid_memSweep);
+  lua_register(L, "likwid_memSweepDomain", lua_likwid_memSweepDomain);
+  // Pinning functions
+  lua_register(L, "likwid_pinProcess", lua_likwid_pinProcess);
+  lua_register(L, "likwid_pinThread", lua_likwid_pinThread);
+  // Helper functions
+  lua_register(L, "likwid_setenv", lua_likwid_setenv);
+  lua_register(L, "likwid_unsetenv", lua_likwid_unsetenv);
+  lua_register(L, "likwid_getpid", lua_likwid_getpid);
+  lua_register(L, "likwid_access", lua_likwid_access);
+  lua_register(L, "likwid_startProgram", lua_likwid_startProgram);
+  lua_register(L, "likwid_checkProgram", lua_likwid_checkProgram);
+  lua_register(L, "likwid_killProgram", lua_likwid_killProgram);
+  lua_register(L, "likwid_catchSignal", lua_likwid_catch_signal);
+  lua_register(L, "likwid_getSignalState", lua_likwid_return_signal_state);
+  lua_register(L, "likwid_waitpid", lua_likwid_waitpid);
+  lua_register(L, "likwid_sendSignal", lua_likwid_send_signal);
+  // Verbosity functions
+  lua_register(L, "likwid_setVerbosity", lua_likwid_setVerbosity);
+  lua_register(L, "likwid_getVerbosity", lua_likwid_getVerbosity);
+  // Marker API functions
+  lua_register(L, "likwid_markerInit", lua_likwid_markerInit);
+  lua_register(L, "likwid_markerThreadInit", lua_likwid_markerThreadInit);
+  lua_register(L, "likwid_markerNextGroup", lua_likwid_markerNext);
+  lua_register(L, "likwid_markerClose", lua_likwid_markerClose);
+  lua_register(L, "likwid_registerRegion", lua_likwid_registerRegion);
+  lua_register(L, "likwid_startRegion", lua_likwid_startRegion);
+  lua_register(L, "likwid_stopRegion", lua_likwid_stopRegion);
+  lua_register(L, "likwid_getRegion", lua_likwid_getRegion);
+  lua_register(L, "likwid_resetRegion", lua_likwid_resetRegion);
+  // CPU feature manipulation functions
+  lua_register(L, "likwid_cpuFeaturesInit", lua_likwid_cpuFeatures_init);
+  lua_register(L, "likwid_cpuFeaturesGet", lua_likwid_cpuFeatures_get);
+  lua_register(L, "likwid_cpuFeaturesEnable", lua_likwid_cpuFeatures_enable);
+  lua_register(L, "likwid_cpuFeaturesDisable", lua_likwid_cpuFeatures_disable);
+  // Marker API related functions
+  lua_register(L, "likwid_readMarkerFile", lua_likwid_markerFile_read);
+  lua_register(L, "likwid_destroyMarkerFile", lua_likwid_markerFile_destroy);
+  lua_register(L, "likwid_markerNumRegions", lua_likwid_markerNumRegions);
+  lua_register(L, "likwid_markerRegionGroup", lua_likwid_markerRegionGroup);
+  lua_register(L, "likwid_markerRegionTag", lua_likwid_markerRegionTag);
+  lua_register(L, "likwid_markerRegionEvents", lua_likwid_markerRegionEvents);
+  lua_register(L, "likwid_markerRegionThreads", lua_likwid_markerRegionThreads);
+  lua_register(L, "likwid_markerRegionCpulist", lua_likwid_markerRegionCpulist);
+  lua_register(L, "likwid_markerRegionTime", lua_likwid_markerRegionTime);
+  lua_register(L, "likwid_markerRegionCount", lua_likwid_markerRegionCount);
+  lua_register(L, "likwid_markerRegionResult", lua_likwid_markerRegionResult);
+  lua_register(L, "likwid_markerRegionMetric", lua_likwid_markerRegionMetric);
+  // CPU frequency functions
+  lua_register(L, "likwid_initFreq", lua_likwid_initFreq);
+  lua_register(L, "likwid_finalizeFreq", lua_likwid_finalizeFreq);
+  lua_register(L, "likwid_getCpuClockBase", lua_likwid_getCpuClockBase);
+  lua_register(L, "likwid_getCpuClockCurrent", lua_likwid_getCpuClockCurrent);
+  lua_register(L, "likwid_getCpuClockMin", lua_likwid_getCpuClockMin);
+  lua_register(L, "likwid_getConfCpuClockMin", lua_likwid_getConfCpuClockMin);
+  lua_register(L, "likwid_setCpuClockMin", lua_likwid_setCpuClockMin);
+  lua_register(L, "likwid_getCpuClockMax", lua_likwid_getCpuClockMax);
+  lua_register(L, "likwid_getConfCpuClockMax", lua_likwid_getConfCpuClockMax);
+  lua_register(L, "likwid_setCpuClockMax", lua_likwid_setCpuClockMax);
+  lua_register(L, "likwid_getGovernor", lua_likwid_getGovernor);
+  lua_register(L, "likwid_setGovernor", lua_likwid_setGovernor);
+  lua_register(L, "likwid_getAvailFreq", lua_likwid_getAvailFreq);
+  lua_register(L, "likwid_getAvailGovs", lua_likwid_getAvailGovs);
+  lua_register(L, "likwid_setTurbo", lua_likwid_setTurbo);
+  lua_register(L, "likwid_getTurbo", lua_likwid_getTurbo);
+  lua_register(L, "likwid_setUncoreFreqMin", lua_likwid_setUncoreFreqMin);
+  lua_register(L, "likwid_getUncoreFreqMin", lua_likwid_getUncoreFreqMin);
+  lua_register(L, "likwid_setUncoreFreqMax", lua_likwid_setUncoreFreqMax);
+  lua_register(L, "likwid_getUncoreFreqMax", lua_likwid_getUncoreFreqMax);
+  lua_register(L, "likwid_getUncoreFreqCur", lua_likwid_getUncoreFreqCur);
+  // setuid&friends
+  lua_register(L, "likwid_getuid", lua_likwid_getuid);
+  lua_register(L, "likwid_geteuid", lua_likwid_geteuid);
+  lua_register(L, "likwid_setuid", lua_likwid_setuid);
+  lua_register(L, "likwid_seteuid", lua_likwid_seteuid);
+  lua_register(L, "likwid_setresuid", lua_likwid_setresuid);
+  lua_register(L, "likwid_setresuser", lua_likwid_setresuser);
+  // Nvidia GPU functions
+  lua_register(L, "likwid_nvSupported", lua_likwid_nvSupported);
 #ifdef LIKWID_WITH_NVMON
-    lua_register(L, "likwid_getGpuTopology", lua_likwid_getGpuTopology);
-    lua_register(L, "likwid_putGpuTopology", lua_likwid_putGpuTopology);
-    lua_register(L, "likwid_getGpuEventsAndCounters", lua_likwid_getGpuEventsAndCounters);
-    lua_register(L, "likwid_getGpuGroups", lua_likwid_getGpuGroups);
-    lua_register(L, "likwid_gpustr_to_gpulist", lua_likwid_gpustr_to_gpulist);
-    lua_register(L, "likwid_readNvMarkerFile", lua_likwid_nvMarkerFile_read);
-    lua_register(L, "likwid_destroyNvMarkerFile", lua_likwid_nvMarkerFile_destroy);
-    lua_register(L, "likwid_nvMarkerNumRegions", lua_likwid_nvMarkerNumRegions);
-    lua_register(L, "likwid_nvMarkerRegionGroup", lua_likwid_nvMarkerRegionGroup);
-    lua_register(L, "likwid_nvMarkerRegionTag", lua_likwid_nvMarkerRegionTag);
-    lua_register(L, "likwid_nvMarkerRegionEvents", lua_likwid_nvMarkerRegionEvents);
-    lua_register(L, "likwid_nvMarkerRegionMetrics", lua_likwid_nvMarkerRegionMetrics);
-    lua_register(L, "likwid_nvMarkerRegionGpus", lua_likwid_nvMarkerRegionGpus);
-    lua_register(L, "likwid_nvMarkerRegionGpulist", lua_likwid_nvMarkerRegionGpulist);
-    lua_register(L, "likwid_nvMarkerRegionTime", lua_likwid_nvMarkerRegionTime);
-    lua_register(L, "likwid_nvMarkerRegionCount", lua_likwid_nvMarkerRegionCount);
-    lua_register(L, "likwid_nvMarkerRegionResult", lua_likwid_nvMarkerRegionResult);
-    lua_register(L, "likwid_nvMarkerRegionMetric", lua_likwid_nvMarkerRegionMetric);
-    lua_register(L, "likwid_nvInit",lua_likwid_nvInit);
-    lua_register(L, "likwid_nvAddEventSet", lua_likwid_nvAddEventSet);
-    lua_register(L, "likwid_nvFinalize",lua_likwid_nvFinalize);
-    lua_register(L, "likwid_nvGetNameOfEvent",lua_likwid_nvGetNameOfEvent);
-    lua_register(L, "likwid_nvGetNameOfCounter",lua_likwid_nvGetNameOfCounter);
-    lua_register(L, "likwid_nvGetNameOfMetric",lua_likwid_nvGetNameOfMetric);
-    lua_register(L, "likwid_nvGetNameOfGroup",lua_likwid_nvGetNameOfGroup);
+  lua_register(L, "likwid_getGpuTopology", lua_likwid_getGpuTopology);
+  lua_register(L, "likwid_putGpuTopology", lua_likwid_putGpuTopology);
+  lua_register(L, "likwid_getGpuEventsAndCounters",
+               lua_likwid_getGpuEventsAndCounters);
+  lua_register(L, "likwid_getGpuGroups", lua_likwid_getGpuGroups);
+  lua_register(L, "likwid_gpustr_to_gpulist", lua_likwid_gpustr_to_gpulist);
+  lua_register(L, "likwid_readNvMarkerFile", lua_likwid_nvMarkerFile_read);
+  lua_register(L, "likwid_destroyNvMarkerFile",
+               lua_likwid_nvMarkerFile_destroy);
+  lua_register(L, "likwid_nvMarkerNumRegions", lua_likwid_nvMarkerNumRegions);
+  lua_register(L, "likwid_nvMarkerRegionGroup", lua_likwid_nvMarkerRegionGroup);
+  lua_register(L, "likwid_nvMarkerRegionTag", lua_likwid_nvMarkerRegionTag);
+  lua_register(L, "likwid_nvMarkerRegionEvents",
+               lua_likwid_nvMarkerRegionEvents);
+  lua_register(L, "likwid_nvMarkerRegionMetrics",
+               lua_likwid_nvMarkerRegionMetrics);
+  lua_register(L, "likwid_nvMarkerRegionGpus", lua_likwid_nvMarkerRegionGpus);
+  lua_register(L, "likwid_nvMarkerRegionGpulist",
+               lua_likwid_nvMarkerRegionGpulist);
+  lua_register(L, "likwid_nvMarkerRegionTime", lua_likwid_nvMarkerRegionTime);
+  lua_register(L, "likwid_nvMarkerRegionCount", lua_likwid_nvMarkerRegionCount);
+  lua_register(L, "likwid_nvMarkerRegionResult",
+               lua_likwid_nvMarkerRegionResult);
+  lua_register(L, "likwid_nvMarkerRegionMetric",
+               lua_likwid_nvMarkerRegionMetric);
+  lua_register(L, "likwid_nvInit", lua_likwid_nvInit);
+  lua_register(L, "likwid_nvAddEventSet", lua_likwid_nvAddEventSet);
+  lua_register(L, "likwid_nvFinalize", lua_likwid_nvFinalize);
+  lua_register(L, "likwid_nvGetNameOfEvent", lua_likwid_nvGetNameOfEvent);
+  lua_register(L, "likwid_nvGetNameOfCounter", lua_likwid_nvGetNameOfCounter);
+  lua_register(L, "likwid_nvGetNameOfMetric", lua_likwid_nvGetNameOfMetric);
+  lua_register(L, "likwid_nvGetNameOfGroup", lua_likwid_nvGetNameOfGroup);
 #endif /* LIKWID_WITH_NVMON */
-    lua_register(L, "likwid_initHWFeatures", lua_likwid_initHWFeatures);
-    lua_register(L, "likwid_finalizeHWFeatures", lua_likwid_finalizeHWFeatures);
-    lua_register(L, "likwid_hwFeatures_list",lua_likwid_getHwFeatureList);
-    lua_register(L, "likwid_hwFeatures_get",lua_likwid_getHwFeature);
-    lua_register(L, "likwid_hwFeatures_set",lua_likwid_setHwFeature);
+  // ROCm GPU functions
+  lua_register(L, "likwid_rocmSupported", lua_likwid_rocmSupported);
+#ifdef LIKWID_WITH_ROCMON
+  lua_register(L, "likwid_getGpuTopology_rocm", lua_likwid_getGpuTopology_rocm);
+  lua_register(L, "likwid_putGpuTopology_rocm", lua_likwid_putGpuTopology_rocm);
+  lua_register(L, "likwid_getGpuEventsAndCounters_rocm",
+               lua_likwid_getGpuEventsAndCounters_rocm);
+  lua_register(L, "likwid_getGpuGroups_rocm", lua_likwid_getGpuGroups_rocm);
+  lua_register(L, "likwid_gpustr_to_gpulist_rocm",
+               lua_likwid_gpustr_to_gpulist_rocm);
+  lua_register(L, "likwid_init_rocm", lua_likwid_init_rocm);
+  lua_register(L, "likwid_addEventSet_rocm", lua_likwid_addEventSet_rocm);
+  lua_register(L, "likwid_finalize_rocm", lua_likwid_finalize_rocm);
+  lua_register(L, "likwid_getNameOfEvent_rocm", lua_likwid_getNameOfEvent_rocm);
+  lua_register(L, "likwid_getNameOfCounter_rocm",
+               lua_likwid_getNameOfCounter_rocm);
+  lua_register(L, "likwid_getNameOfMetric_rocm",
+               lua_likwid_getNameOfMetric_rocm);
+  lua_register(L, "likwid_getNameOfGroup_rocm", lua_likwid_getNameOfGroup_rocm);
+  lua_register(L, "likwid_readMarkerFile_rocm",
+               lua_likwid_markerFile_read_rocm);
+  lua_register(L, "likwid_markerFile_destroy_rocm",
+               lua_likwid_markerFile_destroy_rocm);
+  lua_register(L, "likwid_markerNumRegions_rocm",
+               lua_likwid_markerNumRegions_rocm);
+  lua_register(L, "likwid_markerRegionGroup_rocm",
+               lua_likwid_markerRegionGroup_rocm);
+  lua_register(L, "likwid_markerRegionTag_rocm",
+               lua_likwid_markerRegionTag_rocm);
+  lua_register(L, "likwid_markerRegionEvents_rocm",
+               lua_likwid_markerRegionEvents_rocm);
+  lua_register(L, "likwid_markerRegionMetrics_rocm",
+               lua_likwid_markerRegionMetrics_rocm);
+  lua_register(L, "likwid_markerRegionGpus_rocm",
+               lua_likwid_markerRegionGpus_rocm);
+  lua_register(L, "likwid_markerRegionGpulist_rocm",
+               lua_likwid_markerRegionGpulist_rocm);
+  lua_register(L, "likwid_markerRegionTime_rocm",
+               lua_likwid_markerRegionTime_rocm);
+  lua_register(L, "likwid_markerRegionCount_rocm",
+               lua_likwid_markerRegionCount_rocm);
+  lua_register(L, "likwid_markerRegionResult_rocm",
+               lua_likwid_markerRegionResult_rocm);
+  lua_register(L, "likwid_markerRegionMetric_rocm",
+               lua_likwid_markerRegionMetric_rocm);
+#endif /* LIKWID_WITH_ROCMON */
+  lua_register(L, "likwid_initHWFeatures", lua_likwid_initHWFeatures);
+  lua_register(L, "likwid_finalizeHWFeatures", lua_likwid_finalizeHWFeatures);
+  lua_register(L, "likwid_hwFeatures_list", lua_likwid_getHwFeatureList);
+  lua_register(L, "likwid_hwFeatures_get", lua_likwid_getHwFeature);
+  lua_register(L, "likwid_hwFeatures_set", lua_likwid_setHwFeature);
 #ifdef __MIC__
-    setuid(0);
-    seteuid(0);
+  setuid(0);
+  seteuid(0);
 #endif /* __MIC__ */
-    return 0;
+  return 0;
 }
diff --git a/src/nvmon.c b/src/nvmon.c
index dbf645823..682d53f27 100644
--- a/src/nvmon.c
+++ b/src/nvmon.c
@@ -138,6 +138,8 @@ nvmon_init(int nrGpus, const int* gpuIds)
     nvGroupSet->groups = NULL;
     nvGroupSet->activeGroup = -1;
     nvGroupSet->numberOfBackends = 0;
+    nvGroupSet->numGroupSources = 0;
+    nvGroupSet->groupSources = NULL;
 
 #ifdef LIKWID_NVMON_CUPTI_H
     nvGroupSet->backends[LIKWID_NVMON_CUPTI_BACKEND] = &nvmon_cupti_functions;
@@ -184,6 +186,9 @@ nvmon_init(int nrGpus, const int* gpuIds)
             perfworkscount++;
         }
         device->deviceId = gpuIds[i];
+        device->timeStart = 0;
+        device->timeStop = 0;
+        device->timeRead = 0;
         if (cputicount > 0 && perfworkscount > 0)
         {
             ERROR_PRINT(Cannot use GPUs with different backends in a session, gpuIds[i]);
@@ -209,6 +214,15 @@ nvmon_init(int nrGpus, const int* gpuIds)
         }
     }
 
+    ret = nvml_init();
+    if (ret < 0)
+    {
+        free(nvGroupSet->gpus);
+        free(nvGroupSet);
+        nvGroupSet = NULL;
+        return ret;
+    }
+
     nvmon_initialized = 1;
     return 0;
 }
@@ -220,6 +234,7 @@ nvmon_finalize(void)
 {
     if (nvmon_initialized && nvGroupSet)
     {
+        nvml_finalize();
         for (int i = 0; i < nvGroupSet->numberOfGPUs; i++)
         {
             NvmonDevice_t device = &nvGroupSet->gpus[i];
@@ -227,6 +242,12 @@ nvmon_finalize(void)
             if (funcs && funcs->freeDevice)
                 funcs->freeDevice(&nvGroupSet->gpus[i]);
         }
+        for (int i = 0; i < nvGroupSet->numGroupSources; i++)
+        {
+            free(nvGroupSet->groupSources[i].sourceTypes);
+            free(nvGroupSet->groupSources[i].sourceIds);
+        }
+        free(nvGroupSet->groupSources);
         free(nvGroupSet->gpus);
         free(nvGroupSet);
         nvGroupSet = NULL;
@@ -253,22 +274,30 @@ static int concatNvmonEventLists(NvmonEventList_t base, NvmonEventList_t new)
     for (int i = 0; i < new->numEvents; i++)
     {
         int bidx = base->numEvents + i;
-        base->events[bidx].name = malloc((strlen(new->events[i].name)+1)*sizeof(char));
+        int len = strlen(new->events[i].name);
+        base->events[bidx].name = malloc((len+1)*sizeof(char));
         if (base->events[bidx].name)
         {
-            strncpy(base->events[bidx].name, new->events[i].name, strlen(new->events[i].name));
+            strncpy(base->events[bidx].name, new->events[i].name, len);
+            base->events[bidx].name[len] = '\0';
         }
-        base->events[bidx].desc = malloc((strlen(new->events[i].desc)+1)*sizeof(char));
+        len = strlen(new->events[i].desc);
+        base->events[bidx].desc = malloc((len+1)*sizeof(char));
         if (base->events[bidx].desc)
         {
-            strncpy(base->events[bidx].desc, new->events[i].desc, strlen(new->events[i].desc));
+            strncpy(base->events[bidx].desc, new->events[i].desc, len);
+            base->events[bidx].desc[len] = '\0';
         }
-        base->events[bidx].limit = malloc((strlen(new->events[i].limit)+1)*sizeof(char));
+        len = strlen(new->events[i].limit);
+        base->events[bidx].limit = malloc((len+1)*sizeof(char));
         if (base->events[bidx].limit)
         {
-            strncpy(base->events[bidx].limit, new->events[i].limit, strlen(new->events[i].limit));
+            strncpy(base->events[bidx].limit, new->events[i].limit, len);
+            base->events[bidx].limit[len] = '\0';
         }
     }
+
+    base->numEvents = totalevents;
     return 0;
 }
 
@@ -293,26 +322,48 @@ nvmon_getEventsOfGpu(int gpuId, NvmonEventList_t* list)
             break;
         }
     }
-    if (available >= 0)
+    if (available < 0)
+    {
+        return -EINVAL;
+    }
+
+    //err = nvmon_cupti_functions.getEventList(available, list);
+    if (gtopo->devices[available].ccapMajor < 7)
     {
-        //err = nvmon_cupti_functions.getEventList(available, list);
-        if (gtopo->devices[available].ccapMajor < 7)
+        if (nvmon_cupti_functions.getEventList)
         {
-            if (nvmon_cupti_functions.getEventList)
-            {
-                err = nvmon_cupti_functions.getEventList(available, list);
-            }
+            err = nvmon_cupti_functions.getEventList(available, list);
         }
-        else
+    }
+    else
+    {
+        if (nvmon_perfworks_functions.getEventList)
         {
-            if (nvmon_perfworks_functions.getEventList)
-            {
-                err = nvmon_perfworks_functions.getEventList(available, list);
-            }
+            err = nvmon_perfworks_functions.getEventList(available, list);
         }
     }
-    return err;
 
+    // Get nvml events and merge lists
+    NvmonEventList_t nvmlList = NULL;
+    err = nvml_getEventsOfGpu(gpuId, &nvmlList);
+    if (err < 0)
+    {
+        nvmon_returnEventsOfGpu(*list);
+        *list = NULL;
+        return err;
+    }
+    err = concatNvmonEventLists(*list, nvmlList);
+    if (err < 0)
+    {
+        ERROR_PLAIN_PRINT(Failed to concatenate event lists);
+        nvmon_returnEventsOfGpu(*list);
+        nvml_returnEventsOfGpu(nvmlList);
+        *list = NULL;
+        return err;
+    }
+    nvml_returnEventsOfGpu(nvmlList);
+
+    return 0;
 }
 
 void nvmon_returnEventsOfGpu(NvmonEventList_t list)
@@ -339,6 +390,133 @@ void nvmon_returnEventsOfGpu(NvmonEventList_t list)
 }
 
 
+static int
+nvmon_splitEventSet(GroupInfo* backendEvents, GroupInfo* nvmlEvents, int gid)
+{
+    int ret;
+
+    // Initialize groups
+    perfgroup_new(backendEvents);
+    perfgroup_new(nvmlEvents);
+
+    // Sort events (shallow copy)
+    GroupInfo* info = &nvGroupSet->groups[gid];
+    for (int i = 0; i < info->nevents; i++)
+    {
+        if (nvGroupSet->groupSources[gid].sourceTypes[i] == NVMON_SOURCE_NVML)
+        {
+            ret = perfgroup_addEvent(nvmlEvents, info->counters[i], info->events[i]);
+            if (ret < 0)
+            {
+                ERROR_PRINT(Failed to add event while splitting);
+                return ret;
+            }
+        }
+        else
+        {
+            ret = perfgroup_addEvent(backendEvents, info->counters[i], info->events[i]);
+            if (ret < 0)
+            {
+                ERROR_PRINT(Failed to add event while splitting);
+                return ret;
+            }
+        }
+    }
+
+    return 0;
+}
+
+
+static void
+nvmon_returnSplitEventSet(GroupInfo* backendEvents, GroupInfo* nvmlEvents)
+{
+    if (backendEvents != NULL)
+    {
+        perfgroup_returnGroup(backendEvents);
+    }
+    if (nvmlEvents != NULL)
+    {
+        perfgroup_returnGroup(nvmlEvents);
+    }
+}
+
+
+static int
+nvmon_initEventSourceLookupMaps(int gid, int gpuId)
+{
+    int ret;
+    GroupInfo* group = &nvGroupSet->groups[gid];
+
+    // Allocate memory
+    NvmonGroupSourceInfo* info;
+    if (gid >= nvGroupSet->numGroupSources)
+    {
+        int* tmpSourceTypes = (int*) malloc(group->nevents * sizeof(int));
+        if (tmpSourceTypes == NULL)
+        {
+            ERROR_PLAIN_PRINT(Failed to allocate source type map);
+            return -ENOMEM;
+        }
+        int* tmpSourceIds = (int*) malloc(group->nevents * sizeof(int));
+        if (tmpSourceIds == NULL)
+        {
+            ERROR_PLAIN_PRINT(Failed to allocate source id map);
+            free(tmpSourceTypes);
+            return -ENOMEM;
+        }
+        NvmonGroupSourceInfo* tmpInfo = (NvmonGroupSourceInfo*) realloc(nvGroupSet->groupSources, (gid+1) * sizeof(NvmonGroupSourceInfo));
+        if (tmpInfo == NULL)
+        {
+            ERROR_PLAIN_PRINT(Failed to allocate source infos);
+            free(tmpSourceTypes);
+            free(tmpSourceIds);
+            return -ENOMEM;
+        }
+
+        info = tmpInfo;
+        nvGroupSet->groupSources = tmpInfo;
+        nvGroupSet->groupSources[gid].numEvents = group->nevents;
+        nvGroupSet->groupSources[gid].sourceTypes = tmpSourceTypes;
+        nvGroupSet->groupSources[gid].sourceIds = tmpSourceIds;
+        nvGroupSet->numGroupSources++;
+    }
+
+    // Get list of nvml events for sorting
+    NvmonEventList_t nvmlList;
+    ret = nvml_getEventsOfGpu(gpuId, &nvmlList);
+    if (ret < 0) return ret;
+
+    // Sort events
+    int nvmlId = 0;
+    int backendId = 0;
+    for (int i = 0; i < group->nevents; i++)
+    {
+        // Check if event is in nvml list
+        int isNvmlEvent = 0;
+        for (int j = 0; j < nvmlList->numEvents; j++)
+        {
+            if (strcmp(group->events[i], nvmlList->events[j].name) == 0)
+            {
+                isNvmlEvent = 1;
+                break;
+            }
+        }
+        if (isNvmlEvent)
+        {
+            nvGroupSet->groupSources[gid].sourceTypes[i] = NVMON_SOURCE_NVML;
+            nvGroupSet->groupSources[gid].sourceIds[i] = nvmlId++;
+        }
+        else
+        {
+            nvGroupSet->groupSources[gid].sourceTypes[i] = NVMON_SOURCE_BACKEND;
+            nvGroupSet->groupSources[gid].sourceIds[i] = backendId++;
+        }
+    }
+
+    nvml_returnEventsOfGpu(nvmlList);
+    return 0;
+}
+
 
 int
 nvmon_addEventSet(const char* eventCString)
@@ -442,35 +620,82 @@ nvmon_addEventSet(const char* eventCString)
         }
         isPerfGroup = 1;
     }
+
+    // Build source lookup
+    err = nvmon_initEventSourceLookupMaps(nvGroupSet->numberOfGroups - 1, nvGroupSet->gpus[0].deviceId); // it is assumed that all gpus split the same
+    if (err < 0)
+    {
+        ERROR_PRINT(Failed to init source lookup for group %d, nvGroupSet->numberOfGroups - 1);
+        return err;
+    }
+
+    // Split events into nvml and backend events
+    GroupInfo backendEvents;
+    GroupInfo nvmlEvents;
+    err = nvmon_splitEventSet(&backendEvents, &nvmlEvents, nvGroupSet->numberOfGroups-1);
+    if (err < 0)
+    {
+        ERROR_PRINT(Failed to split events);
+        return -1;
+    }
+
+    // Add nvml events
+    err = nvml_addEventSet(nvmlEvents.events, nvmlEvents.nevents);
+    if (err < 0)
+    {
+        ERROR_PRINT(Failed to add nvml events);
+        return -1;
+    }
+
     bdestroy(eventBString);
-    char * evstr = perfgroup_getEventStr(&nvGroupSet->groups[nvGroupSet->numberOfGroups-1]);
+    char * evstr = perfgroup_getEventStr(&backendEvents);
 /*    eventBString = bfromcstr(evstr);*/
     GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, EventStr %s, evstr);
 /*    eventtokens = bsplit(eventBString, ',');*/
 /*    bdestroy(eventBString);*/
 
+    // Event string is null when there are no events for backend
     for (devId = 0; devId < nvGroupSet->numberOfGPUs; devId++)
     {
-        int err = 0;
         device = &nvGroupSet->gpus[devId];
-        NvmonFunctions* funcs = nvGroupSet->backends[device->backend];
-        if (!funcs)
+        if (evstr != NULL)
         {
-            ERROR_PRINT(Backend functions undefined?);
-        }
-        if (funcs->addEvents)
+            int err = 0;
+            NvmonFunctions* funcs = nvGroupSet->backends[device->backend];
+            if (!funcs)
+            {
+                ERROR_PRINT(Backend functions undefined?);
+            }
+            if (funcs->addEvents)
+            {
+                GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Calling addevents);
+                err = funcs->addEvents(device, evstr);
+                if (err < 0)
+                {
+                    errno = -err;
+                    ERROR_PRINT(Failed to add event set for GPU %d, devId);
+                    return err;
+                }
+            }
+        }   
+        else
         {
-            GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Calling addevents);
-            err = funcs->addEvents(device, evstr);
-            if (err < 0)
+            // Add empty event set 
+            NvmonEventSet* tmpEventSet = realloc(device->nvEventSets, (device->numNvEventSets+1)*sizeof(NvmonEventSet));
+            if (!tmpEventSet)
             {
-                errno = -err;
-                ERROR_PRINT(Failed to add event set for GPU %d, devId);
-                return err;
+                ERROR_PRINT(Cannot enlarge GPU %d eventSet list, device->deviceId);
+                return -ENOMEM;
             }
+            device->nvEventSets = tmpEventSet;
+            NvmonEventSet* newEventSet = &device->nvEventSets[device->numNvEventSets];
+            memset(newEventSet, 0, sizeof(NvmonEventSet));
         }
     }
 
+    // Cleanup
+    nvmon_returnSplitEventSet(&backendEvents, &nvmlEvents);
+
     // Check whether group has any event in any device
     nvGroupSet->numberOfActiveGroups++;
     return (nvGroupSet->numberOfActiveGroups-1);
@@ -498,11 +723,12 @@ int nvmon_setupCounters(int gid)
             NvmonEventSet* devEventSet = &device->nvEventSets[gid];
 
             NvmonFunctions* funcs = nvGroupSet->backends[device->backend];
-            if (funcs->setupCounters)
+            if (devEventSet->numberOfEvents > 0 && funcs->setupCounters)
             {
                 err = funcs->setupCounters(device, devEventSet);
             }
         }
+        nvml_setupCounters(gid);
     }
     nvGroupSet->activeGroup = gid;
 
@@ -523,14 +749,20 @@ nvmon_startCounters(void)
     for (i = 0; i < nvGroupSet->numberOfGPUs; i++)
     {
         NvmonDevice_t device = &nvGroupSet->gpus[i];
+        NvmonEventSet* devEventSet = &device->nvEventSets[nvGroupSet->activeGroup];
         NvmonFunctions* funcs = nvGroupSet->backends[device->backend];
-        if (funcs->startCounters)
+        if (devEventSet->numberOfEvents > 0 && funcs->startCounters)
         {
             err = funcs->startCounters(device);
+            if (err < 0) return err;
         }
     }
 
-    return err;
+    // Start nvml counters
+    err = nvml_startCounters();
+    if (err < 0) return err;
+
+    return 0;
 }
 
 int
@@ -546,13 +778,19 @@ nvmon_stopCounters(void)
     for (i = 0; i < nvGroupSet->numberOfGPUs; i++)
     {
         NvmonDevice_t device = &nvGroupSet->gpus[i];
+        NvmonEventSet* devEventSet = &device->nvEventSets[nvGroupSet->activeGroup];
         NvmonFunctions* funcs = nvGroupSet->backends[device->backend];
-        if (funcs->stopCounters)
+        if (devEventSet->numberOfEvents > 0 && funcs->stopCounters)
         {
             err = funcs->stopCounters(device);
+            if (err < 0) return err;
         }
     }
 
+    // Stop nvml counters
+    err = nvml_stopCounters();
+    if (err < 0) return err;
+
     return 0;
 }
 
@@ -569,13 +807,19 @@ int nvmon_readCounters(void)
     for (i = 0; i < nvGroupSet->numberOfGPUs; i++)
     {
         NvmonDevice_t device = &nvGroupSet->gpus[i];
+        NvmonEventSet* devEventSet = &device->nvEventSets[nvGroupSet->activeGroup];
         NvmonFunctions* funcs = nvGroupSet->backends[device->backend];
-        if (funcs->readCounters)
+        if (devEventSet->numberOfEvents > 0 && funcs->readCounters)
         {
             err = funcs->readCounters(device);
+            if (err < 0) return err;
         }
-
     }
+    
+    // Read nvml counters
+    err = nvml_readCounters();
+    if (err < 0) return err;
+
     return 0;
 }
 
@@ -589,17 +833,31 @@ double nvmon_getResult(int groupId, int eventId, int gpuId)
     {
         return -EFAULT;
     }
-    NvmonDevice *device = &nvGroupSet->gpus[gpuId];
-    if (groupId < 0 || groupId >= device->numNvEventSets)
+    if (groupId < 0 || groupId >= nvGroupSet->numGroupSources)
     {
         return -EFAULT;
     }
-    NvmonEventSet* evset = &device->nvEventSets[groupId];
-    if (eventId < 0 || eventId >= evset->numberOfEvents)
+    NvmonGroupSourceInfo* info = &nvGroupSet->groupSources[groupId];
+    if (eventId < 0 || eventId >= info->numEvents)
     {
         return -EFAULT;
     }
-    return evset->results[eventId].fullValue;
+
+    // Get value from respective source
+    if (info->sourceTypes[eventId] == NVMON_SOURCE_NVML)
+    {
+        return nvml_getResult(gpuId, groupId, info->sourceIds[eventId]);
+    }
+    else
+    {
+        NvmonDevice *device = &nvGroupSet->gpus[gpuId];
+        if (groupId < 0 || groupId >= device->numNvEventSets)
+        {
+            return -EFAULT;
+        }
+        NvmonEventSet* evset = &device->nvEventSets[groupId];
+        return evset->results[info->sourceIds[eventId]].fullValue;
+    }
 }
 
 
@@ -622,17 +880,31 @@ double nvmon_getLastResult(int groupId, int eventId, int gpuId)
     {
         return -EFAULT;
     }
-    NvmonDevice *device = &nvGroupSet->gpus[gpuId];
-    if (groupId < 0 || groupId >= device->numNvEventSets)
+    if (groupId < 0 || groupId >= nvGroupSet->numGroupSources)
     {
         return -EFAULT;
     }
-    NvmonEventSet* evset = &device->nvEventSets[groupId];
-    if (eventId < 0 || eventId >= evset->numberOfEvents)
+    NvmonGroupSourceInfo* info = &nvGroupSet->groupSources[groupId];
+    if (eventId < 0 || eventId >= info->numEvents)
     {
         return -EFAULT;
     }
-    return evset->results[eventId].lastValue;
+
+    // Get value from respective source
+    if (info->sourceTypes[eventId] == NVMON_SOURCE_NVML)
+    {
+        return nvml_getLastResult(gpuId, groupId, info->sourceIds[eventId]);
+    }
+    else
+    {
+        NvmonDevice *device = &nvGroupSet->gpus[gpuId];
+        if (groupId < 0 || groupId >= device->numNvEventSets)
+        {
+            return -EFAULT;
+        }
+        NvmonEventSet* evset = &device->nvEventSets[groupId];
+        return evset->results[info->sourceIds[eventId]].lastValue;
+    }
 }
 
 
@@ -678,35 +950,91 @@ int nvmon_getNumberOfEvents(int groupId)
 
 double nvmon_getTimeOfGroup(int groupId)
 {
-    int i = 0;
-    double t = 0;
     if ((!nvGroupSet) || (!nvmon_initialized) || (groupId < 0) || groupId >= nvGroupSet->numberOfActiveGroups)
     {
         return -EFAULT;
     }
-    for (i = 0; i < nvGroupSet->numberOfGPUs; i++)
+
+    // Get largest time from backend measurings
+    double backendTime = 0;
+    for (int i = 0; i < nvGroupSet->numberOfGPUs; i++)
+    {
+        // timeStart and timeStop are zero if no backend event is registered
+        backendTime = MAX(backendTime, (double)(nvGroupSet->gpus[i].timeStop - nvGroupSet->gpus[i].timeStart));
+    }
+    backendTime *= 1E-9; // ns to seconds
+
+    // Get largest time of nvml events
+    double nvmlTime = 0;
+    nvmlTime = nvml_getTimeOfGroup(groupId);
+    if (nvmlTime < 0)
     {
-        t = MAX(t, (double)(nvGroupSet->gpus[i].timeStop - nvGroupSet->gpus[i].timeStart));
+        return nvmlTime;
     }
-    return t*1E-9;
+
+    // Return largest time
+    return MAX(backendTime, nvmlTime);
 }
 
 
 double nvmon_getLastTimeOfGroup(int groupId)
 {
-    int i = 0;
-    double t = 0;
     if ((!nvGroupSet) || (!nvmon_initialized) || (groupId < 0) || groupId >= nvGroupSet->numberOfActiveGroups)
     {
         return -EFAULT;
     }
-    for (i = 0; i < nvGroupSet->numberOfGPUs; i++)
+
+    // Get largest time from backend measurings
+    double backendTime = 0;
+    for (int i = 0; i < nvGroupSet->numberOfGPUs; i++)
+    {
+        // timeStart and timeStop are zero if no backend event is registered
+        backendTime = MAX(backendTime, (double)(nvGroupSet->gpus[i].timeStop - nvGroupSet->gpus[i].timeRead));
+    }
+    backendTime *= 1E-9; // ns to seconds
+
+    // Get largest time of nvml events
+    double nvmlTime = 0;
+    nvmlTime = nvml_getLastTimeOfGroup(groupId);
+    if (nvmlTime < 0)
     {
-        t = MAX(t, (double)(nvGroupSet->gpus[i].timeStop - nvGroupSet->gpus[i].timeRead));
+        return nvmlTime;
     }
-    return t*1E-9;
+
+    // Return largest time
+    return MAX(backendTime, nvmlTime);
 }
 
+
+double nvmon_getTimeToLastReadOfGroup(int groupId)
+{
+    if ((!nvGroupSet) || (!nvmon_initialized) || (groupId < 0) || groupId >= nvGroupSet->numberOfActiveGroups)
+    {
+        return -EFAULT;
+    }
+
+    // Get largest time from backend measurings
+    double backendTime = 0;
+    for (int i = 0; i < nvGroupSet->numberOfGPUs; i++)
+    {
+        // timeStart and timeStop are zero if no backend event is registered
+        backendTime = MAX(backendTime, (double)(nvGroupSet->gpus[i].timeRead - nvGroupSet->gpus[i].timeStart));
+    }
+    backendTime *= 1E-9; // ns to seconds
+
+    // Get largest time of nvml events
+    double nvmlTime = 0;
+    nvmlTime = nvml_getTimeToLastReadOfGroup(groupId);
+    if (nvmlTime < 0)
+    {
+        return nvmlTime;
+    }
+
+    // Return largest time
+    return MAX(backendTime, nvmlTime);
+}
+
+
 char* nvmon_getEventName(int groupId, int eventId)
 {
     if ((!nvGroupSet) || (!nvmon_initialized) || (groupId < 0) || groupId >= nvGroupSet->numberOfActiveGroups)
diff --git a/src/nvmon_nvml.c b/src/nvmon_nvml.c
new file mode 100644
index 000000000..8672f06b2
--- /dev/null
+++ b/src/nvmon_nvml.c
@@ -0,0 +1,1382 @@
+/*
+ * =======================================================================================
+ *
+ *      Filename:  nvmon_nvml.c
+ *
+ *      Description:  NVML implementation of the performance monitoring module
+ *                    for NVIDIA GPUs
+ *
+ *      Version:   <VERSION>
+ *      Released:  <DATE>
+ *
+ *      Author:   Thomas Gruber (tg), thomas.roehl@googlemail.com
+ *      Project:  likwid
+ *
+ *      Copyright (C) 2016 RRZE, University Erlangen-Nuremberg
+ *
+ *      This program is free software: you can redistribute it and/or modify it under
+ *      the terms of the GNU General Public License as published by the Free Software
+ *      Foundation, either version 3 of the License, or (at your option) any later
+ *      version.
+ *
+ *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
+ *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+ *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *
+ *      You should have received a copy of the GNU General Public License along with
+ *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * =======================================================================================
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <math.h>
+#include <float.h>
+#include <unistd.h>
+#include <sys/types.h>
+
+#include <dlfcn.h>
+#include <nvml.h>
+#include <cupti.h>
+
+#include <likwid.h>
+#include <error.h>
+#include <nvmon_types.h>
+#include <libnvctr_types.h>
+
+typedef enum {
+    FEATURE_CLOCK_INFO          = 1,
+    FEATURE_ECC_LOCAL_ERRORS    = 2,
+    FEATURE_FAN_SPEED           = 4,
+    FEATURE_MAX_CLOCK           = 8,
+    FEATURE_MEMORY_INFO         = 16,
+    FEATURE_PERF_STATES         = 32,
+    FEATURE_POWER               = 64,
+    FEATURE_TEMP                = 128,
+    FEATURE_ECC_TOTAL_ERRORS    = 256,
+    FEATURE_UTILIZATION         = 512,
+    FEATURE_POWER_MANAGEMENT    = 1024,
+    FEATURE_NVML_POWER_MANAGEMENT_LIMIT_CONSTRAINT = 2048,
+} NvmlFeature;
+
+typedef enum {
+    LOCAL_ECC_REGFILE = 0,
+    LOCAL_ECC_L1,
+    LOCAL_ECC_L2,
+    LOCAL_ECC_MEM,
+} NvmlEccErrorCount;
+
+typedef enum {
+    MEMORY_FREE = 0,
+    MEMORY_TOTAL,
+    MEMORY_USED,
+} NvmlMemoryType;
+
+typedef enum {
+    LIMIT_MIN = 0,
+    LIMIT_MAX,
+} NvmlPowerLimit;
+
+typedef enum {
+    UTILIZATION_GPU = 0,
+    UTILIZATION_MEMORY,
+} NvmlUtilization;
+
+typedef struct {
+    double fullValue;
+    double lastValue;
+} NvmlEventResult;
+
+struct NvmlEvent_struct;
+typedef int (*NvmlMeasureFunc)(nvmlDevice_t device, struct NvmlEvent_struct* event, NvmlEventResult* result);
+
+#define LIKWID_NVML_NAME_LEN 40
+#define LIKWID_NVML_DESC_LEN 50
+typedef struct NvmlEvent_struct {
+    char name[LIKWID_NVML_NAME_LEN];
+    char description[LIKWID_NVML_DESC_LEN];
+    NvmlMeasureFunc measureFunc;
+
+    union {
+        nvmlClockType_t clock;
+        struct {
+            nvmlMemoryErrorType_t type;
+            NvmlEccErrorCount counter;
+        } ecc;
+        unsigned int fan;
+        NvmlMemoryType memory;
+        nvmlTemperatureSensors_t tempSensor;
+        NvmlPowerLimit powerLimit;
+        NvmlUtilization utilization;
+    } options;
+} NvmlEvent;
+
+typedef struct {
+    int numEvents;
+    NvmlEvent* events;
+    NvmlEventResult* results;
+} NvmlEventSet;
+
+typedef struct {
+    NvmonDevice* nvDevice;
+    nvmlDevice_t nvmlDevice;
+
+    int numAllEvents;
+    NvmlEvent* allEvents;
+
+    int activeEventSet;
+    int numEventSets;
+    NvmlEventSet* eventSets;
+
+    uint32_t features;
+    unsigned int numFans;
+
+    // Timestamps in ns
+    struct {
+        uint64_t start;
+        uint64_t read;
+        uint64_t stop;
+    } time;
+} NvmlDevice;
+
+typedef struct {
+    int numDevices;
+    NvmlDevice* devices;
+} NvmlContext;
+
+
+// Variables
+static int nvml_initialized = 0;
+static void* dl_nvml = NULL;
+static void* dl_cupti = NULL;
+static NvmlContext nvmlContext;
+
+
+// Macros
+#define FREE_IF_NOT_NULL(x) if (x) { free(x); }
+#define DLSYM_AND_CHECK( dllib, name ) name##_ptr = dlsym( dllib, #name ); if ( dlerror() != NULL ) { return -1; }
+#define NVML_CALL(call, args, handleerror)                                            \
+    do {                                                                           \
+        nvmlReturn_t _status = (*call##_ptr)args;                                         \
+        if (_status != NVML_SUCCESS) {                                            \
+            fprintf(stderr, "Error: function %s failed with error %d.\n", #call, _status);                    \
+            handleerror;                                                             \
+        }                                                                          \
+    } while (0)
+#define CUPTI_CALL(call, args, handleerror)                                            \
+    do {                                                                \
+        CUptiResult _status = (*call##_ptr)args;                                  \
+        if (_status != CUPTI_SUCCESS) {                                 \
+            const char *errstr;                                         \
+            (*cuptiGetResultString)(_status, &errstr);               \
+            fprintf(stderr, "Error: function %s failed with error %s.\n", #call, errstr); \
+            handleerror;                                                \
+        }                                                               \
+    } while (0)
+
+// NVML function declarations
+#define NVMLWEAK __attribute__(( weak ))
+#define DECLAREFUNC_NVML(funcname, funcsig) nvmlReturn_t NVMLWEAK funcname funcsig;  nvmlReturn_t ( *funcname##_ptr ) funcsig;
+
+DECLAREFUNC_NVML(nvmlInit_v2, (void));
+DECLAREFUNC_NVML(nvmlShutdown, (void));
+DECLAREFUNC_NVML(nvmlDeviceGetHandleByIndex_v2, (unsigned int  index, nvmlDevice_t* device));
+DECLAREFUNC_NVML(nvmlDeviceGetClockInfo, (nvmlDevice_t device, nvmlClockType_t type, unsigned int* clock));
+DECLAREFUNC_NVML(nvmlDeviceGetInforomVersion, (nvmlDevice_t device, nvmlInforomObject_t object, char* version, unsigned int  length));
+DECLAREFUNC_NVML(nvmlDeviceGetEccMode, (nvmlDevice_t device, nvmlEnableState_t* current, nvmlEnableState_t* pending));
+DECLAREFUNC_NVML(nvmlDeviceGetDetailedEccErrors, (nvmlDevice_t device, nvmlMemoryErrorType_t errorType, nvmlEccCounterType_t counterType, nvmlEccErrorCounts_t* eccCounts));
+DECLAREFUNC_NVML(nvmlDeviceGetTotalEccErrors, (nvmlDevice_t device, nvmlMemoryErrorType_t errorType, nvmlEccCounterType_t counterType, unsigned long long* eccCounts));
+DECLAREFUNC_NVML(nvmlDeviceGetFanSpeed_v2, (nvmlDevice_t device, unsigned int fan, unsigned int* speed));
+DECLAREFUNC_NVML(nvmlDeviceGetClock, (nvmlDevice_t device, nvmlClockType_t clockType, nvmlClockId_t clockId, unsigned int* clockMHz));
+DECLAREFUNC_NVML(nvmlDeviceGetMemoryInfo, (nvmlDevice_t device, nvmlMemory_t* memory));
+DECLAREFUNC_NVML(nvmlDeviceGetPerformanceState, (nvmlDevice_t device, nvmlPstates_t* pState));
+DECLAREFUNC_NVML(nvmlDeviceGetPowerUsage, (nvmlDevice_t device, unsigned int* power));
+DECLAREFUNC_NVML(nvmlDeviceGetTemperature, (nvmlDevice_t device, nvmlTemperatureSensors_t sensorType, unsigned int* temp));
+DECLAREFUNC_NVML(nvmlDeviceGetPowerManagementLimit, (nvmlDevice_t device, unsigned int* limit));
+DECLAREFUNC_NVML(nvmlDeviceGetPowerManagementLimitConstraints, (nvmlDevice_t device, unsigned int* minLimit, unsigned int* maxLimit));
+DECLAREFUNC_NVML(nvmlDeviceGetUtilizationRates, (nvmlDevice_t device, nvmlUtilization_t* utilization));
+
+// CUPTI function declarations
+#define CUPTIWEAK __attribute__(( weak ))
+#define DECLAREFUNC_CUPTI(funcname, funcsig) CUptiResult CUPTIWEAK funcname funcsig;  CUptiResult( *funcname##_ptr ) funcsig;
+
+DECLAREFUNC_CUPTI(cuptiGetTimestamp, (uint64_t * timestamp));
+DECLAREFUNC_CUPTI(cuptiGetResultString, (CUptiResult result, const char **str));
+
+
+// ----------------------------------------------------
+//   Wrapper functions
+// ----------------------------------------------------
+
+static void
+_nvml_resultAddMeasurement(NvmlEventResult* result, double value)
+{
+    result->lastValue = value;
+    result->fullValue += value;
+}
+
+static int
+_nvml_wrapper_getClockInfo(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    unsigned int clock;
+
+    NVML_CALL(nvmlDeviceGetClockInfo, (device, event->options.clock, &clock), return -1);
+    _nvml_resultAddMeasurement(result, clock);
+
+    return 0;
+}
+
+
+static int
+_nvml_wrapper_getMaxClock(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    unsigned int clock;
+
+    NVML_CALL(nvmlDeviceGetClock, (device, event->options.clock, NVML_CLOCK_ID_CUSTOMER_BOOST_MAX, &clock), return -1);
+    _nvml_resultAddMeasurement(result, clock);
+
+    return 0;
+}
+
+
+static int
+_nvml_wrapper_getEccLocalErrors(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    nvmlEccErrorCounts_t counts;
+
+    NVML_CALL(nvmlDeviceGetDetailedEccErrors, (device, event->options.ecc.type, NVML_VOLATILE_ECC, &counts), return -1);
+    switch (event->options.ecc.counter)
+    {
+    case LOCAL_ECC_L1:      _nvml_resultAddMeasurement(result, counts.l1Cache);         break;
+    case LOCAL_ECC_L2:      _nvml_resultAddMeasurement(result, counts.l2Cache);         break;
+    case LOCAL_ECC_MEM:     _nvml_resultAddMeasurement(result, counts.deviceMemory);    break;
+    case LOCAL_ECC_REGFILE: _nvml_resultAddMeasurement(result, counts.registerFile);    break;
+    default:                return -1;
+    }
+
+    return 0;
+}
+
+
+static int
+_nvml_wrapper_getEccTotalErrors(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    unsigned long long count;
+
+    NVML_CALL(nvmlDeviceGetTotalEccErrors, (device, event->options.ecc.type, NVML_VOLATILE_ECC, &count), return -1);
+    _nvml_resultAddMeasurement(result, count);
+
+    return 0;
+}
+
+
+static int
+_nvml_wrapper_getFanSpeed(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    unsigned int speed;
+
+    NVML_CALL(nvmlDeviceGetFanSpeed_v2, (device, event->options.fan, &speed), return -1);
+    _nvml_resultAddMeasurement(result, speed);
+    
+    return 0;
+}
+
+
+static int
+_nvml_wrapper_getMemoryInfo(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    nvmlMemory_t memory;
+
+    NVML_CALL(nvmlDeviceGetMemoryInfo, (device, &memory), return -1);
+    switch (event->options.memory)
+    {
+    case MEMORY_FREE:   _nvml_resultAddMeasurement(result, memory.free);    break;
+    case MEMORY_TOTAL:  _nvml_resultAddMeasurement(result, memory.total);   break;
+    case MEMORY_USED:   _nvml_resultAddMeasurement(result, memory.used);    break;
+    default:            return -1;
+    }
+    
+    return 0;
+}
+
+
+static int
+_nvml_wrapper_getPerformanceState(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    nvmlPstates_t state;
+
+    NVML_CALL(nvmlDeviceGetPerformanceState, (device, &state), return -1);
+    _nvml_resultAddMeasurement(result, state);
+
+    return 0;
+}
+
+
+static int
+_nvml_wrapper_getPowerUsage(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    unsigned int power;
+
+    NVML_CALL(nvmlDeviceGetPowerUsage, (device, &power), return -1);
+    _nvml_resultAddMeasurement(result, power);
+
+    return 0;
+}
+
+
+static int
+_nvml_wrapper_getTemperature(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    unsigned int temp;
+
+    NVML_CALL(nvmlDeviceGetTemperature, (device, event->options.tempSensor, &temp), return -1);
+    _nvml_resultAddMeasurement(result, temp);
+
+    return 0;
+}
+
+
+static int
+_nvml_wrapper_getPowerManagementLimit(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    unsigned int limit;
+
+    NVML_CALL(nvmlDeviceGetPowerManagementLimit, (device, &limit), return -1);
+    _nvml_resultAddMeasurement(result, limit);
+
+    return 0;
+}
+
+
+static int
+_nvml_wrapper_getPowerManagementLimitConstraints(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    unsigned int maxLimit;
+    unsigned int minLimit;
+
+    NVML_CALL(nvmlDeviceGetPowerManagementLimitConstraints, (device, &minLimit, &maxLimit), return -1);
+    if (event->options.powerLimit == LIMIT_MIN)
+    {
+        _nvml_resultAddMeasurement(result, minLimit);
+    }
+    else if (event->options.powerLimit == LIMIT_MAX)
+    {
+        _nvml_resultAddMeasurement(result, maxLimit);
+    }
+
+    return 0;
+}
+
+
+static int
+_nvml_wrapper_getUtilization(nvmlDevice_t device, NvmlEvent* event, NvmlEventResult* result)
+{
+    nvmlUtilization_t utilization;
+
+    NVML_CALL(nvmlDeviceGetUtilizationRates, (device, &utilization), return -1);
+    switch (event->options.utilization)
+    {
+    case UTILIZATION_GPU:       _nvml_resultAddMeasurement(result, utilization.gpu);        break;
+    case UTILIZATION_MEMORY:    _nvml_resultAddMeasurement(result, utilization.memory);     break;
+    default:                    return -1;
+    }
+    
+    return 0;
+}
+
+
+// ----------------------------------------------------
+//   Helper functions
+// ----------------------------------------------------
+
+static int
+_nvml_linkLibraries()
+{
+    // Load NVML libary and link functions
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Init NVML Libaries);
+    dl_nvml = dlopen("libnvidia-ml.so", RTLD_NOW | RTLD_GLOBAL);
+    if (!dl_nvml || dlerror() != NULL)
+    {
+        fprintf(stderr, "NVML library libnvidia-ml.so not found.");
+        return -1;
+    }
+
+    DLSYM_AND_CHECK(dl_nvml, nvmlInit_v2);
+    DLSYM_AND_CHECK(dl_nvml, nvmlShutdown);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetHandleByIndex_v2);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetClockInfo);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetInforomVersion);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetEccMode);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetDetailedEccErrors);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetTotalEccErrors);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetFanSpeed_v2);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetClock);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetMemoryInfo);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetPerformanceState);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetPowerUsage);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetTemperature);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetPowerManagementLimit);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetPowerManagementLimitConstraints);
+    DLSYM_AND_CHECK(dl_nvml, nvmlDeviceGetUtilizationRates);
+
+    // Load CUPTI library and link functions
+    GPUDEBUG_PRINT(DEBUGLEV_DEVELOP, Init NVML Libaries);
+    dl_cupti = dlopen("libcupti.so", RTLD_NOW | RTLD_GLOBAL);
+    if (!dl_cupti || dlerror() != NULL)
+    {
+        fprintf(stderr, "CUPTI library libcupti.so not found.");
+        return -1;
+    }
+
+    DLSYM_AND_CHECK(dl_cupti, cuptiGetTimestamp);
+    DLSYM_AND_CHECK(dl_cupti, cuptiGetResultString);
+
+    return 0;
+}
+
+
+static int
+_nvml_getEventsForDevice(NvmlDevice* device)
+{
+    NvmlEvent* event = device->allEvents;
+
+    if (device->features & FEATURE_CLOCK_INFO)
+    {
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "CLOCK_GRAPHICS");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Graphics clock domain in MHz");
+        event->measureFunc = &_nvml_wrapper_getClockInfo;
+        event->options.clock = NVML_CLOCK_GRAPHICS;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "CLOCK_SM");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "SM clock domain in MHz");
+        event->measureFunc = &_nvml_wrapper_getClockInfo;
+        event->options.clock = NVML_CLOCK_SM;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "CLOCK_MEM");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Memory clock domain in MHz");
+        event->measureFunc = &_nvml_wrapper_getClockInfo;
+        event->options.clock = NVML_CLOCK_MEM;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "CLOCK_VIDEO");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Video clock domain in MHz");
+        event->measureFunc = &_nvml_wrapper_getClockInfo;
+        event->options.clock = NVML_CLOCK_VIDEO;
+        event++;
+    }
+
+    if (device->features & FEATURE_MAX_CLOCK)
+    {
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "MAX_CLOCK_GRAPHICS");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Maximum graphics clock domain in MHz");
+        event->measureFunc = &_nvml_wrapper_getMaxClock;
+        event->options.clock = NVML_CLOCK_GRAPHICS;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "MAX_CLOCK_SM");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Maximum SM clock domain in MHz");
+        event->measureFunc = &_nvml_wrapper_getMaxClock;
+        event->options.clock = NVML_CLOCK_SM;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "MAX_CLOCK_MEM");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Maximum memory clock domain in MHz");
+        event->measureFunc = &_nvml_wrapper_getMaxClock;
+        event->options.clock = NVML_CLOCK_MEM;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "MAX_CLOCK_VIDEO");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Maximum video clock domain in MHz");
+        event->measureFunc = &_nvml_wrapper_getClockInfo;
+        event->options.clock = NVML_CLOCK_VIDEO;
+        event++;
+    }
+
+    if (device->features & FEATURE_ECC_LOCAL_ERRORS)
+    {
+        // Single bit errors
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "L1_LOCAL_ECC_ERRORS_SINGLE_BIT");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "L1 cache single bit ECC errors");
+        event->measureFunc = &_nvml_wrapper_getEccLocalErrors;
+        event->options.ecc.type = NVML_MEMORY_ERROR_TYPE_CORRECTED;
+        event->options.ecc.counter = LOCAL_ECC_L1;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "L2_LOCAL_ECC_ERRORS_SINGLE_BIT");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "L2 cache single bit ECC errors");
+        event->measureFunc = &_nvml_wrapper_getEccLocalErrors;
+        event->options.ecc.type = NVML_MEMORY_ERROR_TYPE_CORRECTED;
+        event->options.ecc.counter = LOCAL_ECC_L2;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "MEM_LOCAL_ECC_ERRORS_SINGLE_BIT");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Memory single bit ECC errors");
+        event->measureFunc = &_nvml_wrapper_getEccLocalErrors;
+        event->options.ecc.type = NVML_MEMORY_ERROR_TYPE_CORRECTED;
+        event->options.ecc.counter = LOCAL_ECC_MEM;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "REGFILE_LOCAL_ECC_ERRORS_SINGLE_BIT");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Register file single bit ECC errors");
+        event->measureFunc = &_nvml_wrapper_getEccLocalErrors;
+        event->options.ecc.type = NVML_MEMORY_ERROR_TYPE_CORRECTED;
+        event->options.ecc.counter = LOCAL_ECC_REGFILE;
+        event++;
+
+        // Double bit errors
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "L1_LOCAL_ECC_ERRORS_DOUBLE_BIT");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "L1 cache double bit ECC errors");
+        event->measureFunc = &_nvml_wrapper_getEccLocalErrors;
+        event->options.ecc.type = NVML_MEMORY_ERROR_TYPE_UNCORRECTED;
+        event->options.ecc.counter = LOCAL_ECC_L1;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "L2_LOCAL_ECC_ERRORS_DOUBLE_BIT");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "L2 cache double bit ECC errors");
+        event->measureFunc = &_nvml_wrapper_getEccLocalErrors;
+        event->options.ecc.type = NVML_MEMORY_ERROR_TYPE_UNCORRECTED;
+        event->options.ecc.counter = LOCAL_ECC_L2;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "MEM_LOCAL_ECC_ERRORS_DOUBLE_BIT");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Memory double bit ECC errors");
+        event->measureFunc = &_nvml_wrapper_getEccLocalErrors;
+        event->options.ecc.type = NVML_MEMORY_ERROR_TYPE_UNCORRECTED;
+        event->options.ecc.counter = LOCAL_ECC_MEM;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "REGFILE_LOCAL_ECC_ERRORS_DOUBLE_BIT");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Register file double bit ECC errors");
+        event->measureFunc = &_nvml_wrapper_getEccLocalErrors;
+        event->options.ecc.type = NVML_MEMORY_ERROR_TYPE_UNCORRECTED;
+        event->options.ecc.counter = LOCAL_ECC_REGFILE;
+        event++;
+    }
+
+    if (device->features & FEATURE_ECC_TOTAL_ERRORS)
+    {
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "TOTAL_ECC_ERRORS_SINGLE_BIT");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Total single bit ECC errors");
+        event->measureFunc = &_nvml_wrapper_getEccTotalErrors;
+        event->options.ecc.type = NVML_MEMORY_ERROR_TYPE_CORRECTED;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "TOTAL_ECC_ERRORS_DOUBLE_BIT");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Total double bit ECC errors");
+        event->measureFunc = &_nvml_wrapper_getEccTotalErrors;
+        event->options.ecc.type = NVML_MEMORY_ERROR_TYPE_UNCORRECTED;
+        event++;
+    }
+
+    if (device->features & FEATURE_FAN_SPEED)
+    {
+        for (int i = 0; i < device->numFans; i++)
+        {
+            snprintf(event->name, LIKWID_NVML_NAME_LEN, "FAN_SPEED[%d]", i);
+            snprintf(event->description, LIKWID_NVML_DESC_LEN, "Indended fan speed represented as a percentage of the maximum noise tolerance fan speed. May exceed 100 in certain cases");
+            event->measureFunc = &_nvml_wrapper_getFanSpeed;
+            event->options.fan = i;
+            event++;
+        }
+    }
+
+    if (device->features & FEATURE_MEMORY_INFO)
+    {
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "FREE_MEMORY");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Unallocated FB memory (in bytes)");
+        event->measureFunc = &_nvml_wrapper_getMemoryInfo;
+        event->options.memory = MEMORY_FREE;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "TOTAL_MEMORY");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Total installed FB memory (in bytes)");
+        event->measureFunc = &_nvml_wrapper_getMemoryInfo;
+        event->options.memory = MEMORY_TOTAL;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "USED_MEMORY");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Allocated FB memory (in bytes). Note that the driver/GPU always sets aside a small amount of memory for bookkeeping");
+        event->measureFunc = &_nvml_wrapper_getMemoryInfo;
+        event->options.memory = MEMORY_USED;
+        event++;
+    }
+
+    if (device->features & FEATURE_PERF_STATES)
+    {
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "PERF_STATE");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Current performance state for the device");
+        event->measureFunc = &_nvml_wrapper_getPerformanceState;
+        event++;
+    }
+
+    if (device->features & FEATURE_POWER)
+    {
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "POWER_USAGE");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Power usage for this GPU in milliwatts and its associated circuitry (e.g. memory)");
+        event->measureFunc = &_nvml_wrapper_getPowerUsage;
+        event++;
+    }
+
+    if (device->features & FEATURE_TEMP)
+    {
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "TEMP_GPU");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Current temperature readings for the device, in degrees C");
+        event->measureFunc = &_nvml_wrapper_getPowerUsage;
+        event->options.tempSensor = NVML_TEMPERATURE_GPU;
+        event++;
+    }
+
+    if (device->features & FEATURE_POWER_MANAGEMENT)
+    {
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "POWER_LIMIT");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Power management limit associated with this device in milliwatts");
+        event->measureFunc = &_nvml_wrapper_getPowerManagementLimit;
+        event++;
+    }
+
+    if (device->features & FEATURE_NVML_POWER_MANAGEMENT_LIMIT_CONSTRAINT)
+    {
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "POWER_LIMIT_MIN");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Minimum power management limit in milliwatts");
+        event->measureFunc = &_nvml_wrapper_getPowerManagementLimitConstraints;
+        event->options.powerLimit = LIMIT_MIN;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "POWER_LIMIT_MAX");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Maximum power management limit in milliwatts");
+        event->measureFunc = &_nvml_wrapper_getPowerManagementLimitConstraints;
+        event->options.powerLimit = LIMIT_MAX;
+        event++;
+    }
+
+    if (device->features & FEATURE_UTILIZATION)
+    {
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "GPU_UTILIZATION");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Percent of time over the past sample period during which one or more kernels was executing on the GPU");
+        event->measureFunc = &_nvml_wrapper_getUtilization;
+        event->options.utilization = UTILIZATION_GPU;
+        event++;
+
+        snprintf(event->name, LIKWID_NVML_NAME_LEN, "MEMORY_UTILIZATION");
+        snprintf(event->description, LIKWID_NVML_DESC_LEN, "Percent of time over the past sample period during which global (device) memory was being read or written");
+        event->measureFunc = &_nvml_wrapper_getUtilization;
+        event->options.utilization = UTILIZATION_MEMORY;
+        event++;
+    }
+
+    return 0;
+}
+
+
+static void
+_nvml_getEccFeaturesOfDevice(NvmlDevice* device)
+{
+    char inforomECC[16];
+    float ecc_version = 0;
+    nvmlEnableState_t mode = NVML_FEATURE_DISABLED;
+    nvmlEnableState_t pendingmode = NVML_FEATURE_DISABLED;
+
+    /*
+    For Tesla and Quadro products from Fermi and Kepler families.
+    requires NVML_INFOROM_ECC 2.0 or higher for location-based counts
+    requires NVML_INFOROM_ECC 1.0 or higher for all other ECC counts
+    requires ECC mode to be enabled.
+    */
+
+    // Query ecc version
+    if ((*nvmlDeviceGetInforomVersion_ptr)(device->nvmlDevice, NVML_INFOROM_ECC, inforomECC, 16) == NVML_SUCCESS)
+    {
+        ecc_version = strtof(inforomECC, NULL);
+    }
+
+    // Query ecc mode
+    if ((*nvmlDeviceGetEccMode_ptr)(device->nvmlDevice, &mode, &pendingmode) == NVML_SUCCESS) {
+        if (mode == NVML_FEATURE_ENABLED) {
+            if (ecc_version >= 2.0) {
+                device->features |= FEATURE_ECC_LOCAL_ERRORS;
+                device->numAllEvents += 8; /* {single bit, two bit errors} x { reg, l1, l2, memory } */
+            }
+            if (ecc_version >= 1.0) {
+                device->features |= FEATURE_ECC_TOTAL_ERRORS;
+                device->numAllEvents += 2; /* single bit errors, double bit errors */
+            }
+        }
+    }
+}
+
+
+static int
+_nvml_getFeaturesOfDevice(NvmlDevice* device)
+{
+    unsigned int value;
+
+    /*
+    Features copied from PAPI nvml component (https://bitbucket.org/icl/papi/src/master/src/components/nvml/linux-nvml.c)
+    */
+
+    // Reset state
+    device->features = 0;
+    device->numAllEvents = 0;
+
+    // Check FEATURE_CLOCK_INFO
+    if ((*nvmlDeviceGetClockInfo_ptr)(device->nvmlDevice, NVML_CLOCK_GRAPHICS, &value) == NVML_SUCCESS)
+    {
+        device->features |= FEATURE_CLOCK_INFO;
+        device->numAllEvents += 4;
+    }
+
+    // Check FEATURE_MAX_CLOCK
+    if ((*nvmlDeviceGetClock_ptr)(device->nvmlDevice, NVML_CLOCK_GRAPHICS, NVML_CLOCK_ID_CUSTOMER_BOOST_MAX, &value) == NVML_SUCCESS)
+    {
+        device->features |= FEATURE_MAX_CLOCK;
+        device->numAllEvents += 4;
+    }
+
+    // Check ECC features
+    _nvml_getEccFeaturesOfDevice(device);
+
+    // Check FEATURE_FAN_SPEED
+    while (1)
+    {
+        if ((*nvmlDeviceGetFanSpeed_v2_ptr)(device->nvmlDevice, device->numFans, &value) != NVML_SUCCESS)
+        {
+            break;
+        }
+        device->features |= FEATURE_FAN_SPEED;
+        device->numAllEvents += 1;
+        device->numFans++;
+    }
+
+    // All products support FEATURE_MEMORY_INFO
+    device->features |= FEATURE_MEMORY_INFO;
+    device->numAllEvents += 3;
+
+    // Check FEATURE_PERF_STATES
+    nvmlPstates_t state;
+    if ((*nvmlDeviceGetPerformanceState_ptr)(device->nvmlDevice, &state) == NVML_SUCCESS)
+    {
+        device->features |= FEATURE_PERF_STATES;
+        device->numAllEvents += 1;
+    }
+
+    // Check FEATURE_POWER
+    if ((*nvmlDeviceGetPowerUsage_ptr)(device->nvmlDevice, &value) == NVML_SUCCESS)
+    {
+        device->features |= FEATURE_POWER;
+        device->numAllEvents += 1;
+    }
+
+    // Check FEATURE_TEMP
+    if ((*nvmlDeviceGetTemperature_ptr)(device->nvmlDevice, NVML_TEMPERATURE_GPU, &value) == NVML_SUCCESS)
+    {
+        device->features |= FEATURE_TEMP;
+        device->numAllEvents += 1;
+    }
+
+    // Check FEATURE_POWER_MANAGEMENT
+    if ((*nvmlDeviceGetPowerManagementLimit_ptr)(device->nvmlDevice, &value) == NVML_SUCCESS)
+    {
+        device->features |= FEATURE_POWER_MANAGEMENT;
+        device->numAllEvents += 1;
+    }
+
+    // Check FEATURE_NVML_POWER_MANAGEMENT_LIMIT_CONSTRAINT
+    unsigned int minLimit, maxLimit;
+    if ((*nvmlDeviceGetPowerManagementLimitConstraints_ptr)(device->nvmlDevice, &minLimit, &maxLimit) == NVML_SUCCESS)
+    {
+        device->features |= FEATURE_NVML_POWER_MANAGEMENT_LIMIT_CONSTRAINT;
+        device->numAllEvents += 2;
+    }
+
+    // Check FEATURE_UTILIZATION
+    nvmlUtilization_t utilization;
+    if ((*nvmlDeviceGetUtilizationRates_ptr)(device->nvmlDevice, &utilization) == NVML_SUCCESS)
+    {
+        device->features |= FEATURE_UTILIZATION;
+        device->numAllEvents += 2;
+    }
+
+    return 0;
+}
+
+
+static int
+_nvml_createDevice(int idx, NvmlDevice* device)
+{
+    int ret;
+
+    // Set corresponding nvmon device
+    device->nvDevice = &nvGroupSet->gpus[idx];
+    device->activeEventSet = 0;
+    device->numEventSets = 0;
+    device->eventSets = NULL;
+    device->numFans = 0;
+
+    // Get NVML device handle
+    NVML_CALL(nvmlDeviceGetHandleByIndex_v2, (device->nvDevice->deviceId, &device->nvmlDevice), {
+        ERROR_PRINT(Failed to get device handle for GPU %d, device->nvDevice->deviceId);
+        return -1;
+    });
+
+    ret = _nvml_getFeaturesOfDevice(device);
+    if (ret < 0) return ret;
+
+    // Allocate memory for event list
+    device->allEvents = (NvmlEvent*) malloc(device->numAllEvents * sizeof(NvmlEvent));
+    if (device->allEvents == NULL)
+    {
+        ERROR_PRINT(Failed to allocate memory for event list of GPU %d, device->nvDevice->deviceId);
+        return -ENOMEM;
+    }
+
+    ret = _nvml_getEventsForDevice(device);
+    if (ret < 0) return ret;
+
+    return 0;
+}
+
+
+static int
+_nvml_readCounters(void (*saveTimestamp)(NvmlDevice* device, uint64_t timestamp), void (*afterMeasure)(NvmlEventResult* result))
+{
+    int ret;
+
+    // Get timestamp
+    uint64_t timestamp;
+    CUPTI_CALL(cuptiGetTimestamp, (&timestamp), return -EFAULT);
+    if (ret < 0)
+    {
+        return -EFAULT;
+    }
+
+    for (int i = 0; i < nvmlContext.numDevices; i++)
+    {
+        NvmlDevice* device = &nvmlContext.devices[i];
+        NvmlEventSet* eventSet = &device->eventSets[device->activeEventSet];
+
+        // Save timestamp
+        if (saveTimestamp)
+        {
+            saveTimestamp(device, timestamp);
+        }
+
+        // Read value of each event
+        for (int i = 0; i < eventSet->numEvents; i++)
+        {
+            NvmlEvent* event = &eventSet->events[i];
+            NvmlEventResult* result = &eventSet->results[i];
+            if (event->measureFunc)
+            {
+                ret = event->measureFunc(device->nvmlDevice, event, result);
+                if (ret < 0) return ret;
+
+                if (afterMeasure)
+                {
+                    afterMeasure(result);
+                }
+            }
+        }
+    }
+
+    return 0;
+}
+
+
+static void
+_nvml_saveStartTime(NvmlDevice* device, uint64_t timestamp)
+{
+    device->time.start = timestamp;
+    device->time.read = timestamp;
+}
+
+
+static void
+_nvml_resetFullValue(NvmlEventResult* result)
+{
+    result->fullValue = 0;
+}
+
+
+static void
+_nvml_saveReadTime(NvmlDevice* device, uint64_t timestamp)
+{
+    device->time.read = timestamp;
+}
+
+
+static void
+_nvml_saveStopTime(NvmlDevice* device, uint64_t timestamp)
+{
+    device->time.stop = timestamp;
+}
+
+
+// ----------------------------------------------------
+//   Exported functions
+// ----------------------------------------------------
+
+int
+nvml_init()
+{
+    int ret;
+
+    if (nvml_initialized == 1)
+    {
+        return 0;
+    }
+
+    ret = _nvml_linkLibraries();
+    if (ret < 0)
+    {
+        ERROR_PLAIN_PRINT(Failed to link libraries);
+        return -1;
+    }
+
+    // Allocate space for nvml specific structures
+    nvmlContext.numDevices = nvGroupSet->numberOfGPUs;
+    nvmlContext.devices = (NvmlDevice*) malloc(nvmlContext.numDevices * sizeof(NvmlDevice));
+    if (nvmlContext.devices == NULL)
+    {   
+        ERROR_PLAIN_PRINT(Cannot allocate NVML device structures);
+        return -ENOMEM;
+    }
+
+    // Init NVML
+    NVML_CALL(nvmlInit_v2, (), return -1);
+
+    // Do device specific setup
+    for (int i = 0; i < nvmlContext.numDevices; i++)
+    {
+        NvmlDevice* device = &nvmlContext.devices[i];
+        ret = _nvml_createDevice(i, device);
+        if (ret < 0)
+        {
+            ERROR_PRINT(Failed to create device #%d, i);
+            return ret;
+        }
+    }
+
+    nvml_initialized = 1;
+    return 0;
+}
+
+
+void
+nvml_finalize()
+{
+    if (nvmlContext.devices)
+    {
+        for (int i = 0; i < nvmlContext.numDevices; i++)
+        {
+            NvmlDevice* device = &nvmlContext.devices[i];
+
+            FREE_IF_NOT_NULL(device->allEvents);
+            for (int j = 0; j < device->numEventSets; j++)
+            {
+                FREE_IF_NOT_NULL(device->eventSets[j].events);
+                FREE_IF_NOT_NULL(device->eventSets[j].results);
+            }
+            FREE_IF_NOT_NULL(device->eventSets);
+        }
+        free(nvmlContext.devices);
+    }
+
+    // Shutdown NVML
+    NVML_CALL(nvmlShutdown, (), return);
+}
+
+
+int
+nvml_addEventSet(char** events, int numEvents)
+{
+    // Allocate memory for event results
+    for (int i = 0; i < nvmlContext.numDevices; i++)
+    {
+        NvmlDevice* device = &nvmlContext.devices[i];
+
+        // Allocate new event set in device
+        NvmlEvent* tmpEvents = (NvmlEvent*) malloc(numEvents * sizeof(NvmlEvent));
+        if (tmpEvents == NULL)
+        {
+            ERROR_PLAIN_PRINT(Cannot allocate events for new event set);
+            return -ENOMEM;
+        }
+        NvmlEventResult* tmpResults = (NvmlEventResult*) malloc(numEvents * sizeof(NvmlEventResult));
+        if (tmpResults == NULL)
+        {
+            ERROR_PLAIN_PRINT(Cannot allocate event results);
+            free(tmpEvents);
+            return -ENOMEM;
+        }
+        NvmlEventSet* tmpEventSets = (NvmlEventSet*) realloc(device->eventSets, (device->numEventSets+1) * sizeof(NvmlEventSet));
+        if (tmpEventSets == NULL)
+        {
+            ERROR_PLAIN_PRINT(Cannot allocate new event set);
+            free(tmpEvents);
+            free(tmpResults);
+            return -ENOMEM;
+        }
+
+        // Copy event information
+        for (int j = 0; j < numEvents; j++)
+        {
+            // Search for it in allEvents
+            int idx = -1;
+            for (int k = 0; k < device->numAllEvents; k++)
+            {
+                if (strcmp(device->allEvents[k].name, events[j]) == 0)
+                {
+                    idx = k;
+                    break;
+                }
+            }
+
+            // Check if event was found
+            if (idx < 0)
+            {
+                ERROR_PRINT(Could not find event %s, events[j]);
+                return -EINVAL;
+            }
+
+            // Copy whole event into activeEvents array
+            memcpy(&tmpEvents[j], &device->allEvents[idx], sizeof(NvmlEvent));
+        }
+
+        device->eventSets = tmpEventSets;
+        device->eventSets[device->numEventSets].numEvents = numEvents;
+        device->eventSets[device->numEventSets].events = tmpEvents;
+        device->eventSets[device->numEventSets].results = tmpResults;
+        device->numEventSets++;
+    }
+
+    return 0;
+}
+
+
+int
+nvml_setupCounters(int gid)
+{
+    // Update active events of each device
+    for (int i = 0; i < nvmlContext.numDevices; i++)
+    {
+        nvmlContext.devices[i].activeEventSet = gid;
+    }
+
+    return 0;
+}
+
+
+// Strings inside structure are only valid as long as nvmon/nvml is initialized
+int
+nvml_getEventsOfGpu(int gpuId, NvmonEventList_t* output)
+{
+    int gpuIdx = -1;
+
+    // Find index with given gpuId
+    for (int i = 0; i < nvmlContext.numDevices; i++)
+    {
+        if (nvmlContext.devices[i].nvDevice->deviceId == gpuId)
+        {
+            gpuIdx = i;
+            break;
+        }
+    }
+    if (gpuIdx < 0)
+    {
+        return -EINVAL;
+    }
+
+    // Get device handle
+    NvmlDevice* device = &nvmlContext.devices[gpuIdx];
+
+    // Allocate space for output structure
+    NvmonEventListEntry* entries = (NvmonEventListEntry*) malloc(device->numAllEvents * sizeof(NvmonEventListEntry));
+    if (entries == NULL)
+    {
+        ERROR_PLAIN_PRINT(Cannot allocate event list entries);
+        return -ENOMEM;
+    }
+    NvmonEventList* list = (NvmonEventList*) malloc(sizeof(NvmonEventList));
+    if (list == NULL)
+    {
+        ERROR_PLAIN_PRINT(Cannot allocate event list);
+        free(entries);
+        return -ENOMEM;
+    }
+
+    // Fill structure
+    for (int i = 0; i < device->numAllEvents; i++)
+    {
+        NvmlEvent* event = &device->allEvents[i];
+        NvmonEventListEntry* entry = &entries[i];
+        int len;
+
+        entry->name = event->name;
+        entry->desc = "No description"; // TODO: Add event descriptions
+        entry->limit = "GPU";
+    }
+
+    list->events = entries;
+    list->numEvents = device->numAllEvents;
+    *output = list;
+
+    return 0;
+}
+
+
+void
+nvml_returnEventsOfGpu(NvmonEventList_t list)
+{
+    if (list == NULL)
+    {
+        return;
+    }
+
+    if (list->events != NULL && list->numEvents > 0)
+    {
+        // Event entries do not have owned strings, so nothing to free per entry
+        free(list->events);
+    }
+
+    free(list);
+}
+
+
+int
+nvml_startCounters()
+{
+    int ret;
+
+    // Ensure nvml is initialized
+    if (!nvml_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Read initial counter values and reset full value
+    ret = _nvml_readCounters(_nvml_saveStartTime, _nvml_resetFullValue);
+    if (ret < 0) return ret;
+
+    return 0;
+}
+
+
+int
+nvml_stopCounters()
+{
+    int ret;
+
+    // Ensure nvml is initialized
+    if (!nvml_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Read counters
+    ret = _nvml_readCounters(_nvml_saveStopTime, NULL);
+    if (ret < 0) return ret;
+
+    return 0;
+}
+
+
+int
+nvml_readCounters()
+{
+    int ret;
+
+    // Ensure nvml is initialized
+    if (!nvml_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Read counters
+    ret = _nvml_readCounters(_nvml_saveReadTime, NULL);
+    if (ret < 0) return ret;
+
+    return 0;
+}
+
+
+int
+nvml_getNumberOfEvents(int groupId)
+{
+    // Ensure nvml is initialized
+    if (!nvml_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Verify that at least one device is registered
+    if (nvmlContext.numDevices < 1)
+    {
+        return 0; // No events registered
+    }
+
+    // Verify groupId
+    NvmlDevice* device = &nvmlContext.devices[0];
+    if (groupId < 0 || groupId >= device->numEventSets)
+    {
+        return -EINVAL;
+    }
+
+    // Events are the same on all devices, take the first
+    return device->eventSets[groupId].numEvents;
+}
+
+
+double
+nvml_getResult(int gpuIdx, int groupId, int eventId)
+{
+    // Ensure nvml is initialized
+    if (!nvml_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Validate gpuIdx
+    if (gpuIdx < 0 || gpuIdx >= nvmlContext.numDevices)
+    {
+        return -EINVAL;
+    }
+
+    // Validate groupId
+    NvmlDevice* device = &nvmlContext.devices[gpuIdx];
+    if (groupId < 0 || groupId >= device->numEventSets)
+    {
+        return -EINVAL;
+    }
+
+    // Validate eventId
+    NvmlEventSet* eventSet = &device->eventSets[groupId];
+    if (eventId < 0 || eventId >= eventSet->numEvents)
+    {
+        return -EINVAL;
+    }
+
+    // Return result
+    return eventSet->results[eventId].fullValue;
+}
+
+
+double
+nvml_getLastResult(int gpuIdx, int groupId, int eventId)
+{
+    // Ensure nvml is initialized
+    if (!nvml_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Validate gpuIdx
+    if (gpuIdx < 0 || gpuIdx >= nvmlContext.numDevices)
+    {
+        return -EINVAL;
+    }
+
+    // Validate groupId
+    NvmlDevice* device = &nvmlContext.devices[gpuIdx];
+    if (groupId < 0 || groupId >= device->numEventSets)
+    {
+        return -EINVAL;
+    }
+
+    // Validate eventId
+    NvmlEventSet* eventSet = &device->eventSets[groupId];
+    if (eventId < 0 || eventId >= eventSet->numEvents)
+    {
+        return -EINVAL;
+    }
+
+    // Return result
+    return eventSet->results[eventId].lastValue;
+}
+
+
+double 
+nvml_getTimeOfGroup(int groupId)
+{
+    double time = 0;
+
+    // Ensure nvml is initialized
+    if (!nvml_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Validate gpuIdx
+    if (groupId < 0 || groupId >= nvGroupSet->numberOfActiveGroups)
+    {
+        return -EINVAL;
+    }
+
+    // Get largest time measured
+    for (int i = 0; i < nvmlContext.numDevices; i++)
+    {
+        time = MAX(time, (double)(nvmlContext.devices[i].time.stop - nvmlContext.devices[i].time.start));
+    }
+
+    // Return time as seconds
+    return time*1E-9;
+}
+
+
+double 
+nvml_getLastTimeOfGroup(int groupId)
+{
+    double time = 0;
+
+    // Ensure nvml is initialized
+    if (!nvml_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Validate gpuIdx
+    if (groupId < 0 || groupId >= nvGroupSet->numberOfActiveGroups)
+    {
+        return -EINVAL;
+    }
+
+    // Get largest time measured
+    for (int i = 0; i < nvmlContext.numDevices; i++)
+    {
+        time = MAX(time, (double)(nvmlContext.devices[i].time.stop - nvmlContext.devices[i].time.read));
+    }
+
+    // Return time as seconds
+    return time*1E-9;
+}
+
+
+double
+nvml_getTimeToLastReadOfGroup(int groupId)
+{
+    double time = 0;
+
+    // Ensure nvml is initialized
+    if (!nvml_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Validate gpuIdx
+    if (groupId < 0 || groupId >= nvGroupSet->numberOfActiveGroups)
+    {
+        return -EINVAL;
+    }
+
+    // Get largest time measured
+    for (int i = 0; i < nvmlContext.numDevices; i++)
+    {
+        time = MAX(time, (double)(nvmlContext.devices[i].time.read - nvmlContext.devices[i].time.start));
+    }
+
+    // Return time as seconds
+    return time*1E-9;
+}
\ No newline at end of file
diff --git a/src/rocmon.c b/src/rocmon.c
new file mode 100644
index 000000000..bd66e9b54
--- /dev/null
+++ b/src/rocmon.c
@@ -0,0 +1,2275 @@
+ /* =======================================================================================
+ *
+ *      Filename:  rocmon.c
+ *
+ *      Description:  Main implementation of the performance monitoring module
+ *                    for AMD GPUs
+ *
+ *      Version:   <VERSION>
+ *      Released:  <DATE>
+ *
+ *      Author:   Thomas Gruber (tg), thomas.roehl@googlemail.com
+ *      Project:  likwid
+ *
+ *      Copyright (C) 2016 RRZE, University Erlangen-Nuremberg
+ *
+ *      This program is free software: you can redistribute it and/or modify it under
+ *      the terms of the GNU General Public License as published by the Free Software
+ *      Foundation, either version 3 of the License, or (at your option) any later
+ *      version.
+ *
+ *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
+ *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+ *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *
+ *      You should have received a copy of the GNU General Public License along with
+ *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * =======================================================================================
+ */
+#ifdef LIKWID_WITH_ROCMON
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <math.h>
+#include <float.h>
+#include <unistd.h>
+#include <types.h>
+#include <sys/types.h>
+#include <inttypes.h>
+
+#include <likwid.h>
+#include <bstrlib.h>
+#include <error.h>
+#include <dlfcn.h>
+
+#include <likwid.h>
+#include <rocmon_types.h>
+#include <rocm_smi/rocm_smi.h>
+
+// #include <hsa.h>
+// #include <rocprofiler.h>
+// #include <hsa/hsa_ext_amd.h>
+
+// Variables
+static void *dl_hsa_lib = NULL;
+static void *dl_profiler_lib = NULL;
+static void *dl_rsmi_lib = NULL;
+
+RocmonContext *rocmon_context = NULL;
+static bool rocmon_initialized = FALSE;
+int likwid_rocmon_verbosity = DEBUGLEV_ONLY_ERROR;
+
+// Macros
+#define membersize(type, member) sizeof(((type *) NULL)->member)
+#define FREE_IF_NOT_NULL(var) if ( var ) { free( var ); var = NULL; }
+#define ROCM_CALL( call, args, handleerror )                                  \
+    do {                                                                \
+        hsa_status_t _status = (*call##_ptr)args;                                  \
+        if (_status != HSA_STATUS_SUCCESS && _status != HSA_STATUS_INFO_BREAK) {           \
+            const char* err = NULL; \
+            fprintf(stderr, "Error: function %s failed with error %d\n", #call, _status); \
+            rocprofiler_error_string(&err); \
+            fprintf(stderr, "Error: %s\n", err); \
+            handleerror;                                                \
+        }                                                               \
+    } while (0)
+
+#define RSMI_CALL( call, args, handleerror )                                  \
+    do {                                                                \
+        rsmi_status_t _status = (*call##_ptr)args;                                  \
+        if (_status != RSMI_STATUS_SUCCESS) {           \
+            fprintf(stderr, "Error: function %s failed with error %d.\n", #call, _status); \
+            handleerror;                                                \
+        }                                                               \
+    } while (0)
+
+// ROCm function declarations
+#define ROCMWEAK __attribute__(( weak ))
+#define DECLAREFUNC_HSA(funcname, funcsig) hsa_status_t ROCMWEAK funcname funcsig;  hsa_status_t ( *funcname##_ptr ) funcsig;
+#define DECLAREFUNC_SMI(funcname, funcsig) rsmi_status_t ROCMWEAK funcname funcsig; rsmi_status_t ( *funcname##_ptr ) funcsig;
+
+DECLAREFUNC_HSA(hsa_init, ());
+DECLAREFUNC_HSA(hsa_shut_down, ());
+DECLAREFUNC_HSA(hsa_iterate_agents, (hsa_status_t (*callback)(hsa_agent_t agent, void* data), void* data));
+DECLAREFUNC_HSA(hsa_agent_get_info, (hsa_agent_t agent, hsa_agent_info_t attribute, void* value));
+DECLAREFUNC_HSA(hsa_system_get_info, (hsa_system_info_t attribute, void *value));
+
+DECLAREFUNC_HSA(rocprofiler_iterate_info, (const hsa_agent_t* agent, rocprofiler_info_kind_t kind, hsa_status_t (*callback)(const rocprofiler_info_data_t, void* data), void* data));
+DECLAREFUNC_HSA(rocprofiler_close, (rocprofiler_t* context));
+DECLAREFUNC_HSA(rocprofiler_open, (hsa_agent_t agent, rocprofiler_feature_t* features, uint32_t feature_count, rocprofiler_t** context, uint32_t mode, rocprofiler_properties_t* properties));
+DECLAREFUNC_HSA(rocprofiler_error_string, ());
+DECLAREFUNC_HSA(rocprofiler_start, (rocprofiler_t* context, uint32_t group_index));
+DECLAREFUNC_HSA(rocprofiler_stop, (rocprofiler_t* context, uint32_t group_index));
+DECLAREFUNC_HSA(rocprofiler_read, (rocprofiler_t* context, uint32_t group_index));
+DECLAREFUNC_HSA(rocprofiler_get_data, (rocprofiler_t* context, uint32_t group_index));
+DECLAREFUNC_HSA(rocprofiler_get_metrics, (const rocprofiler_t* context));
+
+DECLAREFUNC_SMI(rsmi_init, (uint64_t flags));
+DECLAREFUNC_SMI(rsmi_shut_down, ());
+DECLAREFUNC_SMI(rsmi_dev_supported_func_iterator_open, (uint32_t dv_ind, rsmi_func_id_iter_handle_t* handle));
+DECLAREFUNC_SMI(rsmi_dev_supported_variant_iterator_open, (rsmi_func_id_iter_handle_t obj_h, rsmi_func_id_iter_handle_t* var_iter));
+DECLAREFUNC_SMI(rsmi_func_iter_value_get, (rsmi_func_id_iter_handle_t handle, rsmi_func_id_value_t* value ));
+DECLAREFUNC_SMI(rsmi_func_iter_next, (rsmi_func_id_iter_handle_t handle));
+DECLAREFUNC_SMI(rsmi_dev_supported_func_iterator_close, (rsmi_func_id_iter_handle_t* handle));
+DECLAREFUNC_SMI(rsmi_dev_power_ave_get, (uint32_t dv_ind, uint32_t sensor_ind, uint64_t* power));
+DECLAREFUNC_SMI(rsmi_dev_pci_throughput_get, (uint32_t dv_ind, uint64_t* sent, uint64_t* received, uint64_t* max_pkt_sz));
+DECLAREFUNC_SMI(rsmi_dev_pci_replay_counter_get, (uint32_t dv_ind, uint64_t* counter));
+DECLAREFUNC_SMI(rsmi_dev_memory_total_get, (uint32_t dv_ind, rsmi_memory_type_t mem_type, uint64_t* total));
+DECLAREFUNC_SMI(rsmi_dev_memory_usage_get, (uint32_t dv_ind, rsmi_memory_type_t mem_type, uint64_t* used ));
+DECLAREFUNC_SMI(rsmi_dev_memory_busy_percent_get, (uint32_t dv_ind, uint32_t* busy_percent));
+DECLAREFUNC_SMI(rsmi_dev_memory_reserved_pages_get, (uint32_t dv_ind, uint32_t* num_pages, rsmi_retired_page_record_t* records));
+DECLAREFUNC_SMI(rsmi_dev_fan_rpms_get, (uint32_t dv_ind, uint32_t sensor_ind, int64_t* speed));
+DECLAREFUNC_SMI(rsmi_dev_fan_speed_get, (uint32_t dv_ind, uint32_t sensor_ind, int64_t* speed));
+DECLAREFUNC_SMI(rsmi_dev_fan_speed_max_get, (uint32_t dv_ind, uint32_t sensor_ind, uint64_t* max_speed));
+DECLAREFUNC_SMI(rsmi_dev_temp_metric_get, (uint32_t dv_ind, uint32_t sensor_type, rsmi_temperature_metric_t metric, int64_t* temperature));
+DECLAREFUNC_SMI(rsmi_dev_volt_metric_get, (uint32_t dv_ind, rsmi_voltage_type_t sensor_type, rsmi_voltage_metric_t metric, int64_t* voltage));
+DECLAREFUNC_SMI(rsmi_dev_overdrive_level_get, (uint32_t dv_ind, uint32_t* od));
+DECLAREFUNC_SMI(rsmi_dev_ecc_count_get, (uint32_t dv_ind, rsmi_gpu_block_t block, rsmi_error_count_t* ec));
+DECLAREFUNC_SMI(rsmi_compute_process_info_get, (rsmi_process_info_t* procs, uint32_t* num_items));
+
+
+// ----------------------------------------------------
+//   SMI event wrapper
+// ----------------------------------------------------
+
+static int
+_smi_wrapper_pci_throughput_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    uint64_t value;
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, _smi_wrapper_pci_throughput_get(%d, %d), deviceId, event->extra);
+    // Internal variant: 0 for sent, 1 for received bytes and 2 for max packet size
+    if (event->extra == 0)       RSMI_CALL(rsmi_dev_pci_throughput_get, (deviceId, &value, NULL, NULL), return -1);
+    else if (event->extra == 1)  RSMI_CALL(rsmi_dev_pci_throughput_get, (deviceId, NULL, &value, NULL), return -1);
+    else if (event->extra == 2)  RSMI_CALL(rsmi_dev_pci_throughput_get, (deviceId, NULL, NULL, &value), return -1);
+    else return -1;
+
+    result->fullValue += value;
+    result->lastValue = value;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_pci_replay_counter_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    uint64_t counter;
+    RSMI_CALL(rsmi_dev_pci_replay_counter_get, (deviceId, &counter), return -1);
+    result->fullValue += counter;
+    result->lastValue = counter;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_power_ave_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    uint64_t power;
+    RSMI_CALL(rsmi_dev_power_ave_get, (deviceId, event->subvariant, &power), return -1);
+    result->fullValue += power;
+    result->lastValue = power;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_memory_total_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    uint64_t total;
+    RSMI_CALL(rsmi_dev_memory_total_get, (deviceId, event->variant, &total), return -1);
+    result->fullValue += total;
+    result->lastValue = total;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_memory_usage_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    uint64_t used;
+    RSMI_CALL(rsmi_dev_memory_usage_get, (deviceId, event->variant, &used), return -1);
+    result->fullValue += used;
+    result->lastValue = used;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_memory_busy_percent_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    uint32_t percent;
+    RSMI_CALL(rsmi_dev_memory_busy_percent_get, (deviceId, &percent), return -1);
+    result->fullValue += percent;
+    result->lastValue = percent;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_memory_reserved_pages_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    uint32_t num_pages;
+    RSMI_CALL(rsmi_dev_memory_reserved_pages_get, (deviceId, &num_pages, NULL), return -1);
+    result->fullValue += num_pages;
+    result->lastValue = num_pages;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_fan_rpms_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    int64_t speed;
+    RSMI_CALL(rsmi_dev_fan_rpms_get, (deviceId, event->subvariant, &speed), return -1);
+    result->fullValue += speed;
+    result->lastValue = speed;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_fan_speed_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    int64_t speed;
+    RSMI_CALL(rsmi_dev_fan_speed_get, (deviceId, event->subvariant, &speed), return -1);
+    result->fullValue += speed;
+    result->lastValue = speed;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_fan_speed_max_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    int64_t max_speed;
+    RSMI_CALL(rsmi_dev_fan_speed_max_get, (deviceId, event->subvariant, &max_speed), return -1);
+    result->fullValue += max_speed;
+    result->lastValue = max_speed;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_temp_metric_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    int64_t temperature;
+    RSMI_CALL(rsmi_dev_temp_metric_get, (deviceId, event->subvariant, event->variant, &temperature), return -1);
+    result->fullValue += temperature;
+    result->lastValue = temperature;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_volt_metric_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    int64_t voltage;
+    RSMI_CALL(rsmi_dev_volt_metric_get, (deviceId, event->subvariant, event->variant, &voltage), return -1);
+    result->fullValue += voltage;
+    result->lastValue = voltage;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_overdrive_level_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    uint32_t overdrive;
+    RSMI_CALL(rsmi_dev_overdrive_level_get, (deviceId, &overdrive), return -1);
+    result->fullValue += overdrive;
+    result->lastValue = overdrive;
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_ecc_count_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    rsmi_error_count_t error_count;
+    RSMI_CALL(rsmi_dev_ecc_count_get, (deviceId, event->variant, &error_count), return -1);
+
+    if (event->extra == 0)
+    {
+        result->lastValue = error_count.correctable_err - result->fullValue;
+        result->fullValue = error_count.correctable_err;
+    }
+    else if (event->extra == 1)
+    {
+        result->lastValue = error_count.uncorrectable_err - result->fullValue;
+        result->fullValue = error_count.uncorrectable_err;
+    }
+    else
+    {
+        return -1;
+    }
+
+    return 0;
+}
+
+
+static int
+_smi_wrapper_compute_process_info_get(int deviceId, RocmonSmiEvent* event, RocmonEventResult* result)
+{
+    uint32_t num_items;
+    RSMI_CALL(rsmi_compute_process_info_get, (NULL, &num_items), return -1);
+    result->fullValue += num_items;
+    result->lastValue = num_items;
+
+    return 0;
+}
+
+
+// ----------------------------------------------------
+//   Rocmon helper functions
+// ----------------------------------------------------
+
+static int
+_rocmon_link_libraries()
+{
+    #define DLSYM_AND_CHECK( dllib, name ) name##_ptr = dlsym( dllib, #name ); if ( dlerror() != NULL ) { ERROR_PRINT(Failed to link  #name); return -1; }
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, Linking AMD ROCMm libraries);
+  
+    // Need to link in the ROCm HSA libraries
+    dl_hsa_lib = dlopen("libhsa-runtime64.so", RTLD_NOW | RTLD_GLOBAL);
+    if (!dl_hsa_lib)
+    {
+        ERROR_PRINT(ROCm HSA library libhsa-runtime64.so not found);
+        return -1;
+    }
+
+    // Need to link in the Rocprofiler libraries
+    dl_profiler_lib = dlopen("librocprofiler64.so", RTLD_NOW | RTLD_GLOBAL);
+    if (!dl_profiler_lib)
+    {
+        ERROR_PRINT(Rocprofiler library librocprofiler64.so not found);
+        return -1;
+    }
+
+    // Need to link in the Rocprofiler libraries
+    dl_rsmi_lib = dlopen("librocm_smi64.so", RTLD_NOW | RTLD_GLOBAL);
+    if (!dl_rsmi_lib)
+    {
+        ERROR_PRINT(ROCm SMI library librocm_smi64.so not found);
+        return -1;
+    }
+
+    // Link HSA functions
+    DLSYM_AND_CHECK(dl_hsa_lib, hsa_init);
+    DLSYM_AND_CHECK(dl_hsa_lib, hsa_shut_down);
+    DLSYM_AND_CHECK(dl_hsa_lib, hsa_iterate_agents);
+    DLSYM_AND_CHECK(dl_hsa_lib, hsa_agent_get_info);
+    DLSYM_AND_CHECK(dl_hsa_lib, hsa_system_get_info);
+
+    // Link Rocprofiler functions
+    DLSYM_AND_CHECK(dl_profiler_lib, rocprofiler_iterate_info);
+    DLSYM_AND_CHECK(dl_profiler_lib, rocprofiler_close);
+    DLSYM_AND_CHECK(dl_profiler_lib, rocprofiler_open);
+    DLSYM_AND_CHECK(dl_profiler_lib, rocprofiler_error_string);
+    DLSYM_AND_CHECK(dl_profiler_lib, rocprofiler_start);
+    DLSYM_AND_CHECK(dl_profiler_lib, rocprofiler_stop);
+    DLSYM_AND_CHECK(dl_profiler_lib, rocprofiler_read);
+    DLSYM_AND_CHECK(dl_profiler_lib, rocprofiler_get_data);
+    DLSYM_AND_CHECK(dl_profiler_lib, rocprofiler_get_metrics);
+
+    // Link SMI functions
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_init);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_shut_down);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_supported_func_iterator_open);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_supported_variant_iterator_open);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_func_iter_value_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_func_iter_next);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_supported_func_iterator_close);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_power_ave_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_pci_throughput_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_pci_replay_counter_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_memory_total_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_memory_usage_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_memory_busy_percent_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_memory_reserved_pages_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_fan_rpms_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_fan_speed_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_fan_speed_max_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_temp_metric_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_volt_metric_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_overdrive_level_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_dev_ecc_count_get);
+    DLSYM_AND_CHECK(dl_rsmi_lib, rsmi_compute_process_info_get);
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, Linking AMD ROCMm libraries done);
+    return 0;
+}
+
+typedef struct {
+    RocmonContext* context;
+    int numGpus;
+    const int* gpuIds;
+} iterate_agents_cb_arg;
+
+typedef struct {
+    RocmonDevice* device;
+    int currIndex;
+} iterate_info_cb_arg;
+
+
+static hsa_status_t 
+_rocmon_iterate_info_callback_count(const rocprofiler_info_data_t info, void* data)
+{
+    RocmonDevice* device = (RocmonDevice*) data;
+    if (device) {
+        device->numRocMetrics++;
+    }
+    return HSA_STATUS_SUCCESS;
+}
+
+static void
+_rocmon_print_rocprofiler_info_data(const rocprofiler_info_data_t info)
+{
+    if (info.kind != ROCPROFILER_INFO_KIND_METRIC)
+    {
+        return;
+    }
+    printf("Name '%s':\n", info.metric.name);
+    printf("\tKind: '%s'\n", (info.kind == ROCPROFILER_INFO_KIND_METRIC ? "Metric" : "Trace"));
+    printf("\tInstances: %d\n", info.metric.instances);
+    printf("\tDescription: '%s'\n", info.metric.description);
+    printf("\tExpression: '%s'\n", info.metric.expr);
+    printf("\tBlockName: '%s'\n", info.metric.block_name);
+    printf("\tBlockCounters: %d\n", info.metric.block_counters);
+}
+
+static hsa_status_t 
+_rocmon_iterate_info_callback_add(const rocprofiler_info_data_t info, void* data)
+{
+    iterate_info_cb_arg* arg = (iterate_info_cb_arg*) data;
+
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, _rocmon_iterate_info_callback_add);
+    if (likwid_rocmon_verbosity == DEBUGLEV_DEVELOP)
+    {
+        _rocmon_print_rocprofiler_info_data(info);
+    }
+    // Check info kind
+    if (info.kind != ROCPROFILER_INFO_KIND_METRIC)
+    {
+        ERROR_PRINT(Wrong info kind %u, info.kind);
+        return HSA_STATUS_ERROR;
+    }
+
+    // Check index
+    if (arg->currIndex >= arg->device->numRocMetrics)
+    {
+        ERROR_PRINT(Metric index out of bounds: %d, arg->currIndex);
+        return HSA_STATUS_ERROR;
+    }
+
+    // Copy info data
+    rocprofiler_info_data_t* target_info = &arg->device->rocMetrics[arg->currIndex];
+    memcpy(target_info, &info, sizeof(rocprofiler_info_data_t));
+    arg->currIndex++;
+
+    return HSA_STATUS_SUCCESS;
+}
+
+
+static hsa_status_t
+_rocmon_iterate_agents_callback(hsa_agent_t agent, void* argv)
+{
+    // Count number of callback invocations as the devices id
+    static int nextDeviceId = 0;
+    int deviceId = nextDeviceId;
+    bool noAgent = false;
+
+    iterate_agents_cb_arg *arg = (iterate_agents_cb_arg*) argv;
+
+    // Check if device is a GPU
+    hsa_device_type_t type;
+    ROCM_CALL(hsa_agent_get_info, (agent, HSA_AGENT_INFO_DEVICE, &type), return -1);
+    if (type != HSA_DEVICE_TYPE_GPU)
+    {
+        return HSA_STATUS_SUCCESS;
+    }
+    nextDeviceId++;
+
+    // Check if device is includes in arg->gpuIds
+    int gpuIndex = -1;
+    for (int i = 0; i < arg->numGpus; i++)
+    {
+        if (deviceId == arg->gpuIds[i])
+        {
+            gpuIndex = i;
+            break;
+        }
+    }
+    if (gpuIndex < 0)
+    {
+        return HSA_STATUS_SUCCESS;
+    }
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, Initializing agent %d, gpuIndex);
+
+    // Add agent to context
+    RocmonDevice *device = &arg->context->devices[gpuIndex];
+    device->deviceId = deviceId;
+    device->hsa_agent = agent;
+    device->context = NULL;
+    device->numActiveRocEvents = 0;
+    device->activeRocEvents = NULL;
+    device->numGroupResults = 0;
+    device->groupResults = NULL;
+
+    // Get number of available metrics
+    device->numRocMetrics = 0;
+    ROCM_CALL(rocprofiler_iterate_info, (&agent, ROCPROFILER_INFO_KIND_METRIC, _rocmon_iterate_info_callback_count, device), return HSA_STATUS_ERROR);
+    ROCMON_DEBUG_PRINT(DEBUGLEV_INFO, RocProfiler provides %d events, device->numRocMetrics);
+
+    // workaround for bug in ROCm 5.4.0
+    if(device->numRocMetrics == 0) {
+        ROCM_CALL(rocprofiler_iterate_info, (NULL, ROCPROFILER_INFO_KIND_METRIC, _rocmon_iterate_info_callback_count, device), return HSA_STATUS_ERROR);
+        noAgent = true;
+    }
+
+    // Allocate memory for metrics
+    device->rocMetrics = (rocprofiler_info_data_t*) malloc(device->numRocMetrics * sizeof(rocprofiler_info_data_t));
+    if (device->rocMetrics == NULL)
+    {
+        ERROR_PLAIN_PRINT(Cannot allocate set of rocMetrics);
+        return HSA_STATUS_ERROR;
+    }
+
+    // Initialize SMI events map
+    if (init_map(&device->smiMetrics, MAP_KEY_TYPE_STR, 0, &free) < 0)
+    {
+        ERROR_PLAIN_PRINT(Cannot init smiMetrics map);
+        return HSA_STATUS_ERROR;
+    }
+
+    // Fetch metric informatino
+    iterate_info_cb_arg info_arg = {
+        .device = device,
+        .currIndex = 0,
+    };
+    ROCMON_DEBUG_PRINT(DEBUGLEV_INFO, Read %d RocProfiler events for device %d, device->numRocMetrics, device->deviceId);
+
+    // If the call fails with agent, call rocprofiler_iterate_info without agent
+    if(noAgent)
+    {
+        ROCM_CALL(rocprofiler_iterate_info, (NULL, ROCPROFILER_INFO_KIND_METRIC, _rocmon_iterate_info_callback_add, &info_arg), return HSA_STATUS_ERROR);
+    } else {
+        ROCM_CALL(rocprofiler_iterate_info, (&agent, ROCPROFILER_INFO_KIND_METRIC, _rocmon_iterate_info_callback_add, &info_arg), return HSA_STATUS_ERROR);
+    }
+
+    return HSA_STATUS_SUCCESS;
+}
+
+
+static int
+_rocmon_parse_eventstring(const char* eventString, GroupInfo* group)
+{
+    int err = 0;
+    Configuration_t config = get_configuration();
+    bstring eventBString = bfromcstr(eventString);
+
+    if (bstrchrp(eventBString, ':', 0) != BSTR_ERR)
+    {
+        // If custom group -> perfgroup_customGroup
+        err = perfgroup_customGroup(eventString, group);
+        if (err < 0)
+        {
+            ERROR_PRINT(Cannot transform %s to performance group, eventString);
+            return err;
+        }
+    }
+    else
+    {
+        // If performance group -> perfgroup_readGroup
+        err = perfgroup_readGroup(config->groupPath, "amd_gpu", eventString, group);
+        if (err == -EACCES)
+        {
+            ERROR_PRINT(Access to performance group %s not allowed, eventString);
+            return err;
+        }
+        else if (err == -ENODEV)
+        {
+            ERROR_PRINT(Performance group %s only available with deactivated HyperThreading, eventString);
+            return err;
+        }
+        if (err < 0)
+        {
+            ERROR_PRINT(Cannot read performance group %s, eventString);
+            return err;
+        }
+    }
+
+    return 0;
+}
+
+
+static int
+_rocmon_get_timestamp(uint64_t* timestamp_ns)
+{
+    uint64_t timestamp;
+
+    // Get timestamp from system
+    ROCM_CALL(hsa_system_get_info, (HSA_SYSTEM_INFO_TIMESTAMP, &timestamp), return -1);
+    // Convert to nanoseconds
+    *timestamp_ns = (uint64_t)((long double)timestamp * rocmon_context->hsa_timestamp_factor);
+
+    return 0;
+}
+
+
+static int
+_rocmon_getLastResult(RocmonDevice* device, int eventId, double* value)
+{
+    rocprofiler_data_t* data = &device->activeRocEvents[eventId].data;
+
+    switch (data->kind)
+    {
+	case ROCPROFILER_DATA_KIND_INT32:
+        *value = (double) data->result_int32;
+        break;
+	case ROCPROFILER_DATA_KIND_INT64:
+        *value = (double) data->result_int64;
+        break;
+	case ROCPROFILER_DATA_KIND_FLOAT:
+        *value = (double) data->result_float;
+        break;
+	case ROCPROFILER_DATA_KIND_DOUBLE:
+        *value = data->result_double;
+        break;
+        
+	case ROCPROFILER_DATA_KIND_BYTES:
+    case ROCPROFILER_DATA_KIND_UNINIT:
+    default:
+        return -1;
+    }
+
+    return 0;
+}
+
+
+static int
+_rocmon_readCounters_rocprofiler(RocmonDevice* device)
+{
+    int ret;
+
+    // Check if there are any counters to start
+    if (device->numActiveRocEvents <= 0)
+    {
+        return 0;
+    }
+
+    if (!device->context)
+    {
+        return 0;
+    }
+
+    ROCM_CALL(rocprofiler_read, (device->context, 0), return -1);
+    ROCM_CALL(rocprofiler_get_data, (device->context, 0), return -1);
+    ROCM_CALL(rocprofiler_get_metrics, (device->context), return -1);
+
+    // Update results
+    RocmonEventResultList* groupResult = &device->groupResults[rocmon_context->activeGroup];
+    for (int i = 0; i < device->numActiveRocEvents; i++)
+    {
+        RocmonEventResult* result = &groupResult->results[i];
+        
+        // Read value
+        ret = _rocmon_getLastResult(device, i, &result->fullValue);
+        if (ret < 0)
+        {
+            return -1;
+        }
+
+        // Calculate delta since last read
+        result->lastValue = result->fullValue - result->lastValue;
+    }
+
+    return 0;
+}
+
+
+static int
+_rocmon_readCounters_smi(RocmonDevice* device)
+{
+    // Check if there are any counters to start
+    if (device->numActiveSmiEvents <= 0)
+    {
+        return 0;
+    }
+
+    // Save baseline values
+    RocmonEventResultList* groupResult = &device->groupResults[rocmon_context->activeGroup];
+    for (int i = 0; i < device->numActiveSmiEvents; i++)
+    {
+        double value = 0;
+        RocmonSmiEvent* event = &device->activeSmiEvents[i];
+        RocmonEventResult* result = &groupResult->results[device->numActiveRocEvents+i];
+
+        // Measure counter
+        if (event->measureFunc)
+        {
+            event->measureFunc(device->deviceId, event, result);
+        }
+    }
+
+    return 0;
+}
+
+
+static int
+_rocmon_readCounters(uint64_t* (*getDestTimestampFunc)(RocmonDevice* device))
+{
+    int ret;
+
+    // Get timestamp
+    uint64_t timestamp;
+    if (ret = _rocmon_get_timestamp(&timestamp))
+    {
+        return ret;
+    }
+
+    for (int i = 0; i < rocmon_context->numDevices; i++)
+    {
+        RocmonDevice* device = &rocmon_context->devices[i];
+
+        // Save timestamp
+        if (getDestTimestampFunc)
+        {
+            uint64_t* timestampDest = getDestTimestampFunc(device);
+            if (timestampDest)
+            {
+                *timestampDest = timestamp;
+            }
+        }
+
+        // Read rocprofiler counters
+        ret = _rocmon_readCounters_rocprofiler(device);
+        if (ret < 0) return ret;
+
+        // Read SMI counters
+        ret = _rocmon_readCounters_smi(device);
+        if (ret < 0) return ret;
+    }
+
+    return 0;
+}
+
+
+static uint64_t*
+_rocmon_get_read_time(RocmonDevice* device)
+{
+    return &device->time.read;
+}
+
+
+static uint64_t*
+_rocmon_get_stop_time(RocmonDevice* device)
+{
+    return &device->time.stop;
+}
+
+
+// ----------------------------------------------------
+//   Rocmon SMI helper functions
+// ----------------------------------------------------
+
+static bstring
+_rocmon_smi_build_label(RocmonSmiEventType type, const char* funcname, uint64_t variant, uint64_t subvariant)
+{
+    switch (type)
+    {
+    case ROCMON_SMI_EVENT_TYPE_NORMAL:
+        return bfromcstr(funcname);
+    case ROCMON_SMI_EVENT_TYPE_VARIANT:
+        return bformat("%s|%" PRIu64, funcname, variant);
+    case ROCMON_SMI_EVENT_TYPE_SUBVARIANT:
+        return bformat("%s|%" PRIu64 "|%" PRIu64, funcname, variant, subvariant);
+    case ROCMON_SMI_EVENT_TYPE_INSTANCES:
+        return bfromcstr(funcname);
+    }
+}
+
+
+static int
+_rocmon_smi_add_event_to_device(RocmonDevice* device, const char* funcname, RocmonSmiEventType type, int64_t variant, uint64_t subvariant)
+{
+    int ret;
+    
+    // Get event by label
+    RocmonSmiEventList* list = NULL;
+    bstring label = _rocmon_smi_build_label(type, funcname, variant, subvariant);
+    ret = get_smap_by_key(rocmon_context->smiEvents, bdata(label), (void**)&list);
+    bdestroy(label);
+    if (ret < 0)
+    {
+        // Event not registered -> ignore
+        return 0;
+    }
+
+    // For events with multiple sensor, only make one entry -> find if one exists
+    if (type == ROCMON_SMI_EVENT_TYPE_INSTANCES && subvariant > 0)
+    {
+        // Get list from map
+        for (int i = 0; i < list->numEntries; i++)
+        {
+            RocmonSmiEvent* event = &list->entries[i];
+            RocmonSmiEvent* existingEvent = NULL;
+            ret = get_smap_by_key(device->smiMetrics, event->name, (void**)&existingEvent);
+            if (ret < 0)
+            {
+                ERROR_PRINT(Failed to find previous instance for event %s, event->name);
+                return -1;
+            }
+
+            // Update instance information
+            existingEvent->instances++;
+        }
+        return 0;
+    }
+
+    for (int i = 0; i < list->numEntries; i++)
+    {
+        RocmonSmiEvent* event = &list->entries[i];
+
+        // Allocate memory for device event description
+        RocmonSmiEvent* tmpEvent = (RocmonSmiEvent*) malloc(sizeof(RocmonSmiEvent));
+        if (tmpEvent == NULL)
+        {
+            ERROR_PRINT(Failed to allocate memory for SMI event in device list %s, event->name);
+            return -ENOMEM;
+        }
+
+        // Copy information from global description
+        memcpy(tmpEvent, event, sizeof(RocmonSmiEvent));
+        tmpEvent->variant = variant;
+        tmpEvent->subvariant = subvariant;
+        tmpEvent->instances = 1;
+
+        // Save event info to device event map
+        add_smap(device->smiMetrics, tmpEvent->name, tmpEvent);
+    }
+
+    return 0;
+}
+
+
+static int
+_rocmon_smi_get_function_subvariants(RocmonDevice* device, const char* funcname, uint64_t variant, rsmi_func_id_iter_handle_t var_iter)
+{
+    rsmi_func_id_iter_handle_t sub_var_iter;
+    rsmi_func_id_value_t value;
+    rsmi_status_t status;
+    int ret;
+
+    // Get open subvariants iterator
+    status = (*rsmi_dev_supported_variant_iterator_open_ptr)(var_iter, &sub_var_iter);
+    if (status == RSMI_STATUS_NO_DATA)
+    {
+        // No subvariants
+        ret = _rocmon_smi_add_event_to_device(device, funcname, ROCMON_SMI_EVENT_TYPE_VARIANT, variant, 0);
+        if (ret < 0) return -1;
+        return 0;
+    }
+    
+    // Subvariants available -> iterate them
+    do {
+        // Get subvariant information
+        (*rsmi_func_iter_value_get_ptr)(sub_var_iter, &value);
+
+        // Process info
+        if (variant == RSMI_DEFAULT_VARIANT)
+            ret = _rocmon_smi_add_event_to_device(device, funcname, ROCMON_SMI_EVENT_TYPE_INSTANCES, variant, value.id);
+        else
+            ret = _rocmon_smi_add_event_to_device(device, funcname, ROCMON_SMI_EVENT_TYPE_SUBVARIANT, variant, value.id);
+        if (ret < 0) return ret;
+
+        // Advance iterator
+        status = (*rsmi_func_iter_next_ptr)(sub_var_iter);
+    } while (status != RSMI_STATUS_NO_DATA);
+
+    // Close iterator
+    (*rsmi_dev_supported_func_iterator_close_ptr)(&sub_var_iter);
+
+    return 0;
+}
+
+
+static int
+_rocmon_smi_get_function_variants(RocmonDevice* device, const char* funcname, rsmi_func_id_iter_handle_t iter_handle)
+{
+    rsmi_func_id_iter_handle_t var_iter;
+    rsmi_func_id_value_t value;
+    rsmi_status_t status;
+    int ret;
+
+    // Get open variants iterator
+    status = (*rsmi_dev_supported_variant_iterator_open_ptr)(iter_handle, &var_iter);
+    if (status == RSMI_STATUS_NO_DATA)
+    {
+        // No variants
+        ret = _rocmon_smi_add_event_to_device(device, funcname, ROCMON_SMI_EVENT_TYPE_NORMAL, 0, 0);
+        if (ret < 0) return -1;
+        return 0;
+    }
+    
+    // Variants available -> iterate them
+    do {
+        // Get variant information
+        (*rsmi_func_iter_value_get_ptr)(var_iter, &value);
+
+        // Get function subvariants
+        ret = _rocmon_smi_get_function_subvariants(device, funcname, value.id, var_iter);
+        if (ret < 0) return -1;
+
+        // Advance iterator
+        status = (*rsmi_func_iter_next_ptr)(var_iter);
+    } while (status != RSMI_STATUS_NO_DATA);
+
+    // Close iterator
+    (*rsmi_dev_supported_func_iterator_close_ptr)(&var_iter);
+
+    return 0;
+}
+
+
+static int
+_rocmon_smi_get_functions(RocmonDevice* device)
+{
+    rsmi_func_id_iter_handle_t iter_handle;
+    rsmi_func_id_value_t value;
+    rsmi_status_t status;
+    int ret;
+
+    // Open iterator
+    //(*rsmi_dev_supported_func_iterator_open_ptr)(device->deviceId, &iter_handle);
+    RSMI_CALL(rsmi_dev_supported_func_iterator_open, (device->deviceId, &iter_handle), {
+        return -1;
+    });
+
+    do
+    {
+        // Get function information
+        //(*rsmi_func_iter_value_get_ptr)(iter_handle, &value);
+        RSMI_CALL(rsmi_func_iter_value_get, (iter_handle, &value), {
+            ERROR_PRINT(Failed to get smi function value for device %d, device->deviceId);
+            RSMI_CALL(rsmi_dev_supported_func_iterator_close, (&iter_handle), );
+            return -1;
+        });
+
+        // Get function variants
+        ret = _rocmon_smi_get_function_variants(device, value.name, iter_handle);
+        if (ret < 0)
+        {
+            ERROR_PRINT(Failed to get smi function variants for device %d, device->deviceId);
+            RSMI_CALL(rsmi_dev_supported_func_iterator_close, (&iter_handle), );
+            return -1;
+        }
+
+        // Advance iterator (cannot use RSMI_CALL macro here because we have an assignment,
+        // so we check that the function pointer exists to avoid segfaults.)
+        if (rsmi_func_iter_next_ptr) {
+            status = (*rsmi_func_iter_next_ptr)(iter_handle);
+        }
+    } while (status != RSMI_STATUS_NO_DATA);
+
+    // Close iterator
+    //(*rsmi_dev_supported_func_iterator_close_ptr)(&iter_handle);
+    RSMI_CALL(rsmi_dev_supported_func_iterator_close, (&iter_handle), );
+
+    // Add device independent functions
+    ret = _rocmon_smi_add_event_to_device(device, "rsmi_compute_process_info_get", ROCMON_SMI_EVENT_TYPE_NORMAL, 0, 0);
+    if (ret < 0) return -1;
+
+    return 0;
+}
+
+#define ADD_SMI_EVENT(name, type, smifunc, variant, subvariant, extra, measurefunc) if (_rocmon_smi_add_event_to_map(name, type, smifunc, variant, subvariant, extra, measurefunc) < 0) { return -1; }
+#define ADD_SMI_EVENT_N(name, smifunc, extra, measurefunc) ADD_SMI_EVENT(name, ROCMON_SMI_EVENT_TYPE_NORMAL, smifunc, 0, 0, extra, measurefunc)
+#define ADD_SMI_EVENT_V(name, smifunc, variant, extra, measurefunc) ADD_SMI_EVENT(name, ROCMON_SMI_EVENT_TYPE_VARIANT, smifunc, variant, 0, extra, measurefunc)
+#define ADD_SMI_EVENT_S(name, smifunc, variant, subvariant, extra, measurefunc) ADD_SMI_EVENT(name, ROCMON_SMI_EVENT_TYPE_SUBVARIANT, smifunc, variant, subvariant, extra, measurefunc)
+#define ADD_SMI_EVENT_I(name, smifunc, extra, measurefunc) ADD_SMI_EVENT(name, ROCMON_SMI_EVENT_TYPE_INSTANCES, smifunc, 0, 0, extra, measurefunc)
+
+static int
+_rocmon_smi_add_event_to_map(char* name, RocmonSmiEventType type, char* smifunc, uint64_t variant, uint64_t subvariant, uint64_t extra, RocmonSmiMeasureFunc measureFunc)
+{
+    // Add new event list to map (if not already present)
+    bstring label = _rocmon_smi_build_label(type, smifunc, variant, subvariant);
+    RocmonSmiEventList* list;
+    if (get_smap_by_key(rocmon_context->smiEvents, bdata(label), (void**)&list) < 0)
+    {
+        // Allocate memory for event list
+        list = (RocmonSmiEventList*) malloc(sizeof(RocmonSmiEventList));
+        if (list == NULL)
+        {
+            ERROR_PRINT(Failed to allocate memory for SMI event list %s, name);
+            return -ENOMEM;
+        }
+        list->entries = NULL;
+        list->numEntries = 0;
+
+        add_smap(rocmon_context->smiEvents, bdata(label), list);
+    }
+    bdestroy(label);
+
+    // Allocate memory for another event in list
+    list->numEntries++;
+    list->entries = (RocmonSmiEvent*) realloc(list->entries, list->numEntries * sizeof(RocmonSmiEvent));
+    if (list->entries == NULL)
+    {
+        ERROR_PRINT(Failed to allocate memory for SMI event %s, name);
+        return -ENOMEM;
+    }
+
+    // Set event properties
+    RocmonSmiEvent* event = &list->entries[list->numEntries-1];
+    strncpy(event->name, name, sizeof(event->name));
+    event->name[sizeof(event->name)] = '\0';
+    event->type = type;
+    event->variant = variant;
+    event->subvariant = subvariant;
+    event->extra = extra;
+    event->instances = 0; // gets set when scanning supported device functions
+    event->measureFunc = measureFunc;
+
+    return 0;
+}
+
+
+static void
+_rcomon_smi_free_event_list(void* vlist)
+{
+    RocmonSmiEventList* list = (RocmonSmiEventList*)vlist;
+    if (list)
+    {
+        FREE_IF_NOT_NULL(list->entries);
+        free(list);
+    }
+}
+
+
+static int
+_rocmon_smi_init_events()
+{
+    int ret;
+
+    // Init map
+    ret = init_map(&rocmon_context->smiEvents, MAP_KEY_TYPE_STR, 0, &_rcomon_smi_free_event_list);
+    if (ret < 0)
+    {
+        ERROR_PRINT(Failed to create map for ROCm SMI events);
+        return -1;
+    }
+
+    // Add events
+    ADD_SMI_EVENT_N("PCI_THROUGHPUT_SENT",                  "rsmi_dev_pci_throughput_get", 0,                                           &_smi_wrapper_pci_throughput_get        );
+    ADD_SMI_EVENT_N("PCI_THROUGHPUT_RECEIVED",              "rsmi_dev_pci_throughput_get", 1,                                           &_smi_wrapper_pci_throughput_get        );
+    ADD_SMI_EVENT_N("PCI_THROUGHPUT_MAX_PKT_SZ",            "rsmi_dev_pci_throughput_get", 2,                                           &_smi_wrapper_pci_throughput_get        );
+    ADD_SMI_EVENT_N("PCI_REPLAY_COUNTER",                   "rsmi_dev_pci_replay_counter_get", 0,                                       &_smi_wrapper_pci_replay_counter_get    );
+    ADD_SMI_EVENT_I("POWER_AVE",                            "rsmi_dev_power_ave_get", 0,                                                &_smi_wrapper_power_ave_get             );
+    ADD_SMI_EVENT_V("MEMORY_TOTAL_VRAM",                    "rsmi_dev_memory_total_get", RSMI_MEM_TYPE_VRAM, 0,                         &_smi_wrapper_memory_total_get          );
+    ADD_SMI_EVENT_V("MEMORY_TOTAL_VIS_VRAM",                "rsmi_dev_memory_total_get", RSMI_MEM_TYPE_VIS_VRAM, 0,                     &_smi_wrapper_memory_total_get          );
+    ADD_SMI_EVENT_V("MEMORY_TOTAL_GTT",                     "rsmi_dev_memory_total_get", RSMI_MEM_TYPE_GTT, 0,                          &_smi_wrapper_memory_total_get          );
+    ADD_SMI_EVENT_V("MEMORY_USAGE_VRAM",                    "rsmi_dev_memory_usage_get", RSMI_MEM_TYPE_VRAM, 0,                         &_smi_wrapper_memory_usage_get          );
+    ADD_SMI_EVENT_V("MEMORY_USAGE_VIS_VRAM",                "rsmi_dev_memory_usage_get", RSMI_MEM_TYPE_VIS_VRAM, 0,                     &_smi_wrapper_memory_usage_get          );
+    ADD_SMI_EVENT_V("MEMORY_USAGE_GTT",                     "rsmi_dev_memory_usage_get", RSMI_MEM_TYPE_GTT, 0,                          &_smi_wrapper_memory_usage_get          );
+    ADD_SMI_EVENT_N("MEMORY_BUSY_PERCENT",                  "rsmi_dev_memory_busy_percent_get", 0,                                      &_smi_wrapper_memory_busy_percent_get   );
+    ADD_SMI_EVENT_N("MEMORY_NUM_RESERVED_PAGES",            "rsmi_dev_memory_reserved_pages_get", 0,                                    &_smi_wrapper_memory_reserved_pages_get );
+    ADD_SMI_EVENT_I("FAN_RPMS",                             "rsmi_dev_fan_rpms_get", 0,                                                 &_smi_wrapper_fan_rpms_get              );
+    ADD_SMI_EVENT_I("FAN_SPEED",                            "rsmi_dev_fan_speed_get", 0,                                                &_smi_wrapper_fan_speed_get             );
+    ADD_SMI_EVENT_I("FAN_SPEED_MAX",                        "rsmi_dev_fan_speed_max_get", 0,                                            &_smi_wrapper_fan_speed_max_get         );
+    ADD_SMI_EVENT_S("TEMP_EDGE",                            "rsmi_dev_temp_metric_get", RSMI_TEMP_CURRENT, RSMI_TEMP_TYPE_EDGE, 0,      &_smi_wrapper_temp_metric_get           );
+    ADD_SMI_EVENT_S("TEMP_JUNCTION",                        "rsmi_dev_temp_metric_get", RSMI_TEMP_CURRENT, RSMI_TEMP_TYPE_JUNCTION, 0,  &_smi_wrapper_temp_metric_get           );
+    ADD_SMI_EVENT_S("TEMP_MEMORY",                          "rsmi_dev_temp_metric_get", RSMI_TEMP_CURRENT, RSMI_TEMP_TYPE_MEMORY, 0,    &_smi_wrapper_temp_metric_get           );
+    ADD_SMI_EVENT_S("VOLT_VDDGFX",                          "rsmi_dev_volt_metric_get", RSMI_VOLT_CURRENT, RSMI_VOLT_TYPE_VDDGFX, 0,    &_smi_wrapper_volt_metric_get           );
+    ADD_SMI_EVENT_N("OVERDRIVE_LEVEL",                      "rsmi_dev_overdrive_level_get", 0,                                          &_smi_wrapper_overdrive_level_get       );
+    ADD_SMI_EVENT_V("ECC_COUNT_UMC_CORRECTABLE",            "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_UMC, 0,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_UMC_UNCORRECTABLE",          "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_UMC, 1,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_SDMA_CORRECTABLE",           "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_SDMA, 0,                           &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_SDMA_UNCORRECTABLE",         "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_SDMA, 1,                           &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_GFX_CORRECTABLE",            "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_GFX, 0,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_GFX_UNCORRECTABLE",          "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_GFX, 1,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_MMHUB_CORRECTABLE",          "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_MMHUB, 0,                          &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_MMHUB_UNCORRECTABLE",        "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_MMHUB, 1,                          &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_ATHUB_CORRECTABLE",          "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_ATHUB, 0,                          &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_ATHUB_UNCORRECTABLE",        "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_ATHUB, 1,                          &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_PCIE_BIF_CORRECTABLE",       "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_PCIE_BIF, 0,                       &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_PCIE_BIF_UNCORRECTABLE",     "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_PCIE_BIF, 1,                       &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_HDP_CORRECTABLE",            "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_HDP, 0,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_HDP_UNCORRECTABLE",          "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_HDP, 1,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_XGMI_WAFL_CORRECTABLE",      "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_XGMI_WAFL, 0,                      &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_XGMI_WAFL_UNCORRECTABLE",    "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_XGMI_WAFL, 1,                      &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_DF_CORRECTABLE",             "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_DF, 0,                             &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_DF_UNCORRECTABLE",           "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_DF, 1,                             &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_SMN_CORRECTABLE",            "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_SMN, 0,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_SMN_UNCORRECTABLE",          "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_SMN, 1,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_SEM_CORRECTABLE",            "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_SEM, 0,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_SEM_UNCORRECTABLE",          "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_SEM, 1,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_MP0_CORRECTABLE",            "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_MP0, 0,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_MP0_UNCORRECTABLE",          "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_MP0, 1,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_MP1_CORRECTABLE",            "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_MP1, 0,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_MP1_UNCORRECTABLE",          "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_MP1, 1,                            &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_FUSE_CORRECTABLE",           "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_FUSE, 0,                           &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_FUSE_UNCORRECTABLE",         "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_FUSE, 1,                           &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_LAST_CORRECTABLE",           "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_LAST, 0,                           &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_V("ECC_COUNT_LAST_UNCORRECTABLE",         "rsmi_dev_ecc_count_get", RSMI_GPU_BLOCK_LAST, 1,                           &_smi_wrapper_ecc_count_get             );
+    ADD_SMI_EVENT_N("PROCS_USING_GPU",                      "rsmi_compute_process_info_get", 0,                                         &_smi_wrapper_compute_process_info_get  );
+
+    return 0;
+}
+
+
+int
+rocmon_init(int numGpus, const int* gpuIds)
+{
+    hsa_status_t status;
+
+    // check if already initialized
+    if (rocmon_initialized)
+    {
+        return 0;
+    }
+    if (rocmon_context != NULL)
+    {
+        return -EEXIST;
+    }
+
+    // Validate arguments
+    if (numGpus <= 0)
+    {
+        ERROR_PRINT(Number of gpus must be greater than 0 but only %d given, numGpus);
+        return -EINVAL;
+    }
+    
+    // Initialize other parts
+    init_configuration();
+
+    // initialize libraries
+    int ret = _rocmon_link_libraries();
+    if (ret < 0)
+    {
+	ERROR_PLAIN_PRINT(Failed to initialize libraries);
+        return ret;
+    }
+
+    // Allocate memory for context
+    rocmon_context = (RocmonContext*) malloc(sizeof(RocmonContext));
+    if (rocmon_context == NULL)
+    {
+        ERROR_PLAIN_PRINT(Cannot allocate Rocmon context);
+        return -ENOMEM;
+    }
+    rocmon_context->groups = NULL;
+    rocmon_context->numGroups = 0;
+    rocmon_context->numActiveGroups = 0;
+
+    rocmon_context->devices = (RocmonDevice*) malloc(numGpus * sizeof(RocmonDevice));
+    rocmon_context->numDevices = numGpus;
+    if (rocmon_context->devices == NULL)
+    {
+        ERROR_PLAIN_PRINT(Cannot allocate set of GPUs);
+        free(rocmon_context);
+        rocmon_context = NULL;
+        return -ENOMEM;
+    }
+
+    // init hsa library
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, Initializing HSA);
+    ROCM_CALL(hsa_init, (),
+    {
+        ERROR_PLAIN_PRINT(Failed to init hsa library);
+        goto rocmon_init_hsa_failed;
+    });
+
+    // init rocm smi library
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, Initializing RSMI);
+    RSMI_CALL(rsmi_init, (0),
+    {
+        ERROR_PLAIN_PRINT(Failed to init rocm_smi);
+        goto rocmon_init_rsmi_failed;
+    });
+
+    // Get hsa timestamp factor
+    uint64_t frequency_hz;
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, Getting HSA timestamp factor);
+    ROCM_CALL(hsa_system_get_info, (HSA_SYSTEM_INFO_TIMESTAMP_FREQUENCY, &frequency_hz),
+    {
+        ERROR_PLAIN_PRINT(Failed to get HSA timestamp factor);
+        goto rocmon_init_info_agents_failed;
+    });
+    rocmon_context->hsa_timestamp_factor = (long double)1000000000 / (long double)frequency_hz;
+
+    // initialize structures for specified devices (fetch ROCm specific info)
+    iterate_agents_cb_arg arg = {
+        .context = rocmon_context,
+        .numGpus = numGpus,
+        .gpuIds = gpuIds,
+    };
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, Iterating through %d available agents, numGpus);
+    ROCM_CALL(hsa_iterate_agents, (_rocmon_iterate_agents_callback, &arg),
+    {
+        ERROR_PRINT(Error while iterating through available agents);
+        goto rocmon_init_info_agents_failed;
+    });
+
+    // Get available SMI events for devices
+    _rocmon_smi_init_events();
+    for (int i = 0; i < rocmon_context->numDevices; i++)
+    {
+        if (_rocmon_smi_get_functions(&rocmon_context->devices[i]) < 0)
+        {
+            ERROR_PRINT(Failed to get SMI functions for device %d, rocmon_context->devices[i].deviceId);
+            goto rocmon_init_info_agents_failed;
+        }
+    }
+
+    rocmon_initialized = TRUE;
+    return 0;
+rocmon_init_info_agents_failed:
+    RSMI_CALL(rsmi_shut_down, (), {
+        // fall through
+    });
+rocmon_init_rsmi_failed:
+    ROCM_CALL(hsa_shut_down, (), {
+        // fall through
+    });
+rocmon_init_hsa_failed:
+    free(rocmon_context->devices);
+    free(rocmon_context);
+    rocmon_context = NULL;
+    return -1;
+}
+
+
+void
+rocmon_finalize(void)
+{
+    RocmonContext* context = rocmon_context;
+
+    if (!rocmon_initialized)
+    {
+        return;
+    }
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, Finalize LIKWID ROCMON);
+
+    if (context)
+    {
+        if (context->devices)
+        {
+            // Free each devices fields
+            for (int i = 0; i < context->numDevices; i++)
+            {
+                RocmonDevice* device = &context->devices[i];
+                FREE_IF_NOT_NULL(device->rocMetrics);
+                FREE_IF_NOT_NULL(device->activeRocEvents);
+                FREE_IF_NOT_NULL(device->activeSmiEvents);
+                if (device->groupResults)
+                {
+                    // Free events of event result lists
+                    for (int j = 0; j < device->numGroupResults; j++)
+                    {
+                        FREE_IF_NOT_NULL(device->groupResults[i].results);
+                    }
+                    // Free list
+                    free(device->groupResults);
+                }
+                if (device->context)
+                {
+                    ROCM_CALL(rocprofiler_close, (device->context),);
+                }
+                destroy_smap(device->smiMetrics);
+            }
+
+            free(context->devices);
+            context->devices = NULL;
+        }
+
+        FREE_IF_NOT_NULL(context->groups);
+        destroy_smap(context->smiEvents);
+
+        free(context);
+        context = NULL;
+    }
+
+    RSMI_CALL(rsmi_shut_down, (), {
+        ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, Shutdown SMI);
+        // fall through
+    });
+    ROCM_CALL(hsa_shut_down, (), {
+        ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, Shutdown HSA);
+        // fall through
+    });
+}
+
+
+int
+rocmon_addEventSet(const char* eventString, int* gid)
+{
+    // Check arguments
+    if (!eventString)
+    {
+        return -EINVAL;
+    }
+    
+    // Ensure rocmon is initialized
+    if (!rocmon_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Allocate memory for event group if necessary
+    if (rocmon_context->numActiveGroups == rocmon_context->numGroups)
+    {
+        GroupInfo* tmpInfo = (GroupInfo*) realloc(rocmon_context->groups, (rocmon_context->numGroups+1) * sizeof(GroupInfo));
+        if (tmpInfo == NULL)
+        {
+            ERROR_PLAIN_PRINT(Cannot allocate additional group);
+            return -ENOMEM;
+        }
+        rocmon_context->groups = tmpInfo;
+        rocmon_context->numGroups++;
+    }
+
+    // Parse event string
+    int err = _rocmon_parse_eventstring(eventString, &rocmon_context->groups[rocmon_context->numActiveGroups]);
+    if (err < 0)
+    {
+        return err;
+    }
+
+    // Allocate memory for event results
+    for (int i = 0; i < rocmon_context->numDevices; i++)
+    {
+        RocmonDevice* device = &rocmon_context->devices[i];
+
+        // Allocate memory for event results
+        int numEvents = rocmon_context->groups[rocmon_context->numActiveGroups].nevents;
+        RocmonEventResult* tmpResults = (RocmonEventResult*) malloc(numEvents * sizeof(RocmonEventResult));
+        if (tmpResults == NULL)
+        {
+            ERROR_PLAIN_PRINT(Cannot allocate event results);
+            return -ENOMEM;
+        }
+
+        // Allocate memory for new event result list entry
+        RocmonEventResultList* tmpGroupResults = (RocmonEventResultList*) realloc(device->groupResults, (device->numGroupResults+1) * sizeof(RocmonEventResultList));
+        if (tmpGroupResults == NULL)
+        {
+            ERROR_PLAIN_PRINT(Cannot allocate new event group result list);
+            return -ENOMEM;
+        }
+
+        device->groupResults = tmpGroupResults;
+        device->groupResults[device->numGroupResults].results = tmpResults;
+        device->groupResults[device->numGroupResults].numResults = numEvents;
+        device->numGroupResults++;
+    }
+
+    *gid = rocmon_context->numActiveGroups;
+    rocmon_context->numActiveGroups++;
+    return 0;
+}
+
+
+static int
+_rocmon_setupCounters_rocprofiler(RocmonDevice* device, const char** events, int numEvents)
+{
+    // Close previous rocprofiler context
+    if (device->context)
+    {
+        ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, Closing previous rocprofiler context);
+        ROCM_CALL(rocprofiler_close, (device->context), return -1);
+    }
+
+    // Look if the are any events
+    if (numEvents <= 0)
+    {
+        return 0;
+    }
+
+    // Create feature array to monitor
+    rocprofiler_feature_t* features = (rocprofiler_feature_t*) malloc(numEvents * sizeof(rocprofiler_feature_t));
+    if (features == NULL)
+    {
+        ERROR_PLAIN_PRINT(Cannot allocate feature list);
+        return -ENOMEM;
+    }
+    for (int i = 0; i < numEvents; i++)
+    {
+        features[i].kind = ROCPROFILER_FEATURE_KIND_METRIC;
+        features[i].name = events[i];
+        ROCMON_DEBUG_PRINT(DEBUGLEV_DEVELOP, SETUP EVENT %d %s, i, events[i]);
+    }
+
+    // Free previous feature array if present
+    FREE_IF_NOT_NULL(device->activeRocEvents);
+
+    device->numActiveRocEvents = numEvents;
+    device->activeRocEvents = features;
+
+    // Open context
+    rocprofiler_properties_t properties = {};
+    properties.queue_depth = 128;
+    uint32_t mode = ROCPROFILER_MODE_STANDALONE | ROCPROFILER_MODE_CREATEQUEUE | ROCPROFILER_MODE_SINGLEGROUP;
+
+    // Important: only a single profiling group is supported at this time which limits the number of events that can be monitored at a time.
+    ROCM_CALL(rocprofiler_open, (device->hsa_agent, device->activeRocEvents, device->numActiveRocEvents, &device->context, mode, &properties), return -1);
+
+    return 0;
+}
+
+
+static int
+_rocmon_setupCounters_smi(RocmonDevice* device, const char** events, int numEvents)
+{
+    int ret;
+    const int instanceNumLen = 5;
+
+    // Delete previous events
+    if (device->activeSmiEvents)
+    {
+        device->activeSmiEvents = NULL;
+        device->numActiveSmiEvents = 0;
+    }
+
+    // Look if the are any events
+    if (numEvents <= 0)
+    {
+        return 0;
+    }
+
+    // Create event array
+    RocmonSmiEvent* activeEvents = (RocmonSmiEvent*) malloc(numEvents * sizeof(RocmonSmiEvent));
+    if (activeEvents == NULL)
+    {
+        ERROR_PLAIN_PRINT(Cannot allocate active event list);
+        return -ENOMEM;
+    }
+
+    for (int i = 0; i < numEvents; i++)
+    {
+        char eventName[membersize(RocmonSmiEvent, name)];
+        int instance = -1;
+
+        // Parse event name -> normal event vs one with multiple instances (EVENT[0])
+        const char* event = events[i];
+        char* instancePart = strrchr(event, '[');
+        if (instancePart != NULL)
+        {
+            char withoutBrackets[instanceNumLen+1]; // +1 is '\0'
+            int partlen = strlen(instancePart);
+
+            // Check if number fit in 'withoutBrackets'
+            if (partlen - 2 > instanceNumLen)
+            {
+                ERROR_PRINT(Instance number in '%s' is too large, event);
+                free(activeEvents);
+                return -EINVAL;
+            }
+
+            // Copy instance number without brackets
+            strncpy(withoutBrackets, instancePart+1, partlen-2);
+            withoutBrackets[instanceNumLen] = '\0';
+
+            // Parse instance as number
+            char* endParsed;
+            instance = strtol(withoutBrackets, &endParsed, 10);
+
+            // Check if parsing was successful
+            char* endOfString = &withoutBrackets[partlen-2];
+            if (endParsed != endOfString)
+            {
+                ERROR_PRINT(Failed to parse instance number in '%s', event);
+                free(activeEvents);
+                return -EINVAL;
+            }
+
+            // Copy event name without instance
+            int eventNameLen = instancePart - event;
+            strncpy(eventName, event, eventNameLen);
+            eventName[eventNameLen] = '\0';
+        }
+        else
+        {
+            // Copy entire event name
+            strncpy(eventName, event, membersize(RocmonSmiEvent, name));
+        }
+
+        // Lookup event in available events
+        RocmonSmiEvent* metric = NULL;
+        ret = get_smap_by_key(device->smiMetrics, eventName, (void**)&metric);
+        if (ret < 0)
+        {
+            ERROR_PRINT(RSMI event '%s' not found for device %d, eventName, device->deviceId);
+            free(activeEvents);
+            return -EINVAL;
+        }
+
+        // Copy event
+        RocmonSmiEvent* tmpEvent = &activeEvents[i];
+        memcpy(tmpEvent, metric, sizeof(RocmonSmiEvent));
+
+        // Check if event supports instances
+        if (instance >= 0 && tmpEvent->type != ROCMON_SMI_EVENT_TYPE_INSTANCES)
+        {
+            ERROR_PRINT(Instance number given but event '%s' does not support one, eventName);
+            free(activeEvents);
+            return -EINVAL;
+        }
+
+        // Check if event requires instances
+        if (instance < 0 && tmpEvent->type == ROCMON_SMI_EVENT_TYPE_INSTANCES)
+        {
+            ERROR_PRINT(No instance number given but event '%s' requires one, eventName);
+            free(activeEvents);
+            return -EINVAL;
+        }
+
+        // Check if event has enough instances
+        if (instance >= 0 && instance >= metric->instances)
+        {
+            ERROR_PRINT(Instance %d seleced but event '%s' has only %d, instance, eventName, metric->instances);
+            free(activeEvents);
+            return -EINVAL;
+        }
+
+        // Set instance number
+        if (instance >= 0)
+        {
+            tmpEvent->subvariant = instance;
+        }
+    }
+
+    device->activeSmiEvents = activeEvents;
+    device->numActiveSmiEvents = numEvents;
+
+    return 0;
+}
+
+
+int
+rocmon_setupCounters(int gid)
+{
+    int ret;
+
+    // Check arguments
+    if (gid < 0 || gid >= rocmon_context->numActiveGroups)
+    {
+        return -EINVAL;
+    }
+    
+    // Ensure rocmon is initialized
+    if (!rocmon_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Get group info
+    GroupInfo* group = &rocmon_context->groups[gid];
+
+    //
+    // Separate rocprofiler and SMI events
+    //
+    const char **smiEvents = NULL, **rocEvents = NULL;
+    int numSmiEvents = 0, numRocEvents = 0;
+
+    // Allocate memory for string arrays
+    smiEvents = (const char**) malloc(group->nevents * sizeof(const char*));
+    if (smiEvents == NULL)
+    {
+        ERROR_PLAIN_PRINT(Cannot allocate smiEvent name array);
+        return -ENOMEM;
+    }
+    rocEvents = (const char**) malloc(group->nevents * sizeof(const char*));
+    if (rocEvents == NULL)
+    {
+        ERROR_PLAIN_PRINT(Cannot allocate rocEvent name array);
+        free(smiEvents);
+        return -ENOMEM;
+    }
+
+    // Go through each event and sort it
+    for (int i = 0; i < group->nevents; i++)
+    {
+        const char* name = group->events[i];
+        if (strncmp(name, "RSMI_", 5) == 0)
+        {
+            // RSMI event
+            smiEvents[numSmiEvents] = name + 5; // +5 removes 'RSMI_' prefix
+            numSmiEvents++;
+        }
+        else if (strncmp(name, "ROCP_", 5) == 0)
+        {
+            // Rocprofiler event
+            rocEvents[numRocEvents] = name + 5; // +5 removes 'ROCP_' prefix
+            numRocEvents++;
+        }
+        else
+        {
+            // Unknown event
+            ERROR_PRINT(Event '%s' has no prefix ('ROCP_' or 'RSMI_'), name);
+            return -EINVAL;
+        }
+    }
+
+    // Add events to each device
+    for (int i = 0; i < rocmon_context->numDevices; i++)
+    {
+        RocmonDevice* device = &rocmon_context->devices[i];
+
+        // Add rocprofiler events
+        ROCMON_DEBUG_PRINT(DEBUGLEV_INFO, SETUP ROCPROFILER WITH %d events, numRocEvents);
+        ret = _rocmon_setupCounters_rocprofiler(device, rocEvents, numRocEvents);
+        if (ret < 0)
+        {
+            free(smiEvents);
+            free(rocEvents);
+            return ret;
+        }
+
+        // Add SMI events
+        ROCMON_DEBUG_PRINT(DEBUGLEV_INFO, SETUP ROCM SMI WITH %d events, numSmiEvents);
+        ret = _rocmon_setupCounters_smi(device, smiEvents, numSmiEvents);
+        if (ret < 0)
+        {
+            free(smiEvents);
+            free(rocEvents);
+            return ret;
+        }
+    }
+    rocmon_context->activeGroup = gid;
+
+    // Cleanup
+    free(smiEvents);
+    free(rocEvents);
+
+    return 0;
+}
+
+
+static int
+_rocmon_startCounters_rocprofiler(RocmonDevice* device)
+{
+    // Check if there are any counters to start
+    if (device->numActiveRocEvents <= 0)
+    {
+        return 0;
+    }
+
+    // Reset results
+    RocmonEventResultList* groupResult = &device->groupResults[rocmon_context->activeGroup];
+    for (int i = 0; i < device->numActiveRocEvents; i++)
+    {
+        RocmonEventResult* result = &groupResult->results[i];
+        result->lastValue = 0;
+        result->fullValue = 0;
+    }
+
+    if (device->context)
+    {
+        ROCM_CALL(rocprofiler_start, (device->context, 0), return -1);
+    }
+
+    return 0;
+}
+
+
+static int
+_rocmon_startCounters_smi(RocmonDevice* device)
+{
+    // Check if there are any counters to start
+    if (device->numActiveSmiEvents <= 0)
+    {
+        return 0;
+    }
+
+    // Save baseline values
+    RocmonEventResultList* groupResult = &device->groupResults[rocmon_context->activeGroup];
+    for (int i = 0; i < device->numActiveSmiEvents; i++)
+    {
+        double value = 0;
+        RocmonSmiEvent* event = &device->activeSmiEvents[i];
+        RocmonEventResult* result = &groupResult->results[device->numActiveRocEvents+i];
+
+        // Measure counter
+        if (event->measureFunc)
+        {
+            event->measureFunc(device->deviceId, event, result);
+        }
+
+        // Save value
+        result->fullValue = 0;
+    }
+
+    return 0;
+}
+
+
+int
+rocmon_startCounters(void)
+{
+    int ret;
+
+    // Ensure rocmon is initialized
+    if (!rocmon_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Get timestamp
+    uint64_t timestamp;
+    if (ret = _rocmon_get_timestamp(&timestamp))
+    {
+        return ret;
+    }
+
+    // Start counters on each device
+    for (int i = 0; i < rocmon_context->numDevices; i++)
+    {
+        RocmonDevice* device = &rocmon_context->devices[i];
+        device->time.start = timestamp;
+        device->time.read = timestamp;
+
+        // Start rocprofiler events
+        ret = _rocmon_startCounters_rocprofiler(device);
+        if (ret < 0) return ret;
+
+        // Start SMI events
+        _rocmon_startCounters_smi(device);
+        if (ret < 0) return ret;
+    }
+
+    return 0;
+}
+
+
+static int
+_rocmon_stopCounters_rocprofiler(RocmonDevice* device)
+{
+    if (device->context)
+    {
+        // Close context
+        ROCM_CALL(rocprofiler_stop, (device->context, 0), return -1);
+    }
+
+    return 0;
+}
+
+
+int
+rocmon_stopCounters(void)
+{
+    int ret;
+
+    // Ensure rocmon is initialized
+    if (!rocmon_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Read counters
+    ret = _rocmon_readCounters(&_rocmon_get_stop_time);
+    if (ret < 0) return ret;
+
+    for (int i = 0; i < rocmon_context->numDevices; i++)
+    {
+        RocmonDevice* device = &rocmon_context->devices[i];
+
+        // Stop rocprofiler events
+        ret = _rocmon_stopCounters_rocprofiler(device);
+        if (ret < 0) return ret;
+
+        // Nothing to stop for SMI events
+    }
+
+    return 0;
+}
+
+
+int
+rocmon_readCounters(void)
+{
+    int ret;
+
+    // Ensure rocmon is initialized
+    if (!rocmon_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Read counters
+    ret = _rocmon_readCounters(&_rocmon_get_read_time);
+    if (ret < 0) return ret;
+
+    return 0;
+}
+
+
+double
+rocmon_getResult(int gpuIdx, int groupId, int eventId)
+{
+    // Ensure rocmon is initialized
+    if (!rocmon_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Validate gpuIdx
+    if (gpuIdx < 0 || gpuIdx >= rocmon_context->numDevices)
+    {
+        return -EFAULT;
+    }
+
+    // Validate groupId
+    RocmonDevice* device = &rocmon_context->devices[gpuIdx];
+    if (groupId < 0 || groupId >= device->numGroupResults)
+    {
+        return -EFAULT;
+    }
+
+    // Validate eventId
+    RocmonEventResultList* groupResult = &device->groupResults[groupId];
+    if (eventId < 0 || eventId >= groupResult->numResults)
+    {
+        return -EFAULT;
+    }
+
+    // Return result
+    return groupResult->results[eventId].fullValue;
+}
+
+
+// TODO: multiple groups
+double
+rocmon_getLastResult(int gpuIdx, int groupId, int eventId)
+{
+    // Ensure rocmon is initialized
+    if (!rocmon_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Validate gpuIdx
+    if (gpuIdx < 0 || gpuIdx >= rocmon_context->numDevices)
+    {
+        return -EFAULT;
+    }
+
+    // Validate groupId
+    RocmonDevice* device = &rocmon_context->devices[gpuIdx];
+    if (groupId < 0 || groupId >= device->numGroupResults)
+    {
+        return -EFAULT;
+    }
+
+    // Validate eventId
+    RocmonEventResultList* groupResult = &device->groupResults[groupId];
+    if (eventId < 0 || eventId >= groupResult->numResults)
+    {
+        return -EFAULT;
+    }
+
+    // Return result
+    return groupResult->results[eventId].lastValue;
+}
+
+
+int
+rocmon_getEventsOfGpu(int gpuIdx, EventList_rocm_t* list)
+{
+    // Ensure rocmon is initialized
+    if (!rocmon_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Validate args
+    if (gpuIdx < 0 || gpuIdx > rocmon_context->numDevices)
+    {
+        return -EINVAL;
+    }
+    if (list == NULL)
+    {
+        return -EINVAL;
+    }
+
+    RocmonDevice* device = &rocmon_context->devices[gpuIdx];
+
+    // Allocate list structure
+    EventList_rocm_t tmpList = (EventList_rocm_t) malloc(sizeof(EventList_rocm));
+    if (tmpList == NULL)
+    {
+        ERROR_PLAIN_PRINT(Cannot allocate event list);
+        return -ENOMEM;
+    }
+    
+    // Get number of events
+    printf("NUmber of events %d + %d\n", device->numRocMetrics , get_map_size(device->smiMetrics));
+    tmpList->numEvents = device->numRocMetrics + get_map_size(device->smiMetrics);
+    if (tmpList->numEvents == 0)
+    {
+        // No events -> return empty list
+        tmpList->events = NULL;
+        *list = tmpList;
+        return 0;
+    }
+
+    // Allocate event array
+    tmpList->events = (Event_rocm_t*) malloc(tmpList->numEvents * sizeof(Event_rocm_t));
+    if (tmpList->events == NULL)
+    {
+        ERROR_PLAIN_PRINT(Cannot allocate events for event list);
+        free(tmpList);
+        return -ENOMEM;
+    }
+
+    // Copy rocprofiler event information
+    for (int i = 0; i < device->numRocMetrics; i++)
+    {
+        rocprofiler_info_data_t* event = &device->rocMetrics[i];
+        Event_rocm_t* out = &tmpList->events[i];
+        int len;
+
+        // Copy name
+        printf("Name %s\n", event->metric.name);
+        len = strlen(event->metric.name) + 5 /* Prefix */ + 1 /* NULL byte */;
+        out->name = (char*) malloc(len);
+        if (out->name)
+        {
+            snprintf(out->name, len, "ROCP_%s", event->metric.name);
+        }
+
+        // Copy description
+        len = strlen(event->metric.description) + 1 /* NULL byte */;
+        out->description = (char*) malloc(len);
+        if (out->description)
+        {
+            snprintf(out->description, len, "%s", event->metric.description);
+        }
+
+        // Copy instances
+        out->instances = event->metric.instances;
+    }
+
+    // Copy ROCm SMI metric information
+    for (int i = 0; i < get_map_size(device->smiMetrics); i++)
+    {
+        RocmonSmiEvent* event = NULL;
+        Event_rocm_t* out = &tmpList->events[device->numRocMetrics + i];
+        int len;
+
+        // Get event
+        if (get_smap_by_idx(device->smiMetrics, i, (void**)&event) < 0)
+        {
+            continue;
+        }
+
+        // Copy name
+        len = strlen(event->name) + 5 /* Prefix */ + 1 /* NULL byte */;
+        out->name = (char*) malloc(len);
+        if (out->name)
+        {
+            snprintf(out->name, len, "RSMI_%s", event->name);
+        }
+
+        // Copy description
+        char* description = "SMI Event"; // TODO: use real descriptions
+        len = strlen(description) + 1 /* NULL byte */;
+        out->description = (char*) malloc(len);
+        if (out->description)
+        {
+            snprintf(out->description, len, "%s", description);
+        }
+
+        // Copy instances
+        out->instances = event->instances;
+    }
+
+    *list = tmpList;
+    return 0;
+}
+
+void
+rocmon_freeEventsOfGpu(EventList_rocm_t list)
+{
+#define FREE_IF_NOT_NULL(var) if ( var ) { free( var ); var = NULL; }
+
+    // Check pointer
+    if (list == NULL)
+    {
+        return;
+    }
+
+    if (list->events != NULL)
+    {
+        for (int i = 0; i < list->numEvents; i++)
+        {
+            Event_rocm_t* event = &list->events[i];
+            FREE_IF_NOT_NULL(event->name);
+            FREE_IF_NOT_NULL(event->description);
+        }
+        free(list->events);
+    }
+    free(list);
+}
+
+
+int
+rocmon_switchActiveGroup(int newGroupId)
+{
+    int ret;
+
+    ret = rocmon_stopCounters();
+    if (ret < 0)
+    {
+        return ret;
+    }
+
+    ret = rocmon_setupCounters(newGroupId);
+    if (ret < 0)
+    {
+        return ret;
+    }
+
+    ret = rocmon_startCounters();
+    if (ret < 0)
+    {
+        return ret;
+    }
+
+    return 0;
+}
+
+
+int
+rocmon_getNumberOfGroups(void)
+{
+    if (!rocmon_context || !rocmon_initialized)
+    {
+        return -EFAULT;
+    }
+    return rocmon_context->numActiveGroups;
+}
+
+
+int
+rocmon_getIdOfActiveGroup(void)
+{
+    if (!rocmon_context || !rocmon_initialized)
+    {
+        return -EFAULT;
+    }
+    return rocmon_context->activeGroup;
+}
+
+
+int
+rocmon_getNumberOfGPUs(void)
+{
+    if (!rocmon_context || !rocmon_initialized)
+    {
+        return -EFAULT;
+    }
+    return rocmon_context->numDevices;
+}
+
+
+int
+rocmon_getNumberOfEvents(int groupId)
+{
+    if (!rocmon_context || !rocmon_initialized || (groupId < 0) || groupId >= rocmon_context->numActiveGroups)
+    {
+        return -EFAULT;
+    }
+    GroupInfo* ginfo = &rocmon_context->groups[groupId];
+    return ginfo->nevents;
+}
+
+
+int
+rocmon_getNumberOfMetrics(int groupId)
+{
+    if (!rocmon_context || !rocmon_initialized || (groupId < 0) || groupId > rocmon_context->numActiveGroups)
+    {
+        return -EFAULT;
+    }
+    GroupInfo* ginfo = &rocmon_context->groups[groupId];
+    return ginfo->nmetrics;
+}
+
+
+double
+rocmon_getTimeOfGroup(int groupId)
+{
+    int i = 0;
+    double t = 0;
+    if (!rocmon_context || !rocmon_initialized || (groupId < 0) || groupId > rocmon_context->numActiveGroups)
+    {
+        return -EFAULT;
+    }
+    for (i = 0; i < rocmon_context->numDevices; i++)
+    {
+        RocmonDevice* device = &rocmon_context->devices[i];
+        t = MAX(t, (double)(device->time.stop - device->time.start));
+    }
+    return t*1E-9;
+}
+
+
+double
+rocmon_getLastTimeOfGroup(int groupId)
+{
+    int i = 0;
+    double t = 0;
+    if (!rocmon_context || !rocmon_initialized || (groupId < 0) || groupId > rocmon_context->numActiveGroups)
+    {
+        return -EFAULT;
+    }
+    for (i = 0; i < rocmon_context->numDevices; i++)
+    {
+        RocmonDevice* device = &rocmon_context->devices[i];
+        t = MAX(t, (double)(device->time.stop - device->time.read));
+    }
+    return t*1E-9;
+}
+
+
+double
+rocmon_getTimeToLastReadOfGroup(int groupId)
+{
+    int i = 0;
+    double t = 0;
+    if (!rocmon_context || !rocmon_initialized || (groupId < 0) || groupId > rocmon_context->numActiveGroups)
+    {
+        return -EFAULT;
+    }
+    for (i = 0; i < rocmon_context->numDevices; i++)
+    {
+        RocmonDevice* device = &rocmon_context->devices[i];
+        t = MAX(t, (double)(device->time.read - device->time.start));
+    }
+    return t*1E-9;
+}
+
+
+char*
+rocmon_getEventName(int groupId, int eventId)
+{
+    if (!rocmon_context || !rocmon_initialized || (groupId < 0) || groupId >= rocmon_context->numActiveGroups)
+    {
+        return NULL;
+    }
+    GroupInfo* ginfo = &rocmon_context->groups[groupId];
+    if ((eventId < 0) || (eventId >= ginfo->nevents))
+    {
+        return NULL;
+    }
+    return ginfo->events[eventId];
+}
+
+
+char*
+rocmon_getCounterName(int groupId, int eventId)
+{
+    if (!rocmon_context || !rocmon_initialized || (groupId < 0) || groupId >= rocmon_context->numActiveGroups)
+    {
+        return NULL;
+    }
+    GroupInfo* ginfo = &rocmon_context->groups[groupId];
+    if ((eventId < 0) || (eventId >= ginfo->nevents))
+    {
+        return NULL;
+    }
+    return ginfo->counters[eventId];
+}
+
+
+char*
+rocmon_getMetricName(int groupId, int metricId)
+{
+    if (!rocmon_context || !rocmon_initialized || (groupId < 0) || groupId >= rocmon_context->numActiveGroups)
+    {
+        return NULL;
+    }
+    GroupInfo* ginfo = &rocmon_context->groups[groupId];
+    if ((metricId < 0) || (metricId >= ginfo->nmetrics))
+    {
+        return NULL;
+    }
+    return ginfo->metricnames[metricId];
+}
+
+
+char* 
+rocmon_getGroupName(int groupId)
+{
+    if (!rocmon_context || !rocmon_initialized || (groupId < 0) || groupId >= rocmon_context->numActiveGroups)
+    {
+        return NULL;
+    }
+    GroupInfo* ginfo = &rocmon_context->groups[groupId];
+    return ginfo->groupname;
+}
+
+
+char*
+rocmon_getGroupInfoShort(int groupId)
+{
+    if (!rocmon_context || !rocmon_initialized || (groupId < 0) || groupId >= rocmon_context->numActiveGroups)
+    {
+        return NULL;
+    }
+    GroupInfo* ginfo = &rocmon_context->groups[groupId];
+    return ginfo->shortinfo;
+}
+
+
+char*
+rocmon_getGroupInfoLong(int groupId)
+{
+    if (!rocmon_context || !rocmon_initialized || (groupId < 0) || groupId >= rocmon_context->numActiveGroups)
+    {
+        return NULL;
+    }
+    GroupInfo* ginfo = &rocmon_context->groups[groupId];
+    return ginfo->longinfo;
+}
+
+
+int
+rocmon_getGroups(char*** groups, char*** shortinfos, char*** longinfos)
+{
+    init_configuration();
+    Configuration_t config = get_configuration();
+
+    return perfgroup_getGroups(config->groupPath, "amd_gpu", groups, shortinfos, longinfos);
+}
+
+
+int
+rocmon_returnGroups(int nrgroups, char** groups, char** shortinfos, char** longinfos)
+{
+    perfgroup_returnGroups(nrgroups, groups, shortinfos, longinfos);
+}
+
+void rocmon_setVerbosity(int level)
+{
+    if (level >= DEBUGLEV_ONLY_ERROR && level <= DEBUGLEV_DEVELOP)
+    {
+        likwid_rocmon_verbosity = level;
+    }
+}
+
+
+#endif /* LIKWID_WITH_ROCMON */
diff --git a/src/rocmon_marker.c b/src/rocmon_marker.c
new file mode 100644
index 000000000..828e9cdde
--- /dev/null
+++ b/src/rocmon_marker.c
@@ -0,0 +1,1076 @@
+/*
+ * =======================================================================================
+ *
+ *      Filename:  libnvctr.c
+ *
+ *      Description:  Marker API interface of module rocmon
+ *
+ *      Version:   <VERSION>
+ *      Released:  <DATE>
+ *
+ *      Authors:  Thomas Gruber (tg), thomas.roehl@googlemail.com
+ *      Project:  likwid
+ *
+ *      Copyright (C) 2016 RRZE, University Erlangen-Nuremberg
+ *
+ *      This program is free software: you can redistribute it and/or modify it under
+ *      the terms of the GNU General Public License as published by the Free Software
+ *      Foundation, either version 3 of the License, or (at your option) any later
+ *      version.
+ *
+ *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
+ *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+ *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *
+ *      You should have received a copy of the GNU General Public License along with
+ *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * =======================================================================================
+ */
+#ifdef LIKWID_WITH_ROCMON
+
+#include <syscall.h>
+
+#include <lock.h>
+#include <bstrlib.h>
+#include <error.h>
+#include <map.h>
+#include <perfgroup.h>
+#include <types.h>
+
+#include <likwid.h>
+#include <rocmon_types.h>
+
+#define gettid() syscall(SYS_gettid)
+
+#ifndef NAN
+#define NAN (0.0/0.0)
+#endif
+
+#ifndef INFINITY
+#define INFINITY (1.0/0.0)
+#endif
+
+static int rocmon_marker_initialized = 0;
+static pid_t main_tid = -1;
+
+static int num_groups = 0;
+static int* gpu_groups = NULL;
+static int active_group = -1;
+
+static int num_gpus = 0;
+static int* gpu_ids = NULL;
+static Map_t* gpu_maps = NULL;
+
+typedef enum {
+    ROCMON_MARKER_STATE_NEW,
+    ROCMON_MARKER_STATE_START,
+    ROCMON_MARKER_STATE_STOP
+} LikwidRegionState;
+
+typedef struct {
+    bstring label;
+
+    int gpuId;
+    int groupId;
+
+    uint32_t count;
+    double timeActive;
+    TimerData startTime;
+
+    LikwidRegionState state;
+    RocmonEventResultList groupResults;
+} RocmonRegionResults;
+
+static int
+_rocmon_parse_gpustr(char* gpuStr, int* numGpus, int** gpuIds)
+{
+    // Create bstring
+    bstring bGpuStr = bfromcstr(gpuStr);
+    
+    // Parse list
+    struct bstrList* gpuTokens = bsplit(bGpuStr,',');
+    int tmpNumGpus = gpuTokens->qty;
+
+    // Allocate gpuId list
+    int* tmpGpuIds = malloc(tmpNumGpus * sizeof(int));
+    if (!tmpGpuIds)
+    {
+        fprintf(stderr,"Cannot allocate space for GPU list.\n");
+        bdestroy(bGpuStr);
+        bstrListDestroy(gpuTokens);
+        return -EXIT_FAILURE;
+    }
+
+    // Parse ids to int
+    for (int i = 0; i < tmpNumGpus; i++)
+    {
+        tmpGpuIds[i] = atoi(bdata(gpuTokens->entry[i]));
+    }
+
+    // Copy data
+    *numGpus = tmpNumGpus;
+    *gpuIds = tmpGpuIds;
+
+    // Destroy bstring
+    bdestroy(bGpuStr);
+    bstrListDestroy(gpuTokens);
+
+    return 0;
+}
+
+static void
+_rocmon_saveToFile(void)
+{
+    /* File format
+     * 1 numberOfGPUs numberOfRegions numberOfGpuGroups
+     * 2 regionID:regionTag0
+     * 3 regionID:regionTag1
+     * 4 regionID groupID gpuID callCount timeActive numEvents countersvalues(space separated)
+     * 5 regionID groupID gpuID callCount timeActive numEvents countersvalues(space separated)
+     */
+
+    // Get markerfile path from environment
+    char* markerfile = getenv("LIKWID_ROCMON_FILEPATH");
+    if (markerfile == NULL)
+    {
+        fprintf(stderr, "Is the application executed with LIKWID wrapper? No file path for the Rocmon Marker API output defined.\n");
+        return;
+    }
+
+    // Verify there is something to output
+    int numberOfRegions = get_map_size(gpu_maps[0]);
+    int numberOfGPUs = rocmon_getNumberOfGPUs();
+    if ((numberOfGPUs == 0) || (numberOfRegions == 0))
+    {
+        fprintf(stderr, "No GPUs or regions defined in hash table\n");
+        return;
+    }
+
+    // Open file in write mode
+    FILE* file = fopen(markerfile,"w");
+    if (file == NULL)
+    {
+        fprintf(stderr, "Cannot open file %s\n", markerfile);
+        fprintf(stderr, "%s", strerror(errno));
+        return;
+    }
+
+    // Write header: numberOfGPUs numberOfRegions numberOfGpuGroups
+    bstring thread_regs_grps = bformat("%d %d %d", numberOfGPUs, numberOfRegions, num_groups);
+    fprintf(file,"%s\n", bdata(thread_regs_grps));
+    bdestroy(thread_regs_grps);
+
+    // Write region tags
+    for (int j = 0; j < numberOfRegions; j++)
+    {
+        RocmonRegionResults* results = NULL;
+        int ret = get_smap_by_idx(gpu_maps[0], j, (void**)&results);
+        if (ret != 0)
+        {
+            continue;
+        }
+
+        // Write region tags: regionID:regionTag0
+        bstring tmp = bformat("%d:%s", j, bdata(results->label));
+        fprintf(file,"%s\n", bdata(tmp));
+        bdestroy(tmp);
+    }
+
+    // Write counter values for each region
+    for (int j = 0; j < numberOfRegions; j++)
+    {
+        for (int i = 0; i < numberOfGPUs; i++)
+        {
+            RocmonRegionResults* results = NULL;
+            int ret = get_smap_by_idx(gpu_maps[i], j, (void**)&results);
+            if (ret != 0)
+            {
+                continue;
+            }
+
+            // Write: regionID groupID gpuID callCount timeActive numEvents countersvalues(space separated)
+            bstring l = bformat("%d %d %d %u %e %d ", 
+                            j, results->groupId, gpu_ids[results->gpuId], results->count, 
+                            results->timeActive, results->groupResults.numResults);
+            for (int k = 0; k < results->groupResults.numResults; k++)
+            {
+                bstring tmp = bformat("%e ", results->groupResults.results[k].fullValue);
+                bconcat(l, tmp);
+                bdestroy(tmp);
+            }
+            fprintf(file,"%s\n", bdata(l));
+            bdestroy(l);
+        }
+    }
+}
+
+static void
+_rocmon_finalize(void)
+{
+#define FREE_IF_NOT_NULL(x) if (x != NULL) { free(x); x = NULL; }
+
+    // Ensure markers were initialized
+    if (!rocmon_marker_initialized)
+    {
+        return;
+    }
+
+    FREE_IF_NOT_NULL(gpu_ids);
+    FREE_IF_NOT_NULL(gpu_groups);
+
+    // Free each map
+    for (int i = 0; i < num_gpus; i++)
+    {
+        destroy_smap(gpu_maps[i]);
+    }
+    
+    rocmon_finalize();
+}
+
+
+void
+rocmon_markerInit(void)
+{
+    int ret;
+
+    // Check if rocmon markers are already initialized
+    if (rocmon_marker_initialized)
+    {
+        return;
+    }
+
+    // Get environment variables
+    char* eventStr = getenv("LIKWID_ROCMON_EVENTS");
+    char* gpuStr = getenv("LIKWID_ROCMON_GPUS");
+    char* gpuFileStr = getenv("LIKWID_ROCMON_FILEPATH");
+    char* verbosityStr = getenv("LIKWID_ROCMON_VERBOSITY");
+
+    // Validate environment variables are set
+    if ((eventStr == NULL) || (gpuStr == NULL) || (gpuFileStr == NULL))
+    {
+        fprintf(stderr, "Running without GPU Marker API. Activate GPU Marker API with -m, -G and -W on commandline.\n");
+        return;
+    }
+    if (verbosityStr != NULL) {
+        int v = atoi(verbosityStr);
+        rocmon_setVerbosity(v);
+    }
+
+    // Init timer module
+    timer_init();
+    
+    // Save current thread id
+    main_tid = gettid();
+
+    // Parse GPU list
+    ret = _rocmon_parse_gpustr(gpuStr, &num_gpus, &gpu_ids);
+    if (ret < 0)
+    {
+        fprintf(stderr, "Error parsing GPU string.\n");
+        exit(ret);
+    }
+
+    // Allocate GPU Hashmaps
+    gpu_maps = malloc(num_gpus * sizeof(Map_t));
+    if (!gpu_maps)
+    {
+        fprintf(stderr,"Cannot allocate space for results.\n");
+        free(gpu_ids);
+        exit(-EXIT_FAILURE);
+    }
+
+    // Parse event string
+    bstring bGeventStr = bfromcstr(eventStr);
+    struct bstrList* gEventStrings = bsplit(bGeventStr,'|');
+    num_groups = gEventStrings->qty;
+
+    // Allocate space for event group ids
+    gpu_groups = malloc(num_groups * sizeof(int));
+    if (!gpu_groups)
+    {
+        fprintf(stderr,"Cannot allocate space for group handling.\n");
+        bstrListDestroy(gEventStrings);
+        free(gpu_ids);
+        free(gpu_maps);
+        bdestroy(bGeventStr);
+        exit(-EXIT_FAILURE);
+    }
+
+    // Initialize rocmon
+    ret = rocmon_init(num_gpus, gpu_ids);
+    if (ret < 0)
+    {
+        fprintf(stderr,"Error init Rocmon Marker API.\n");
+        free(gpu_ids);
+        free(gpu_maps);
+        free(gpu_groups);
+        bstrListDestroy(gEventStrings);
+        bdestroy(bGeventStr);
+        exit(-EXIT_FAILURE);
+    }
+
+    // Add event sets
+    for (int i = 0; i < gEventStrings->qty; i++)
+    {
+        ret = rocmon_addEventSet(bdata(gEventStrings->entry[i]), &gpu_groups[i]);
+        if (ret < 0)
+        {
+            fprintf(stderr,"Error setting up Rocmon Marker API.\n");
+            free(gpu_ids);
+            free(gpu_maps);
+            free(gpu_groups);
+            exit(-EXIT_FAILURE);
+        }
+    }
+    bstrListDestroy(gEventStrings);
+    bdestroy(bGeventStr);
+    active_group = 0;
+
+    // Init GPU maps
+    for (int i = 0; i < num_gpus; i++)
+    {
+        init_smap(&gpu_maps[i]);
+    }
+
+    // Setup counters
+    ret = rocmon_setupCounters(gpu_groups[active_group]);
+    if (ret)
+    {
+        fprintf(stderr,"Error setting up Rocmon Marker API.\n");
+        free(gpu_ids);
+        free(gpu_maps);
+        free(gpu_groups);
+        rocmon_finalize();
+        exit(-EXIT_FAILURE);
+    }
+
+    // Start counters
+    ret = rocmon_startCounters();
+    if (ret)
+    {
+        fprintf(stderr,"Error starting up Rocmon Marker API.\n");
+        free(gpu_ids);
+        free(gpu_maps);
+        free(gpu_groups);
+        rocmon_finalize();
+        exit(-EXIT_FAILURE);
+    }
+
+    rocmon_marker_initialized = 1;
+}
+
+
+void
+rocmon_markerClose(void)
+{
+    // Ensure markers were initialized
+    if (!rocmon_marker_initialized)
+    {
+        return;
+    }
+
+    // Verify that we are on the same thread
+    if (gettid() != main_tid)
+    {
+        return;
+    }
+
+    // Stop counters
+    rocmon_stopCounters();
+
+    _rocmon_saveToFile();
+    _rocmon_finalize();
+}
+
+
+int
+rocmon_markerRegisterRegion(const char* regionTag)
+{
+    // Ensure markers were initialized
+    if (!rocmon_marker_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Verify that we are on the same thread
+    if (gettid() != main_tid)
+    {
+        return 0;
+    }
+
+    // Add region results to each gpu map
+    for (int i = 0; i < num_gpus; i++)
+    {
+        // Allocate memory for region results
+        RocmonRegionResults* results = malloc(sizeof(RocmonRegionResults));
+        if (results == NULL)
+        {
+            fprintf(stderr, "Failed to register region %s\n", regionTag);
+            return -ENOMEM;
+        }
+
+        // Initialize struct
+        results->label = bformat("%s-%d", regionTag, active_group);
+        results->timeActive = 0;
+        results->count = 0;
+        results->gpuId = gpu_ids[i];
+        results->groupId = gpu_groups[active_group];
+        results->state = ROCMON_MARKER_STATE_NEW;
+        
+        // Get number of events in active group
+        int numEvents = rocmon_getNumberOfEvents(active_group);
+        
+        // Allocate memory for event results
+        RocmonEventResult* tmpResults = malloc(numEvents * sizeof(RocmonEventResult));
+        if (tmpResults == NULL)
+        {
+            fprintf(stderr, "Failed to allocate event results for region %s\n", regionTag);
+            free(results);
+            return -ENOMEM;
+        }
+        results->groupResults.results = tmpResults;
+        results->groupResults.numResults = numEvents;
+
+        // Initialize event results
+        for (int j = 0; j < numEvents; j++)
+        {
+            RocmonEventResult* res = &results->groupResults.results[j];
+            res->lastValue = 0.0;
+            res->fullValue = 0.0;
+        }
+
+        // Add region results to map
+        add_smap(gpu_maps[i], bdata(results->label), results);
+    }
+
+    return 0;
+}
+
+
+int
+rocmon_markerStartRegion(const char* regionTag)
+{
+    // Ensure markers were initialized
+    if (!rocmon_marker_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Verify that we are on the same thread
+    if (gettid() != main_tid)
+    {
+        return 0;
+    }
+
+    // Read counters (for all devices)
+    TimerData timestamp;
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DETAIL, START REGION '%s' (group %d), regionTag, active_group);
+    timer_start(&timestamp);
+    rocmon_readCounters();
+
+    // Copy values for each device
+    bstring tag = bformat("%s-%d", regionTag, active_group);
+    for (int i = 0; i < num_gpus; i++)
+    {
+        // Get results from map
+        RocmonRegionResults* results = NULL;
+        int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void**) &results);
+        if (ret < 0)
+        {
+            fprintf(stderr, "WARN: Starting an unknown region %s\n", regionTag);
+            return -EFAULT;
+        }
+
+        // Check region state
+        if (results->state == ROCMON_MARKER_STATE_START)
+        {
+            fprintf(stderr, "WARN: Starting an already-started region %s\n", regionTag);
+            return -EFAULT;
+        }
+
+        // Update timer information
+        results->startTime.start = timestamp.start;
+
+        // Copy values for each event
+        for (int j = 0; j < results->groupResults.numResults; j++)
+        {
+            RocmonEventResult* res = &results->groupResults.results[j];
+            res->lastValue = rocmon_getResult(results->gpuId, results->groupId, j);
+        }
+
+        results->state = ROCMON_MARKER_STATE_START;
+    }
+
+    bdestroy(tag);
+    return 0;
+}
+
+
+int
+rocmon_markerStopRegion(const char* regionTag)
+{
+    // Ensure markers were initialized
+    if (!rocmon_marker_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Verify that we are on the same thread
+    if (gettid() != main_tid)
+    {
+        return 0;
+    }
+
+    // Read counters (for all devices)
+    TimerData timestamp;
+    ROCMON_DEBUG_PRINT(DEBUGLEV_DETAIL, STOP REGION '%s' (group %d), regionTag, active_group);
+    timer_stop(&timestamp);
+    rocmon_readCounters();
+
+    // Copy values for each device
+    bstring tag = bformat("%s-%d", regionTag, active_group);
+    for (int i = 0; i < num_gpus; i++)
+    {
+        // Get results from map
+        RocmonRegionResults* results = NULL;
+        int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void**) &results);
+        if (ret < 0)
+        {
+            fprintf(stderr, "WARN: Stopping an unknown region %s\n", regionTag);
+            return -EFAULT;
+        }
+
+        // Check region state
+        if (results->state != ROCMON_MARKER_STATE_START)
+        {
+            fprintf(stderr, "WARN: Stopping an not-started region %s\n", regionTag);
+            return -EFAULT;
+        }
+
+        // Update timer and count information
+        results->startTime.stop = timestamp.stop;
+        results->timeActive += timer_print(&results->startTime);
+        results->count++;
+
+        // Copy values for each event
+        for (int j = 0; j < results->groupResults.numResults; j++)
+        {
+            RocmonEventResult* res = &results->groupResults.results[j];
+            if (rocmon_getEventName(results->groupId, j)[1] == 'S')
+            {   // ROCm SMI event
+                res->fullValue += rocmon_getLastResult(results->gpuId, results->groupId, j);
+            }
+            else
+            {   // ROC-Profiler event
+                res->fullValue += rocmon_getResult(results->gpuId, results->groupId, j) - res->lastValue;
+            }
+        }
+
+        results->state = ROCMON_MARKER_STATE_STOP;
+    }
+
+    bdestroy(tag);
+    return 0;
+}
+
+
+void
+rocmon_markerGetRegion(
+        const char* regionTag,
+        int* nr_gpus,
+        int* nr_events,
+        double** events,
+        double** time,
+        int **count)
+{
+    // Ensure markers were initialized
+    if (!rocmon_marker_initialized)
+    {
+        return;
+    }
+
+    // TODO: implement this function
+    fprintf(stderr, "WARN: Function 'rocmon_markerGetRegion' is not implemented.\n");
+    
+    *nr_gpus = 0;
+    *nr_events = 0;
+    *time = NULL;
+    *events = NULL;
+    *count = NULL;
+}
+
+
+int
+rocmon_markerResetRegion(const char* regionTag)
+{
+    // Ensure markers were initialized
+    if (!rocmon_marker_initialized)
+    {
+        return -EFAULT;
+    }
+
+    // Verify that we are on the same thread
+    if (gettid() != main_tid)
+    {
+        return 0;
+    }
+
+    // Reset values for each device
+    bstring tag = bformat("%s-%d", regionTag, active_group);
+    for (int i = 0; i < num_gpus; i++)
+    {
+        // Get results from map
+        RocmonRegionResults* results = NULL;
+        int ret = get_smap_by_key(gpu_maps[i], bdata(tag), (void**) &results);
+        if (ret < 0)
+        {
+            fprintf(stderr, "WARN: Stopping an unknown region %s\n", regionTag);
+            return -EFAULT;
+        }
+
+        // Update timer and count information
+        timer_reset(&results->startTime);
+        results->timeActive = 0;
+        results->count = 0;
+
+        // Reset values for each event
+        for (int j = 0; j < results->groupResults.numResults; j++)
+        {
+            RocmonEventResult* res = &results->groupResults.results[j];
+            res->lastValue = 0;
+            res->fullValue = 0;
+        }
+    }
+
+    return 0;
+}
+
+
+void
+rocmon_markerNextGroup(void)
+{
+    // Ensure markers were initialized
+    if (!rocmon_marker_initialized)
+    {
+        return;
+    }
+
+    // Verify that we are on the same thread
+    if (gettid() != main_tid)
+    {
+        return;
+    }
+
+    int nextGroup = (active_group + 1) % num_groups;
+    if (nextGroup != active_group)
+    {
+        rocmon_switchActiveGroup(nextGroup);
+    }
+}
+
+
+LikwidRocmResults* rocmMarkerResults = NULL;
+int rocmMarkerRegions = 0;
+
+int
+rocmon_readMarkerFile(const char* filename)
+{
+    int ret = 0, i = 0;
+    FILE* fp = NULL;
+    char buf[2048];
+    buf[0] = '\0';
+    char *ptr = NULL;
+    int gpus = 0, groups = 0, regions = 0;
+    int nr_regions = 0;
+
+    if (filename == NULL)
+    {
+        return -EINVAL;
+    }
+    if (access(filename, R_OK))
+    {
+        return -EINVAL;
+    }
+    fp = fopen(filename, "r");
+    if (fp == NULL)
+    {
+        fprintf(stderr, "Error opening file %s\n", filename);
+    }
+    ptr = fgets(buf, sizeof(buf), fp);
+    ret = sscanf(buf, "%d %d %d", &gpus, &regions, &groups);
+    if (ret != 3)
+    {
+        fprintf(stderr, "ROCMMarker file missformatted.\n");
+        return -EINVAL;
+    }
+    rocmMarkerResults = realloc(rocmMarkerResults, regions * sizeof(LikwidRocmResults));
+    if (rocmMarkerResults == NULL)
+    {
+        fprintf(stderr, "Failed to allocate %lu bytes for the marker results storage\n", regions * sizeof(LikwidRocmResults));
+        return -ENOMEM;
+    }
+    int* regionGPUs = (int*)malloc(regions * sizeof(int));
+    if (regionGPUs == NULL)
+    {
+        fprintf(stderr, "Failed to allocate %lu bytes for temporal gpu count storage\n", regions * sizeof(int));
+        return -ENOMEM;
+    }
+    rocmMarkerRegions = regions;
+    for ( uint32_t i=0; i < regions; i++ )
+    {
+        regionGPUs[i] = 0;
+        rocmMarkerResults[i].gpuCount = gpus;
+        rocmMarkerResults[i].time = (double*) malloc(gpus * sizeof(double));
+        if (!rocmMarkerResults[i].time)
+        {
+            fprintf(stderr, "Failed to allocate %lu bytes for the time storage\n", gpus * sizeof(double));
+            break;
+        }
+        rocmMarkerResults[i].count = (uint32_t*) malloc(gpus * sizeof(uint32_t));
+        if (!rocmMarkerResults[i].count)
+        {
+            fprintf(stderr, "Failed to allocate %lu bytes for the count storage\n", gpus * sizeof(uint32_t));
+            break;
+        }
+        rocmMarkerResults[i].gpulist = (int*) malloc(gpus * sizeof(int));
+        if (!rocmMarkerResults[i].gpulist)
+        {
+            fprintf(stderr, "Failed to allocate %lu bytes for the gpulist storage\n", gpus * sizeof(int));
+            break;
+        }
+        rocmMarkerResults[i].counters = (double**) malloc(gpus * sizeof(double*));
+        if (!rocmMarkerResults[i].counters)
+        {
+            fprintf(stderr, "Failed to allocate %lu bytes for the counter result storage\n", gpus * sizeof(double*));
+            break;
+        }
+    }
+    while (fgets(buf, sizeof(buf), fp))
+    {
+        if (strchr(buf,':'))
+        {
+            int regionid = 0, groupid = -1;
+            char regiontag[100];
+            char* ptr = NULL;
+            char* colonptr = NULL;
+            regiontag[0] = '\0';
+            ret = sscanf(buf, "%d:%s", &regionid, regiontag);
+
+            ptr = strrchr(regiontag,'-');
+            colonptr = strchr(buf,':');
+            if (ret != 2 || ptr == NULL || colonptr == NULL)
+            {
+                fprintf(stderr, "Line %s not a valid region description\n", buf);
+                continue;
+            }
+            groupid = atoi(ptr+1);
+            snprintf(regiontag, strlen(regiontag)-strlen(ptr)+1, "%s", &(buf[colonptr-buf+1]));
+            rocmMarkerResults[regionid].groupID = groupid;
+            rocmMarkerResults[regionid].tag = bfromcstr(regiontag);
+            nr_regions++;
+        }
+        else
+        {
+            int regionid = 0, groupid = 0, gpu = 0, count = 0, nevents = 0;
+            int gpuidx = 0, eventidx = 0;
+            double time = 0;
+            char remain[1024];
+            remain[0] = '\0';
+            ret = sscanf(buf, "%d %d %d %d %lf %d %[^\t\n]", &regionid, &groupid, &gpu, &count, &time, &nevents, remain);
+            if (ret != 7)
+            {
+                fprintf(stderr, "Line %s not a valid region values line\n", buf);
+                continue;
+            }
+            if (gpu >= 0)
+            {
+                gpuidx = regionGPUs[regionid];
+                rocmMarkerResults[regionid].gpulist[gpuidx] = gpu;
+                rocmMarkerResults[regionid].eventCount = nevents;
+                rocmMarkerResults[regionid].time[gpuidx] = time;
+                rocmMarkerResults[regionid].count[gpuidx] = count;
+                rocmMarkerResults[regionid].counters[gpuidx] = malloc(nevents * sizeof(double));
+
+                eventidx = 0;
+                ptr = strtok(remain, " ");
+                while (ptr != NULL && eventidx < nevents)
+                {
+                    sscanf(ptr, "%lf", &(rocmMarkerResults[regionid].counters[gpuidx][eventidx]));
+                    ptr = strtok(NULL, " ");
+                    eventidx++;
+                }
+                regionGPUs[regionid]++;
+            }
+        }
+    }
+    for ( uint32_t i=0; i < regions; i++ )
+    {
+        rocmMarkerResults[i].gpuCount = regionGPUs[i];
+    }
+    free(regionGPUs);
+    fclose(fp);
+    return nr_regions;
+}
+
+void
+rocmon_destroyMarkerResults()
+{
+    int i = 0, j = 0;
+    if (rocmMarkerResults != NULL)
+    {
+        for (i = 0; i < rocmMarkerRegions; i++)
+        {
+            free(rocmMarkerResults[i].time);
+            free(rocmMarkerResults[i].count);
+            free(rocmMarkerResults[i].gpulist);
+            for (j = 0; j < rocmMarkerResults[i].gpuCount; j++)
+            {
+                free(rocmMarkerResults[i].counters[j]);
+            }
+            free(rocmMarkerResults[i].counters);
+            bdestroy(rocmMarkerResults[i].tag);
+        }
+        free(rocmMarkerResults);
+        rocmMarkerResults = NULL;
+        rocmMarkerRegions = 0;
+    }
+}
+
+
+int
+rocmon_getCountOfRegion(int region, int gpu)
+{
+    if (rocmMarkerResults == NULL)
+    {
+        ERROR_PLAIN_PRINT(Rocmon module not properly initialized);
+        return -EINVAL;
+    }
+    if (region < 0 || region >= rocmMarkerRegions)
+    {
+        return -EINVAL;
+    }
+    if (gpu < 0 || gpu >= rocmMarkerResults[region].gpuCount)
+    {
+        return -EINVAL;
+    }
+    if (rocmMarkerResults[region].count == NULL)
+    {
+        return 0;
+    }
+    return rocmMarkerResults[region].count[gpu];
+}
+
+double
+rocmon_getTimeOfRegion(int region, int gpu)
+{
+    if (rocmMarkerResults == NULL)
+    {
+        ERROR_PLAIN_PRINT(Rocmon module not properly initialized);
+        return -EINVAL;
+    }
+    if (region < 0 || region >= rocmMarkerRegions)
+    {
+        return -EINVAL;
+    }
+    if (gpu < 0 || gpu >= rocmMarkerResults[region].gpuCount)
+    {
+        return -EINVAL;
+    }
+    if (rocmMarkerResults[region].time == NULL)
+    {
+        return 0.0;
+    }
+    return rocmMarkerResults[region].time[gpu];
+}
+
+int
+rocmon_getGpulistOfRegion(int region, int count, int* gpulist)
+{
+    int i;
+    if (rocmMarkerResults == NULL)
+    {
+        ERROR_PLAIN_PRINT(Rocmon module not properly initialized);
+        return -EINVAL;
+    }
+    if (region < 0 || region >= rocmMarkerRegions)
+    {
+        return -EINVAL;
+    }
+    if (gpulist == NULL)
+    {
+        return -EINVAL;
+    }
+    for (i=0; i< MIN(count, rocmMarkerResults[region].gpuCount); i++)
+    {
+        gpulist[i] = rocmMarkerResults[region].gpulist[i];
+    }
+    return MIN(count, rocmMarkerResults[region].gpuCount);
+}
+
+int
+rocmon_getGpusOfRegion(int region)
+{
+    if (rocmMarkerResults == NULL)
+    {
+        ERROR_PLAIN_PRINT(Rocmon module not properly initialized);
+        return -EINVAL;
+    }
+    if (region < 0 || region >= rocmMarkerRegions)
+    {
+        return -EINVAL;
+    }
+    return rocmMarkerResults[region].gpuCount;
+}
+
+int
+rocmon_getMetricsOfRegion(int region)
+{
+    if (rocmMarkerResults == NULL)
+    {
+        ERROR_PLAIN_PRINT(Rocmon module not properly initialized);
+        return -EINVAL;
+    }
+    if (region < 0 || region >= rocmMarkerRegions)
+    {
+        return -EINVAL;
+    }
+    return rocmon_getNumberOfMetrics(rocmMarkerResults[region].groupID);
+}
+
+int
+rocmon_getNumberOfRegions()
+{
+    if (rocmMarkerResults == NULL)
+    {
+        ERROR_PLAIN_PRINT(Rocmon module not properly initialized);
+        return -EINVAL;
+    }
+    return rocmMarkerRegions;
+}
+
+int
+rocmon_getGroupOfRegion(int region)
+{
+    if (rocmMarkerResults == NULL)
+    {
+        ERROR_PLAIN_PRINT(Rocmon module not properly initialized);
+        return -EINVAL;
+    }
+    if (region < 0 || region >= rocmMarkerRegions)
+    {
+        return -EINVAL;
+    }
+    return rocmMarkerResults[region].groupID;
+}
+
+char*
+rocmon_getTagOfRegion(int region)
+{
+    if (rocmMarkerResults == NULL)
+    {
+        ERROR_PLAIN_PRINT(Rocmon module not properly initialized);
+        return NULL;
+    }
+    if (region < 0 || region >= rocmMarkerRegions)
+    {
+        return NULL;
+    }
+    return bdata(rocmMarkerResults[region].tag);
+}
+
+int
+rocmon_getEventsOfRegion(int region)
+{
+    if (rocmMarkerResults == NULL)
+    {
+        ERROR_PLAIN_PRINT(Rocmon module not properly initialized);
+        return -EINVAL;
+    }
+    if (region < 0 || region >= rocmMarkerRegions)
+    {
+        return -EINVAL;
+    }
+    return rocmMarkerResults[region].eventCount;
+}
+
+double
+rocmon_getResultOfRegionGpu(int region, int eventId, int gpuId)
+{
+    if (rocmMarkerResults == NULL)
+    {
+        ERROR_PLAIN_PRINT(Rocmon module not properly initialized);
+        return -EINVAL;
+    }
+    if (region < 0 || region >= rocmMarkerRegions)
+    {
+        return -EINVAL;
+    }
+    if (gpuId < 0 || gpuId >= rocmMarkerResults[region].gpuCount)
+    {
+        return -EINVAL;
+    }
+    if (eventId < 0 || eventId >= rocmMarkerResults[region].eventCount)
+    {
+        return -EINVAL;
+    }
+    if (rocmMarkerResults[region].counters[gpuId] == NULL)
+    {
+        return 0.0;
+    }
+    return rocmMarkerResults[region].counters[gpuId][eventId];
+}
+
+double
+rocmon_getMetricOfRegionGpu(int region, int metricId, int gpuId)
+{
+    int e = 0, err = 0;
+    double result = 0.0;
+    CounterList clist;
+    if (rocmMarkerResults == NULL)
+    {
+        ERROR_PLAIN_PRINT(Rocmon module not properly initialized);
+        return NAN;
+    }
+    if (region < 0 || region >= rocmMarkerRegions)
+    {
+        return NAN;
+    }
+    if (rocmMarkerResults == NULL)
+    {
+        return NAN;
+    }
+    if (gpuId < 0 || gpuId >= rocmMarkerResults[region].gpuCount)
+    {
+        return NAN;
+    }
+    GroupInfo* ginfo = &rocmon_context->groups[rocmMarkerResults[region].groupID];
+    if (metricId < 0 || metricId >= ginfo->nmetrics)
+    {
+        return NAN;
+    }
+    char *f = ginfo->metricformulas[metricId];
+    timer_init();
+    init_clist(&clist);
+    for (e = 0; e < rocmMarkerResults[region].eventCount; e++)
+    {
+        double res = rocmon_getResultOfRegionGpu(region, e, gpuId);
+        char* ctr = ginfo->counters[e];
+        add_to_clist(&clist, ctr, res);
+    }
+    add_to_clist(&clist, "time", rocmon_getTimeOfRegion(rocmMarkerResults[region].groupID, gpuId));
+    add_to_clist(&clist, "inverseClock", 1.0/timer_getCycleClock());
+    add_to_clist(&clist, "true", 1);
+    add_to_clist(&clist, "false", 0);
+
+    err = calc_metric(f, &clist, &result);
+    if (err < 0)
+    {
+        ERROR_PRINT(Cannot calculate formula %s, f);
+        return NAN;
+    }
+    destroy_clist(&clist);
+    return result;
+}
+
+#endif /* LIKWID_WITH_ROCMON */
diff --git a/src/topology_gpu_rocm.c b/src/topology_gpu_rocm.c
new file mode 100644
index 000000000..60e8b6c2e
--- /dev/null
+++ b/src/topology_gpu_rocm.c
@@ -0,0 +1,273 @@
+/*
+ * =======================================================================================
+ *
+ *      Filename:  topology_gpu.c
+ *
+ *      Description:  Topology module for GPUs
+ *
+ *      Version:   <VERSION>
+ *      Released:  <DATE>
+ *
+ *      Author:   Thomas Gruber (tg), thomas.roehl@googlemail.com
+ *      Project:  likwid
+ *
+ *      Copyright (C) 2016 RRZE, University Erlangen-Nuremberg
+ *
+ *      This program is free software: you can redistribute it and/or modify it under
+ *      the terms of the GNU General Public License as published by the Free Software
+ *      Foundation, either version 3 of the License, or (at your option) any later
+ *      version.
+ *
+ *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
+ *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+ *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *
+ *      You should have received a copy of the GNU General Public License along with
+ *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * =======================================================================================
+ */
+#ifdef LIKWID_WITH_ROCMON
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <math.h>
+#include <float.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <errno.h>
+#include <ctype.h>
+#include <assert.h>
+
+#include <hip/hip_runtime.h>
+#include <dlfcn.h>
+
+#include <error.h>
+#include <likwid.h>
+
+
+// Variables
+static void *topo_dl_libhip = NULL;
+static int topo_gpu_initialized = 0;
+static GpuTopology_rocm topo_gpuTopology = {0, NULL};
+
+
+// HIP function declarations
+#define HIPWEAK __attribute__( ( weak ) )
+#define DECLAREHIPFUNC(funcname, funcsig) hipError_t HIPWEAK funcname funcsig;  hipError_t ( *funcname##TopoPtr ) funcsig;
+
+DECLAREHIPFUNC(hipGetDeviceCount, (int *count))
+DECLAREHIPFUNC(hipGetDeviceProperties, (hipDeviceProp_t *prop, int deviceId))
+
+
+static int
+topo_link_libraries(void)
+{
+#define DLSYM_AND_CHECK( dllib, name ) dlsym( dllib, name ); if ( dlerror() != NULL ) { return -1; }
+
+    /* Need to link in the ROCm HIP libraries */
+    topo_dl_libhip = dlopen("libamdhip64.so", RTLD_NOW | RTLD_GLOBAL);
+    if (!topo_dl_libhip)
+    {
+        fprintf(stderr, "ROCm HIP library libamdrocm64.so not found.\n");
+        return -1;
+    }
+
+    // Link HIP functions
+    hipGetDeviceCountTopoPtr = DLSYM_AND_CHECK(topo_dl_libhip, "hipGetDeviceCount");
+    hipGetDevicePropertiesTopoPtr = DLSYM_AND_CHECK(topo_dl_libhip, "hipGetDeviceProperties");
+
+    return 0;
+}
+
+static int
+topo_get_numDevices(void)
+{
+    int count = 0;
+
+    hipError_t err = (*hipGetDeviceCountTopoPtr)(&count);
+    if (err == hipErrorNoDevice)
+    {
+        return 0;
+    }
+
+    return count;
+}
+
+static int
+topo_get_numNode(int pci_bus, int pci_dev, int pci_domain)
+{
+    char fname[1024];
+    char buff[100];
+    int ret = snprintf(fname, 1023, "/sys/bus/pci/devices/0000:%02x:%02x.%1x/numa_node", pci_bus, pci_dev, pci_domain);
+    if (ret > 0)
+    {
+        fname[ret] = '\0';
+        FILE* fp = fopen(fname, "r");
+        if (fp)
+        {
+            ret = fread(buff, sizeof(char), 99, fp);
+            int numa_node = atoi(buff);
+            fclose(fp);
+            return numa_node;
+        }
+    }
+    return -1;
+}
+
+static void
+topo_gpu_cleanup(int numDevices)
+{
+#define FREE_IF_NOT_NULL(var) if ( var ) { free( var ); }
+
+    for (int i = 0; i < numDevices; i++)
+    {
+        GpuDevice_rocm *device = &topo_gpuTopology.devices[i];
+
+        FREE_IF_NOT_NULL(device->name)
+    }
+
+    FREE_IF_NOT_NULL(topo_gpuTopology.devices)
+}
+
+static int
+topo_gpu_init(GpuDevice_rocm *device, int deviceId)
+{
+    hipError_t err;
+    hipDeviceProp_t props;
+
+    device->devid = deviceId;
+    device->name = NULL;
+    device->short_name = "amd_gpu";
+
+    // Get HIP device properties
+    err = (*hipGetDevicePropertiesTopoPtr)(&props, deviceId);
+    if (err == hipErrorInvalidDevice)
+    {
+        ERROR_PRINT(GPU %d is not a valid device, deviceId);
+        return -ENODEV;
+    }
+    if (err != hipSuccess)
+    {
+        ERROR_PRINT(Failed to retreive properties for GPU %d, deviceId);
+        return EXIT_FAILURE;
+    }
+
+    // Copy info from props
+    device->mem = props.totalGlobalMem;
+    device->ccapMajor = props.major;
+    device->ccapMinor = props.minor;
+    device->maxThreadsPerBlock = props.maxThreadsPerBlock;
+    device->maxThreadsDim[0] = props.maxThreadsDim[0];
+    device->maxThreadsDim[1] = props.maxThreadsDim[1];
+    device->maxThreadsDim[2] = props.maxThreadsDim[2];
+    device->maxGridSize[0] = props.maxGridSize[0];
+    device->maxGridSize[1] = props.maxGridSize[1];
+    device->maxGridSize[2] = props.maxGridSize[2];
+    device->sharedMemPerBlock = props.sharedMemPerBlock;
+    device->totalConstantMemory = props.totalConstMem;
+    device->simdWidth = props.warpSize;
+    device->memPitch = props.memPitch;
+    device->regsPerBlock = props.regsPerBlock;
+    device->clockRatekHz = props.clockRate;
+    device->textureAlign = props.textureAlignment;
+    device->l2Size = props.l2CacheSize;
+    device->memClockRatekHz = props.memoryClockRate;
+    device->pciBus = props.pciBusID;
+    device->pciDev = props.pciDeviceID;
+    device->pciDom = props.pciDomainID;
+    device->numMultiProcs = props.multiProcessorCount;
+    device->maxThreadPerMultiProc = props.maxThreadsPerMultiProcessor;
+    device->memBusWidth = props.memoryBusWidth;
+    device->ecc = props.ECCEnabled;
+    device->mapHostMem = props.canMapHostMemory;
+    device->integrated = props.integrated;
+    device->numaNode = topo_get_numNode(device->pciBus, device->pciDev, device->pciDom);
+
+    // Copy Name
+    device->name = malloc(256 * sizeof(char));
+    if (!device->name)
+    {
+        ERROR_PRINT(Cannot allocate space for name of GPU %d, deviceId);
+        return -ENOMEM;
+    }
+    strncpy(device->name, props.name, 256);
+    device->name[255] = '\0';
+
+    return 0;
+}
+
+int
+topology_gpu_init_rocm()
+{
+    int ret = 0;
+
+    // Do not initialize twice
+    if (topo_gpu_initialized)
+    {
+        return EXIT_SUCCESS;
+    }
+
+    // Link required functions from dynamic libraries
+    ret = topo_link_libraries();
+    if (ret != 0)
+    {
+        ERROR_PLAIN_PRINT(Cannot open ROCm HIP library to fill GPU topology);
+        return EXIT_FAILURE;
+    }
+
+    // Get number of devices to initialize
+    int num_devs = topo_get_numDevices();
+    if (num_devs < 0)
+    {
+        ERROR_PLAIN_PRINT(Cannot get number of devices from ROCm HIP library);
+        return EXIT_FAILURE;
+    }
+
+    // Allocate memory for device information
+    if (num_devs > 0)
+    {
+        topo_gpuTopology.devices = malloc(num_devs * sizeof(GpuDevice_rocm));
+        if (!topo_gpuTopology.devices)
+        {
+            return -ENOMEM;
+        }
+    }
+
+    // Initialize devices
+    for (int i = 0; i < num_devs; i++)
+    {
+        ret = topo_gpu_init(&topo_gpuTopology.devices[i], i);
+        if (ret != 0)
+        {
+            topo_gpu_cleanup(i+1);
+            return ret;
+        }
+    }
+
+    // Finished
+    topo_gpuTopology.numDevices = num_devs;
+    topo_gpu_initialized = 1;
+    return EXIT_SUCCESS;
+}
+
+void
+topology_gpu_finalize_rocm(void)
+{
+    if (topo_gpu_initialized)
+    {
+        topo_gpu_cleanup(topo_gpuTopology.numDevices);
+    }
+}
+
+GpuTopology_rocm_t
+get_gpuTopology_rocm(void)
+{
+    if (topo_gpu_initialized)
+    {
+        return &topo_gpuTopology;
+    }
+}
+
+#endif /* LIKWID_WITH_ROCMON */
diff --git a/test/Makefile b/test/Makefile
index c5cf7e243..2a7f3a613 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -3,6 +3,7 @@ include ../config.mk
 LIKWID_LIB ?= -L$(PREFIX)/lib
 LIKWID_INC ?= -I$(PREFIX)/include
 LIKWID_DEFINES ?= -DLIKWID_PERFMON
+HIPCC ?= $(ROCM_HOME)/hip/bin/hipcc
 
 all:  targets
 
@@ -56,6 +57,15 @@ test-likwidAPI: test-likwidAPI.c
 test-msr-access: test-msr-access.c
 	gcc -o $@  test-msr-access.c
 
+test-topology-gpu-rocm: test-topology-gpu-rocm.c
+	gcc -O3 -std=c99 $(LIKWID_INC) $(LIKWID_DEFINES) $(LIKWID_LIB) -DLIKWID_WITH_ROCMON -o $@  test-topology-gpu-rocm.c -lm -llikwid
+
+test-rocmon-triad: test-rocmon-triad.cpp
+	$(HIPCC) -O3 -std=c++11 $(LIKWID_INC) $(LIKWID_DEFINES) $(LIKWID_LIB) -I$(RSMIINCLUDE) -L$(ROCM_HOME)/rocm_smi/lib -lrocm_smi64 -DLIKWID_WITH_ROCMON test-rocmon-triad.cpp -o $@ -lm -llikwid
+
+test-rocmon-triad-marker: test-rocmon-triad-marker.cpp
+	$(HIPCC) -O3 -std=c++11 $(LIKWID_INC) $(LIKWID_DEFINES) $(LIKWID_LIB) -DLIKWID_WITH_ROCMON -DLIKWID_ROCMON test-rocmon-triad-marker.cpp -o $@ -lm -llikwid
+
 streamICC: stream.c
 	if [ $(ICC_AVAILABLE) -ne 0 ]; then icc -O3 -xHost -std=c99 -qopenmp $(LIKWID_INC) $(LIKWID_DEFINES) $(LIKWID_LIB) -o $@ stream.c -lm -llikwid; fi
 
@@ -102,6 +112,6 @@ triadCU: triad.cu
 .PHONY: clean distclean streamGCC streamICC streamGCC_C11 streamICC_C11 testmarker-cnt testmarker-omp testmarkerF90 test-mpi test-mpi-pthreads stream_cilk serial test-likwidAPI streamAPIGCC test-msr-access testTBBGCC testTBBICC jacobi-2D-5pt-icc jacobi-2D-5pt-gcc matmul_marker matmul marker_overhead
 
 clean:
-	rm -f streamGCC streamICC streamGCC_C11 streamICC_C11 stream_cilk testmarker-cnt testmarkerF90 test-mpi test-mpi-pthreads testmarker-omp serial test-likwidAPI streamAPIGCC test-msr-access testTBBGCC testTBBICC jacobi-2D-5pt-icc jacobi-2D-5pt-gcc matmul_marker matmul marker_overhead streamCU
+	rm -f streamGCC streamICC streamGCC_C11 streamICC_C11 stream_cilk testmarker-cnt testmarkerF90 test-mpi test-mpi-pthreads testmarker-omp serial test-likwidAPI streamAPIGCC test-msr-access testTBBGCC testTBBICC jacobi-2D-5pt-icc jacobi-2D-5pt-gcc matmul_marker matmul marker_overhead streamCU test-topology-gpu-rocm test-rocmon test-rocmon-triad test-rocmon-triad-marker
 
 distclean: clean
diff --git a/test/test-rocmon-triad-marker.cpp b/test/test-rocmon-triad-marker.cpp
new file mode 100644
index 000000000..f97b54982
--- /dev/null
+++ b/test/test-rocmon-triad-marker.cpp
@@ -0,0 +1,161 @@
+/*
+ * =======================================================================================
+ *
+ *      Filename:  triad.cu
+ *
+ *      Description:  Triad kernel in CUDA to test NvMarkerAPI
+ *
+ *      Version:   <VERSION>
+ *      Released:  <DATE>
+ *
+ *      Author:   Dominik Ernst (de) dominik.ernst@fau.de
+ *      Project:  likwid
+ *
+ *      Copyright (C) 2019 RRZE, University Erlangen-Nuremberg
+ *
+ *      This program is free software: you can redistribute it and/or modify it under
+ *      the terms of the GNU General Public License as published by the Free Software
+ *      Foundation, either version 3 of the License, or (at your option) any later
+ *      version.
+ *
+ *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
+ *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+ *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *
+ *      You should have received a copy of the GNU General Public License along with
+ *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * =======================================================================================
+ */
+
+
+#include <iomanip>
+#include <iostream>
+#include <sys/time.h>
+
+#include <hip/hip_runtime.h>
+#include <likwid-marker.h>
+
+double dtime() {
+  double tseconds = 0;
+  struct timeval t;
+  gettimeofday(&t, NULL);
+  tseconds = (double)t.tv_sec + (double)t.tv_usec * 1.0e-6;
+  return tseconds;
+}
+
+#define GPU_ERROR(ans)                                                         \
+  { gpuAssert((ans), __FILE__, __LINE__); }
+
+inline void gpuAssert(hipError_t code, const char *file, int line,
+                      bool abort = true) {
+  if (code != hipSuccess) {
+    std::cerr << "GPUassert: \"" << hipGetErrorString(code) << "\"  in "
+              << file << ": " << line << "\n";
+    if (abort)
+      exit(code);
+  }
+}
+
+using namespace std;
+
+template <typename T>
+__global__ void init_kernel(T *A, const T *__restrict__ B,
+                            const T *__restrict__ C, const T *__restrict__ D,
+                            const size_t N) {
+  int tidx = threadIdx.x + blockIdx.x * blockDim.x;
+  for (size_t i = tidx; i < N; i += blockDim.x * gridDim.x) {
+    A[i] = 0.1;
+  }
+}
+
+template <typename T>
+__global__ void sch_triad_kernel(T *A, const T *__restrict__ B,
+                                 const T *__restrict__ C,
+                                 const T *__restrict__ D, const int64_t N) {
+  int tidx = threadIdx.x + blockIdx.x * blockDim.x;
+  for (int64_t i = tidx; i < N; i += blockDim.x * gridDim.x) {
+    A[i] = B[i] + C[i] * D[i];
+  }
+}
+
+int main(int argc, char **argv) {
+  size_t buffer_size = 128 * 1024 * 1024;
+  if (argc == 2)
+  {
+    buffer_size = atoi(argv[1]);
+  }
+  cout << "Buffer size: " << buffer_size << endl;
+
+  double *dA, *dB, *dC, *dD;
+  
+  // Get start time
+  double tstart = dtime();
+
+  // Marker init
+  ROCMON_MARKER_INIT;
+  ROCMON_MARKER_REGISTER("init");
+  ROCMON_MARKER_REGISTER("triad");
+
+  GPU_ERROR(hipMalloc(&dA, buffer_size * sizeof(double)));
+  GPU_ERROR(hipMalloc(&dB, buffer_size * sizeof(double)));
+  GPU_ERROR(hipMalloc(&dC, buffer_size * sizeof(double)));
+  GPU_ERROR(hipMalloc(&dD, buffer_size * sizeof(double)));
+
+  ROCMON_MARKER_START("init");
+  hipLaunchKernelGGL((init_kernel<double>), dim3(256), dim3(400), 0, 0, dA, dA, dA, dA, buffer_size);
+  hipLaunchKernelGGL((init_kernel<double>), dim3(256), dim3(400), 0, 0, dB, dB, dB, dB, buffer_size);
+  hipLaunchKernelGGL((init_kernel<double>), dim3(256), dim3(400), 0, 0, dC, dC, dC, dC, buffer_size);
+  hipLaunchKernelGGL((init_kernel<double>), dim3(256), dim3(400), 0, 0, dD, dD, dD, dD, buffer_size);
+  ROCMON_MARKER_STOP("init");
+
+  GPU_ERROR(hipDeviceSynchronize());
+  const int iters = 100;
+  cout << "Iterations: " << iters << endl;
+
+  const int block_size = 512;
+  hipDeviceProp_t prop;
+  int deviceId = 0;
+  GPU_ERROR(hipGetDevice(&deviceId));
+  GPU_ERROR(hipGetDeviceProperties(&prop, deviceId));
+  int smCount = prop.multiProcessorCount;
+  int maxActiveBlocks = 0;
+  GPU_ERROR(hipOccupancyMaxActiveBlocksPerMultiprocessor(
+      &maxActiveBlocks, sch_triad_kernel<double>, block_size, 0));
+
+  int max_blocks = maxActiveBlocks * smCount;
+
+  hipLaunchKernelGGL((sch_triad_kernel<double>), dim3(max_blocks), dim3(block_size), 0, 0, dA, dB, dC, dD, buffer_size);
+
+  GPU_ERROR(hipDeviceSynchronize());
+  double t1 = dtime();
+  for (int i = 0; i < iters; i++) {
+    ROCMON_MARKER_START("triad");
+    hipLaunchKernelGGL((sch_triad_kernel<double>), dim3(max_blocks), dim3(block_size), 0, 0, dA, dB, dC, dD, buffer_size);
+    ROCMON_MARKER_STOP("triad");
+  }
+  GPU_ERROR(hipGetLastError());
+  GPU_ERROR(hipDeviceSynchronize());
+
+  double t2 = dtime();
+
+  // Marker stop
+  ROCMON_MARKER_CLOSE;
+
+  GPU_ERROR(hipFree(dA));
+  GPU_ERROR(hipFree(dB));
+  GPU_ERROR(hipFree(dC));
+  GPU_ERROR(hipFree(dD));
+
+  // Get start time
+  double tstop = dtime();
+
+  cout << "Total time: " << (tstop - tstart) * 1000 << " ms" << endl;
+  cout << "Iteration time: " << (t2 - t1) * 1000 << " ms" << endl;
+
+  double dt = (t2 - t1) / iters;
+  cout << fixed << setprecision(2) << setw(6) << dt * 1000 << "ms " << setw(5)
+       << 4 * buffer_size * sizeof(double) / dt * 1e-9 << "GB/s \n";
+
+  return 0;
+}
diff --git a/test/test-rocmon-triad.cpp b/test/test-rocmon-triad.cpp
new file mode 100644
index 000000000..c19389021
--- /dev/null
+++ b/test/test-rocmon-triad.cpp
@@ -0,0 +1,182 @@
+/*
+ * =======================================================================================
+ *
+ *      Filename:  triad.cu
+ *
+ *      Description:  Triad kernel in CUDA to test NvMarkerAPI
+ *
+ *      Version:   <VERSION>
+ *      Released:  <DATE>
+ *
+ *      Author:   Dominik Ernst (de) dominik.ernst@fau.de
+ *      Project:  likwid
+ *
+ *      Copyright (C) 2019 RRZE, University Erlangen-Nuremberg
+ *
+ *      This program is free software: you can redistribute it and/or modify it under
+ *      the terms of the GNU General Public License as published by the Free Software
+ *      Foundation, either version 3 of the License, or (at your option) any later
+ *      version.
+ *
+ *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
+ *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+ *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *
+ *      You should have received a copy of the GNU General Public License along with
+ *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * =======================================================================================
+ */
+
+
+#include <iomanip>
+#include <iostream>
+#include <sys/time.h>
+
+#include <hip/hip_runtime.h>
+#include <likwid.h>
+
+double dtime() {
+  double tseconds = 0;
+  struct timeval t;
+  gettimeofday(&t, NULL);
+  tseconds = (double)t.tv_sec + (double)t.tv_usec * 1.0e-6;
+  return tseconds;
+}
+
+#define LIKWID_TEST(ans) { int err = (ans); if (err < 0) { std::cerr << "Error " << err << " in `" << #ans << "` (" << __FILE__ << ": " << __LINE__ << ")" << std::endl; return -err; } }
+#define GPU_ERROR(ans)                                                         \
+  { gpuAssert((ans), __FILE__, __LINE__); }
+
+inline void gpuAssert(hipError_t code, const char *file, int line,
+                      bool abort = true) {
+  if (code != hipSuccess) {
+    std::cerr << "GPUassert: \"" << hipGetErrorString(code) << "\"  in "
+              << file << ": " << line << "\n";
+    if (abort)
+      exit(code);
+  }
+}
+
+using namespace std;
+
+template <typename T>
+__global__ void init_kernel(T *A, const T *__restrict__ B,
+                            const T *__restrict__ C, const T *__restrict__ D,
+                            const size_t N) {
+  int tidx = threadIdx.x + blockIdx.x * blockDim.x;
+  for (size_t i = tidx; i < N; i += blockDim.x * gridDim.x) {
+    A[i] = 0.1;
+  }
+}
+
+template <typename T>
+__global__ void sch_triad_kernel(T *A, const T *__restrict__ B,
+                                 const T *__restrict__ C,
+                                 const T *__restrict__ D, const int64_t N) {
+  int tidx = threadIdx.x + blockIdx.x * blockDim.x;
+  for (int64_t i = tidx; i < N; i += blockDim.x * gridDim.x) {
+    A[i] = B[i] + C[i] * D[i];
+  }
+}
+
+void print_results(int gid)
+{
+  for (int j = 0; j < rocmon_getNumberOfEvents(gid); j++)
+  {
+    char* name = rocmon_getEventName(gid, j);
+    double lastValue = rocmon_getLastResult(0, gid, j);
+    double fullValue = rocmon_getResult(0, gid, j);
+
+    printf("%s: %.2f / %.2f\n", name, fullValue, lastValue);
+  }
+}
+
+int main(int argc, char **argv) {
+  const size_t buffer_size = 248 * 1024 * 1024;
+
+  double *dA, *dB, *dC, *dD;
+
+  GPU_ERROR(hipMalloc(&dA, buffer_size * sizeof(double)));
+  GPU_ERROR(hipMalloc(&dB, buffer_size * sizeof(double)));
+  GPU_ERROR(hipMalloc(&dC, buffer_size * sizeof(double)));
+  GPU_ERROR(hipMalloc(&dD, buffer_size * sizeof(double)));
+
+  hipLaunchKernelGGL((init_kernel<double>), dim3(256), dim3(400), 0, 0, dA, dA, dA, dA, buffer_size);
+  hipLaunchKernelGGL((init_kernel<double>), dim3(256), dim3(400), 0, 0, dB, dB, dB, dB, buffer_size);
+  hipLaunchKernelGGL((init_kernel<double>), dim3(256), dim3(400), 0, 0, dC, dC, dC, dC, buffer_size);
+  hipLaunchKernelGGL((init_kernel<double>), dim3(256), dim3(400), 0, 0, dD, dD, dD, dD, buffer_size);
+
+  GPU_ERROR(hipDeviceSynchronize());
+  const int iters = 10;
+
+  const int block_size = 512;
+  hipDeviceProp_t prop;
+  int deviceId = 0;
+  GPU_ERROR(hipGetDevice(&deviceId));
+  GPU_ERROR(hipGetDeviceProperties(&prop, deviceId));
+  int smCount = prop.multiProcessorCount;
+  int maxActiveBlocks = 0;
+  GPU_ERROR(hipOccupancyMaxActiveBlocksPerMultiprocessor(
+      &maxActiveBlocks, sch_triad_kernel<double>, block_size, 0));
+
+  int max_blocks = maxActiveBlocks * smCount;
+
+  int gpuIds[1] = {deviceId}; 
+  if (rocmon_init(1, gpuIds) < 0)
+  {
+      printf("Failed to initialie Rocmon\n");
+      return -1;
+  }
+  int gid = -1;
+  // if (rocmon_addEventSet("GRBM_COUNT:_,GRBM_GUI_ACTIVE:_,GPUBusy:_", &gid) < 0 || gid < 0)
+  // if (rocmon_addEventSet(
+  //   // "TCC_HIT[0]:_,TCC_HIT[1]:_,TCC_HIT[2]:_,TCC_HIT[3]:_,TCC_HIT[4]:_,TCC_HIT[5]:_,TCC_HIT[6]:_,TCC_HIT[7]:_,TCC_HIT[8]:_,TCC_HIT[9]:_,TCC_HIT[10]:_,TCC_HIT[11]:_,TCC_HIT[12]:_,TCC_HIT[13]:_,TCC_HIT[14]:_,TCC_HIT[15]:_"
+  //   // ",TCC_MISS[0]:_,TCC_MISS[1]:_,TCC_MISS[2]:_,TCC_MISS[3]:_,TCC_MISS[4]:_,TCC_MISS[5]:_,TCC_MISS[6]:_,TCC_MISS[7]:_,TCC_MISS[8]:_,TCC_MISS[9]:_,TCC_MISS[10]:_,TCC_MISS[11]:_,TCC_MISS[12]:_,TCC_MISS[13]:_,TCC_MISS[14]:_,TCC_MISS[15]:_"
+  //   // ",TA_FLAT_WRITE_WAVEFRONTS[0]:_,TA_FLAT_WRITE_WAVEFRONTS[1]:_,TA_FLAT_WRITE_WAVEFRONTS[2]:_,TA_FLAT_WRITE_WAVEFRONTS[3]:_,TA_FLAT_WRITE_WAVEFRONTS[4]:_,TA_FLAT_WRITE_WAVEFRONTS[5]:_,TA_FLAT_WRITE_WAVEFRONTS[6]:_,TA_FLAT_WRITE_WAVEFRONTS[7]:_,TA_FLAT_WRITE_WAVEFRONTS[8]:_,TA_FLAT_WRITE_WAVEFRONTS[9]:_,TA_FLAT_WRITE_WAVEFRONTS[10]:_,TA_FLAT_WRITE_WAVEFRONTS[11]:_,TA_FLAT_WRITE_WAVEFRONTS[12]:_,TA_FLAT_WRITE_WAVEFRONTS[13]:_,TA_FLAT_WRITE_WAVEFRONTS[14]:_,TA_FLAT_WRITE_WAVEFRONTS[15]:_"
+  //   // "SQ_WAVES:_,SQ_INSTS_VALU:_,SQ_INSTS_VMEM_WR:_,SQ_INSTS_VMEM_RD:_,SQ_INSTS_SALU:_"
+  //   // "SQ_INSTS_SMEM:_,SQ_INSTS_FLAT:_,SQ_INSTS_FLAT_LDS_ONLY:_,SQ_INSTS_LDS:_,SQ_INSTS_GDS:_"
+  //   ,&gid) < 0 || gid < 0)
+  // if (rocmon_addEventSet("GRBM_GUI_ACTIVE:_,GRBM_COUNT:_,GPUBusy:_", &gid) < 0 || gid < 0)
+  if (rocmon_addEventSet("GRBM_GUI_ACTIVE:_,GRBM_COUNT:_,TCC_HIT_sum:_", &gid) < 0 || gid < 0)
+  // if (rocmon_addEventSet("GRBM_GUI_ACTIVE:_,GRBM_COUNT:_,TCC_HIT_sum:_,GPUBusy:_", &gid) < 0 || gid < 0)
+  // if (rocmon_addEventSet("GRBM_GUI_ACTIVE:_,GRBM_COUNT:_,TCC_HIT_sum:_,GPUBusy:_,VALUInsts:_,VALUUtilization:_,LDSBankConflict:_,Wavefronts:_,FETCH_SIZE:_", &gid) < 0 || gid < 0)
+  {
+      printf("Failed to add event set\n");
+      return -1;
+  }
+  LIKWID_TEST(rocmon_setupCounters(gid));
+  LIKWID_TEST(rocmon_startCounters());
+
+  hipLaunchKernelGGL((sch_triad_kernel<double>), dim3(max_blocks), dim3(block_size), 0, 0, dA, dB, dC, dD, buffer_size);
+
+  GPU_ERROR(hipDeviceSynchronize());
+  double t1 = dtime();
+  for (int i = 0; i < iters; i++) {
+    hipLaunchKernelGGL((sch_triad_kernel<double>), dim3(max_blocks), dim3(block_size), 0, 0, dA, dB, dC, dD, buffer_size);
+
+    LIKWID_TEST(rocmon_readCounters());
+    std::cout << i << ":" << std::endl; 
+    print_results(gid);
+    std::cout << std::endl;
+  }
+  GPU_ERROR(hipGetLastError());
+  GPU_ERROR(hipDeviceSynchronize());
+  double t2 = dtime();
+
+  LIKWID_TEST(rocmon_stopCounters());
+  print_results(gid);
+  printf("Time of group: %f / %f\n", rocmon_getTimeOfGroup(0), rocmon_getLastTimeOfGroup(0));
+
+  double dt = (t2 - t1) / iters;
+
+  cout << fixed << setprecision(2) << setw(6) << dt * 1000 << "ms " << setw(5)
+       << 4 * buffer_size * sizeof(double) / dt * 1e-9 << "GB/s \n";
+
+  rocmon_finalize();
+  GPU_ERROR(hipFree(dA));
+  GPU_ERROR(hipFree(dB));
+  GPU_ERROR(hipFree(dC));
+  GPU_ERROR(hipFree(dD));
+  return 0;
+}
diff --git a/test/test-topology-gpu-rocm.c b/test/test-topology-gpu-rocm.c
new file mode 100644
index 000000000..3f2c359f3
--- /dev/null
+++ b/test/test-topology-gpu-rocm.c
@@ -0,0 +1,62 @@
+#include <stdlib.h>
+#include <stdio.h>
+#include <likwid.h>
+
+int main()
+{
+    printf("Init\n");
+
+    // Init
+    int ret = topology_gpu_init_rocm();
+    if (ret != 0)
+    {
+        printf("Oops! Failed to initialize ROCm GPU topology.");
+        return -1;
+    }
+
+    // Use
+    GpuTopology_rocm_t topo = get_gpuTopology_rocm();
+    printf("Number of devices: %d\n\n", topo->numDevices);
+
+    for (int i = 0; i < topo->numDevices; i++)
+    {
+        GpuDevice_rocm *device = &topo->devices[i];
+
+        printf("---\n");
+        printf("devid: %d\n", device->devid);
+        printf("numaNode: %d\n", device->numaNode);
+        printf("name: %s\n", device->name);
+        printf("short_name: %s\n", device->short_name);
+        printf("mem: %u\n", device->mem);
+        printf("ccapMajor: %d\n", device->ccapMajor);
+        printf("ccapMinor: %d\n", device->ccapMinor);
+        printf("maxThreadsPerBlock: %d\n", device->maxThreadsPerBlock);
+        printf("maxThreadsPerDim: %d / %d / %d\n", device->maxThreadsDim[0], device->maxThreadsDim[1], device->maxThreadsDim[2]);
+        printf("maxGridSize: %d / %d / %d\n", device->maxGridSize[0], device->maxGridSize[1], device->maxGridSize[2]);
+        printf("sharedMemPerBlock: %d\n", device->sharedMemPerBlock);
+        printf("totalConstantMemory: %u\n", device->totalConstantMemory);
+        printf("simdWidth: %d\n", device->simdWidth);
+        printf("memPitch: %u\n", device->memPitch);
+        printf("regsPerBlock: %d\n", device->regsPerBlock);
+        printf("clockRatekHz: %d\n", device->clockRatekHz);
+        printf("textureAlign: %u\n", device->textureAlign);
+        printf("l2Size: %d\n", device->l2Size);
+        printf("memClockRatekHz: %d\n", device->memClockRatekHz);
+        printf("pciBus: %d\n", device->pciBus);
+        printf("pciDev: %d\n", device->pciDev);
+        printf("pciDom: %d\n", device->pciDom);
+        printf("numMultiProcs: %d\n", device->numMultiProcs);
+        printf("maxThreadPerMultiProc: %d\n", device->maxThreadPerMultiProc);
+        printf("memBusWidth: %d\n", device->memBusWidth);
+        printf("ecc: %d\n", device->ecc);
+        printf("mapHostMem: %d\n", device->mapHostMem);
+        printf("integrated: %d\n", device->integrated);
+        printf("---\n\n");
+    }
+
+    // Finalize
+    topology_gpu_finalize_rocm();
+
+    printf("Finalized\n");
+    return 0;
+}
\ No newline at end of file
diff --git a/test/triad.cu b/test/triad.cu
index 4652d5869..643f17226 100644
--- a/test/triad.cu
+++ b/test/triad.cu
@@ -13,22 +13,22 @@
  *
  *      Copyright (C) 2019 RRZE, University Erlangen-Nuremberg
  *
- *      This program is free software: you can redistribute it and/or modify it under
- *      the terms of the GNU General Public License as published by the Free Software
- *      Foundation, either version 3 of the License, or (at your option) any later
- *      version.
+ *      This program is free software: you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation, either version 3 of the License, or (at your option) any
+ * later version.
  *
- *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
- *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
- *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *      This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
  *
- *      You should have received a copy of the GNU General Public License along with
- *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *      You should have received a copy of the GNU General Public License along
+ * with this program.  If not, see <http://www.gnu.org/licenses/>.
  *
  * =======================================================================================
  */
 
-
 #include <iomanip>
 #include <iostream>
 #include <sys/time.h>
@@ -80,70 +80,70 @@ __global__ void sch_triad_kernel(T *A, const T *__restrict__ B,
 
 int main(int argc, char **argv) {
 
-    const size_t buffer_size = 128 * 1024 * 1024;
+  const size_t buffer_size = 128 * 1024 * 1024;
   const int iters = 10;
   const int block_size = 512;
 
-
   int deviceCount = 0;
   GPU_ERROR(cudaGetDeviceCount(&deviceCount));
 
   LIKWID_NVMARKER_INIT;
   LIKWID_NVMARKER_REGISTER("triad");
 
-  #pragma omp parallel num_threads(deviceCount)
+#pragma omp parallel num_threads(deviceCount)
   {
-      GPU_ERROR(cudaSetDevice(omp_get_thread_num()));
-
-      double *dA, *dB, *dC, *dD;
-
-      GPU_ERROR(cudaMalloc(&dA, buffer_size * sizeof(double)));
-      GPU_ERROR(cudaMalloc(&dB, buffer_size * sizeof(double)));
-      GPU_ERROR(cudaMalloc(&dC, buffer_size * sizeof(double)));
-      GPU_ERROR(cudaMalloc(&dD, buffer_size * sizeof(double)));
-
-      init_kernel<<<256, 400>>>(dA, dA, dA, dA, buffer_size);
-      init_kernel<<<256, 400>>>(dB, dB, dB, dB, buffer_size);
-      init_kernel<<<256, 400>>>(dC, dC, dC, dC, buffer_size);
-      init_kernel<<<256, 400>>>(dD, dD, dD, dD, buffer_size);
-      GPU_ERROR(cudaDeviceSynchronize());
-
-      cudaDeviceProp prop;
-      int deviceId;
-      GPU_ERROR(cudaGetDevice(&deviceId));
-      GPU_ERROR(cudaGetDeviceProperties(&prop, deviceId));
-      int smCount = prop.multiProcessorCount;
-      int maxActiveBlocks = 0;
-      GPU_ERROR(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
-                    &maxActiveBlocks, sch_triad_kernel<double>, block_size, 0));
-
-      int max_blocks = maxActiveBlocks * smCount;
-
-
+    GPU_ERROR(cudaSetDevice(omp_get_thread_num()));
+
+    double *dA, *dB, *dC, *dD;
+
+    GPU_ERROR(cudaMalloc(&dA, buffer_size * sizeof(double)));
+    GPU_ERROR(cudaMalloc(&dB, buffer_size * sizeof(double)));
+    GPU_ERROR(cudaMalloc(&dC, buffer_size * sizeof(double)));
+    GPU_ERROR(cudaMalloc(&dD, buffer_size * sizeof(double)));
+
+    init_kernel<<<256, 400>>>(dA, dA, dA, dA, buffer_size);
+    init_kernel<<<256, 400>>>(dB, dB, dB, dB, buffer_size);
+    init_kernel<<<256, 400>>>(dC, dC, dC, dC, buffer_size);
+    init_kernel<<<256, 400>>>(dD, dD, dD, dD, buffer_size);
+    GPU_ERROR(cudaDeviceSynchronize());
+
+    cudaDeviceProp prop;
+    int deviceId;
+    GPU_ERROR(cudaGetDevice(&deviceId));
+    GPU_ERROR(cudaGetDeviceProperties(&prop, deviceId));
+    int smCount = prop.multiProcessorCount;
+    int maxActiveBlocks = 0;
+    GPU_ERROR(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+        &maxActiveBlocks, sch_triad_kernel<double>, block_size, 0));
+
+    int max_blocks = maxActiveBlocks * smCount;
+
+    sch_triad_kernel<double>
+        <<<max_blocks, block_size>>>(dA, dB, dC, dD, buffer_size);
+
+    GPU_ERROR(cudaDeviceSynchronize());
+    double t1 = dtime();
+    for (int i = 0; i < iters; i++) {
+      LIKWID_NVMARKER_START("triad");
       sch_triad_kernel<double>
           <<<max_blocks, block_size>>>(dA, dB, dC, dD, buffer_size);
+      LIKWID_NVMARKER_STOP("triad");
+    }
+    GPU_ERROR(cudaGetLastError());
+    GPU_ERROR(cudaDeviceSynchronize());
+    double t2 = dtime();
 
-      GPU_ERROR(cudaDeviceSynchronize());
-      double t1 = dtime();
-      for (int i = 0; i < iters; i++) {
-          LIKWID_NVMARKER_START("triad");
-          sch_triad_kernel<double>
-              <<<max_blocks, block_size>>>(dA, dB, dC, dD, buffer_size);
-          LIKWID_NVMARKER_STOP("triad");
-      }
-      GPU_ERROR(cudaGetLastError());
-      GPU_ERROR(cudaDeviceSynchronize());
-      double t2 = dtime();
-
-      double dt = (t2 - t1) / iters;
+    double dt = (t2 - t1) / iters;
 #pragma omp critical
-      cout << deviceId << " " << setw(40) << prop.name << fixed << setprecision(1) << "   " << setw(6) << dt * 1000 << "ms " << setw(5)
-           << 4 * buffer_size * sizeof(double) / dt * 1e-9 << "GB/s " << setw(2) << 2*buffer_size /dt*1e-6 << "MFLOP/s\n";
-
-      GPU_ERROR(cudaFree(dA));
-      GPU_ERROR(cudaFree(dB));
-      GPU_ERROR(cudaFree(dC));
-      GPU_ERROR(cudaFree(dD));
+    cout << deviceId << " " << setw(40) << prop.name << fixed << setprecision(1)
+         << "   " << setw(6) << dt * 1000 << "ms " << setw(5)
+         << 4 * buffer_size * sizeof(double) / dt * 1e-9 << "GB/s " << setw(2)
+         << 2 * buffer_size / dt * 1e-6 << "MFLOP/s\n";
+
+    GPU_ERROR(cudaFree(dA));
+    GPU_ERROR(cudaFree(dB));
+    GPU_ERROR(cudaFree(dC));
+    GPU_ERROR(cudaFree(dD));
   }
   LIKWID_NVMARKER_CLOSE;
   return 0;

From 5dddffbf82653cc6b279253b4f83609c2329b871 Mon Sep 17 00:00:00 2001
From: Thomas Gruber <Thomas.Roehl@googlemail.com>
Date: Fri, 3 Nov 2023 17:20:26 +0100
Subject: [PATCH 2/2] Add support for Apple M1 Studio (#495)

* Add initial support for Apple M1 Studio

* Add event-counter-affinities to event list

* Make creation of cache domains configurable

* Try cluster_id if there is only a single socket and die

* Update topology to work with current perf_event interface

* Update perf_event interface to use new style

* Add group for Apple M1
---
 groups/apple_m1/CPI.txt                 |  21 ++++
 src/affinity.c                          |  12 +-
 src/includes/perfmon_applem1.h          |  35 ++++++
 src/includes/perfmon_applem1_counters.h |  53 +++++++++
 src/includes/perfmon_applem1_events.txt | 143 ++++++++++++++++++++++++
 src/includes/perfmon_perfevent.h        |  91 +++++++++++++--
 src/includes/registers_types.h          |   3 +
 src/includes/topology.h                 |   2 +
 src/includes/topology_static.h          |   5 +
 src/perfmon.c                           |  29 ++++-
 src/topology.c                          |  38 +++++++
 src/topology_proc.c                     |  61 +++++++++-
 12 files changed, 481 insertions(+), 12 deletions(-)
 create mode 100644 groups/apple_m1/CPI.txt
 create mode 100644 src/includes/perfmon_applem1.h
 create mode 100644 src/includes/perfmon_applem1_counters.h
 create mode 100644 src/includes/perfmon_applem1_events.txt

diff --git a/groups/apple_m1/CPI.txt b/groups/apple_m1/CPI.txt
new file mode 100644
index 000000000..d0af51a43
--- /dev/null
+++ b/groups/apple_m1/CPI.txt
@@ -0,0 +1,21 @@
+SHORT  Cycles per instruction
+
+EVENTSET
+PMC0  CPU_CYCLES
+PMC1  INST_RETIRED
+
+METRICS
+Runtime (RDTSC) [s] time
+CPI   PMC0/PMC1
+IPC   PMC1/PMC0
+
+LONG
+Formulas:
+CPI = CPU_CYCLES/INST_RETIRED
+IPC = INST_RETIRED/CPU_CYCLES
+-
+This group measures how efficient the processor works with
+regard to instruction throughput. Also important as the raw
+value of INST_RETIRED as it tells you how many instruction
+you need to execute for a task.
+
diff --git a/src/affinity.c b/src/affinity.c
index 7b555b59a..741cd51de 100644
--- a/src/affinity.c
+++ b/src/affinity.c
@@ -526,20 +526,28 @@ affinity_init()
     }
     topology_init();
     CpuTopology_t cputopo = get_cpuTopology();
+    CpuInfo_t cpuinfo = get_cpuInfo();
     numa_init();
     NumaTopology_t numatopo = get_numaTopology();
 
+    int doCacheDomains = 1;
     int numberOfCacheDomains = 0;
     int numberOfCoresPerCache = 0;
     int numberOfProcessorsPerCache = 0;
 
+    /* check system and remove domains if needed */
+    if (cpuinfo->vendor == APPLE_M1 && cpuinfo->model == APPLE_M1_STUDIO)
+    {
+        doCacheDomains = 0;
+    }
+
     /* determine total number of domains */
     numberOfDomains = 1;
     numberOfDomains += cputopo->numSockets;
     DEBUG_PRINT(DEBUGLEV_DEVELOP, Affinity: Socket domains %d, cputopo->numSockets);
     numberOfDomains += (cputopo->numDies > 0 ? cputopo->numDies : cputopo->numSockets);
     DEBUG_PRINT(DEBUGLEV_DEVELOP, Affinity: CPU die domains %d, (cputopo->numDies > 0 ? cputopo->numDies : cputopo->numSockets));
-    if (cputopo->numCacheLevels > 0)
+    if (doCacheDomains && cputopo->numCacheLevels > 0)
     {
         numberOfProcessorsPerCache = cputopo->cacheLevels[cputopo->numCacheLevels-1].threads;
         numberOfCoresPerCache = numberOfProcessorsPerCache / cputopo->numThreadsPerCore;
@@ -602,7 +610,7 @@ affinity_init()
         }
     }
     /* Last level cache domains */
-    if (cputopo->numCacheLevels > 0)
+    if (doCacheDomains && cputopo->numCacheLevels > 0)
     {
         for (int i = 0; i < cputopo->numSockets; i++)
         {
diff --git a/src/includes/perfmon_applem1.h b/src/includes/perfmon_applem1.h
new file mode 100644
index 000000000..4f1285368
--- /dev/null
+++ b/src/includes/perfmon_applem1.h
@@ -0,0 +1,35 @@
+/*
+ * =======================================================================================
+ *
+ *      Filename:  perfmon_applem1.h
+ *
+ *      Description:  Header File of perfmon module for Apple M1
+ *
+ *      Version:   <VERSION>
+ *      Released:  <DATE>
+ *
+ *      Author:   Thomas Gruber (tr), thomas.roehl@googlemail.com
+ *      Project:  likwid
+ *
+ *      Copyright (C) 2015 RRZE, University Erlangen-Nuremberg
+ *
+ *      This program is free software: you can redistribute it and/or modify it under
+ *      the terms of the GNU General Public License as published by the Free Software
+ *      Foundation, either version 3 of the License, or (at your option) any later
+ *      version.
+ *
+ *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
+ *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+ *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *
+ *      You should have received a copy of the GNU General Public License along with
+ *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * =======================================================================================
+ */
+
+#include <perfmon_applem1_events.h>
+#include <perfmon_applem1_counters.h>
+
+static int perfmon_numCountersAppleM1 = NUM_COUNTERS_APPLEM1;
+static int perfmon_numArchEventsAppleM1 = NUM_ARCH_EVENTS_APPLEM1;
diff --git a/src/includes/perfmon_applem1_counters.h b/src/includes/perfmon_applem1_counters.h
new file mode 100644
index 000000000..a8ca78960
--- /dev/null
+++ b/src/includes/perfmon_applem1_counters.h
@@ -0,0 +1,53 @@
+/*
+ * =======================================================================================
+ *
+ *      Filename:  perfmon_a64fx_counters.h
+ *
+ *      Description:  Counter Header File of perfmon module for Fujitsu A64FX
+ *
+ *      Version:   <VERSION>
+ *      Released:  <DATE>
+ *
+ *      Author:   Thomas Gruber (tr), thomas.roehl@googlemail.com
+ *      Project:  likwid
+ *
+ *      Copyright (C) 2015 RRZE, University Erlangen-Nuremberg
+ *
+ *      This program is free software: you can redistribute it and/or modify it under
+ *      the terms of the GNU General Public License as published by the Free Software
+ *      Foundation, either version 3 of the License, or (at your option) any later
+ *      version.
+ *
+ *      This program is distributed in the hope that it will be useful, but WITHOUT ANY
+ *      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+ *      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+ *
+ *      You should have received a copy of the GNU General Public License along with
+ *      this program.  If not, see <http://www.gnu.org/licenses/>.
+ *
+ * =======================================================================================
+ */
+
+#define NUM_COUNTERS_APPLEM1 10
+
+static RegisterMap applem1_counter_map[NUM_COUNTERS_APPLEM1] = {
+    {"PMC0", PMC0, PMC, 0, 0, 0, 0, EVENT_OPTION_NONE_MASK},
+    {"PMC1", PMC1, PMC, 0, 0, 0, 0, EVENT_OPTION_NONE_MASK},
+    {"PMC2", PMC2, PMC, 0, 0, 0, 0, EVENT_OPTION_NONE_MASK},
+    {"PMC3", PMC3, PMC, 0, 0, 0, 0, EVENT_OPTION_NONE_MASK},
+    {"PMC4", PMC4, PMC, 0, 0, 0, 0, EVENT_OPTION_NONE_MASK},
+    {"PMC5", PMC5, PMC, 0, 0, 0, 0, EVENT_OPTION_NONE_MASK},
+    {"PMC6", PMC6, PMC, 0, 0, 0, 0, EVENT_OPTION_NONE_MASK},
+    {"PMC7", PMC7, PMC, 0, 0, 0, 0, EVENT_OPTION_NONE_MASK},
+    {"PMC8", PMC8, PMC, 0, 0, 0, 0, EVENT_OPTION_NONE_MASK},
+    {"PMC9", PMC9, PMC, 0, 0, 0, 0, EVENT_OPTION_NONE_MASK},
+};
+
+static BoxMap applem1_box_map[NUM_UNITS] = {
+    [PMC] = {0, 0, 0, 0, 0, 0, 32},
+};
+
+static char* applem1_translate_types[NUM_UNITS] = {
+    [IPMC] = "/sys/bus/event_source/devices/apple_icestorm_pmu",
+    [FPMC] = "/sys/bus/event_source/devices/apple_firestorm_pmu",
+};
diff --git a/src/includes/perfmon_applem1_events.txt b/src/includes/perfmon_applem1_events.txt
new file mode 100644
index 000000000..baf2e3787
--- /dev/null
+++ b/src/includes/perfmon_applem1_events.txt
@@ -0,0 +1,143 @@
+# =======================================================================================
+#
+#      Filename:  perfmon_applem1_events.txt
+#
+#      Description:  Event list for Apple M1
+#
+#      Version:   <VERSION>
+#      Released:  <DATE>
+#
+#      Author:   Thomas Gruber (tr), thomas.roehl@googlemail.com
+#      Project:  likwid
+#
+#      Copyright (C) 2015 RRZE, University Erlangen-Nuremberg
+#
+#      This program is free software: you can redistribute it and/or modify it under
+#      the terms of the GNU General Public License as published by the Free Software
+#      Foundation, either version 3 of the License, or (at your option) any later
+#      version.
+#
+#      This program is distributed in the hope that it will be useful, but WITHOUT ANY
+#      WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+#      PARTICULAR PURPOSE.  See the GNU General Public License for more details.
+#
+#      You should have received a copy of the GNU General Public License along with
+#      this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# =======================================================================================
+
+# Event-Counter-Mappings taken from excellent work of AsahiLinux kernel devs
+# AsahiLinux defines currently only 2 named events: CPU_CYCLES and INST_RETIRED
+# The other names are taken from https://gist.github.com/dmbfm/0c7e868624a3f8e390348b9f2cf3d7a7
+
+EVENT_UNKNOWN_01            0x01 PMC7
+UMASK_UNKNOWN_01            0x00
+
+EVENT_CPU_CYCLES            0x02 PMC0|PMC2|PMC3|PMC4|PMC5|PMC6|PMC7|PMC8|PMC9
+UMASK_CPU_CYCLES            0x00
+
+EVENT_INSTR_RETIRED         0x8C PMC1|PMC7
+UMASK_INSTR_RETIRED         0x00
+
+EVENT_INST_RETIRED          0x8C PMC1|PMC7
+UMASK_INST_RETIRED          0x00
+
+EVENT_BRANCHES_RETIRED      0x8D PMC5|PMC6|PMC7
+UMASK_BRANCHES_RETIRED      0x00
+
+EVENT_UNKNOWN_8D            0x8D PMC5|PMC6|PMC7
+UMASK_UNKNOWN_8D            0x00
+
+EVENT_UNKNOWN_8E            0x8E PMC5|PMC6|PMC7
+UMASK_UNKNOWN_8E            0x00
+
+EVENT_UNKNOWN_8F            0x8F PMC5|PMC6|PMC7
+UMASK_UNKNOWN_8F            0x00
+
+EVENT_UNKNOWN_90            0x90 PMC5|PMC6|PMC7
+UMASK_UNKNOWN_90            0x00
+
+EVENT_UNKNOWN_93            0x93 PMC5|PMC6|PMC7
+UMASK_UNKNOWN_93            0x00
+
+EVENT_UNKNOWN_94            0x94 PMC5|PMC6|PMC7
+UMASK_UNKNOWN_94            0x00
+
+EVENT_UNKNOWN_95            0x95 PMC5|PMC6|PMC7
+UMASK_UNKNOWN_95            0x00
+
+EVENT_UNKNOWN_96            0x96 PMC5|PMC6|PMC7
+UMASK_UNKNOWN_96            0x00
+
+EVENT_UNKNOWN_97            0x97 PMC7
+UMASK_UNKNOWN_97            0x00
+
+EVENT_UNKNOWN_98            0x98 PMC5|PMC6|PMC7
+UMASK_UNKNOWN_90            0x00
+
+EVENT_UNKNOWN_99            0x99 PMC5|PMC6|PMC7
+UMASK_UNKNOWN_90            0x00
+
+EVENT_UNKNOWN_9A            0x9A PMC7
+UMASK_UNKNOWN_9A            0x00
+
+EVENT_UNKNOWN_9B            0x9B PMC5|PMC6|PMC7
+UMASK_UNKNOWN_9B            0x00
+
+EVENT_UNKNOWN_9C            0x9C PMC5|PMC6|PMC7
+UMASK_UNKNOWN_9C            0x00
+
+EVENT_UNKNOWN_9F            0x9F PMC7
+UMASK_UNKNOWN_9F            0x00
+
+EVENT_DATA_CACHE_LOAD_MISS  0xBF PMC5|PMC6|PMC7
+UMASK_DATA_CACHE_LOAD_MISS  0x00
+
+EVENT_DATA_CACHE_STORE_MISS 0xC0 PMC5|PMC6|PMC7
+UMASK_DATA_CACHE_STORE_MISS 0x00
+
+EVENT_DTLB_MISS             0xC1 PMC5|PMC6|PMC7
+UMASK_DTLB_MISS             0x00
+
+EVENT_ST_HIT_YNGR_LD        0xC4 PMC5|PMC6|PMC7
+UMASK_ST_HIT_YNGR_LD        0x00
+
+EVENT_UNKNOWN_C5            0xC5 PMC5|PMC6|PMC7
+UMASK_UNKNOWN_C5            0x00
+
+EVENT_UNKNOWN_C6            0xC6 PMC5|PMC6|PMC7
+UMASK_UNKNOWN_C6            0x00
+
+EVENT_UNKNOWN_C8            0xC8 PMC5|PMC6|PMC7
+UMASK_UNKNOWN_C8            0x00
+
+EVENT_UNKNOWN_CA            0xCA PMC5|PMC6|PMC7
+UMASK_UNKNOWN_CA            0x00
+
+EVENT_BRANCHES_RETIRED_MISP 0xCB PMC5|PMC6|PMC7
+UMASK_BRANCHES_RETIRED_MISP 0x00
+
+# This event is not listed by the AsahiLinux kernel
+# but https://gist.github.com/dmbfm/0c7e868624a3f8e390348b9f2cf3d7a7
+EVENT_ITLB_MISS             0xD4 PMC
+UMASK_ITLB_MISS             0x00
+
+# This event is not listed by the AsahiLinux kernel
+# but https://gist.github.com/dmbfm/0c7e868624a3f8e390348b9f2cf3d7a7
+EVENT_IC_MISS_DEM           0xD3 PMC
+UMASK_IC_MISS_DEM           0x00
+
+EVENT_UNKNOWN_F5            0xF5 PMC2|PMC4|PMC6
+UMASK_UNKNOWN_F5            0x00
+
+EVENT_UNKNOWN_F6            0xF6 PMC2|PMC4|PMC6
+UMASK_UNKNOWN_F6            0x00
+
+EVENT_UNKNOWN_F7            0xF7 PMC2|PMC4|PMC6
+UMASK_UNKNOWN_F7            0x00
+
+EVENT_UNKNOWN_F8            0xF8 PMC2|PMC3|PMC4|PMC5|PMC6|PMC7
+UMASK_UNKNOWN_F8            0x00
+
+EVENT_UNKNOWN_FD            0xFD PMC2|PMC4|PMC6
+UMASK_UNKNOWN_FD            0x00
diff --git a/src/includes/perfmon_perfevent.h b/src/includes/perfmon_perfevent.h
index 098d77781..b1927e2e9 100644
--- a/src/includes/perfmon_perfevent.h
+++ b/src/includes/perfmon_perfevent.h
@@ -116,6 +116,50 @@ int perfevent_paranoid_value()
     return paranoid;
 }
 
+#if defined(__ARM_ARCH_8A) || defined(__ARM_ARCH_7A__)
+enum apple_m1_pmc_type {
+    M1_UNKNOWN = 0,
+    M1_ICESTORM,
+    M1_FIRESTORM,
+};
+
+int perfevent_apple_m1_pmc_type_select(int cpu_id, enum apple_m1_pmc_type *type)
+{
+    FILE* fd;
+    char buff[100];
+    *type = M1_UNKNOWN;
+    bstring fname = bformat("/sys/devices/system/cpu/cpu%d/of_node/compatible", cpu_id);
+    fd = fopen(bdata(fname), "r");
+    if (fd == NULL)
+    {
+        ERROR_PRINT(Failed to detect PMU type for Apple M1);
+        bdestroy(fname);
+        return errno;
+    }
+    size_t read = fread(buff, sizeof(char), 100, fd);
+    if (read > 0)
+    {
+        if (strncmp("apple,icestorm", buff, 14) == 0)
+        {
+            *type = M1_ICESTORM;
+        }
+        else if (strncmp("apple,firestorm", buff, 15) == 0)
+        {
+            *type = M1_FIRESTORM;
+        }
+        bdestroy(fname);
+        return 0;
+    }
+    else
+    {
+        ERROR_PRINT(Cannot read %s, bdata(fname));
+    }
+    bdestroy(fname);
+    return -1;
+}
+#endif
+
+
 int perfmon_init_perfevent(int cpu_id)
 {
     perf_event_paranoid = perfevent_paranoid_value();
@@ -448,7 +492,7 @@ int perf_metrics_setup(struct perf_event_attr *attr, RegisterIndex index, Perfmo
     return 0;
 }
 
-int perf_pmc_setup(struct perf_event_attr *attr, RegisterIndex index, PerfmonEvent *event)
+int perf_pmc_setup(struct perf_event_attr *attr, RegisterIndex index, RegisterType type, PerfmonEvent *event)
 {
     int ret = 0;
     uint64_t offcore_flags = 0x0ULL;
@@ -460,6 +504,8 @@ int perf_pmc_setup(struct perf_event_attr *attr, RegisterIndex index, PerfmonEve
     attr->exclude_hv = 1;
     attr->disabled = 1;
     attr->inherit = 1;
+    attr->exclude_guest = 1;
+
 
     num_formats = 0;
     formats = NULL;
@@ -477,10 +523,10 @@ int perf_pmc_setup(struct perf_event_attr *attr, RegisterIndex index, PerfmonEve
     }
     else
     {
-        ret = parse_event_config(translate_types[PMC], perfEventOptionNames[EVENT_OPTION_GENERIC_CONFIG], &num_formats, &formats);
+        ret = parse_event_config(translate_types[type], perfEventOptionNames[EVENT_OPTION_GENERIC_CONFIG], &num_formats, &formats);
     }
 #else
-    ret = parse_event_config(translate_types[PMC], perfEventOptionNames[EVENT_OPTION_GENERIC_CONFIG], &num_formats, &formats);
+    ret = parse_event_config(translate_types[type], perfEventOptionNames[EVENT_OPTION_GENERIC_CONFIG], &num_formats, &formats);
 #endif
     if (ret == 0)
     {
@@ -517,7 +563,7 @@ int perf_pmc_setup(struct perf_event_attr *attr, RegisterIndex index, PerfmonEve
         num_formats = 1;
     }
 #else
-    ret = parse_event_config(translate_types[PMC], perfEventOptionNames[EVENT_OPTION_GENERIC_UMASK], &num_formats, &formats);
+    ret = parse_event_config(translate_types[type], perfEventOptionNames[EVENT_OPTION_GENERIC_UMASK], &num_formats, &formats);
 #endif
     if (ret == 0)
     {
@@ -560,7 +606,7 @@ int perf_pmc_setup(struct perf_event_attr *attr, RegisterIndex index, PerfmonEve
                     ret = 0;
                     num_formats = 0;
                     formats = NULL;
-                    ret = parse_event_config(translate_types[PMC], perfEventOptionNames[event->options[j].type], &num_formats, &formats);
+                    ret = parse_event_config(translate_types[type], perfEventOptionNames[event->options[j].type], &num_formats, &formats);
                     if (ret == 0)
                     {
                         uint64_t optval = event->options[j].value;
@@ -609,7 +655,8 @@ int perf_pmc_setup(struct perf_event_attr *attr, RegisterIndex index, PerfmonEve
         ret = 0;
         num_formats = 0;
         formats = NULL;
-        ret = parse_event_config(translate_types[PMC], perfEventOptionNames[EVENT_OPTION_MATCH0], &num_formats, &formats);
+        ret = parse_event_config(translate_types[type], perfEventOptionNames[EVENT_OPTION_MATCH0], &num_formats, &formats);
+
         if (ret == 0)
         {
             uint64_t optval = offcore_flags;
@@ -636,7 +683,7 @@ int perf_pmc_setup(struct perf_event_attr *attr, RegisterIndex index, PerfmonEve
     ret = 0;
     num_formats = 0;
     formats = NULL;
-    ret = parse_event_config(translate_types[PMC], perfEventOptionNames[EVENT_OPTION_PMC], &num_formats, &formats);
+    ret = parse_event_config(translate_types[type], perfEventOptionNames[EVENT_OPTION_PMC], &num_formats, &formats);
     if (ret == 0)
     {
         uint64_t optval = getCounterTypeOffset(index)+1;
@@ -1002,10 +1049,38 @@ int perfmon_setupCountersThread_perfevent(
                         }
                     }
                 }
+                else if (cpuid_info.vendor == APPLE_M1 && cpuid_info.model == APPLE_M1_STUDIO)
+                {
+                    enum apple_m1_pmc_type ptype = M1_UNKNOWN;
+                    DEBUG_PRINT(DEBUGLEV_DEVELOP, Getting real perf_event type for HWThread %d, cpu_id);
+                    ret = perfevent_apple_m1_pmc_type_select(cpu_id, &ptype);
+                    if (ret == 0)
+                    {
+                        switch (ptype)
+                        {
+                            case M1_ICESTORM:
+                                DEBUG_PRINT(DEBUGLEV_DEVELOP, HWThread %d is an icestorm core, cpu_id);
+                                type = IPMC;
+                                break;
+                            case M1_FIRESTORM:
+                                DEBUG_PRINT(DEBUGLEV_DEVELOP, HWThread %d is an firestorm core, cpu_id);
+                                type = FPMC;
+                                break;
+                            default:
+                                pmc_lock = 0;
+                                break;
+                        }
+                    }
+                    else
+                    {
+                        ERROR_PRINT(Failed getting real perf_event type for HWThread %d, cpu_id);
+                        pmc_lock = 0;
+                    }
+                }
 #endif
                 if (pmc_lock)
                 {
-                    ret = perf_pmc_setup(&attr, index, event);
+                    ret = perf_pmc_setup(&attr, index, type, event);
                     VERBOSEPRINTREG(cpu_id, index, attr.config, SETUP_PMC);
                 }
                 break;
diff --git a/src/includes/registers_types.h b/src/includes/registers_types.h
index 4d3fed44c..3d6a89c92 100644
--- a/src/includes/registers_types.h
+++ b/src/includes/registers_types.h
@@ -228,6 +228,7 @@ typedef enum {
     EDBOX2, EDBOX2FIX, EDBOX3, EDBOX3FIX,
     EDBOX4, EDBOX4FIX, EDBOX5, EDBOX5FIX,
     EDBOX6, EDBOX6FIX, EDBOX7, EDBOX7FIX,
+    IPMC, FPMC,
     MDF0, MDF1, MDF2, MDF3,
     MDF4, MDF5, MDF6, MDF7,
     MDF8, MDF9, MDF10, MDF11,
@@ -256,6 +257,8 @@ typedef enum {
 
 static char* RegisterTypeNames[MAX_UNITS] = {
     [PMC] = "Core-local general purpose counters",
+    [IPMC] = "Core-local general purpose counters (Icestorm)",
+    [FPMC] = "Core-local general purpose counters (Firestorm)",
     [FIXED] = "Fixed counters",
     [PERF] = "Perf counters",
     [THERMAL] = "Thermal",
diff --git a/src/includes/topology.h b/src/includes/topology.h
index 4f6022112..04eb72490 100644
--- a/src/includes/topology.h
+++ b/src/includes/topology.h
@@ -175,6 +175,7 @@ struct topology_functions {
 #define  APP_XGENE1	0x00U
 #define  ARM_NEOVERSE_N1 0xD0CU
 #define  FUJITSU_A64FX 0x001U
+#define  APPLE_M1_STUDIO 0x02U
 #define  HUAWEI_TSV110 0xD01U
 #define  AWS_GRAVITON3 0xD40U
 
@@ -187,6 +188,7 @@ struct topology_functions {
 #define QUALCOMM	0x51U
 #define SAMSUNG		0x53U
 #define APPLE		0x67U
+#define APPLE_M1	0x61U
 #define MARVELL		0x56U
 #define INTEL_ARM	0x69U
 #define FUJITSU_ARM 0x46U
diff --git a/src/includes/topology_static.h b/src/includes/topology_static.h
index eeea32962..7febb0ff4 100644
--- a/src/includes/topology_static.h
+++ b/src/includes/topology_static.h
@@ -42,5 +42,10 @@ CacheLevel a64fx_caches[2] = {
     {2, DATACACHE, 16, 2048, 256, 8388608, 12, 1},
 };
 
+CacheLevel apple_m1_caches[2] = {
+    {1, DATACACHE, 4, 64, 256, 131072, 1, 1},
+    {2, DATACACHE, 16, 2048, 256, 12582912, 4, 1},
+};
+
 
 #endif
diff --git a/src/perfmon.c b/src/perfmon.c
index b7ce1463a..779bbb1db 100644
--- a/src/perfmon.c
+++ b/src/perfmon.c
@@ -85,6 +85,7 @@
 #include <perfmon_sapphirerapids.h>
 #include <perfmon_neon1.h>
 #include <perfmon_a64fx.h>
+#include <perfmon_applem1.h>
 #include <perfmon_hisilicon.h>
 #include <perfmon_graviton3.h>
 
@@ -814,13 +815,16 @@ perfmon_check_counter_map(int cpu_id)
         }
 #else
         char* path = translate_types[counter_map[i].type];
+        if (cpuid_info.vendor == APPLE_M1 && cpuid_info.model == APPLE_M1_STUDIO) {
+            path = translate_types[FPMC];
+        }
         struct stat st;
         if (path == NULL || stat(path, &st) != 0)
         {
             counter_map[i].type = NOTYPE;
             counter_map[i].optionMask = 0x0ULL;
         }
-        if (counter_map[i].type != PMC && counter_map[i].type != FIXED && counter_map[i].type != PERF)
+        if (counter_map[i].type != PMC && counter_map[i].type != FIXED && counter_map[i].type != PERF && counter_map[i].type != IPMC && counter_map[i].type != FPMC)
         {
             if (perfevent_paranoid_value() > 0 && getuid() != 0)
             {
@@ -1481,8 +1485,27 @@ perfmon_init_maps(void)
                             break;
                         default:
                             ERROR_PLAIN_PRINT(Unsupported Huawei Processor);
+                            err = -EINVAL;
                             break;
                     }
+                    break;
+		            case APPLE_M1:
+		            case APPLE:
+                    switch (cpuid_info.model)
+                    {
+                         case APPLE_M1_STUDIO:
+                             eventHash = applem1_arch_events;
+                             perfmon_numArchEvents = perfmon_numArchEventsAppleM1;
+                             counter_map = applem1_counter_map;
+                             box_map = applem1_box_map;
+                             perfmon_numCounters = perfmon_numCountersAppleM1;
+                             translate_types = applem1_translate_types;
+                             break;
+                         default:
+                             ERROR_PLAIN_PRINT(Unsupported Apple Processor);
+                             err = -EINVAL;
+                             break;
+                    }
                     break;
                 default:
                     ERROR_PLAIN_PRINT(Unsupported ARMv8 Processor);
@@ -2415,6 +2438,10 @@ perfmon_addEventSet(const char* eventCString)
             }
 #else
             char* path = translate_types[counter_map[event->index].type];
+            if (cpuid_info.vendor == APPLE_M1 && cpuid_info.model == APPLE_M1_STUDIO)
+            {
+                path = translate_types[FPMC];
+            }
             struct stat st;
             if (path == NULL || stat(path, &st) != 0)
             {
diff --git a/src/topology.c b/src/topology.c
index bb1887c8b..b5559e663 100644
--- a/src/topology.c
+++ b/src/topology.c
@@ -128,6 +128,7 @@ static char* arm_neoverse_n1 = "ARM Neoverse N1";
 static char* arm_neoverse_v1 = "ARM Neoverse V1";
 static char* arm_huawei_tsv110 = "Huawei TSV110 (ARMv8)";
 static char* fujitsu_a64fx = "Fujitsu A64FX";
+static char* apple_m1_studio = "Apple M1";
 static char* power7_str = "POWER7 architecture";
 static char* power8_str = "POWER8 architecture";
 static char* power9_str = "POWER9 architecture";
@@ -185,6 +186,7 @@ static char* short_arm8_cav_tx = "arm8_tx";
 static char* short_arm8_neo_n1 = "arm8_n1";
 static char* short_arm8_neo_v1 = "arm8_v1";
 static char* short_a64fx = "arm64fx";
+static char* short_apple_m1 = "apple_m1";
 
 static char* short_power7 = "power7";
 static char* short_power8 = "power8";
@@ -1244,6 +1246,20 @@ topology_setName(void)
                             return EXIT_FAILURE;
                             break;
                     }
+                    break;
+                case APPLE_M1:
+                case APPLE:
+                    switch (cpuid_info.model)
+                    {
+                        case APPLE_M1_STUDIO:
+                            cpuid_info.name = apple_m1_studio;
+                            cpuid_info.short_name = short_apple_m1;
+                            break;
+                        default:
+                            return EXIT_FAILURE;
+                            break;
+                    }
+                    break;
                 case HUAWEI_ARM:
                     switch (cpuid_info.part)
                     {
@@ -1254,7 +1270,9 @@ topology_setName(void)
                         default:
                             return EXIT_FAILURE;
                             break;
+
                     }
+                    break;
                 default:
                     return EXIT_FAILURE;
                     break;
@@ -1480,6 +1498,26 @@ topology_init(void)
                                     break;
                             }
                             break;
+/*                        case APPLE_M1:*/
+/*                            switch(cpuid_info.model) {*/
+/*                                case APPLE_M1_STUDIO:*/
+/*                                    cachePool = (CacheLevel*) malloc(2 * sizeof(CacheLevel));*/
+/*                                    for(int i=0;i < 2; i++)*/
+/*                                    {*/
+/*                                        cachePool[i].level = apple_m1_caches[i].level;*/
+/*                                        cachePool[i].size = apple_m1_caches[i].size;*/
+/*                                        cachePool[i].lineSize = apple_m1_caches[i].lineSize;*/
+/*                                        cachePool[i].threads = apple_m1_caches[i].threads;*/
+/*                                        cachePool[i].inclusive = apple_m1_caches[i].inclusive;*/
+/*                                        cachePool[i].sets = apple_m1_caches[i].sets;*/
+/*                                        cachePool[i].associativity = apple_m1_caches[i].associativity;*/
+/*                                    }*/
+/*                                    cpuid_topology.cacheLevels = cachePool;*/
+/*                                    cpuid_topology.numCacheLevels = 2;*/
+/*                                    break;*/
+/*                                default:*/
+/*                                    break;*/
+/*                            }*/
                         default:
                             break;
                     }
diff --git a/src/topology_proc.c b/src/topology_proc.c
index 0bad5d797..782eebcad 100644
--- a/src/topology_proc.c
+++ b/src/topology_proc.c
@@ -716,7 +716,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
     int* helper = malloc(cpuid_topology.numHWThreads * sizeof(int));
     if (!helper)
     {
-	    return;
+        return;
     }
     cpuid_topology.threadPool = hwThreadPool;
     int hidx = 0;
@@ -737,6 +737,65 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
             helper[hidx++] = pid;
         }
     }
+    int tmp_numSockets = hidx;
+    hidx = 0;
+    for (int i = 0; i < cpuid_topology.numHWThreads; i++)
+    {
+        int pid = hwThreadPool[i].dieId;
+        int found = 0;
+        for (int j = 0; j < hidx; j++)
+        {
+            if (pid == helper[j])
+            {
+                found = 1;
+                break;
+            }
+        }
+        if (!found)
+        {
+            helper[hidx++] = pid;
+        }
+    }
+    int tmp_numDies = hidx;
+    if (tmp_numSockets == 1 && tmp_numDies == 1)
+    {
+        for (uint32_t i=0;i<cpuid_topology.numHWThreads;i++)
+        {
+            cpudir = bformat("/sys/devices/system/cpu/cpu%d/topology",i);
+            file = bformat("%s/cluster_id", bdata(cpudir));
+            if (NULL != (fp = fopen (bdata(file), "r")))
+            {
+                bstring src = bread ((bNread) fread, fp);
+                int packageId = ownatoi(bdata(src));
+                hwThreadPool[i].packageId = packageId;
+                if (packageId > last_socket)
+                {
+                    num_sockets++;
+                    last_socket = packageId;
+                }
+                fclose(fp);
+            }
+            bdestroy(file);
+        }
+    }
+    hidx = 0;
+    for (int i = 0; i < cpuid_topology.numHWThreads; i++)
+    {
+        int pid = hwThreadPool[i].packageId;
+        int found = 0;
+        for (int j = 0; j < hidx; j++)
+        {
+            if (pid == helper[j])
+            {
+                found = 1;
+                break;
+            }
+        }
+        if (!found)
+        {
+            helper[hidx++] = pid;
+        }
+    }
     cpuid_topology.numSockets = hidx;
     /* Traverse all sockets to get maximal thread count per socket.
      * This should fix the code for architectures with "empty" sockets.