Merge pull request #15 from UCA-Datalab/develop

Develop
UCA-Datalab · Apr 7, 2022 · a2a6fb7 · a2a6fb7
2 parents 317e362 + 0c72a0a
commit a2a6fb7
Show file tree

Hide file tree

Showing 14 changed files with 905 additions and 96 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,6 +2,7 @@ notebooks/
 plots/
 output*/
 *.csv
+*.ipynb
 
 # Log
 log.out

diff --git a/README.md b/README.md
@@ -1,4 +1,67 @@
-# NILM: classification VS regression
+<!-- README template: https://github.com/othneildrew/Best-README-Template -->
+
+<!-- PROJECT SHIELDS -->
+<!--
+*** I'm using markdown "reference style" links for readability.
+*** Reference links are enclosed in brackets [ ] instead of parentheses ( ).
+*** See the bottom of this document for the declaration of the reference variables
+*** for contributors-url, forks-url, etc. This is an optional, concise syntax you may use.
+*** https://www.markdownguide.org/basic-syntax/#reference-style-links
+-->
+[![Contributors][contributors-shield]][contributors-url]
+[![Forks][forks-shield]][forks-url]
+[![Stargazers][stars-shield]][stars-url]
+[![Issues][issues-shield]][issues-url]
+[![LinkedIn][linkedin-shield]][linkedin-url]
+
+<!-- PROJECT LOGO -->
+<br />
+<p align="center">
+  <a href="https://github.com/UCA-Datalab">
+    <img src="images/logo.png" alt="Logo" width="400" height="80">
+  </a>
+
+  <h3 align="center">NILM: classification VS regression</h3>
+</p>
+
+
+<!-- TABLE OF CONTENTS -->
+<details open="open">
+  <summary>Table of Contents</summary>
+  <ol>
+    <li>
+      <a href="#about-the-project">About The Project</a>
+    </li>
+    <li>
+      <a href="#getting-started">Getting Started</a>
+      <ul>
+        <li><a href="#create-the-environment">Create the Environment</a></li>
+      </ul>
+    </li>
+    <li>
+      <a href="#datasets">Datasets</a>
+      <ul>
+        <li><a href="#uk-dale">UK-DALE</a></li>
+      </ul>
+      <ul>
+        <li><a href="#pecan-street-dataport">Pecan Street Dataport</a></li>
+      </ul>
+   <li><a href="#preprocess-the-data">Preprocess the Data</a></li>
+    </li>
+    <li>
+      <a href="#train">Train</a>
+      <ul>
+         <li><a href="#reproduce-the-paper">Reproduce the Paper</a></li>
+         <li><a href="#thresholding-methods">Thresholding Methods</a></li>
+      </ul>
+    </li>
+    <li><a href="#publications">Publications</a></li>
+    <li><a href="#contact">Contact</a></li>
+    <li><a href="#acknowledgements">Acknowledgements</a></li>
+  </ol>
+</details>
+
+## About the project
 
 Non-Intrusive Load Monitoring (NILM)  aims to predict the status
 or consumption of  domestic appliances in a household only by knowing
@@ -14,10 +77,10 @@ deep learning state-of-the-art architectures on both the regression and
 classification problems, introducing criteria to select the most convenient
 thresholding method.
 
-Source: [see publications](#publications)
+## Getting started
+### Create the Environment
 
-## Set up
-### Create the environment using Conda
+To create the environment using Conda:
 
   1. Install miniconda
 
@@ -43,12 +106,10 @@ Source: [see publications](#publications)
        conda activate nilm-thresholding
        ```
  
-## Data
+## Datasets
 
 ### UK-DALE
 
-#### Download UK-DALE
-
 UK-DALE dataset is hosted on the following link:
 [https://data.ukedc.rl.ac.uk/browse/edc/efficiency/residential
 /EnergyConsumption/Domestic/UK-DALE-2017/UK-DALE-FULL-disaggregated](https://data.ukedc.rl.ac.uk/browse/edc/efficiency/residential/EnergyConsumption/Domestic/UK-DALE-2017/UK-DALE-FULL-disaggregated)
@@ -69,7 +130,15 @@ nilm-thresholding
 
 Credit: [Jack Kelly](https://jack-kelly.com/data/)
 
-### Preprocess
+### Pecan Street Dataport
+
+We are aiming to include this dataset in a future release. You can check the issue here: [https://github.com/UCA-Datalab/nilm-thresholding/issues/8](https://github.com/UCA-Datalab/nilm-thresholding/issues/8)
+
+Any help and suggestions are welcome!
+
+Credit: [Pecan Street](https://dataport.pecanstreet.org/)
+
+## Preprocess the Data
 
 Once downloaded the raw data from any of the sources above,
 you must preprocess it.
@@ -106,23 +175,23 @@ If you want to use your own set of parameters, duplicate the aforementioned
  configuration file and modify the paremeters you want to change (without deleting any
   parameter). You can then use that config file with the following command:
  
- ```
+```
 python nilmth/train.py  --path_config <path to your config file>
 ```
 
 For more information about the script, run:
 
- ```
+```
 python nilmth/train.py  --help
 ```
 
 Once the models are trained, test them with:
 
- ```
+```
 python nilmth/test.py  --path_config <path to your config file>
 ```
 
-#### Reproduce paper
+### Reproduce the Paper
 
 To reproduce the results shown in [our paper](#publications), activate the
  environment and then run:
@@ -136,11 +205,11 @@ models are stored. Then, the script `train.py` will be called, using each
  configuration each. This will store the model weights, which will be used
  again during the test phase:
  
- ```
+```
 nohup sh test_sequential.sh > log.out & 
 ```
 
-### Thresholding methods
+### Thresholding Methods
 
 There are three threshold methods available. Read [our paper](#publications)
 to understand how each threshold works.
@@ -151,13 +220,32 @@ to understand how each threshold works.
 
 ## Publications
 
-[NILM as a regression versus classification problem:
+* [NILM as a regression versus classification problem:
 the importance of thresholding](https://www.researchgate.net/project/Non-Intrusive-Load-Monitoring-6)
 
-## Contact information
+## Contact
+
+Daniel Precioso - [daniprec](https://github.com/daniprec) -  [email protected]
+
+Project link: [https://github.com/UCA-Datalab/nilm-thresholding](https://github.com/UCA-Datalab/nilm-thresholding)
+
+ResearhGate link: [https://www.researchgate.net/project/NILM-classification-VS-regression](https://www.researchgate.net/project/NILM-classification-VS-regression)
+
+## Acknowledgements
+
+* [UCA DataLab](http://datalab.uca.es/)
+* [David Gómez-Ullate](https://www.linkedin.com/in/david-g%C3%B3mez-ullate-oteiza-87a820b/?originalSubdomain=en)
+
 
-Author: Daniel Precioso, PhD student at Universidad de Cádiz
-- Email: [email protected]
-- [Github](https://github.com/daniprec)
-- [LinkedIn](https://www.linkedin.com/in/daniel-precioso-garcelan/)
-- [ResearchGate](https://www.researchgate.net/profile/Daniel_Precioso_Garcelan)
+<!-- MARKDOWN LINKS & IMAGES -->
+<!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->
+[contributors-shield]: https://img.shields.io/github/contributors/UCA-Datalab/nilm-thresholding.svg?style=for-the-badge
+[contributors-url]: https://github.com/UCA-Datalab/nilm-thresholding/graphs/contributors
+[forks-shield]: https://img.shields.io/github/forks/UCA-Datalab/nilm-thresholding.svg?style=for-the-badge
+[forks-url]: https://github.com/UCA-Datalab/nilm-thresholding/network/members
+[stars-shield]: https://img.shields.io/github/stars/UCA-Datalab/nilm-thresholding.svg?style=for-the-badge
+[stars-url]: https://github.com/UCA-Datalab/nilm-thresholding/stargazers
+[issues-shield]: https://img.shields.io/github/issues/UCA-Datalab/nilm-thresholding.svg?style=for-the-badge
+[issues-url]: https://github.com/UCA-Datalab/nilm-thresholding/issues
+[linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555
+[linkedin-url]: https://www.linkedin.com/in/daniel-precioso-garcelan/
diff --git a/images/logo.png b/images/logo.png
diff --git a/nilmth/data/clustering.py b/nilmth/data/clustering.py
@@ -0,0 +1,160 @@
+import itertools
+from typing import Optional, Tuple
+
+import matplotlib.pyplot as plt
+import numpy as np
+from scipy.cluster.hierarchy import cophenet, dendrogram, fcluster, linkage
+from scipy.spatial.distance import pdist
+
+
+class HierarchicalClustering:
+    def __init__(
+        self, distance: str = "average", n_cluster: int = 2, criterion: str = "maxclust"
+    ):
+        """This object is able to perform Hierarchical Clustering on a given set of points
+
+        Parameters
+        ----------
+        distance : str, optional
+            Clustering distance criteria, by default "average"
+        n_cluster : int, optional
+            Number of clusters to form, by default 2
+        criterion : str, optional
+            Criterion used to compute the clusters, by default "maxclust"
+        """
+        self.distance = distance
+        self.n_cluster = n_cluster
+        self.criterion = criterion
+
+        # Attributes filled with `perform_clustering`
+        self.x = np.empty(0)  # Set of data points
+        self.z = np.empty(0)  # The hierarchical clustering encoded as a linkage matrix
+        # z[i] will tell us which clusters were merged in the i-th iteration
+
+        # Attributes filled with `plot_dendogram`
+        self.dendrogram = {}
+        # A dictionary of data structures computed to render the dendrogram
+
+        # Attributes filled with `compute_thresholds_and_centroids`
+        self.thresh = np.empty(0)
+        self.centroids = np.empty(0)
+
+    def perform_clustering(
+        self, ser: np.array, distance: Optional[str] = None
+    ) -> np.array:
+        """Performs the actual clustering, using the linkage function
+
+        Parameters
+        ----------
+        ser : np.array
+            Series of points to group in clusters
+        distance : str, optional
+            Clustering distance criteria, by default None (takes the one from the class)
+        """
+        self.distance = distance if distance is not None else self.distance
+        # The shape of our X matrix must be (n, m)
+        # n = samples, m = features
+        self.x = np.expand_dims(ser, axis=1)
+        self.z = linkage(self.x, method=self.distance)
+
+    @property
+    def cophenet(self):
+        # Cophenet correlation coefficient
+        c, coph_dists = cophenet(self.z, pdist(self.x))
+        return c
+
+    def plot_dendrogram(
+        self, p: int = 6, max_d: Optional[float] = None, figsize: Tuple[int] = (3, 3)
+    ):
+        """Plots the dendrogram
+
+        Parameters
+        ----------
+        p : int, optional
+            Last split, by default 6
+        max_d : Optional[float], optional
+            Maximum distance between splits, by default None
+        figsize : Tuple[int], optional
+            Figure size, by default (3, 3)
+        """
+        fig, ax = plt.subplots(figsize=figsize)
+        self.dendrogram = dendrogram(
+            self.z,
+            p=p,
+            orientation="right",
+            truncate_mode="lastp",
+            labels=self.x[:, 0],
+            ax=ax,
+        )
+        if max_d is not None:
+            ax.axvline(x=max_d, c="k")
+        return fig, ax
+
+    @property
+    def dendrogram_distance(self):
+        return sorted(set(itertools.chain(*self.dendrogram["dcoord"])), reverse=True)
+
+    def plot_dendrogram_distance(self, figsize: Tuple[int] = (10, 3)):
+        """Plots the dendrogram distances
+
+        Parameters
+        ----------
+        figsize : Tuple[int], optional
+            Size of the figure, by default (10, 3)
+        """
+        # Initialize plots
+        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)
+        # Dendrogram distance
+        ax1.scatter(
+            range(2, len(self.dendrogram_distance) + 1), self.dendrogram_distance[:-1]
+        )
+        ax1.set_ylabel("Distance")
+        ax1.set_xlabel("Number of clusters")
+        ax1.grid()
+        # Dendrogram distance difference
+        diff = np.divide(
+            -np.diff(self.dendrogram_distance), self.dendrogram_distance[:-1]
+        )
+        ax2.scatter(range(3, len(self.dendrogram_distance) + 1), diff[:-1])
+        ax2.set_ylabel("Gradient")
+        ax2.set_xlabel("Number of clusters")
+        ax2.grid()
+        return fig, (ax1, ax2)
+
+    def compute_thresholds_and_centroids(
+        self,
+        n_cluster: Optional[int] = None,
+        criterion: Optional[str] = None,
+        centroid: str = "median",
+    ):
+        """Computes the thresholds and centroids of each group
+
+        Parameters
+        ----------
+        n_cluster : Optional[int], optional
+            Number of clusters, by default None
+        criterion : Optional[str], optional
+            Criterion used to compute the clusters, by default None
+        centroid : str, optional
+            Method to compute the centroids (median or mean), by default "median"
+        """
+        self.n_cluster = n_cluster if n_cluster is not None else self.n_cluster
+        self.criterion = criterion if criterion is not None else self.criterion
+        clusters = fcluster(self.z, self.n_cluster, self.criterion)
+        # Get centroids
+        if centroid == "median":
+            fun = np.median
+        elif centroid == "mean":
+            fun = np.mean
+        self.centroids = np.array(
+            sorted([fun(self.x[clusters == (c + 1)]) for c in range(self.n_cluster)])
+        )
+        # Sort clusters by power
+        x_max = sorted(
+            [np.max(self.x[clusters == (c + 1)]) for c in range(self.n_cluster)]
+        )
+        x_min = sorted(
+            [np.min(self.x[clusters == (c + 1)]) for c in range(self.n_cluster)]
+        )
+        thresh = np.divide(np.array(x_min[1:]) + np.array(x_max[:-1]), 2)
+        self.thresh = np.insert(thresh, 0, 0, axis=0)
-Original file line number
+Diff line change
@@ Expand Up / @@ -2,6 +2,7 @@ notebooks/ @@
     plots/
     output*/
     *.csv
+    *.ipynb
     # Log
     log.out
@@ Expand Down @@