docs improvements

CTUAvastLab · Jun 6, 2024 · 57c65bf · 57c65bf
1 parent 8c04db2
commit 57c65bf
Show file tree

Hide file tree

Showing 18 changed files with 118 additions and 60 deletions.
diff --git a/docs/src/examples/gnn.md b/docs/src/examples/gnn.md
@@ -31,7 +31,10 @@ Furthermore, let's assume that each vertex is described by three features stored
 X = ArrayNode(randn(Float32, 3, 10))
 ```
 
-We use [`ScatteredBags`](@ref) from `Mill` to encode neighbors of each vertex. In other words, each vertex is described by a bag of its neighbors. This information is conveniently stored in `fadjlist` field of `g`, therefore the bags can be constructed as:
+We use [`ScatteredBags`](@ref) from [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) to encode
+neighbors of each vertex. In other words, each vertex is described by a bag of its neighbors. This
+information is conveniently stored in `fadjlist` field of `g`, therefore the bags can be constructed
+as:
 
 ```@repl gnn
 b = ScatteredBags(g.fadjlist)
@@ -83,7 +86,9 @@ end
 nothing # hide
 ```
 
-As it is the case with whole `Mill`, even this graph neural network is properly integrated with [`Flux.jl`](https://fluxml.ai) ecosystem and suports automatic differentiation:
+As it is the case with whole [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl), even this graph
+neural network is properly integrated with [`Flux.jl`](https://fluxml.ai) ecosystem and suports
+automatic differentiation:
 
 ```@example gnn
 zd = 4
@@ -100,6 +105,8 @@ gnn(g, X, 5)
 gradient(m -> m(g, X, 5) |> sum, gnn)
 ```
 
-The above implementation is surprisingly general, as it supports an arbitrarily rich description of vertices. For simplicity, we used only vectors in `X`, however, any `Mill` hierarchy is applicable.
+The above implementation is surprisingly general, as it supports an arbitrarily rich description of
+vertices. For simplicity, we used only vectors in `X`, however, any
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) hierarchy is applicable.
 
 To put different weights on edges, one can use [Weighted aggregation](@ref).
diff --git a/docs/src/examples/jsons.md b/docs/src/examples/jsons.md
@@ -11,5 +11,11 @@
 
 # Processing JSONs
 
-Processing JSONs is actually one of the main motivations for building [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). As a matter of fact, with `Mill` one is now able to process a set of valid JSON documents that follow the same meta schema. [`JsonGrinder.jl`](https://github.com/CTUAvastLab/JsonGrinder.jl) is a library that helps with infering the schema and other steps in the pipeline. For some examples, please refer to its [documentation](https://CTUAvastLab.github.io/JsonGrinder.jl/stable).
+Processing JSONs is actually one of the main motivations for building
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). As a matter of fact, with
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) one is now able to process a set of valid JSON
+documents that follow the same meta schema.
+[`JsonGrinder.jl`](https://github.com/CTUAvastLab/JsonGrinder.jl) is a library that helps with
+infering the schema and other steps in the pipeline. For some examples, please refer to its
+[documentation](https://CTUAvastLab.github.io/JsonGrinder.jl/stable).
 
diff --git a/docs/src/examples/musk/musk.md b/docs/src/examples/musk/musk.md
@@ -26,7 +26,7 @@ nothing #hide
 
 ### Loading the data
 
-Now we load the dataset and transform it into a `Mill` structure. The `musk.jld2` file contains...
+Now we load the dataset and transform it into a [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) structure. The `musk.jld2` file contains...
 * a matrix with features, each column is one instance:
 
 ````@example musk
@@ -64,7 +64,7 @@ y_oh = onehotbatch(y, 1:2)
 
 ### Model construction
 
-Once the data are in `Mill` internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
+Once the data are in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
 
 ````@example musk
 model = BagModel(
@@ -84,7 +84,7 @@ model(ds)
 
 ### Training
 
-Since `Mill` is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
+Since [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
 
 ````@example musk
 opt_state = Flux.setup(Adam(), model);

diff --git a/docs/src/examples/musk/musk_literate.jl b/docs/src/examples/musk/musk_literate.jl
@@ -21,7 +21,7 @@ using Random; Random.seed!(42);
 
 # ### Loading the data
 
-# Now we load the dataset and transform it into a `Mill` structure. The `musk.jld2` file contains...
+# Now we load the dataset and transform it into a [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) structure. The `musk.jld2` file contains...
 # * a matrix with features, each column is one instance:
 fMat = load("musk.jld2", "fMat")
 # * the ids of samples (*bags* in MIL terminology) specifying to which each instance (column in `fMat`) belongs to:
@@ -42,7 +42,7 @@ y_oh = onehotbatch(y, 1:2)
 
 # ### Model construction
 
-# Once the data are in `Mill` internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
+# Once the data are in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
 model = BagModel(
     Dense(166, 50, Flux.tanh),
     SegmentedMeanMax(50),
@@ -56,7 +56,7 @@ model(ds)
 
 # ### Training
 
-# Since `Mill` is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
+# Since [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
 
 opt_state = Flux.setup(Adam(), model);
 

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -26,9 +26,10 @@ Julia v1.9 or later is required.
 
 For the quickest start, see the [Musk](@ref) example.
 
-* [Motivation](@ref): a brief introduction into the philosophy of `Mill`
-* [Manual](@ref Nodes): a brief tutorial into `Mill`
-* [Examples](@ref Musk): some examples of `Mill` use
+* [Motivation](@ref): a brief introduction into the philosophy of
+  [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl)
+* [Manual](@ref Nodes): a brief tutorial into [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl)
+* [Examples](@ref Musk): some examples of [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) use
 * [External tools](@ref HierarchicalUtils.jl): examples of integration with other packages
 * [Public API](@ref Aggregation): extensive API reference
 * [References](@ref): related literature

diff --git a/docs/src/manual/aggregation.md b/docs/src/manual/aggregation.md
@@ -26,7 +26,8 @@ Different choice of operator, or their combinations, are suitable for different
 a_{\max}(\{x_1, \ldots, x_k\}) = \max_{i = 1, \ldots, k} x_i
 ```
 
-where ``\{x_1, \ldots, x_k\}`` are all instances of the given bag. In `Mill`, the operator is constructed this way:
+where ``\{x_1, \ldots, x_k\}`` are all instances of the given bag. In
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl), the operator is constructed this way:
 
 ```@repl aggregation
 a_max = SegmentedMax(d)
@@ -86,7 +87,11 @@ Whereas non-parametric aggregations do not use any parameter, parametric aggrega
 a_{\operatorname{lse}}(\{x_1, \ldots, x_k\}; r) = \frac{1}{r}\log \left(\frac{1}{k} \sum_{i = 1}^{k} \exp({r\cdot x_i})\right)
 ```
 
-With different values of ``r``, LSE behaves differently and in fact both max and mean operators are limiting cases of LSE. If ``r`` is very small, the output approaches simple mean, and on the other hand, if ``r`` is a large number, LSE becomes a smooth approximation of the max function. Naively implementing the definition above may lead to numerical instabilities, however, the `Mill` implementation is numerically stable.
+With different values of ``r``, LSE behaves differently and in fact both max and mean operators are
+limiting cases of LSE. If ``r`` is very small, the output approaches simple mean, and on the other
+hand, if ``r`` is a large number, LSE becomes a smooth approximation of the max function. Naively
+implementing the definition above may lead to numerical instabilities, however, the
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) implementation is numerically stable.
 
 ```@repl aggregation
 a_lse = SegmentedLSE(d)
@@ -101,7 +106,7 @@ a_lse(X, bags)
 a_{\operatorname{pnorm}}(\{x_1, \ldots, x_k\}; p, c) = \left(\frac{1}{k} \sum_{i = 1}^{k} \vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}
 ```
 
-Again, the `Mill` implementation is stable.
+Again, the [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) implementation is numerically stable.
 
 ```@repl aggregation
 a_pnorm = SegmentedPNorm(d)
@@ -119,7 +124,8 @@ a = AggregationStack(a_mean, a_max)
 a(X, bags)
 ```
 
-For the most common combinations, `Mill` provides some convenience definitions:
+For the most common combinations, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) provides some
+convenience definitions:
 
 ```@repl aggregation
 SegmentedMeanMax(d)
@@ -138,7 +144,8 @@ a_{\operatorname{mean}}(\{(x_i, w_i)\}_{i=1}^k) = \frac{1}{\sum_{i=1}^k w_i} \su
 a_{\operatorname{pnorm}}(\{x_i, w_i\}_{i=1}^k; p, c) = \left(\frac{1}{\sum_{i=1}^k w_i} \sum_{i = 1}^{k} w_i\cdot\vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}
 ```
 
-This is done in `Mill` by passing an additional parameter:
+This is done in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) by passing an additional
+parameter:
 
 ```@repl aggregation
 w = Float32.([1.0, 0.2, 0.8, 0.5])
@@ -173,7 +180,9 @@ Otherwise, [`WeightedBagNode`](@ref) behaves exactly like the standard [`BagNode
 
 For some problems, it may be beneficial to use the size of the bag directly and feed it to subsequent layers. To do this, wrap an instance of [`AbstractAggregation`](@ref) or [`AggregationStack`](@ref) in the [`BagCount`](@ref) type.
 
-In the aggregation phase, bag count appends one more element which stores the bag size to the output after all operators are applied. Furthermore, `Mill`, performs a mapping ``x \mapsto \log(x) + 1`` on top of that:
+In the aggregation phase, bag count appends one more element which stores the bag size to the output
+after all operators are applied. Furthermore, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl),
+performs a mapping ``x \mapsto \log(x) + 1`` on top of that:
 
 ```@repl aggregation
 a_mean_bc = BagCount(a_mean)

diff --git a/docs/src/manual/custom.md b/docs/src/manual/custom.md
@@ -5,12 +5,18 @@ using Flux
 
 ## Custom nodes
 
-[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) data nodes are lightweight wrappers around data, such as `Array`, `DataFrame`, and others. It is of course possible to define a custom data (and model) nodes. A useful abstraction for implementing custom data nodes suitable for most cases  is [`LazyNode`](@ref), which you can easily use to extend the functionality of `Mill`.
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) data nodes are lightweight wrappers around data,
+such as `Array`, `DataFrame`, and others. It is of course possible to define a custom data (and
+model) nodes. A useful abstraction for implementing custom data nodes suitable for most cases  is
+[`LazyNode`](@ref), which you can easily use to extend the functionality of
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl).
 
 ### Unix path example
 
-Let's define a custom node type for representing path names in Unix and one custom model type for processing it. [`LazyNode`](@ref)
-serves as a bolierplate for simple extension of `Mill` ecosystem. We start by by defining an example of such node:
+Let's define a custom node type for representing path names in Unix and one custom model type for
+processing it. [`LazyNode`](@ref) serves as a bolierplate for simple extension of
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) ecosystem. We start by by defining an example of
+such node:
 
 ```@repl custom
 ds = LazyNode{:Path}(["/var/lib/blob_files/myfile.blob"])
@@ -20,10 +26,11 @@ Entirely new type is not needed, because we can dispatch on the first type param
 `:Path` "tag" in this case defines a special kind of [`LazyNode`](@ref). Consequently, we can define
 multiple variations of custom [`LazyNode`](@ref) without any conflicts in dispatch.
 
-As a next step, we extend the [`Mill.unpack2mill`](@ref) function, which always takes one [`LazyNode`](@ref)
-and produces an arbitrary `Mill` structure. We will represent individual file and directory names (as obtained
-by `splitpath`) using an [`NGramMatrix`](@ref) representation and, for simplicity, the whole path as
-a bag of individual names:
+As a next step, we extend the [`Mill.unpack2mill`](@ref) function, which always takes one
+[`LazyNode`](@ref) and produces an arbitrary [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl)
+structure. We will represent individual file and directory names (as obtained by `splitpath`) using
+an [`NGramMatrix`](@ref) representation and, for simplicity, the whole path as a bag of individual
+names:
 
 ```@example custom
 function Mill.unpack2mill(ds::LazyNode{:Path})
@@ -69,10 +76,13 @@ pm(ds)
 The solution using [`LazyNode`](@ref) is sufficient in most scenarios. For other cases, it is recommended to equip custom nodes with the following functionality:
 
 * allow nesting (if needed)
-* implement [`Mill.subset`](@ref) and optionally `Base.getindex` to obtain subsets of observations. `Mill` already defines [`Mill.subset`](@ref) for common datatypes, which can be used.
+* implement [`Mill.subset`](@ref) and optionally `Base.getindex` to obtain subsets of observations.
+  [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) already defines [`Mill.subset`](@ref) for
+  common datatypes, which can be used.
 * allow concatenation of nodes with [`catobs`](@ref). Optionally, implement `reduce(catobs, ...)` as well to avoid excessive compilations if a number of arguments will vary a lot
-* define a specialized method for `MLUtils.numobs`, which we can however import directly from `Mill`.
-* register the custom node with [HierarchicalUtils.jl](@ref) to obtain pretty printing, iterators and other functionality
+* define a specialized method for `MLUtils.numobs`, which we can however import directly from
+  [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl).
+* register the custom node with [`HierarchicalUtils.jl`](@ref) to obtain pretty printing, iterators and other functionality
 
 Here is an example of a custom node with the same functionality as in the [Unix path example](@ref)
 section:

diff --git a/docs/src/manual/leaf_data.md b/docs/src/manual/leaf_data.md
@@ -59,7 +59,9 @@ hosts = [
 ]
 ```
 
-`Mill` offers `n`gram histogram-based representation for strings. To get started, we pass the vector of strings into the constructor of [`NGramMatrix`](@ref):
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) offers `n`gram histogram-based representation
+for strings. To get started, we pass the vector of strings into the constructor of
+    [`NGramMatrix`](@ref):
 
 ```@repl leafs
 hosts_ngrams = NGramMatrix(hosts, 3, 256, 7)
@@ -139,4 +141,8 @@ gradient(m -> sum(m(ds)), m)
 !!! ukn "Numerical features"
     To put all numerical features into one [`ArrayNode`](@ref) is a design choice. We could as well introduce more keys in the final [`ProductNode`](@ref). The model treats these two cases slightly differently (see [Nodes](@ref) section).
 
-This dummy example illustrates the versatility of `Mill`. With little to no preprocessing we are able to process complex hierarchical structures and avoid manually designing feature extraction procedures. For a more involved study on processing Internet traffic with `Mill`, see for example [Pevny2020](@cite).
+This dummy example illustrates the versatility of
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). With little to no preprocessing we are able to
+process complex hierarchical structures and avoid manually designing feature extraction procedures.
+For a more involved study on processing Internet traffic with
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl), see for example [Pevny2020](@cite).
diff --git a/docs/src/manual/missing.md b/docs/src/manual/missing.md
@@ -17,7 +17,10 @@ and many other possible reasons. At the same time, it is wasteful to throw away
 2. Empty bags with no instances in a [`BagNode`](@ref)
 3. And entire key missing in a [`ProductNode`](@ref)
 
-At the moment, `Mill` is capable of handling the first two cases. The solution always involves an additional vector of parameters (denoted always by `ψ`) that are used during the model evaluation to substitute the missing values. Parameters `ψ` can be either fixed or learned during training. Everything is done automatically.
+At the moment, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is capable of handling the first
+two cases. The solution always involves an additional vector of parameters (denoted always by `ψ`)
+that are used during the model evaluation to substitute the missing values. Parameters `ψ` can be
+either fixed or learned during training. Everything is done automatically.
 
 ## Empty bags
 
@@ -99,7 +102,8 @@ Storing missing strings in [`NGramMatrix`](@ref) is straightforward:
 missing_ngrams = NGramMatrix(["foo", missing, "bar"], 3, 256, 5)
 ```
 
-When some values of categorical variables are missing, `Mill` defines a new type for representation:
+When some values of categorical variables are missing,
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) defines a new type for representation:
 
 ```@repl missing
 missing_categorical = maybehotbatch([missing, 2, missing], 1:5)
@@ -187,7 +191,8 @@ Here, `[pre_imputing]Dense` and `[post_imputing]Dense` are standard dense layers
 dense = m.ms[1].m; typeof(dense.weight)
 ```
 
-Inside `Mill` we add a special definition `Base.show` for these types for compact printing.
+Inside [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) we add a special definition `Base.show`
+for these types for compact printing.
 
 The [`reflectinmodel`](@ref) method use types to determine whether imputing is needed or not. Compare the following:
 

diff --git a/docs/src/manual/more_on_nodes.md b/docs/src/manual/more_on_nodes.md
@@ -56,14 +56,15 @@ ds = BagNode(ProductNode((BagNode(randn(Float32, 4, 10),
              [1:1, 2:3, 4:5])
 ```
 
-When data and model trees become complex, `Mill` limits the printing. To inspect the whole tree, use
-`printtree`:
+When data and model trees become complex, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) limits
+the printing. To inspect the whole tree, use `printtree`:
 
 ```@repl more_on_nodes
 printtree(ds)
 ```
 
-Instead of defining a model manually, we can also make use of [Model reflection](@ref), another `Mill` functionality, which simplifies model creation:
+Instead of defining a model manually, we can also make use of [Model reflection](@ref), another
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) functionality, which simplifies model creation:
 
 ```@repl more_on_nodes
 m = reflectinmodel(ds, d -> Dense(d, 2), SegmentedMean)
@@ -72,7 +73,8 @@ m(ds)
 
 ## Node conveniences
 
-To make the handling of data and model hierarchies easier, `Mill` provides several tools. Let's setup some data:
+To make the handling of data and model hierarchies easier,
+[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) provides several tools. Let's setup some data:
 
 ```@repl more_on_nodes
 AN = ArrayNode(Float32.([1 2 3 4; 5 6 7 8]))
@@ -95,7 +97,8 @@ numobs(PN)
 
 ### Indexing and Slicing
 
-Indexing in [`Mill`] operates **on the level of observations**:
+Indexing in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) operates **on the level of
+observations**:
 
 ```@repl more_on_nodes
 AN[1]