Skip to content

Commit

Permalink
docs improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
simonmandlik committed Jun 6, 2024
1 parent 8c04db2 commit 57c65bf
Show file tree
Hide file tree
Showing 18 changed files with 118 additions and 60 deletions.
13 changes: 10 additions & 3 deletions docs/src/examples/gnn.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,10 @@ Furthermore, let's assume that each vertex is described by three features stored
X = ArrayNode(randn(Float32, 3, 10))
```

We use [`ScatteredBags`](@ref) from `Mill` to encode neighbors of each vertex. In other words, each vertex is described by a bag of its neighbors. This information is conveniently stored in `fadjlist` field of `g`, therefore the bags can be constructed as:
We use [`ScatteredBags`](@ref) from [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) to encode
neighbors of each vertex. In other words, each vertex is described by a bag of its neighbors. This
information is conveniently stored in `fadjlist` field of `g`, therefore the bags can be constructed
as:

```@repl gnn
b = ScatteredBags(g.fadjlist)
Expand Down Expand Up @@ -83,7 +86,9 @@ end
nothing # hide
```

As it is the case with whole `Mill`, even this graph neural network is properly integrated with [`Flux.jl`](https://fluxml.ai) ecosystem and suports automatic differentiation:
As it is the case with whole [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl), even this graph
neural network is properly integrated with [`Flux.jl`](https://fluxml.ai) ecosystem and suports
automatic differentiation:

```@example gnn
zd = 4
Expand All @@ -100,6 +105,8 @@ gnn(g, X, 5)
gradient(m -> m(g, X, 5) |> sum, gnn)
```

The above implementation is surprisingly general, as it supports an arbitrarily rich description of vertices. For simplicity, we used only vectors in `X`, however, any `Mill` hierarchy is applicable.
The above implementation is surprisingly general, as it supports an arbitrarily rich description of
vertices. For simplicity, we used only vectors in `X`, however, any
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) hierarchy is applicable.

To put different weights on edges, one can use [Weighted aggregation](@ref).
8 changes: 7 additions & 1 deletion docs/src/examples/jsons.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,11 @@

# Processing JSONs

Processing JSONs is actually one of the main motivations for building [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). As a matter of fact, with `Mill` one is now able to process a set of valid JSON documents that follow the same meta schema. [`JsonGrinder.jl`](https://github.com/CTUAvastLab/JsonGrinder.jl) is a library that helps with infering the schema and other steps in the pipeline. For some examples, please refer to its [documentation](https://CTUAvastLab.github.io/JsonGrinder.jl/stable).
Processing JSONs is actually one of the main motivations for building
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). As a matter of fact, with
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) one is now able to process a set of valid JSON
documents that follow the same meta schema.
[`JsonGrinder.jl`](https://github.com/CTUAvastLab/JsonGrinder.jl) is a library that helps with
infering the schema and other steps in the pipeline. For some examples, please refer to its
[documentation](https://CTUAvastLab.github.io/JsonGrinder.jl/stable).

6 changes: 3 additions & 3 deletions docs/src/examples/musk/musk.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ nothing #hide

### Loading the data

Now we load the dataset and transform it into a `Mill` structure. The `musk.jld2` file contains...
Now we load the dataset and transform it into a [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) structure. The `musk.jld2` file contains...
* a matrix with features, each column is one instance:

````@example musk
Expand Down Expand Up @@ -64,7 +64,7 @@ y_oh = onehotbatch(y, 1:2)

### Model construction

Once the data are in `Mill` internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
Once the data are in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:

````@example musk
model = BagModel(
Expand All @@ -84,7 +84,7 @@ model(ds)

### Training

Since `Mill` is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
Since [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:

````@example musk
opt_state = Flux.setup(Adam(), model);
Expand Down
6 changes: 3 additions & 3 deletions docs/src/examples/musk/musk_literate.jl
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ using Random; Random.seed!(42);

# ### Loading the data

# Now we load the dataset and transform it into a `Mill` structure. The `musk.jld2` file contains...
# Now we load the dataset and transform it into a [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) structure. The `musk.jld2` file contains...
# * a matrix with features, each column is one instance:
fMat = load("musk.jld2", "fMat")
# * the ids of samples (*bags* in MIL terminology) specifying to which each instance (column in `fMat`) belongs to:
Expand All @@ -42,7 +42,7 @@ y_oh = onehotbatch(y, 1:2)

# ### Model construction

# Once the data are in `Mill` internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
# Once the data are in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) internal format, we will manually create a model. [`BagModel`](@ref) is designed to implement a basic multi-instance learning model utilizing two feed-forward networks with an aggregaton operator in between:
model = BagModel(
Dense(166, 50, Flux.tanh),
SegmentedMeanMax(50),
Expand All @@ -56,7 +56,7 @@ model(ds)

# ### Training

# Since `Mill` is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:
# Since [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is entirely compatible with [`Flux.jl`](https://fluxml.ai), we can use its `Adam` optimizer:

opt_state = Flux.setup(Adam(), model);

Expand Down
7 changes: 4 additions & 3 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,10 @@ Julia v1.9 or later is required.

For the quickest start, see the [Musk](@ref) example.

* [Motivation](@ref): a brief introduction into the philosophy of `Mill`
* [Manual](@ref Nodes): a brief tutorial into `Mill`
* [Examples](@ref Musk): some examples of `Mill` use
* [Motivation](@ref): a brief introduction into the philosophy of
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl)
* [Manual](@ref Nodes): a brief tutorial into [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl)
* [Examples](@ref Musk): some examples of [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) use
* [External tools](@ref HierarchicalUtils.jl): examples of integration with other packages
* [Public API](@ref Aggregation): extensive API reference
* [References](@ref): related literature
Expand Down
21 changes: 15 additions & 6 deletions docs/src/manual/aggregation.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ Different choice of operator, or their combinations, are suitable for different
a_{\max}(\{x_1, \ldots, x_k\}) = \max_{i = 1, \ldots, k} x_i
```

where ``\{x_1, \ldots, x_k\}`` are all instances of the given bag. In `Mill`, the operator is constructed this way:
where ``\{x_1, \ldots, x_k\}`` are all instances of the given bag. In
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl), the operator is constructed this way:

```@repl aggregation
a_max = SegmentedMax(d)
Expand Down Expand Up @@ -86,7 +87,11 @@ Whereas non-parametric aggregations do not use any parameter, parametric aggrega
a_{\operatorname{lse}}(\{x_1, \ldots, x_k\}; r) = \frac{1}{r}\log \left(\frac{1}{k} \sum_{i = 1}^{k} \exp({r\cdot x_i})\right)
```

With different values of ``r``, LSE behaves differently and in fact both max and mean operators are limiting cases of LSE. If ``r`` is very small, the output approaches simple mean, and on the other hand, if ``r`` is a large number, LSE becomes a smooth approximation of the max function. Naively implementing the definition above may lead to numerical instabilities, however, the `Mill` implementation is numerically stable.
With different values of ``r``, LSE behaves differently and in fact both max and mean operators are
limiting cases of LSE. If ``r`` is very small, the output approaches simple mean, and on the other
hand, if ``r`` is a large number, LSE becomes a smooth approximation of the max function. Naively
implementing the definition above may lead to numerical instabilities, however, the
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) implementation is numerically stable.

```@repl aggregation
a_lse = SegmentedLSE(d)
Expand All @@ -101,7 +106,7 @@ a_lse(X, bags)
a_{\operatorname{pnorm}}(\{x_1, \ldots, x_k\}; p, c) = \left(\frac{1}{k} \sum_{i = 1}^{k} \vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}
```

Again, the `Mill` implementation is stable.
Again, the [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) implementation is numerically stable.

```@repl aggregation
a_pnorm = SegmentedPNorm(d)
Expand All @@ -119,7 +124,8 @@ a = AggregationStack(a_mean, a_max)
a(X, bags)
```

For the most common combinations, `Mill` provides some convenience definitions:
For the most common combinations, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) provides some
convenience definitions:

```@repl aggregation
SegmentedMeanMax(d)
Expand All @@ -138,7 +144,8 @@ a_{\operatorname{mean}}(\{(x_i, w_i)\}_{i=1}^k) = \frac{1}{\sum_{i=1}^k w_i} \su
a_{\operatorname{pnorm}}(\{x_i, w_i\}_{i=1}^k; p, c) = \left(\frac{1}{\sum_{i=1}^k w_i} \sum_{i = 1}^{k} w_i\cdot\vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}
```

This is done in `Mill` by passing an additional parameter:
This is done in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) by passing an additional
parameter:

```@repl aggregation
w = Float32.([1.0, 0.2, 0.8, 0.5])
Expand Down Expand Up @@ -173,7 +180,9 @@ Otherwise, [`WeightedBagNode`](@ref) behaves exactly like the standard [`BagNode

For some problems, it may be beneficial to use the size of the bag directly and feed it to subsequent layers. To do this, wrap an instance of [`AbstractAggregation`](@ref) or [`AggregationStack`](@ref) in the [`BagCount`](@ref) type.

In the aggregation phase, bag count appends one more element which stores the bag size to the output after all operators are applied. Furthermore, `Mill`, performs a mapping ``x \mapsto \log(x) + 1`` on top of that:
In the aggregation phase, bag count appends one more element which stores the bag size to the output
after all operators are applied. Furthermore, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl),
performs a mapping ``x \mapsto \log(x) + 1`` on top of that:

```@repl aggregation
a_mean_bc = BagCount(a_mean)
Expand Down
30 changes: 20 additions & 10 deletions docs/src/manual/custom.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,18 @@ using Flux

## Custom nodes

[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) data nodes are lightweight wrappers around data, such as `Array`, `DataFrame`, and others. It is of course possible to define a custom data (and model) nodes. A useful abstraction for implementing custom data nodes suitable for most cases is [`LazyNode`](@ref), which you can easily use to extend the functionality of `Mill`.
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) data nodes are lightweight wrappers around data,
such as `Array`, `DataFrame`, and others. It is of course possible to define a custom data (and
model) nodes. A useful abstraction for implementing custom data nodes suitable for most cases is
[`LazyNode`](@ref), which you can easily use to extend the functionality of
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl).

### Unix path example

Let's define a custom node type for representing path names in Unix and one custom model type for processing it. [`LazyNode`](@ref)
serves as a bolierplate for simple extension of `Mill` ecosystem. We start by by defining an example of such node:
Let's define a custom node type for representing path names in Unix and one custom model type for
processing it. [`LazyNode`](@ref) serves as a bolierplate for simple extension of
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) ecosystem. We start by by defining an example of
such node:

```@repl custom
ds = LazyNode{:Path}(["/var/lib/blob_files/myfile.blob"])
Expand All @@ -20,10 +26,11 @@ Entirely new type is not needed, because we can dispatch on the first type param
`:Path` "tag" in this case defines a special kind of [`LazyNode`](@ref). Consequently, we can define
multiple variations of custom [`LazyNode`](@ref) without any conflicts in dispatch.

As a next step, we extend the [`Mill.unpack2mill`](@ref) function, which always takes one [`LazyNode`](@ref)
and produces an arbitrary `Mill` structure. We will represent individual file and directory names (as obtained
by `splitpath`) using an [`NGramMatrix`](@ref) representation and, for simplicity, the whole path as
a bag of individual names:
As a next step, we extend the [`Mill.unpack2mill`](@ref) function, which always takes one
[`LazyNode`](@ref) and produces an arbitrary [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl)
structure. We will represent individual file and directory names (as obtained by `splitpath`) using
an [`NGramMatrix`](@ref) representation and, for simplicity, the whole path as a bag of individual
names:

```@example custom
function Mill.unpack2mill(ds::LazyNode{:Path})
Expand Down Expand Up @@ -69,10 +76,13 @@ pm(ds)
The solution using [`LazyNode`](@ref) is sufficient in most scenarios. For other cases, it is recommended to equip custom nodes with the following functionality:

* allow nesting (if needed)
* implement [`Mill.subset`](@ref) and optionally `Base.getindex` to obtain subsets of observations. `Mill` already defines [`Mill.subset`](@ref) for common datatypes, which can be used.
* implement [`Mill.subset`](@ref) and optionally `Base.getindex` to obtain subsets of observations.
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) already defines [`Mill.subset`](@ref) for
common datatypes, which can be used.
* allow concatenation of nodes with [`catobs`](@ref). Optionally, implement `reduce(catobs, ...)` as well to avoid excessive compilations if a number of arguments will vary a lot
* define a specialized method for `MLUtils.numobs`, which we can however import directly from `Mill`.
* register the custom node with [HierarchicalUtils.jl](@ref) to obtain pretty printing, iterators and other functionality
* define a specialized method for `MLUtils.numobs`, which we can however import directly from
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl).
* register the custom node with [`HierarchicalUtils.jl`](@ref) to obtain pretty printing, iterators and other functionality

Here is an example of a custom node with the same functionality as in the [Unix path example](@ref)
section:
Expand Down
10 changes: 8 additions & 2 deletions docs/src/manual/leaf_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,9 @@ hosts = [
]
```

`Mill` offers `n`gram histogram-based representation for strings. To get started, we pass the vector of strings into the constructor of [`NGramMatrix`](@ref):
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) offers `n`gram histogram-based representation
for strings. To get started, we pass the vector of strings into the constructor of
[`NGramMatrix`](@ref):

```@repl leafs
hosts_ngrams = NGramMatrix(hosts, 3, 256, 7)
Expand Down Expand Up @@ -139,4 +141,8 @@ gradient(m -> sum(m(ds)), m)
!!! ukn "Numerical features"
To put all numerical features into one [`ArrayNode`](@ref) is a design choice. We could as well introduce more keys in the final [`ProductNode`](@ref). The model treats these two cases slightly differently (see [Nodes](@ref) section).

This dummy example illustrates the versatility of `Mill`. With little to no preprocessing we are able to process complex hierarchical structures and avoid manually designing feature extraction procedures. For a more involved study on processing Internet traffic with `Mill`, see for example [Pevny2020](@cite).
This dummy example illustrates the versatility of
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl). With little to no preprocessing we are able to
process complex hierarchical structures and avoid manually designing feature extraction procedures.
For a more involved study on processing Internet traffic with
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl), see for example [Pevny2020](@cite).
11 changes: 8 additions & 3 deletions docs/src/manual/missing.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,10 @@ and many other possible reasons. At the same time, it is wasteful to throw away
2. Empty bags with no instances in a [`BagNode`](@ref)
3. And entire key missing in a [`ProductNode`](@ref)

At the moment, `Mill` is capable of handling the first two cases. The solution always involves an additional vector of parameters (denoted always by `ψ`) that are used during the model evaluation to substitute the missing values. Parameters `ψ` can be either fixed or learned during training. Everything is done automatically.
At the moment, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) is capable of handling the first
two cases. The solution always involves an additional vector of parameters (denoted always by `ψ`)
that are used during the model evaluation to substitute the missing values. Parameters `ψ` can be
either fixed or learned during training. Everything is done automatically.

## Empty bags

Expand Down Expand Up @@ -99,7 +102,8 @@ Storing missing strings in [`NGramMatrix`](@ref) is straightforward:
missing_ngrams = NGramMatrix(["foo", missing, "bar"], 3, 256, 5)
```

When some values of categorical variables are missing, `Mill` defines a new type for representation:
When some values of categorical variables are missing,
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) defines a new type for representation:

```@repl missing
missing_categorical = maybehotbatch([missing, 2, missing], 1:5)
Expand Down Expand Up @@ -187,7 +191,8 @@ Here, `[pre_imputing]Dense` and `[post_imputing]Dense` are standard dense layers
dense = m.ms[1].m; typeof(dense.weight)
```

Inside `Mill` we add a special definition `Base.show` for these types for compact printing.
Inside [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) we add a special definition `Base.show`
for these types for compact printing.

The [`reflectinmodel`](@ref) method use types to determine whether imputing is needed or not. Compare the following:

Expand Down
13 changes: 8 additions & 5 deletions docs/src/manual/more_on_nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,14 +56,15 @@ ds = BagNode(ProductNode((BagNode(randn(Float32, 4, 10),
[1:1, 2:3, 4:5])
```

When data and model trees become complex, `Mill` limits the printing. To inspect the whole tree, use
`printtree`:
When data and model trees become complex, [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) limits
the printing. To inspect the whole tree, use `printtree`:

```@repl more_on_nodes
printtree(ds)
```

Instead of defining a model manually, we can also make use of [Model reflection](@ref), another `Mill` functionality, which simplifies model creation:
Instead of defining a model manually, we can also make use of [Model reflection](@ref), another
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) functionality, which simplifies model creation:

```@repl more_on_nodes
m = reflectinmodel(ds, d -> Dense(d, 2), SegmentedMean)
Expand All @@ -72,7 +73,8 @@ m(ds)

## Node conveniences

To make the handling of data and model hierarchies easier, `Mill` provides several tools. Let's setup some data:
To make the handling of data and model hierarchies easier,
[`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) provides several tools. Let's setup some data:

```@repl more_on_nodes
AN = ArrayNode(Float32.([1 2 3 4; 5 6 7 8]))
Expand All @@ -95,7 +97,8 @@ numobs(PN)

### Indexing and Slicing

Indexing in [`Mill`] operates **on the level of observations**:
Indexing in [`Mill.jl`](https://github.com/CTUAvastLab/Mill.jl) operates **on the level of
observations**:

```@repl more_on_nodes
AN[1]
Expand Down
Loading

0 comments on commit 57c65bf

Please sign in to comment.