-
Notifications
You must be signed in to change notification settings - Fork 3
/
OPM_application_nominal.R
245 lines (153 loc) · 15.6 KB
/
OPM_application_nominal.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
jupyter: ir
---
{{< include macros.qmd >}}
{{< include macros_populations_variates.qmd >}}
[First let's load some libraries and functions that we shall use in this example: the [`tplotfunctions.R`](code/tplotfunctions.R) and the [`optimal_predictor_machine_nominal.R`](code/optimal_predictor_machine_nominal.R) scripts, which should be `source`d from an R console:]{.small}
```{r}
source('code/tplotfunctions.R')
source('code/optimal_predictor_machine_nominal.R')
```
```{r}
#| echo: false
options(repr.plot.width=6*sqrt(2), repr.plot.height=6)
```
# [The optimal predictor machine in action (for nominal quantities)]{.red}
## Example population and data
Consider the following population, which we consider to be exchangeable:
- [*Units*: glass fragments collected at particularly defined crime scenes.]{.small}
- [*Variates*:]{.small}
+ [$\vRI$, ordinal, domain $\set{1,\dotsc,4}$: Refractive Index of the fragment]{.small}
+ [$\vNa$, ordinal, domain $\set{1,\dotsc,4}$: Natrium content of the fragment]{.small}
+ [$\vMg$, ordinal, domain $\set{1,\dotsc,4}$: Magnesium content of the fragment]{.small}
+ [$\vAl$, ordinal, domain $\set{1,\dotsc,4}$: Aluminium content of the fragment]{.small}
+ [$\vSi$, ordinal, domain $\set{1,\dotsc,4}$: Silicon content of the fragment]{.small}
+ [$\vK$, ordinal, domain $\set{1,\dotsc,4}$: Potassium content of the fragment]{.small}
+ [$\vCa$, ordinal, domain $\set{1,\dotsc,4}$: Calcium content of the fragment]{.small}
+ [$\vBa$, ordinal, domain $\set{1,\dotsc,4}$: Barium content of the fragment]{.small}
+ [$\vFe$, ordinal, domain $\set{1,\dotsc,4}$: Iron content of the fragment]{.small}
+ [$\vType$, nominal, domain $\set{\cat{T1},\dotsc,\cat{T7}}$: Type or origin of the glass fragment]{.small}
[The values for the $\vRI$ and content variates represent ranges of numeric values or percentages, which you can find in the metadata file [`glass_metadata-4_lev.csv`](datasets/glass_metadata-4_lev.csv). In the same file you also find the description of the glass types.]{.small}
We have a sample of 214 units from this population; their variate values are stored in the file [`glass_data-4_lev.csv`](datasets/glass_data-4_lev.csv). Here are the first five units:
```{r}
#| fig-cap: "Glimpse of the glass-fragment dataset"
print(head(fread('datasets/glass_data-4_lev.csv'), 5))
```
\
Now we imagine to prepare an agent for drawing inferences about the full population of glass fragments -- which also means fragments from *future* or *unsolved* crime scenes. The agent uses -- or is the embodiment of -- the universal exchangeable-inference machine.
## The agent's initial state of knowledge
### Load the machine's apparatus
The calls
```
source('code/tplotfunctions.R')
source('code/optimal_predictor_machine_nominal.R')
```
at the beginning of this chapter loaded several `R` functions that implement the universal machine: they draw inferences, calculate marginal and conditional probabilities, and plot probability distributions.
### Learning the background information
Our agent at the moment doesn't know anything at all, not even about the existence of the population above. If we were to ask it anything, we would just get a blank stare back.
Let us give it the basic background information about the population: the variates' names and domains. We do this through the function `finfo()`: it has a `data` argument, which we omit for the moment, and a `metadata` argument. The latter can simply be the name of the file containing the metadata (NB: this file must have a specific format):
```{r}
priorknowledge <- finfo(metadata='datasets/glass_metadata-4_lev.csv')
```
The agent now possesses this basic background knowlege, encoded in the `priorknowledge` object. The encoding uses a particular mathematical representation which, however, is of no interest to us^[If you're curious you can have a glimpse at it with the command `str(priorknowledge)`, which displays structural information about an object.]. Other representations could also be used, but the knowledge would be the same. Think of this as encoding an image into a `png` or other lossless format: the representation of the image would be different, but the image would be the same.
### Preliminary inferences about the population
Now the agent knows about the population, variates, and domains. But it has not seen any data, that is, the variate values for some units. Yet we can ask it some questions and to draw some inferences. Remember that *the answer to a question* is not just a value: it *is the collection of all possible values, with a probability assigned to each*. If the actual value is known, then it will have probability `1`, and all others probability `0`.
Let's ask the agent: what is the marginal frequency distribution for the variate $\vType$, in the *full* (infinite!) population? Obviously the agent doesn't know what the actual distribution is, nor do we. It will calculate a probability distribution over all possible marginal frequency distributions.
This probability distribution for the $\vType$ variate is calculated by the function `fmarginal()`. It has arguments `finfo`: the agent's information; and `variates`: the names of the variates of which we want the marginal frequencies:
```{r}
priorknowledge_type <- fmarginal(finfo=priorknowledge, variates='Type')
```
The answer is stored in the object `priorknowledge_type`, which now contains only information pertinent to the $\vType$ variate.
We would like to visualize this probability distribution over marginal frequency distributions. A complication is that we would need infinite dimensions to visualize this faithfully. One approximate way to represent this probability distribution is by showing, say, 100 representative samples from it. The idea is the same as for a scatter plot ([§ @sec-represent-dens]). In this case we would then have 100 different frequency distributions for the variate $\vType$.
The function `plotsamples1D()` does this kind of visual representation. It has arguments `finfo`: the object encoding the probability distribution; `n` (default 100): the number of samples to show; and `predict`, which for the moment we set to `FALSE` and discuss in a moment.
How do you think this probability distribution will look like? what kind of marginal frequencies we do expect in the full population?
```{r}
plotsamples1D(finfo=priorknowledge_type, n=100, predict=FALSE)
```
You see that anything goes: Some frequency distributions give frequency almost `1` to a specific value, and almost `0` to the others. Other frequency distributions spread out the frequencies more evenly, with some peaks here or there.
This is a meaningful answer, because the agent hasn't seen any data. From its point of view, everything is possible in this population.
[@@ TODO: add representation as quantiles]{.small .grey}
### Preliminary inferences about units
Up to now the agent has drawn an inference regarding the full population, not regarding any specific unit. Now let's ask it: what will be the value of the $\vType$ variate in the next unit, or glass fragment, we observe? As usual, an answer consists in a probability distribution over all possible values.
:::{.callout-caution}
## {{< fa user-edit >}} Exercise
Before continuing, ask yourself the same question: which probabilities would you give to the $\vType$ values for the next unit, given that you only know that this quantity has seven possible values?
:::
The agent's answer this time is a probability distribution over seven values, which we can draw faithfully. The function `plotsamples1D()` can draw this probability as well, if we give the argument `predict=TRUE` (default):
```{r}
plotsamples1D(finfo=priorknowledge_type)
```
This plot shows the [probability distribution]{.blue} for the next unit in [blue]{.blue}, together with a sample of 100 possible frequency distributions for the $\vType$ variate over the full population. Note that samples are drawn anew every time, so they can look somewhat differently from time to time.^[To have reproducible plots, use `set.seed(314)` (or any integer you like) before calling the plot function.]
The agent's answer is that in the next unit we can observe any $\vType$ value with equal probability. Do you think this is a reasonable answer?
The plot above and the information it represents are very useful for inference purposes: not only they give the probability for the next observation, but also an idea of how such a probability could change in the future, as more data are collected and knowledge of the population's frequency distribution becomes more precise.
:::{.callout-caution}
## {{< fa user-edit >}} Exercise
Inspect the agent's inferences for other variates.
:::
## The agent learns from data
### Learning from the sample data
Now let's give the agent the data from the sample of 214 glass fragments. This is done again with the `finfo()` function, but providing the `data` argument, which can be the name of the data file:
```{r}
postknowledge <- finfo(data='datasets/glass_data-4_lev.csv', metadata='datasets/glass_metadata-4_lev.csv')
```
The `postknowledge` object contains the agent's knowledge from the metadata and the sample data. This object can be used in the same way as the object representing the agent's background knowledge.
### Inferences about the population
Now that the agent has learned from the data, we can ask it again what is the marginal frequency distribution for the variate $\vType$, in the full population.
We calculate the probability for the possible marginal frequency distributions, and then plot it as a set of 100 representative samples:
```{r}
postknowledge_type <- fmarginal(finfo=postknowledge, variates='Type')
plotsamples1D(finfo=postknowledge_type, predict=FALSE)
```
This plot shows two important aspects of this probability distribution and of the agent's current state of knowledge:
- Not anything goes anymore. Some frequency distributions are clearly "excluded", or more precisely they are *extremely* improbable. The most probable frequency distributions have a maximum at the $\cat{T2}$ value, another lower peak for the value $\cat{T4}$, and several other qualitative features that can be glimpsed from the plot.
- Yet, the agent still has a degree of uncertainty, qualitatively shown by the width of the "bands" of frequency distributions. For example, the frequency for the $\cat{T2}$ value could be $0.40$ as well as $0.30$
[@@ TODO: add code with function that reports the exact probabilities of the frequencies.]{.small .grey}
And there are other more specific aspects that can be found by visual inspection. For instance:
- Some frequency distributions have their absolute maximum at $\cat{T1}$, and a lower value at $\cat{T2}$. Vice versa, others have their absolute maximum at $\cat{T2}$ and a lower value at $\cat{T1}$. So there's still uncertainty as to which value is the most frequent in the full population.
- The agent gives a very small but non-zero frequency to the value $\cat{T7}$. Yet, *the data have no units at all with the $\cat{T7}$ value*. Even if the agent has never seen this value in the data it was given, it knows nevertheless, from the metadata, that this is a possible value. So the agent doesn't dogmatically say that its frequency in the full population should be zero (as in the sample); only that it should be extremely low.
:::{.callout-caution}
## {{< fa user-edit >}} Exercise
Given that the value $\cat{T7}$ has not been observed in 214 units, what approximate upper bound to its frequency would you give? Does the agent's inference agree with your intuition?
:::
### Inferences about units
Finally we ask the agent what $\vType$ value we should observe in the next glass fragment. The probability distribution answering this question is plotted by the same function with the argument `predict=TRUE` (default), as before:
```{r}
#| label: fig-unconditional-glass
#| fig-cap: "[Frequency distributions for full population]{.grey}, and [probability distribution for next unit]{.blue}"
plotsamples1D(finfo=postknowledge_type)
```
## Conditional, "discriminative" or "supervised-learning" inferences
The inferences about a new units that the agent has made so far were of an "unsupervised-learning" or "generative" kind ([§ sec-3-connection-ML]): the agent did not receive or use any partial information about a new unit. Let's now try a "supervised-learning" or "discriminative" kind of inference.
Imagine that we are at a new crime scene, a glass fragment is recovered, and tests are made about its refractive index and chemical composition. The following values are found, referred to the levels of our variates:
:::{.column-margin}
![](glass_crimescene.jpg){width=100%}
:::
```{r}
newfragment <- c(RI=2, Na=2, Mg=3, Al=2, Si=3, K=1, Ca=2, Ba=1, Fe=1)
```
<!-- 4 lvl -->
<!-- newfragment <- c(RI=2, Na=2, Mg=4, Al=2, Si=2, K=1, Ca=2, Ba=1, Fe=1) -->
<!-- 5 lvl -->
<!-- newfragment <- c(RI=2, Na=2, Mg=4, Al=3, Si=3, K=1, Ca=2, Ba=1, Fe=1) 5lvl-->
The detectives would like to know what's the possible origin of this fragment, that is, its $\vType$. Our agent can draw this inference.
First, the agent can calculate the probability distribution over the *conditional frequencies* ([§ @sec-conditional-freqs]) of the $\vType$ values for the subpopulation ([§ @sec-subpopulations]) of units having the specific variate values above. This calculation is done with the function `fconditional()`, with arguments `finfo`: the agent's current knowledge, and `unitdata`: the partial data obtained from the unit.
```{r}
condknowledge <- fconditional(finfo=postknowledge, unitdata=newfragment)
condknowledge_type <- fmarginal(finfo=condknowledge, variates='Type')
```
The `condknowledge` object contains the agent's knowledge conditional on the variates given; this knowledge is about the remaining variates, which in this case are the single variate $\vType$ (so the `fmarginal()` calculation is actually redundant in this case).
Second, the agent can calculate the probability distribution for the $\vType$ values of this particular glass fragment, given the above information.
Both inferences can be visualized in the usual way:
```{r}
#| label: fig-conditional-glass
#| fig-cap: "[Conditional frequency distributions for full population]{.grey}, and [conditional probability distribution for next unit]{.blue}"
plotsamples1D(finfo=condknowledge_type)
```
The agent thus gives a probability around $80\%$ to the fragment's being of $\cat{T1}$ type, around $10\%$ of being $\cat{T2}$ type, and around $5\%$ of being $\cat{T5}$ type. It also shows that further training data could change these probabilities by even $\pm 10\%$ or even $\pm 15\%$.
Note how the possible conditional frequency distributions for $\vType$ and the probability distribution differ from the unconditional ones shown in [fig. @fig-unconditional-glass]. The global maximum, $80\%$, is now sharper than the one for the unconditional case, $35\%$. This means that knowledge of the other variates for the present fragment has decreased the agent's uncertainty as regards its type.
Should the crime investigation proceed on the assumption that the fragment is of $\cat{T1}$ type? No, not necessarily. The best decision depends on the gains and costs involved in making correct or wrong assumptions. To this last decision-making problem we turn next.
:::{.callout-caution}
## {{< fa user-edit >}} Exercise
- Perform the "discriminative" inferences above, but omitting from the `newfragment` data *one* variate in turn; so first omit only $\vRI$ and do the inferences, then omit only $\vNa$ and do the inferences, and so on.
- In each of these inferences you will find that the agent becomes more ore less "confident" about the fragment type. Which of the variates above seem to be most important for making a confident inference?
:::