-
Notifications
You must be signed in to change notification settings - Fork 3
/
beyond-ML.qmd
786 lines (617 loc) · 27.7 KB
/
beyond-ML.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
# [Beyond machine learning]{.green} {#sec-beyond-ML}
{{< include macros.qmd >}}
{{< include macros_prob_inference.qmd >}}
{{< include macros_connection-3.qmd >}}
## Machine learning from a bird's-eye view {#sec-ML-birds-eye}
The last few chapters gave a brief introduction to and overview of popular machine-learning methods, their terminology, and the points of view that they typically adopt. Now let's try to look at them keeping in mind our main goal in this course: [exploring new inference methods, understanding their foundations, and thinking out of the box](index.html).
In this and the next few chapters we shall focus on the following question: [*to what purpose do we use machine-learning algorithms?*]{.yellow}. After answering this question and clarifying what the purpose is, we shall try to achieve it in an *optimal way*, according to the methods and concepts we studied in the initial part of the course. Remember that they are guaranteed to give the optimal solution ([chapter @sec-framework]). But we shall keep an eye open to see where our optimal methods seem to be similar or dissimilar to machine-learning methods.
Thereafetr, in the last chapters, we shall examine where the optimal solution and machine-learning methods converge and diverge, try to understand what machine-learning methods do from the point of view of our optimal solution, and think of ways to improve them.
## A task-oriented categorization of some machine-learning problems {#sec-cat-problems}
For our goal, the common machine-learning categorization and terminology discussed in [chapter @sec-ml-introduction] are somewhat inadequate. Distinctions such as "supervised learning" vs "unsupervised learning" are of secondary importance to a data engineer (as opposed to a ["data mechanic"](preface.html)) for several reasons:
- {{< fa shuffle >}}\ \ They group together some types of tasks that are actually quite different from an inferential or decision-making viewpoint; and conversely they separate types of tasks that are quite similar.
- {{< fa bullseye >}}\ \ They focus on procedures rather than on purposes.
The important questions for us, in fact, are: [*What do we wish to infer or choose?*]{.blue} and [*From which kind of information?*]{.green} These questions define the problem we want to solve. <!-- The procedure may then be chosen depending on the theory, resources and technologies, other contingent factors, and so on. -->
<!-- It's somewhat like saying that the difference between car and aeroplane is that the latter has wings. Sure -- but *why?* The focus on wings misses the essential difference between these two means of transportation: they operate through different material media and exploit different kinds of physics; that's why the second has wings. -->
Let's introduce a different categorization of the kind of tasks that we want to accomplish; a categorization that tries to focus on the purpose or task, and on the types of desired information and of available information, rather than on the procedure.
The categorization below <!-- , of the types of *task* that machine-learning algorithms try to solve, --> is informal. It only provides a starting point from which to examine a new kind of task we may face. Many tasks will fall in between categories: every data-engineering or data-science problem is unique.
\
We *exclude* from the start all tasks that require an agent to continuously and actively interact with its environment for acquiring information, making choices, getting feedback, and so on. Clearly these tasks are the domain of Decision Theory in its most complex form, with ramified decisions, *strategies*, and possibly the interaction with other decision-making agents. To explore and analyse this kind of tasks is beyond the purpose of this course.
::::{.column-margin}
::: {.callout-tip}
## {{< fa rocket >}} For the extra curious
- [*Decision Analysis*](https://hvl.instructure.com/courses/28605/modules)
- Chapters 16--18 in [*Artificial Intelligence*](https://hvl.instructure.com/courses/28605/modules)
- [*Games and Decisions*](https://hvl.instructure.com/courses/28605/modules)
:::
::::
\
We focus on tasks where multiple "instances" with similar characteristics are involved, and the agent has some question related to a "new instance".<!-- , possibly to be repeated an indefinite number of times. --> According to the conceptual framework developed in part [Data II]{.yellow}, we can view these "instances" as *units* of a practically infinite population. The "characteristics" that the agent has observed or must guess are *variates* common to all these units.
:::{.column-margin}
Remember that you can adopt any terminology you like. If you prefer "instance" and "characteristics" or some other words to "unit" and "variate", then use them. What's important is that you understand the ideas and methods behind these words
:::
### New unit: given vs generated
A first distinction can be made between
- [{{< fa sign-out-alt >}} {{< fa cube >}}\ \ Tasks where an agent must itself *generate* a new unit]{.yellow}
- [{{< fa cube >}} {{< fa question >}}\ \ Tasks where a new unit is given to an agent, who must *guess* some of its variates]{.green}
An example of the first type of task is image generation: an algorithm is given a collection of images and is asked to generate a new image based on them.
We shall see that these two types of task are actually quite close to each other, from the point of view of Decision Theory and Probability Theory.
:::{.small .midgrey}
The terms "discriminative" and "generative" are sometimes associated in machine learning with the two types of task. This association, however, is quite loose, because some tasks typically called "generative" actually belong to the first type. We shall therefore avoid these terms. It's enough to keep in mind the distinction between the two types of task above.
:::
### Guessing variates: all or some
Focusing on the second type of task (a new unit is given to the agent), we can further divide it into two subtypes:
- [{{< fa regular star >}}\ \ The agent must guess *all* variates of the new unit]{.purple}
- [{{< fa star-half-alt >}}\ \ The agent must guess *some* variates of the new unit, but can observe other variates of the new unit]{.red}
An example of the first subtype of task is the "urgent vs non-urgent" problem of [§@sec-conditional-joint-sim]: having observed incoming patients, some of which where urgent and some non-urgent, the agent must guess whether the next incoming patient will be urgent or not. No other kinds of information (transport, patient characteristics, or others) are available.
We shall call [**predictands**]{.blue}^[literally "what has to be predicted"] the variates that the agent must guess in a new unit, and [**predictors**]{.blue} those that the agent can observe.^[In machine learning and other fields, the terms "dependent variable", "class" or "label" (for nominal variates) are often used for "predictand"; and the terms "independent variable" or "features" are often used for "predictor".] The first subtype above can be viewed as a special case of the second where all variates are predictands, and there are no predictors.
:::{.small .midgrey}
The terms "unsupervised learning" and "supervised learning" are sometimes associated in machine learning with these two subtypes of task. But the association is loose and can be misleading. "Clustering" tasks, for example, are usually called "unsupervised" but they are examples of the second subtype above, where the agent has some predictors.
:::
### Information available in previous units
Finally we can further divide the second subtype above into two or three subsubtypes, depending on the information available to the agent about *previous units*:
- [{{< fa star-half-alt >}} {{< fa star-half-alt >}}\ \ All predictors and predictands of previous units are known to the agent]{.blue}
- [{{< fa star-half >}} {{< fa star-half >}}\ \ All predictors of previous units, but not the predictands, are known to the agent]{.lightblue}
- [{{< fa regular star-half >}} {{< fa regular star-half >}}\ \ All predictands of previous units, but not the predictors, are known to the agent]{.midgrey}
\
An example of the first subsubtype of task is image classification. The agent is for example given the following 128 × 128-pixel images and character-labels from the [One Punch Man](https://onepunchman.fandom.com) series:
![](saitama_images.png){width=100%}
and is then given one new 128 × 128-pixel image:
![](saitama_new.png){width=128 fig-align="center"}
of which it must guess the character-label.
In the example just given, the image is the predictor, the character-label is the predictand.
\
A slight modification of the example above gives us a task of the second subsubtype. A different agent is given the images above, *but without labels*:
![](saitama_images_nolabels.png){width=100%}
and must then guess some kind of "label" or "group" for the new image above; and possibly also for the images already given. The kind of "group" requested depends on the specific problem.
In this example the image is the predictor, and the label or group is the predictand. The difference from the previous example is that the agent doesn't have the predictand values of previous units.
:::{.small .midgrey}
The term "supervised learning" typically refer to the first subsubtype above.
The term "unsupervised learning" can refer to the second subsubtype, for instance in "clustering" tasks. In a clustering task, the agent tries to guess which group or "cluster" a unit belong to, given a collection of similar units, whose groups are not known either. The cluster effectively is the *predictand* variate. In some cases the agent may want to guess the cluster not only of a new unit, but also of all previous units.
The third subsubtype is very rarely considered in machine learning, yet it is not an unrealistic task.
:::
The types, subtypes, subsubtypes above are obviously not mutually exclusive or comprehensive. We can easily imagine scenarios where an agent has some predictors & predictands available about *some* previous units, but only predictors or only predictands available for other previous units. This scenario falls in between the three subsubtypes above. In machine learning, hybrid situations like these are categorized as "missing data" or "imputation".
## Flexible categorization using probability theory {#sec-categ-probtheory}
We have been speaking about the agent's guessing the values of some variates. "Guessing" means that there's a state of *uncertainty*: the agent can't simply say something like "the value of the label is `Saitama`", because that could be false. Uncertainty means that the most honest thing that the agent can do is to express *degrees of belief* about each of the possible values. Probability theory enters the scene.
In fact it turns out that the categorization above into subtypes and subsubtypes of tasks can be presented in a more straightforward and flexible way using probability-theory notation.
### Notation
First let's introduce some symbol conventions to be used in the next chapters.
- We shall denote with $\bZ$ all variates that are of interest to the agent: those to be guessed as well as those that may be already known.
- The variates to be guessed in a new unit (the predictands) will be collectively denoted with $\bY$.
- The variates that can be observed in a new unit (the predictors) will be collectively denoted with $\bX$. In cases where there are no predictors, $\bX$ is empty.
Therefore we have $\bZ = (\bY \and \bX)$. In cases where there are no predictors we have $\bZ = \bY$.
- $\bZ_i$\ \ denote all variates for unit [#$i$.]{.m}
- $\bY_i$\ \ denote all predictands for unit [#$i$.]{.m}
- $\bX_i$\ \ denote all predictors for unit [#$i$.]{.m}
As usual we number from\ \ $i=1$\ \ to\ \ $i=N$\ \ the units that serve for learning, and\ \ $i=N+1$\ \ is the *new* unit of interest to the agent.
Recall ([§@sec-basic-elements-inference]) that in probability notation
$$\P(\text{\red\small[proposal]}\|\text{\green\small[conditional]} \and \yI)$$
the [proposal]{.red} contains what the agent's belief is about, and the [conditional]{.green} contains what's supposed to be known to the agent, together with the background [information $\yI$.]{.m}
\
Finally let's see how to express different typologies of tasks in probability notation.
\
### [{{< fa regular star >}}]{.purple}\ \ The agent must guess *all* variates of the new unit
This kinds of guess is represented by the probability distribution
$$
\P(\red
Z_{N+1}\mo z
\black\|
\green
Z_{N}\mo z_{N}
\and \dotsb \and
Z_{1}\mo z_{1}
\black \and \yI)
$$
for all possible values $\red z$ in the domain of $\bZ$. The specific values $\green z_N, \dotsc, z_1$ of the variate $\bZ$ for the previous units are known to the agent.
\
### [{{< fa star-half-alt >}}]{.red}\ \ The agent must guess *some* variates of the new unit, having observed other variates of the new unit
This kind of guess is represented by the probability distribution
$$
\P(\red
Y_{N+1}\mo y
\black\|
\green
X_{N+1}\mo x
\, \and\,
\dotsb \,
\black \and \yI)
$$
for all possible values $\red y$ in the domain of the predictands $\bY$. The value $\green x$ of the predictors $\bX$ for the new unit is known to the agent.
The remaining information "$\dotsb$" contained in the conditional depends on the subsubtype of task:
\
#### [{{< fa star-half-alt >}} {{< fa star-half-alt >}}]{.blue}\ \ All predictors and predictands of previous units are known to the agent
This corresponds to the probability distribution
$$
\P(\red
Y_{N+1}\mo y
\black\|
\green
X_{N+1}\mo x
\, \and\,
Y_{N}\mo y_{N}
\and
X_{N}\mo x_{N}
\and \dotsb \and
Y_{1}\mo y_{1}
\and
X_{1}\mo x_{1}
\black \and \yI)
$$
for all possible $\red y$. All information about predictands $\bY$ and predictors $\bX$ for previous units appears in the conditional.
In the example with image classification, a pictorial representation of this probability would be
![](saitama_example2.png){width=100%}
where ${\red y} \in \set{\red\cat{Saitama}, \cat{Fubuki}, \cat{Genos}, \cat{MetalBat}, \dotsc \black}$.
\
#### [{{< fa star-half >}} {{< fa star-half >}}]{.lightblue}\ \ All predictors of previous units, but not their predictands, are known to the agent
This corresponds to the probability distribution
$$
\P(\red
Y_{N+1}\mo y
\black\|
\green
X_{N+1}\mo x
\, \and\,
X_{N}\mo x_{N}
\and \dotsb\and
X_{1}\mo x_{1}
\black \and \yI)
$$
for all possible $\red y$. All information about predictors $\bX$ for the previous units, but *not* that about their predictands $\bY$, appears in the conditional.
\
### More general and hybrid tasks
Consider a task that doesn't fit into any of the types discussed above: The agent wants to guess the predictands for a new unit, say #3, after observing that its predictors have value $\green x$. Of two previous units, the agent knows the predictor value $\green x_1$ of the first, and the predictand value $\green y_2$ of the second. This task is expressed by the probability
$$
\P(\red
Y_{3}\mo y
\black\|
\green
X_{3}\mo x
\, \and\,
Y_{2}\mo y_{2}
\and
X_{1}\mo x_{1}
\black \and \yI)
$$
\
::::{.column-page-inset-right}
:::{.callout-caution}
## {{< fa user-edit >}} Exercises
- Write down the general probability expression for the task of subsubtype "[all predictands of previous units, but not their predictors, are known to the agent]{.midgrey}".
- What kind of task does the following probability express?:
$$
\P(\red
Y_{N+1}\mo y_{N+1}
\and Y_{N}\mo y_{N}
\and \dotsb \and
Y_{2}\mo y_{2}\and
Y_{1}\mo y_{1}
\black\|
\green
X_{N+1}\mo x_{N+1}
\and X_{N}\mo x_{N}
\and \dotsb\and
X_{2}\mo x_{2}
\and
X_{1}\mo x_{1}
\black \and \yI)
$$
What kind of task could it represent in machine-learning terminology?
:::
::::
### [{{< fa sign-out-alt >}} {{< fa cube >}}]{.yellow}\ \ Tasks where an agent must itself *generate* a new unit
Our very first categorization included the task of generating a new unit, given previous examples. In this kind of task there are possible alternatives that the agent could generate. How should one alternative be chosen? A moment's thought shows that the *probabilities* for the possible alternatives should enter the choice.
Suppose, as a very simple example, that a generative agent has been shown, in an unsystematic order, 30 copies of the symbol [{{< fa regular circle-up >}}]{.green} and 10 copies of the symbol [{{< fa regular circle-down >}}]{.yellow}, and is asked to generate a new symbol out of these examples. Intuitively we expect that it will generate [{{< fa regular circle-up >}}]{.green}, but we cannot and don't want to exclude the possibility that it will generate [{{< fa regular circle-down >}}]{.yellow}. These two generation possibilities should simply have different probabilities and, in the long run, appear with different frequencies.
Also in this kind of task, therefore, we have the probability distribution
$$
\P(\red
Z_{N+1}\mo z
\black\|
\green
Z_{N}\mo z_{N}
\and \dotsb \and
Z_{1}\mo z_{1}
\black \and \yI)
$$
the difference from before is that the sentence $\red Z_{N+1}\mo z$ represents not the hypothesis that a *given* new unit has value $\red z$, but the possibility of *generating* a new unit with that value. In other words, the symbol "$\mo$" here means "*must be set to...*" rather than "*would be observed to be...*". Remember the discussion and warnings in [§@sec-sentence-notation]?
\
Our general conclusion is this:
:::{.callout-note style="font-size:120%"}
##
::::{style="font-size:120%"}
Probability distribution such as those discussed above should intrinsically enter all types of machine-learning algorithms.
::::
:::
This is the condition for machine-learning algorithms to be optimal and self-consistent. The less an algorithm satisfies that condition, the less optimal and less consistent it is.
\
## The underlying distribution {#sec-underlying-distribution}
A remarkable feature of all the probabilities discussed in the above task categorization is that they can all be calculated from *one* and the same probability distribution. We briefly discussed and used this feature in [chapter @sec-learning].
A conditional probability such as $\P(\se{\red A}\|\se{\green B} \and \yI)$ can always be written, by the `and`-rule, as the ratio of two probabilities:
$$
\P(\se{\red A}\|\se{\green B} \and \yI)
=
\frac{
\P(\se{\red A}\and \se{\green B} \| \yI)
}{
\P(\se{\green B} \| \yI)
}
$$
Therefore we have, for the probabilities of some of the tasks above,
:::{.column-page-inset-right}
$$
\begin{aligned}
&\P(\red
Z_{N+1}\mo z
\black\|
\green
Z_{N}\mo z_{N}
\and \dotsb \and
Z_{1}\mo z_{1}
\black \and \yI)
=
\frac{
\P(\red
Z_{N+1}\mo z
\black\and
\green
Z_{N}\mo z_{N}
\and \dotsb \and
Z_{1}\mo z_{1}
\black \| \yI)
}{
\P(
\green
Z_{N}\mo z_{N}
\and \dotsb \and
Z_{1}\mo z_{1}
\black \| \yI)
}
\\[2em]
&\P(\red
Y_{N+1}\mo y
\black\|
\green
X_{N+1}\mo x
\, \and\,
Y_{N}\mo y_{N}
\and
X_{N}\mo x_{N}
\and \dotsb \and
Y_{1}\mo y_{1}
\and
X_{1}\mo x_{1}
\black \and \yI)
\\[2ex]
&\qquad{}=
\frac{
\P(\red
Y_{N+1}\mo y
\black\and
\green
X_{N+1}\mo x
\, \and\,
Y_{N}\mo y_{N}
\and
X_{N}\mo x_{N}
\and \dotsb \and
Y_{1}\mo y_{1}
\and
X_{1}\mo x_{1}
\black \| \yI)
}{
\P(
\green
X_{N+1}\mo x
\, \and\,
Y_{N}\mo y_{N}
\and
X_{N}\mo x_{N}
\and \dotsb \and
Y_{1}\mo y_{1}
\and
X_{1}\mo x_{1}
\black \| \yI)
}
\\[2em]
&\P(\red
Y_{N+1}\mo y
\black\|
\green
X_{N+1}\mo x
\, \and\,
X_{N}\mo x_{N}
\and \dotsb\and
X_{1}\mo x_{1}
\black \and \yI)
\\[2ex]
&\qquad{}=
\frac{
\P(\red
Y_{N+1}\mo y
\black\and
\green
X_{N+1}\mo x
\, \and\,
X_{N}\mo x_{N}
\and \dotsb\and
X_{1}\mo x_{1}
\black \| \yI)
}{
\P(
\green
X_{N+1}\mo x
\, \and\,
X_{N}\mo x_{N}
\and \dotsb\and
X_{1}\mo x_{1}
\black \| \yI)
}
\end{aligned}
$$
:::
\
We also know the marginalization rule ([chapter @sec-marginal-probs]): any quantity $\yellow C$ with values $\yellow c$ can be introduced into the proposal of a probability via the `or`-rule:
$$
\P( {\green\boldsymbol{\dotsb}} \| \yI) =
\sum_{\yellow c}\P({\yellow C\mo c} \and {\green\boldsymbol{\dotsb}} \| \yI)
$$
Using the marginalization rule we find these final expressions for the probabilities of some machine-learning tasks discussed so far:
::::{.column-page-right}
:::{.callout-note}
##
- [{{< fa regular star >}}]{.purple} Guess all variates:
$$
\P(\red
Z_{N+1}\mo z
\black\|
\green
Z_{N}\mo z_{N}
\and \dotsb \and
Z_{1}\mo z_{1}
\black \and \yI)
=
\frac{
\P(\red
Z_{N+1}\mo z
\black\and
\green
Z_{N}\mo z_{N}
\and \dotsb \and
Z_{1}\mo z_{1}
\black \| \yI)
}{
\sum_{\purple z}
\P(
\red
Z_{N+1}\mo {\purple z}
\black\and\green
Z_{N}\mo z_{N}
\and \dotsb \and
Z_{1}\mo z_{1}
\black \| \yI)
}
$$
\
- [{{< fa star-half-alt >}} {{< fa star-half-alt >}}]{.blue} All previous predictors and predictands known:
$$
\begin{aligned}
&\P(\red
Y_{N+1}\mo y
\black\|
\green
X_{N+1}\mo x
\, \and\,
Y_{N}\mo y_{N}
\and
X_{N}\mo x_{N}
\and \dotsb \and
Y_{1}\mo y_{1}
\and
X_{1}\mo x_{1}
\black \and \yI)
\\[2ex]
&\qquad{}=
\frac{
\P(\red
Y_{N+1}\mo y
\black\and
\green
X_{N+1}\mo x
\, \and\,
Y_{N}\mo y_{N}
\and
X_{N}\mo x_{N}
\and \dotsb \and
Y_{1}\mo y_{1}
\and
X_{1}\mo x_{1}
\black \| \yI)
}{
\sum_{\purple y}
\P(\red
Y_{N+1}\mo {\purple y}
\black\and
\green
X_{N+1}\mo x
\, \and\,
Y_{N}\mo y_{N}
\and
X_{N}\mo x_{N}
\and \dotsb \and
Y_{1}\mo y_{1}
\and
X_{1}\mo x_{1}
\black \| \yI)
}
\end{aligned}
$$
\
- [{{< fa star-half >}} {{< fa star-half >}}]{.lightblue} Previous predictors known, previous predictands unknown:
$$
\begin{aligned}
&\P(\red
Y_{N+1}\mo y
\black\|
\green
X_{N+1}\mo x
\, \and\,
X_{N}\mo x_{N}
\and \dotsb\and
X_{1}\mo x_{1}
\black \and \yI)
\\[2ex]
&\quad{}=
\frac{
\sum_{\yellow y_{N}, \dotsc, y_{1}}
\P(\red
Y_{N+1}\mo y
\black\and
\green
X_{N+1}\mo x
\black\, \and\,
\yellow
Y_{N}\mo y_{N}
\black\and\green
X_{N}\mo x_{N}
\black\and \dotsb\and
\yellow
Y_{1}\mo y_{1}
\black\and\green
X_{1}\mo x_{1}
\black \| \yI)
}{
\sum_{{\purple y}, \yellow y_{N}, \dotsc, y_{1}}
\P(\red
Y_{N+1}\mo {\purple y}
\black\and
\green
X_{N+1}\mo x
\black\, \and\,
\yellow
Y_{N}\mo y_{N}
\black\and\green
X_{N}\mo x_{N}
\black\and \dotsb\and
\yellow
Y_{1}\mo y_{1}
\black\and\green
X_{1}\mo x_{1}
\black \| \yI)
}
\end{aligned}
$$
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- &\P(\red -->
<!-- Y_{4}\mo y -->
<!-- \black\| -->
<!-- \green -->
<!-- X_{4}\mo x -->
<!-- \, \and\, -->
<!-- Y_{3}\mo y_{3} -->
<!-- \and X_{3}\mo x_{3} -->
<!-- \and \ -->
<!-- X_{2}\mo x_{2} -->
<!-- \and -->
<!-- Y_{1}\mo y_{1} -->
<!-- \and X_{1}\mo x_{1} -->
<!-- \black \and \yI) -->
<!-- \\[2ex] -->
<!-- &\quad{}= -->
<!-- \frac{ -->
<!-- \sum_{\yellow y_{2}} -->
<!-- \P(\red -->
<!-- Y_{4}\mo y -->
<!-- \black\and -->
<!-- \green -->
<!-- X_{4}\mo x -->
<!-- \, \and\, -->
<!-- Y_{3}\mo y_{3} -->
<!-- \and X_{3}\mo x_{3} -->
<!-- \and -->
<!-- {\yellow Y_{2}\mo y_{2}} -->
<!-- \and -->
<!-- X_{2}\mo x_{2} -->
<!-- \and -->
<!-- Y_{1}\mo y_{1} -->
<!-- \and X_{1}\mo x_{1} -->
<!-- \black \| \yI) -->
<!-- }{ -->
<!-- \sum_{\purple y} -->
<!-- \sum_{\yellow y_{2}} -->
<!-- \P(\red -->
<!-- Y_{4}\mo {\purple y} -->
<!-- \black\and -->
<!-- \green -->
<!-- X_{4}\mo x -->
<!-- \, \and\, -->
<!-- Y_{3}\mo y_{3} -->
<!-- \and X_{3}\mo x_{3} -->
<!-- \and -->
<!-- {\yellow Y_{2}\mo y_{2}} -->
<!-- \and -->
<!-- X_{2}\mo x_{2} -->
<!-- \and -->
<!-- Y_{1}\mo y_{1} -->
<!-- \and X_{1}\mo x_{1} -->
<!-- \black \| \yI) -->
<!-- } -->
<!-- \end{aligned} -->
<!-- $$ -->
:::
::::
**All these formulae, even for hybrid tasks, involve sums and ratios of only one distribution:**
$$\boldsymbol{
\P(\blue
Y_{N+1}\mo y_{N+1}
\and
X_{N+1}\mo x_{N+1}
\and \dotsb \and
Y_{1}\mo y_{1}
\and
X_{1}\mo x_{1}
\black \| \yI)
}
$$
Stop for a moment and contemplate some of the consequences of this remarkable fact:
- [{{< fa arrows-spin >}}\ \ *An agent that can perform one of the tasks above can, in principle, also perform all other tasks.*]{.blue}
This is why a perfect agent, working with probability, in principle does not have to worry about "supervised", "unsupervised", "missing data", "imputation", and similar situations. This also shows what was briefly mentioned before: all these task typologies are much closer to one another than it might look like from the perspective of current machine-learning methods.
:::{.column-margin}
The acronym [*OPM*]{.green} ![](opm_fist2.png){height=2em} can stand for [*Optimal Predictor Machine*]{.green} or [*Omni-Predictor Machine*]{.green}
:::
- [{{< fa microchip >}}\ \ *The probability distribution above encodes the agent's background knowledge and assumptions; different agents differ only in the values of that distribution.*]{.blue}
If two agents yield different probability values in the same task, with the same variates and same training data, the difference must come from the joint probability distribution above. And, since the data given to the two agents are exactly the same, the difference must lie in the agents' background [information $\yI$.]{.m}
- [{{< fa user-secret >}}\ \ *Data cannot "speak for themselves"*]{.blue}
Given some data, we can choose two different joint distributions for these data, and therefore get different results in our inferences and tasks. This means that the data alone cannot determine the result: specific background information and assumptions, whether acknowledged or not, always affect the result.
The qualification "in principle" in the first consequence is important. Some of the sums that enter the formulae above are computationally extremely expensive and, with current technologies and maths techniques, cannot be performed within a reasonable time. But *new technologies and new maths discoveries could make these calculations possible*. This is why a data engineer cannot simply brush them aside and forget them.
As regards the third consequence, we shall see that there are different states of knowledge which can converge to the same results, as the number of training data increases.
:::{.callout-caution}
## {{< fa user-edit >}} Exercise
In a previous example of "hybrid" task we had the probability distribution
$$
\P(\red
Y_{3}\mo y
\black\|
\green
X_{3}\mo x
\, \and\,
Y_{2}\mo y_{2}
\and
X_{1}\mo x_{1}
\black \and \yI)
$$
Rewrite it in terms of the underlying joint distribution.
:::
## Plan for the next few chapters
Our goal in building an "Optimal Predictor Machine" is now clear: we must find a way to
:::{.column-margin}
![](optimal_predictor_machine.png){width=100%}
:::
- *assign* the joint probability distribution above, in such a way that it reflects some reasonable background information
- *encode* the distribution in a computationally useful way
The "encode" goal sounds quite challenging, because the number $N$ of units can in principle be infinite; we have an infinite probability distribution.
In the next [Inference III]{.green} part we shall see that partially solving the "assign" goal actually makes the "encode" goal feasible.
\
One question arises if we now look at machine-learning methods from our Probability Theory perspective. Some machine-learning methods, including many popular ones, don't give us probabilities about values. They return *one* definite value. How do we reconcile this with the probabilistic point of view above? We shall answer this question in full in the last chapters; but a short, intuitive answer can already be given now.
If there are several possible correct answers to a given guess, but a machine-learning algorithm gives us only one answer, then the algorithm must have internally *chosen* one of them. In other words, the machine-learning algorithm is internally doing decision-making. We know from chapters [@sec-framework] and [@sec-basic-decisions] that this process should obey Decision Theory and therefore *must* involve:
- [{{< fa scale-unbalanced-flip >}}\ \ the probabilities of the possible correct answers]{.green}
- [{{< fa sack-dollar >}}\ \ the utilities of the possible answer choices]{.blue}
Non-probabilistic machine-learning algorithms must therefore be approximations of an Optimal Predictor Machine that, after computing probabilities, selects one particular answer by using utilities.