diff --git a/_freeze/example_opm1/execute-results/html.json b/_freeze/example_opm1/execute-results/html.json index addd2ea..be3d236 100644 --- a/_freeze/example_opm1/execute-results/html.json +++ b/_freeze/example_opm1/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "0cfc28fcca0bd6b17f0d209281ce85bc", + "hash": "413ca53cfc8d9ce1468cb2f7bd079e01", "result": { "engine": "knitr", - "markdown": "# [Example application: adult-income task]{.red} {#sec-example-opm1}\n::: {.hidden}\n\n\n\\providecommand{\\ul}{\\uline}\n\\providecommand{\\and}{\\mathbin{\\mkern-0.5mu,\\mkern-0.5mu}}\n\\renewcommand*{\\|}[1][]{\\nonscript\\:#1\\vert\\nonscript\\:\\mathopen{}}\n\\providecommand*{\\pr}[1]{\\textsf{\\small`#1'}}\n\\renewcommand*{\\pr}[1]{\\textsf{\\small`#1'}}\n\\providecommand*{\\prq}[1]{\\textsf{\\small #1}}\n\\providecommand*{\\se}[1]{\\mathsfit{#1}}\n\\renewcommand{\\se}[1]{\\mathsfit{#1}}\n\\providecommand*{\\sei}[1]{\\mathsfit{\\small #1}}\n\n\\providecommand{\\cat}[1]{{\\small\\verb;#1;}}\n\\providecommand{\\vec}[1]{\\boldsymbol{#1}}\n\\providecommand{\\p}{\\mathrm{p}}\n\\renewcommand{\\p}{\\mathrm{p}}\n\\renewcommand{\\P}{\\mathrm{P}}\n\\definecolor{quarto-callout-note-color}{HTML}{4477AA}\n\\definecolor{quarto-callout-note-color-frame}{HTML}{4477AA}\n\\definecolor{quarto-callout-important-color}{HTML}{AA3377}\n\\definecolor{quarto-callout-important-color-frame}{HTML}{AA3377}\n\\definecolor{quarto-callout-warning-color}{HTML}{EE6677}\n\\definecolor{quarto-callout-warning-color-frame}{HTML}{EE6677}\n\\definecolor{quarto-callout-tip-color}{HTML}{228833}\n\\definecolor{quarto-callout-tip-color-frame}{HTML}{228833}\n\\definecolor{quarto-callout-caution-color}{HTML}{CCBB44}\n\\definecolor{quarto-callout-caution-color-frame}{HTML}{CCBB44}\n\n\\providecommand*{\\mo}[1][=]{\\mathclose{}\\mathord{\\nonscript\\mkern0mu\\textrm{\\small#1}\\nonscript\\mkern0mu}\\mathopen{}}\n\\providecommand*{\\yX}{\\se{X}}\n\\providecommand*{\\yY}{\\se{Y}}\n\\providecommand*{\\yI}{\\se{I}}\n\\providecommand*{\\yi}[1][]{\\se{I}_{\\text{#1}}}\n\\providecommand{\\di}{\\mathrm{d}}\n\\providecommand{\\defd}{\\coloneqq}\n\\providecommand{\\blue}{\\color[RGB]{68,119,170}}\n\\providecommand{\\red}{\\color[RGB]{238,102,119}}\n\\providecommand{\\purple}{\\color[RGB]{170,51,119}}\n\\providecommand{\\green}{\\color[RGB]{34,136,51}}\n\\providecommand{\\yellow}{\\color[RGB]{204,187,68}}\n\\providecommand{\\lblue}{\\color[RGB]{102,204,238}}\n\\providecommand{\\grey}{\\color[RGB]{187,187,187}}\n\\providecommand{\\midgrey}{\\color[RGB]{119,119,119}}\n\\providecommand{\\black}{\\color[RGB]{0,0,0}}\n\\newcommand*{\\e}{\\mathrm{e}}\n\\newcommand*{\\pu}{\\text{π}}\n\\newcommand*{\\RR}{\\mathbf{R}}\n\n$$\n\\DeclarePairedDelimiters{\\set}{\\{}{\\}}\n\\DeclareMathOperator*{\\argmax}{argmax}\n$$\n\n\n\n\n\n\n\n\n:::\n\n::: {.hidden}\n\\providecommand*{\\yon}{{\\green\\cat{on}}}\n\\providecommand*{\\yof}{{\\red\\cat{off}}}\n\\providecommand*{\\yy}{{\\lblue\\cat{Y}}}\n\\providecommand*{\\yn}{{\\yellow\\cat{N}}}\n\\providecommand{\\ypl}{{\\green\\cat{+}}}\n\\providecommand{\\ymi}{{\\red\\cat{-}}}\n\\providecommand{\\ypa}{{\\green\\cat{pass}}}\n\\providecommand{\\yfa}{{\\red\\cat{fail}}}\n\n\n\\providecommand{\\hi}{{\\green\\cat{high}}}\n\\providecommand{\\me}{{\\yellow\\cat{medium}}}\n\\providecommand{\\lo}{{\\red\\cat{low}}}\n\\providecommand*{\\yJ}{\\se{J}}\n\\providecommand{\\yva}{{\\lblue-1}}\n\\providecommand{\\yvb}{{\\midgrey0}}\n\\providecommand{\\yvc}{{\\yellow1}}\n\\providecommand*{\\yK}{\\se{K}}\n\\providecommand*{\\yL}{\\se{L}}\n\n\\providecommand*{\\yR}{R}\n\n\\providecommand*{\\bZ}{{\\blue Z}}\n\\providecommand*{\\bz}{{\\blue z}}\n\\providecommand*{\\rY}{{\\red Y}}\n\\providecommand*{\\bY}{{\\blue Y}}\n\\providecommand*{\\ry}{{\\red y}}\n\\providecommand*{\\gX}{{\\green X}}\n\\providecommand*{\\bX}{{\\blue X}}\n\\providecommand*{\\gx}{{\\green x}}\n\\providecommand*{\\vf}{\\vec{f}}\n\n\\providecommand*{\\yut}{\\se{K}_{\\textsf{3}}}\n\\providecommand*{\\yul}{\\se{K}}\n\n\\providecommand*{\\bA}{{\\blue A}}\n\\providecommand*{\\bB}{{\\blue B}}\n\\providecommand*{\\bC}{{\\blue C}}\n\n\n\\providecommand*{\\vfa}{\\vf'}\n\\providecommand*{\\vfb}{\\vf''}\n\n:::\n\n\n::: {.hidden}\n\n\\providecommand*{\\data}{\\se{\\green data}}\n\\providecommand*{\\yD}{\\se{I}_{\\textrm{d}}}\n\\providecommand*{\\ya}{k}\n\\providecommand*{\\amin}{\\ya_{\\text{mi}}}\n\\providecommand*{\\amax}{\\ya_{\\text{ma}}}\n\n:::\n\n\n\n\n\n\n\n\n\n::: {.cell}\n\n:::\n\n\nLet's illustrate the example workflow described in [§@sec-opm-workflow] with a toy, but not too simplistic, example, based on the [adult-income dataset](https://archive.ics.uci.edu/dataset/2/adult).\n\n\nAll code functions and data files are in the directory\\\n[https://github.com/pglpm/ADA511/tree/master/code/OPM-nominal](https://github.com/pglpm/ADA511/tree/master/code/OPM-nominal)\n\nWe start loading the R libraries and functions needed at several stages. You need to have installed^[This is done with the [`install.packages()` function](https://rdrr.io/r/utils/install.packages.html).] the packages [*extraDistr*](https://cran.r-project.org/package=extraDistr) and [*foreach*](https://cran.r-project.org/package=foreach). Make sure you have saved all source files and data files in the same directory.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary('extraDistr')\nlibrary('foreach')\n\nsource('tplotfunctions.R')\nsource('guessmetadata.R')\nsource('buildagent.R')\nsource('infer.R')\nsource('decide.R')\nsource('mutualinfo.R')\nsource('rF.R')\nsource('plotFsamples1D.R')\n\noptions(repr.plot.width = 6 * sqrt(2), repr.plot.height = 6)\n```\n:::\n\n\n\n\\\n\n## Define the task\n\nThe main task is to infer whether a USA citizen earns less (≤) or more (>) than USD 50 000/year, given a set of characteristics of that citizen. In view of later workflow stages, let's note a couple of known and unknown facts to delimit this task in a more precise manner:\n\n- Given the flexibility of the agent we shall use, we can generalize the task: to infer any subset of the set of characteristics, given any other subset. In other words, we can choose the predictand and predictor variates for any new citizen. Later on we shall also extend the task to making a concrete decision, based on utilities relevant to that citizen.\n \n\tThis flexibility is also convenient because no explanation is given as to *what purpose* the income should be guessed.\n\n- The training data come from a 1994 census, and our agent will use an exchangeable belief distribution about the population. The value of the USD and the economic situation of the country changes from year to year, as well as the informational relationships between economic and demographic factors. For this reason the agent should be used to draw inferences about at most one or two years around 1994. Beyond such time range the exchangeability assumption is too dubious and risky.\n\n- The [USA population in 1994 was around 260 000 000](https://www.macrotrends.net/countries/USA/united-states/population), and we shall use around 11 000 training data. The population size can therefore be considered approximately infinite.\n\n\\\n\n## Collect & prepare background info\n\n### Variates and domains\n\nThe variates to be used must be of nominal type, because our agent's background beliefs (represented by the Dirichlet-mixture distribution) are only appropriate for nominal variates. In this toy example we simply discard all original non-nominal variates. These included some, such as age, that would surely be relevant for this task. As a different approach, we could have coarsened each non-nominal variate into three or four range values, so that treating it as nominal would have been an acceptable approximation.\n\nFirst, create a preliminary metadata file by running the function `guessmetadata()` on the training data [`train-income_data_example.csv`](https://github.com/pglpm/ADA511/blob/master/code/OPM-nominal/train-income_data_example.csv):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nguessmetadata(data = 'train-income_data_example.csv',\n file = 'preliminary.csv')\n```\n:::\n\n\nInspect the resulting file `preliminary.csv` and check whether you can alter it to add additional background information.\n\nAs an example, note that domain of the $\\mathit{native\\_country}$ variate does not include $\\cat{Norway}$ or $\\cat{Sweden}$. Yet it's extremely likely that there were some native Norwegian or Swedish people in the USA in 1994; maybe too few to have been sampled into the training data. Let's add these two values to the list of domain values, and increase the domain size of $\\mathit{native\\_country}$ from 40 to 42. The resulting, updated metadata file has already been saved as [`meta_income_data_example.csv`](https://github.com/pglpm/ADA511/blob/master/code/OPM-nominal/meta_income_data_example.csv).\n\n### Agent's parameters $\\amin, \\amax$\n\nHow many data should the agent learn in order to appreciably change its initial beliefs about the variates above, for the USA 1994 population? Let's put an upper bound at around 1 000 000 (that's roughly 0.5% of the whole population) with $\\amax = 20$, and a lower bound at 1 with $\\amin = 0$; these are the default values. We shall see later what the agent suggests might be a reasonable amount of training data.\n\n\\\n\n## Collect & prepare training data\n\nThe 11 306 training data have been prepared by including only nominal variates, and discarding datapoints with partially missing data (although the function `buildagent()` discards such incomplete datapoints automatically). The resulting file is [`test-income_data_example.csv`](https://github.com/pglpm/ADA511/blob/master/code/OPM-nominal/test-income_data_example.csv).\n\n\\\n\n## Prepare OPM agent\n\nFor the sake of this example we shall prepare two agents with the same background information:\n\n- `opm10`, trained with 10 training datapoints\n\n- `opmall`, trained with all 11 306 training datapoints\n\nPrepare and train each with the `buildagent()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## temporarily load all training data\ntraindata <- read.csv('train-income_data_example.csv', header = TRUE,\n na.strings = '', stringsAsFactors = FALSE, tryLogical = FALSE)\n\n## feed first 10 datapoints to an agent\nopm10 <- buildagent(metadata = 'meta_income_data_example.csv',\n data = traindata[1:10, ])\n\n## delete training data for memory efficiency\nrm(traindata)\n\n\nopmall <- buildagent(metadata = 'meta_income_data_example.csv',\n data = 'train-income_data_example.csv')\n```\n:::\n\n\n\\\n\nWe can peek into the internal structure of these \"agent objects\" with `str()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(opmall)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nList of 4\n $ counts : num [1:7, 1:16, 1:7, 1:14, 1:6, 1:5, 1:2, 1:42, 1:2] 0 0 0 0 0 0 0 0 0 0 ...\n ..- attr(*, \"dimnames\")=List of 9\n .. ..$ workclass : chr [1:7] \"Federal-gov\" \"Local-gov\" \"Private\" \"Self-emp-inc\" ...\n .. ..$ education : chr [1:16] \"10th\" \"11th\" \"12th\" \"1st-4th\" ...\n .. ..$ marital_status: chr [1:7] \"Divorced\" \"Married-AF-spouse\" \"Married-civ-spouse\" \"Married-spouse-absent\" ...\n .. ..$ occupation : chr [1:14] \"Adm-clerical\" \"Armed-Forces\" \"Craft-repair\" \"Exec-managerial\" ...\n .. ..$ relationship : chr [1:6] \"Husband\" \"Not-in-family\" \"Other-relative\" \"Own-child\" ...\n .. ..$ race : chr [1:5] \"Amer-Indian-Eskimo\" \"Asian-Pac-Islander\" \"Black\" \"Other\" ...\n .. ..$ sex : chr [1:2] \"Female\" \"Male\"\n .. ..$ native_country: chr [1:42] \"Cambodia\" \"Canada\" \"China\" \"Columbia\" ...\n .. ..$ income : chr [1:2] \"<=50K\" \">50K\"\n $ alphas : num [1:21] 1 2 4 8 16 32 64 128 256 512 ...\n $ auxalphas: num [1:21] -160706 -157643 -154588 -151547 -148530 ...\n $ palphas : num [1:21] 0 0 0 0 0 0 0 0 0 0 ...\n - attr(*, \"class\")= chr [1:2] \"agent\" \"list\"\n```\n\n\n:::\n:::\n\n\nthis shows that each agent is encoded as a list of four objects:\n\n- the array `counts`, containing the counts $\\green\\# z$\n- the vector `alphas`, containing the values of $2^k$\n- the vector `auxalphas`, containing the (logarithm of) the multiplicative factors ([§@sec-code-computations])\n- the vector `palphas`, containing the updated probabilities about the required amount of training data\n\nThe agent has internally guessed how many training data should be necessary to affect its prior beliefs. We can peek at its guess by plotting the `alphas` parameters against the `palphas` probabilities:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmytplot(x = opmall$alphas, y = opmall$palphas, type = 'b',\n xlim = c(0, 10000), ylim = c(0, NA),\n xlab = 'required number of training data', ylab = 'probability')\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nThe most probable amount seems to be of the order of magnitude of 2000 units.\n\nNote that you can see the complete list of variates and their domains by simply calling `dimnames(opmall$counts)` (or any relevant agent-object name instead of `opmall`). Here is the beginning of the list:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(dimnames(opmall$counts))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$workclass\n[1] \"Federal-gov\" \"Local-gov\" \"Private\" \"Self-emp-inc\" \n[5] \"Self-emp-not-inc\" \"State-gov\" \"Without-pay\" \n\n$education\n [1] \"10th\" \"11th\" \"12th\" \"1st-4th\" \"5th-6th\" \n [6] \"7th-8th\" \"9th\" \"Assoc-acdm\" \"Assoc-voc\" \"Bachelors\" \n[11] \"Doctorate\" \"HS-grad\" \"Masters\" \"Preschool\" \"Prof-school\" \n[16] \"Some-college\"\n\n$marital_status\n[1] \"Divorced\" \"Married-AF-spouse\" \"Married-civ-spouse\" \n[4] \"Married-spouse-absent\" \"Never-married\" \"Separated\" \n[7] \"Widowed\" \n\n$occupation\n [1] \"Adm-clerical\" \"Armed-Forces\" \"Craft-repair\" \"Exec-managerial\" \n [5] \"Farming-fishing\" \"Handlers-cleaners\" \"Machine-op-inspct\" \"Other-service\" \n [9] \"Priv-house-serv\" \"Prof-specialty\" \"Protective-serv\" \"Sales\" \n[13] \"Tech-support\" \"Transport-moving\" \n\n$relationship\n[1] \"Husband\" \"Not-in-family\" \"Other-relative\" \"Own-child\" \n[5] \"Unmarried\" \"Wife\" \n\n$race\n[1] \"Amer-Indian-Eskimo\" \"Asian-Pac-Islander\" \"Black\" \"Other\" \n[5] \"White\" \n```\n\n\n:::\n:::\n\n\n\\\n\n## Application and exploration\n\n### Application: only predictands\n\nOur two agents are ready to be applied to new instances.\n\nBefore applying them, let's check some of their inferences, and see if we find anything unconvincing about them. If we find something unconvincing, it means that the background information we provided to the agent doesn't match the one in our intuition. Then there are two or three possibilities: our intuition is misleading us and need correcting; or we need to go back to stage [*Collect & prepare background info*]{.blue} and correct the background information given to the agent; or a combination of these two possibilities.\n\nWe ask the `opm10` agent to forecast the $\\mathit{income}$ of the next unit, using the `infer()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninfer(agent = opm10, predictand = 'income')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nincome\n <=50K >50K \n0.506288 0.493712 \n```\n\n\n:::\n:::\n\n\nThis agent gives a slightly larger probability to the $\\cat{<=50K}$ case. Using the function `plotFsamples1D()` we can also inspect the `opm10`-agent's belief about the frequency distribution of $\\mathit{income}$ for the full population. This belief is represented by a generalized scatter plot of 200 representative frequency distributions, represented as the [light-blue lines]{.lightblue}:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opm10,\n n = 200, # number of example frequency distributions\n predictand = 'income',\n ylim = c(0,1), # y-axis range\n main = 'opm10') # plot title\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\nwhere the **black line** is the probability distribution previously calculated with the `infer()` function.\n\nThis plot expresses the `opm10`-agent's belief that future training data might lead to even higher probabilities for $\\cat{<=50K}$. But note that the agent is not excluding the possibility of lower probabilities.\n\n\nLet's visualize the beliefs of the `opmall`-agent, trained with the full training dataset:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opmall, n = 200, predictand = 'income',\n ylim = c(0,1), main = 'opmall')\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nThe probability that the next unit has $\\mathit{income}\\mo\\cat{<=50}$ is now above 70%. Also note that the `opmall`-agent doesn't believe that this probability would change appreciably if more training data were provided.\n\n\\\n\nWe can perform a similar exploration for any other variate. Let's take the $\\mathit{race}$ variate for example:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opm10, n = 200, predictand = 'race',\n ylim = c(0,1), main = 'opm10',\n cex.axis = 0.75) # smaller axis-font size\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-11-1.png){width=672}\n:::\n:::\n\n\nNote again how the little-trained `opm10`-agent has practically uniform beliefs. But it's also expressing the fact that future training data will probably increase the probability of $\\mathit{race}\\mo\\cat{White}$.\n\nThis is corroborated by the fully-trained agent:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opmall, n = 200, predictand = 'race',\n ylim = c(0,1), main = 'opmall', cex.axis = 0.75)\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\n\\\n\nThese checks are satisfying, but it's good to examine their agreement or disagreement with our intuition. Examine the last plot for example. The `opmall` agent has very firm beliefs (no spread in the [light-blue lines]{.lightblue}) about the full-population distribution of $\\mathit{race}$. Do you think its beliefs are too firm, after 11 000 datapoints? would you like the agent to be more \"open-minded\"? In that case you should go back to the [*Collect & prepare background info*]{.blue} stage, and for example modify the parameters $\\amin, \\amax$, then re-check. Or you could even try an agent with a different initial belief distribution.\n\nIn making this kind of considerations it's important to keep in mind what we learned and observed in previous chapters:\n\n:::{.callout-note}\n## \n\n[**Our goal: optimality, not \"success\"**]{.blue}\n\nRemember ([§@sec-optimality] and [§@sec-probability-def]) that a probability represents the *rational* degree of belief that an agent should have *given the particular information available*. We can't judge a probability from the value it assigns to something we later learn to be true -- because according to the information available it could be more rational (and optimal) to consider that something implausible (recall the example in [§@sec-probability-def] of an object falling from the sky as we cross the street).\n\nFrom this point of view we should be wary of comparing the probability of something with our a-posteriori knowledge about it.\n\n\\\n\n[**The data cannot speak for themselves**]{.blue}\n\nWe could build an agent that remains more \"open-minded\" (more spread in the [light-blue lines]{.lightblue}), having received *exactly the same* training data. This \"open-mindedness\" therefore cannot be determined by the training data. Once more this shows that data *cannot* \"speak for themselves\" ([§@sec-underlying-distribution]).\n\n:::\n\n:::{.column-margin}\n![](neverknowsbest.jpg){width=100%}\n:::\n\n\n\\\n\n### Application: specifying predictors\n\nLet's now draw inferences by specifying some predictors.\n\nWe ask the `opm10` agent to forecast the $\\mathit{income}$ of a new unit, given that the unit is known to have $\\mathit{occupation}\\mo\\cat{Exec-managerial}$ and $\\mathit{sex}\\mo\\cat{Male}$ (two predictor variates). What would you expect?\n\nThe `opm10`-agent's belief about the unit -- as well as about the *full subpopulation* ([§@sec-subpopulations]) of units having those predictors -- is shown in the following plot:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opm10, n = 200,\n predictand = 'income',\n predictor = list(occupation = 'Exec-managerial',\n sex = 'Male'),\n ylim = c(0,1), main = 'opm10')\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\nNote how the `opm10`-agent still slightly higher probability to $\\mathit{income}\\mo\\cat{<=50}$, but at the same time it is quite uncertain about the subpopulation frequencies; more than if the predictor had not been specified. That is, according to this little-trained agent there could be large variety of possibilities *within this specific subpopulation*.\n\nThe `opmall`-agent's beliefs are shown below:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opmall, n = 200,\n predictand = 'income',\n predictor = list(occupation = 'Exec-managerial',\n sex = 'Male'),\n ylim = c(0,1), main = 'opmall')\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\nit believes with around 55% probability that such a unit would have higher, $\\cat{>50K}$ income. The representative subpopulation-frequency distributions in [light-blue]{.lightblue} indicate that this belief is unlikely to be changed by new training data.\n\n\\\n\nLet's now see an example of our agent's versatility by switching predictands and predictors. We tell the `opmall`-agent that the new unit has $\\mathit{income}\\mo\\cat{>50}$, and ask it to infer the joint variate $(\\mathit{occupation} \\and \\mathit{sex})$; let's present the results in rounded percentages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresult <- infer(agent = opmall,\n predictand = c('occupation', 'sex'),\n predictor = list(income = '>50K'))\n\nround(result * 100, 1) # round to one decimal\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n sex\noccupation Female Male\n Adm-clerical 3.1 4.1\n Armed-Forces 1.0 1.0\n Craft-repair 1.1 9.4\n Exec-managerial 3.6 16.0\n Farming-fishing 1.0 2.1\n Handlers-cleaners 1.0 1.6\n Machine-op-inspct 1.1 3.1\n Other-service 1.6 1.8\n Priv-house-serv 1.0 1.0\n Prof-specialty 4.8 14.7\n Protective-serv 1.0 3.2\n Sales 1.8 10.1\n Tech-support 1.5 3.4\n Transport-moving 1.1 3.7\n```\n\n\n:::\n:::\n\n\nIt returns a 14 × 2 table of joint probabilities. The most probable combinations are $(\\cat{Exec-managerial}, \\cat{Male})$ and $(\\cat{Prof-specialty}, \\cat{Male})$.\n\n### The `rF()` function\n\nThis function generates **full-population** frequency distributions (even for subpopulations) that are probable according to the data. It is used internally by `plotFsamples1D()`, which plots the generated frequency distributions as [light-blue lines]{.lightblue}.\n\nLet's see, as an example, three samples of how the full-population frequency distribution for $\\mathit{sex} \\and \\mathit{income}$ (jointly) could be:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresult <- rF(n = 3, # number of samples\n agent = opmall, \n predictand = c('sex', 'income'))\n\n## name the samples\ndimnames(result)[1] <- list(samples = paste0('#',1:3))\n\n## permute & print so that samples are the last array dimension\nprint(aperm(result) * 100)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n, , sample = #1\n\n sex\nincome Female Male\n <=50K 28.17199 43.5344\n >50K 6.83616 21.4575\n\n, , sample = #2\n\n sex\nincome Female Male\n <=50K 28.18455 43.4979\n >50K 7.24585 21.0717\n\n, , sample = #3\n\n sex\nincome Female Male\n <=50K 28.13035 43.6037\n >50K 7.00727 21.2587\n```\n\n\n:::\n:::\n\n\n\nThese possible full-population frequency distributions can be used to assess how much the probabilities we find could change, if we collected a much, much larger amount of training data. Here is an example:\n\nWe generate 1000 frequency distributions for $(\\mathit{occupation} \\and \\mathit{sex})$ given $\\mathit{income}\\mo\\cat{>50K}$, and then take the standard deviations of the samples as a rough measure of how much the probabilities we calculated a couple of cells above could change:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreqsamples <- rF(n = 1000,\n agent = opmall,\n predictand = c('occupation', 'sex'),\n predictor = list(income = '>50K'))\n\nvariability <- apply(freqsamples,\n c('occupation','sex'), # which dimensions to apply\n sd) # function to apply to those dimensions\n\nround(variability * 100, 2) # round to two decimals\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n sex\noccupation Female Male\n Adm-clerical 0.29 0.32\n Armed-Forces 0.16 0.17\n Craft-repair 0.17 0.47\n Exec-managerial 0.31 0.62\n Farming-fishing 0.16 0.23\n Handlers-cleaners 0.17 0.19\n Machine-op-inspct 0.17 0.29\n Other-service 0.20 0.22\n Priv-house-serv 0.16 0.16\n Prof-specialty 0.36 0.56\n Protective-serv 0.16 0.28\n Sales 0.22 0.49\n Tech-support 0.21 0.30\n Transport-moving 0.16 0.31\n```\n\n\n:::\n:::\n\n\nthe agent believes (at around 68%) that the current probability wouldn't change more than about ±0.5%.\n\n\n\\\n\nThe inferences above were partially \nmeant as checks, but we see that we can actually ask our agent a wide variety of questions about the full population, and do all sorts of association studies.\n\n\n:::{.callout-important}\n## {{< fa exclamation-triangle >}} No \"test\" or \"validation\" datasets used or needed\n\nThe tests and explorations above were done without any \"validation\" or \"test\" datasets. This is because our agent is capable of calculating and showing its beliefs about the *full population* -- and therefore about future data.\n\nThe need for validation or test datasets with common machine-learning algorithms arise from the fact that full-population beliefs are hidden or, more commonly, not computed at all, in order to gain speed. The application of the trained machine-learning algorithm to a validation dataset is an approximate way of extracting such beliefs.\n\n:::\n\n\\\n\n### Exploring the population properties: mutual information\n\nIn [§@sec-entropy-mutualinfo] we introduced [*mutual information*]{.blue} as the information-theoretic measure of mutual relevance and association of two quantities or variates. For the present task, the `opmall`-agent can tell us the mutual information between any two sets of variates of our choice, with the function `mutualinfo()`.\n\nFor instance, let's calculate the mutual information between $\\mathit{occupation}$ and $\\mathit{marital\\_status}$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutualinfo(agent = opmall,\n A = 'occupation', B = 'marital_status')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0827823\n```\n\n\n:::\n:::\n\n\nIt is a very low association: knowing either variate decreases the effective number of possible values of the other only $2^{0.0827823\\,\\mathit{Sh}} \\approx 1.06$ times.\n\nNow let's consider a scenario where, in order to save resources, we can use *only one* variate to infer the income. Which of the other variates should we prefer? We can calculate the mutual information between each of them, in turn, and $\\mathit{income}$:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## list of all variates\nvariates <- names(dimnames(opmall$counts))\n\n## list of all variates except 'income'\npredictors <- variates[variates != 'income']\n\n## prepare vector to contain the mutual information\nrelevances <- numeric(length(predictors))\nnames(relevances) <- predictors\n\n## calculate, for each variate,\n## the joint probability distribution 'probs'\n## and then mutual information 'relevance' (in shannons)\n## between 'income' and that variate\nfor(var in predictors){\n relevances[var] <- mutualinfo(agent = opmall, A = 'income', B = var)\n}\n\n## output the mutual informations in decreasing order\nsort(relevances, decreasing = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nmarital_status relationship education occupation workclass \n 0.10074130 0.09046621 0.06332052 0.05506897 0.03002995 \nnative_country sex race \n 0.01925227 0.01456655 0.00870089 \n```\n\n\n:::\n:::\n\n\nIf we had to choose *only one* variate to infer the outcome, on average it would be best to use $\\mathit{marital\\_status}$. Our last choice should be $\\mathit{race}$.\n\n:::{.callout-caution}\n## {{< fa user-edit >}} Exercise\n\nNow consider the scenario where we must *exclude one variate* from the eight predictors, or, equivalently, we can only use seven variates as predictors. Which variate should we exclude?\n\nPrepare a script similar to the one above: it calculates the mutual information between $\\mathit{income}$ and the other predictors but with one omitted, omitting each of the eight in turn.\n\n[*Warning: this computation might require 10 or more minutes to complete.*]{.small .midgrey}\n\n- Which single variate should not be omitted from the predictors? which single variate could be dropped?\n\n- Do you obtain the same relevance ranking as in the \"use-one-variate-only\" scenario above?\n\n:::\n\n\n## Example application to new data\n\nLet's apply the `opmall`-agent to a test dataset with 33 914 new units. For each new unit, the agent:\n\n- calculates the probability of $\\mathit{income}\\mo\\cat{<=50}$, via the function `infer()`, using as predictors all variates except $\\mathit{income}$\n- chooses one of the two values $\\set{\\cat{<=50K}, \\cat{>50K}}$, via the function `decide()` trying to maximizing utilities corresponding to the accuracy metric\n\nThe function `decide()` will be described in more detail in chapters [-@sec-make-decision] and [-@sec-example-opm2].\n\nAt the end we plot a histogram of the probabilities calculated for the new units, to check for instance for how many of the agent was sure (beliefs around 0% or 100%) or unsure (beliefs around 50%). We also report the final utility/accuracy per unit, and the time needed for the computation:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Load test data\ntestdata <- read.csv('test-income_data_example.csv', header = TRUE,\n na.strings = '', stringsAsFactors = FALSE, tryLogical = FALSE)\n\nntest <- nrow(testdata) # size of test dataset\n\n## Let's time the calculation\nstopwatch <- Sys.time()\n\ntestprobs <- numeric(ntest) # prepare vector of probabilities\ntesthits <- numeric(ntest) # prepare vector of hits\nfor(i in 1:ntest){\n \n ## calculate probabilities given all variates except 'income'\n probs <- infer(agent = opmall,\n predictor = testdata[i, colnames(testdata) != 'income'])\n\n ## store the probability for <=50K\n testprobs[i] <- probs['<=50K']\n\n ## decide on one value\n chosenvalue <- decide(probs = probs)$optimal\n\n ## check if decision == true_value, and store result\n testhits[i] <- (chosenvalue == testdata[i, 'income'])\n}\n\n## Print total time required\nprint(Sys.time() - stopwatch)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTime difference of 4.23584 secs\n```\n\n\n:::\n\n```{.r .cell-code}\n## Histogram and average accuracy (rounded to one decimal)\nmyhist(testprobs, n = seq(0,1,length.out = 10), plot = TRUE,\n xlab = 'P(income = \"<=50K\")',\n ylab = 'frequency density in test set',\n main = paste0('accuracy: ', round(100*mean(testhits), 1), '%'))\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n\\\n\n## Comparison {#sec-compare-opm}\n\n:::{.callout-caution}\n## {{< fa user-edit >}} Exercise\n\nNow try to use a popular machine-learning algorithm for the same task, using the same training data, and compare it with the prototype optimal predictor machine. Examine the differences. For example:\n\n- Can you inform the algorithm that $\\mathit{native\\_country}$ has two additional values $\\cat{Norway}$, $\\cat{Netherlands}$ not present in the training data? How?\n\n- Can you flexibly use the algorithm to specify any predictors and any predictands on the fly?\n\n- Does the algorithm inform you of how the inferences could change if more training data were available?\n\n- Which accuracy does the algorithm achieve on the test set?\n\n:::\n\n\n", + "markdown": "# [Example application: adult-income task]{.red} {#sec-example-opm1}\n::: {.hidden}\n\n\n\\providecommand{\\ul}{\\uline}\n\\providecommand{\\and}{\\mathbin{\\mkern-0.5mu,\\mkern-0.5mu}}\n\\renewcommand*{\\|}[1][]{\\nonscript\\:#1\\vert\\nonscript\\:\\mathopen{}}\n\\providecommand*{\\pr}[1]{\\textsf{\\small`#1'}}\n\\renewcommand*{\\pr}[1]{\\textsf{\\small`#1'}}\n\\providecommand*{\\prq}[1]{\\textsf{\\small #1}}\n\\providecommand*{\\se}[1]{\\mathsfit{#1}}\n\\renewcommand{\\se}[1]{\\mathsfit{#1}}\n\\providecommand*{\\sei}[1]{\\mathsfit{\\small #1}}\n\n\\providecommand{\\cat}[1]{{\\small\\verb;#1;}}\n\\providecommand{\\vec}[1]{\\boldsymbol{#1}}\n\\providecommand{\\p}{\\mathrm{p}}\n\\renewcommand{\\p}{\\mathrm{p}}\n\\renewcommand{\\P}{\\mathrm{P}}\n\\definecolor{quarto-callout-note-color}{HTML}{4477AA}\n\\definecolor{quarto-callout-note-color-frame}{HTML}{4477AA}\n\\definecolor{quarto-callout-important-color}{HTML}{AA3377}\n\\definecolor{quarto-callout-important-color-frame}{HTML}{AA3377}\n\\definecolor{quarto-callout-warning-color}{HTML}{EE6677}\n\\definecolor{quarto-callout-warning-color-frame}{HTML}{EE6677}\n\\definecolor{quarto-callout-tip-color}{HTML}{228833}\n\\definecolor{quarto-callout-tip-color-frame}{HTML}{228833}\n\\definecolor{quarto-callout-caution-color}{HTML}{CCBB44}\n\\definecolor{quarto-callout-caution-color-frame}{HTML}{CCBB44}\n\n\\providecommand*{\\mo}[1][=]{\\mathclose{}\\mathord{\\nonscript\\mkern0mu\\textrm{\\small#1}\\nonscript\\mkern0mu}\\mathopen{}}\n\\providecommand*{\\yX}{\\se{X}}\n\\providecommand*{\\yY}{\\se{Y}}\n\\providecommand*{\\yI}{\\se{I}}\n\\providecommand*{\\yi}[1][]{\\se{I}_{\\text{#1}}}\n\\providecommand{\\di}{\\mathrm{d}}\n\\providecommand{\\defd}{\\coloneqq}\n\\providecommand{\\blue}{\\color[RGB]{68,119,170}}\n\\providecommand{\\red}{\\color[RGB]{238,102,119}}\n\\providecommand{\\purple}{\\color[RGB]{170,51,119}}\n\\providecommand{\\green}{\\color[RGB]{34,136,51}}\n\\providecommand{\\yellow}{\\color[RGB]{204,187,68}}\n\\providecommand{\\lblue}{\\color[RGB]{102,204,238}}\n\\providecommand{\\grey}{\\color[RGB]{187,187,187}}\n\\providecommand{\\midgrey}{\\color[RGB]{119,119,119}}\n\\providecommand{\\black}{\\color[RGB]{0,0,0}}\n\\newcommand*{\\e}{\\mathrm{e}}\n\\newcommand*{\\pu}{\\text{π}}\n\\newcommand*{\\RR}{\\mathbf{R}}\n\n$$\n\\DeclarePairedDelimiters{\\set}{\\{}{\\}}\n\\DeclareMathOperator*{\\argmax}{argmax}\n$$\n\n\n\n\n\n\n\n\n:::\n\n::: {.hidden}\n\\providecommand*{\\yon}{{\\green\\cat{on}}}\n\\providecommand*{\\yof}{{\\red\\cat{off}}}\n\\providecommand*{\\yy}{{\\lblue\\cat{Y}}}\n\\providecommand*{\\yn}{{\\yellow\\cat{N}}}\n\\providecommand{\\ypl}{{\\green\\cat{+}}}\n\\providecommand{\\ymi}{{\\red\\cat{-}}}\n\\providecommand{\\ypa}{{\\green\\cat{pass}}}\n\\providecommand{\\yfa}{{\\red\\cat{fail}}}\n\n\n\\providecommand{\\hi}{{\\green\\cat{high}}}\n\\providecommand{\\me}{{\\yellow\\cat{medium}}}\n\\providecommand{\\lo}{{\\red\\cat{low}}}\n\\providecommand*{\\yJ}{\\se{J}}\n\\providecommand{\\yva}{{\\lblue-1}}\n\\providecommand{\\yvb}{{\\midgrey0}}\n\\providecommand{\\yvc}{{\\yellow1}}\n\\providecommand*{\\yK}{\\se{K}}\n\\providecommand*{\\yL}{\\se{L}}\n\n\\providecommand*{\\yR}{R}\n\n\\providecommand*{\\bZ}{{\\blue Z}}\n\\providecommand*{\\bz}{{\\blue z}}\n\\providecommand*{\\rY}{{\\red Y}}\n\\providecommand*{\\bY}{{\\blue Y}}\n\\providecommand*{\\ry}{{\\red y}}\n\\providecommand*{\\gX}{{\\green X}}\n\\providecommand*{\\bX}{{\\blue X}}\n\\providecommand*{\\gx}{{\\green x}}\n\\providecommand*{\\vf}{\\vec{f}}\n\n\\providecommand*{\\yut}{\\se{K}_{\\textsf{3}}}\n\\providecommand*{\\yul}{\\se{K}}\n\n\\providecommand*{\\bA}{{\\blue A}}\n\\providecommand*{\\bB}{{\\blue B}}\n\\providecommand*{\\bC}{{\\blue C}}\n\n\n\\providecommand*{\\vfa}{\\vf'}\n\\providecommand*{\\vfb}{\\vf''}\n\n:::\n\n\n::: {.hidden}\n\n\\providecommand*{\\data}{\\se{\\green data}}\n\\providecommand*{\\yD}{\\se{I}_{\\textrm{d}}}\n\\providecommand*{\\ya}{k}\n\\providecommand*{\\amin}{\\ya_{\\text{mi}}}\n\\providecommand*{\\amax}{\\ya_{\\text{ma}}}\n\n:::\n\n\n\n\n\n\n\n\n\n::: {.cell}\n\n:::\n\n\nLet's illustrate the example workflow described in [§@sec-opm-workflow] with a toy, but not too simplistic, example, based on the [adult-income dataset](https://archive.ics.uci.edu/dataset/2/adult).\n\n\nAll code functions and data files are in the directory\\\n[https://github.com/pglpm/ADA511/tree/master/code/OPM-nominal](https://github.com/pglpm/ADA511/tree/master/code/OPM-nominal)\n\nWe start loading the R libraries and functions needed at several stages. You need to have installed^[This is done with the [`install.packages()` function](https://rdrr.io/r/utils/install.packages.html).] the packages [*extraDistr*](https://cran.r-project.org/package=extraDistr) and [*foreach*](https://cran.r-project.org/package=foreach). Make sure you have saved all source files and data files in the same directory.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary('extraDistr')\nlibrary('foreach')\n\nsource('tplotfunctions.R')\nsource('guessmetadata.R')\nsource('buildagent.R')\nsource('infer.R')\nsource('decide.R')\nsource('mutualinfo.R')\nsource('rF.R')\nsource('plotFsamples1D.R')\n\noptions(repr.plot.width = 6 * sqrt(2), repr.plot.height = 6)\n```\n:::\n\n\n\n\\\n\n## Define the task\n\nThe main task is to infer whether a USA citizen earns less (≤) or more (>) than USD 50 000/year, given a set of characteristics of that citizen. In view of later workflow stages, let's note a couple of known and unknown facts to delimit this task in a more precise manner:\n\n- Given the flexibility of the agent we shall use, we can generalize the task: to infer any subset of the set of characteristics, given any other subset. In other words, we can choose the predictand and predictor variates for any new citizen. Later on we shall also extend the task to making a concrete decision, based on utilities relevant to that citizen.\n \n\tThis flexibility is also convenient because no explanation is given as to *what purpose* the income should be guessed.\n\n- The training data come from a 1994 census, and our agent will use an exchangeable belief distribution about the population. The value of the USD and the economic situation of the country changes from year to year, as well as the informational relationships between economic and demographic factors. For this reason the agent should be used to draw inferences about at most one or two years around 1994. Beyond such time range the exchangeability assumption is too dubious and risky.\n\n- The [USA population in 1994 was around 260 000 000](https://www.macrotrends.net/countries/USA/united-states/population), and we shall use around 11 000 training data. The population size can therefore be considered approximately infinite.\n\n\\\n\n## Collect & prepare background info\n\n### Variates and domains\n\nThe variates to be used must be of nominal type, because our agent's background beliefs (represented by the Dirichlet-mixture distribution) are only appropriate for nominal variates. In this toy example we simply discard all original non-nominal variates. These included some, such as age, that would surely be relevant for this task. As a different approach, we could have coarsened each non-nominal variate into three or four range values, so that treating it as nominal would have been an acceptable approximation.\n\nFirst, create a preliminary metadata file by running the function `guessmetadata()` on the training data [`train-income_data_example.csv`](https://github.com/pglpm/ADA511/blob/master/code/OPM-nominal/train-income_data_example.csv):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nguessmetadata(data = 'train-income_data_example.csv',\n file = 'preliminary.csv')\n```\n:::\n\n\nInspect the resulting file `preliminary.csv` and check whether you can alter it to add additional background information.\n\nAs an example, note that domain of the $\\mathit{native\\_country}$ variate does not include $\\cat{Norway}$ or $\\cat{Sweden}$. Yet it's extremely likely that there were some native Norwegian or Swedish people in the USA in 1994; maybe too few to have been sampled into the training data. Let's add these two values to the list of domain values, and increase the domain size of $\\mathit{native\\_country}$ from 40 to 42. The resulting, updated metadata file has already been saved as [`meta_income_data_example.csv`](https://github.com/pglpm/ADA511/blob/master/code/OPM-nominal/meta_income_data_example.csv).\n\n### Agent's parameters $\\amin, \\amax$\n\nHow many data should the agent learn in order to appreciably change its initial beliefs about the variates above, for the USA 1994 population? Let's put an upper bound at around 1 000 000 (that's roughly 0.5% of the whole population) with $\\amax = 20$, and a lower bound at 1 with $\\amin = 0$; these are the default values. We shall see later what the agent suggests might be a reasonable amount of training data.\n\n\\\n\n## Collect & prepare training data\n\nThe 11 306 training data have been prepared by including only nominal variates, and discarding datapoints with partially missing data (although the function `buildagent()` discards such incomplete datapoints automatically). The resulting file is [`test-income_data_example.csv`](https://github.com/pglpm/ADA511/blob/master/code/OPM-nominal/test-income_data_example.csv).\n\n\\\n\n## Prepare OPM agent\n\nFor the sake of this example we shall prepare two agents with the same background information:\n\n- `opm10`, trained with 10 training datapoints\n\n- `opmall`, trained with all 11 306 training datapoints\n\nPrepare and train each with the `buildagent()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## temporarily load all training data\ntraindata <- read.csv('train-income_data_example.csv', header = TRUE,\n na.strings = '', stringsAsFactors = FALSE, tryLogical = FALSE)\n\n## feed first 10 datapoints to an agent\nopm10 <- buildagent(metadata = 'meta_income_data_example.csv',\n data = traindata[1:10, ])\n\n## delete training data for memory efficiency\nrm(traindata)\n\n\nopmall <- buildagent(metadata = 'meta_income_data_example.csv',\n data = 'train-income_data_example.csv')\n```\n:::\n\n\n\\\n\nWe can peek into the internal structure of these \"agent objects\" with `str()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(opmall)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nList of 4\n $ counts : num [1:7, 1:16, 1:7, 1:14, 1:6, 1:5, 1:2, 1:42, 1:2] 0 0 0 0 0 0 0 0 0 0 ...\n ..- attr(*, \"dimnames\")=List of 9\n .. ..$ workclass : chr [1:7] \"Federal-gov\" \"Local-gov\" \"Private\" \"Self-emp-inc\" ...\n .. ..$ education : chr [1:16] \"10th\" \"11th\" \"12th\" \"1st-4th\" ...\n .. ..$ marital_status: chr [1:7] \"Divorced\" \"Married-AF-spouse\" \"Married-civ-spouse\" \"Married-spouse-absent\" ...\n .. ..$ occupation : chr [1:14] \"Adm-clerical\" \"Armed-Forces\" \"Craft-repair\" \"Exec-managerial\" ...\n .. ..$ relationship : chr [1:6] \"Husband\" \"Not-in-family\" \"Other-relative\" \"Own-child\" ...\n .. ..$ race : chr [1:5] \"Amer-Indian-Eskimo\" \"Asian-Pac-Islander\" \"Black\" \"Other\" ...\n .. ..$ sex : chr [1:2] \"Female\" \"Male\"\n .. ..$ native_country: chr [1:42] \"Cambodia\" \"Canada\" \"China\" \"Columbia\" ...\n .. ..$ income : chr [1:2] \"<=50K\" \">50K\"\n $ alphas : num [1:21] 1 2 4 8 16 32 64 128 256 512 ...\n $ auxalphas: num [1:21] -160706 -157643 -154588 -151547 -148530 ...\n $ palphas : num [1:21] 0 0 0 0 0 0 0 0 0 0 ...\n - attr(*, \"class\")= chr [1:2] \"agent\" \"list\"\n```\n\n\n:::\n:::\n\n\nthis shows that each agent is encoded as a list of four objects:\n\n- the array `counts`, containing the counts $\\green\\# z$\n- the vector `alphas`, containing the values of $2^k$\n- the vector `auxalphas`, containing the (logarithm of) the multiplicative factors ([§@sec-code-computations])\n- the vector `palphas`, containing the updated probabilities about the required amount of training data\n\nThe agent has internally guessed how many training data should be necessary to affect its prior beliefs. We can peek at its guess by plotting the `alphas` parameters against the `palphas` probabilities:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmytplot(x = opmall$alphas, y = opmall$palphas, type = 'b',\n xlim = c(0, 10000), ylim = c(0, NA),\n xlab = 'required number of training data', ylab = 'probability')\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nThe most probable amount seems to be of the order of magnitude of 2000 units.\n\nNote that you can see the complete list of variates and their domains by simply calling `dimnames(opmall$counts)` (or any relevant agent-object name instead of `opmall`). Here is the beginning of the list:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(dimnames(opmall$counts))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$workclass\n[1] \"Federal-gov\" \"Local-gov\" \"Private\" \"Self-emp-inc\" \n[5] \"Self-emp-not-inc\" \"State-gov\" \"Without-pay\" \n\n$education\n [1] \"10th\" \"11th\" \"12th\" \"1st-4th\" \"5th-6th\" \n [6] \"7th-8th\" \"9th\" \"Assoc-acdm\" \"Assoc-voc\" \"Bachelors\" \n[11] \"Doctorate\" \"HS-grad\" \"Masters\" \"Preschool\" \"Prof-school\" \n[16] \"Some-college\"\n\n$marital_status\n[1] \"Divorced\" \"Married-AF-spouse\" \"Married-civ-spouse\" \n[4] \"Married-spouse-absent\" \"Never-married\" \"Separated\" \n[7] \"Widowed\" \n\n$occupation\n [1] \"Adm-clerical\" \"Armed-Forces\" \"Craft-repair\" \"Exec-managerial\" \n [5] \"Farming-fishing\" \"Handlers-cleaners\" \"Machine-op-inspct\" \"Other-service\" \n [9] \"Priv-house-serv\" \"Prof-specialty\" \"Protective-serv\" \"Sales\" \n[13] \"Tech-support\" \"Transport-moving\" \n\n$relationship\n[1] \"Husband\" \"Not-in-family\" \"Other-relative\" \"Own-child\" \n[5] \"Unmarried\" \"Wife\" \n\n$race\n[1] \"Amer-Indian-Eskimo\" \"Asian-Pac-Islander\" \"Black\" \"Other\" \n[5] \"White\" \n```\n\n\n:::\n:::\n\n\n\\\n\n## Application and exploration\n\n### Application: only predictands\n\nOur two agents are ready to be applied to new instances.\n\nBefore applying them, let's check some of their inferences, and see if we find anything unconvincing about them. If we find something unconvincing, it means that the background information we provided to the agent doesn't match the one in our intuition. Then there are two or three possibilities: our intuition is misleading us and need correcting; or we need to go back to stage [*Collect & prepare background info*]{.blue} and correct the background information given to the agent; or a combination of these two possibilities.\n\nWe ask the `opm10` agent to forecast the $\\mathit{income}$ of the next unit, using the `infer()` function:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninfer(agent = opm10, predictand = 'income')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nincome\n <=50K >50K \n0.506288 0.493712 \n```\n\n\n:::\n:::\n\n\nThis agent gives a slightly larger probability to the $\\cat{<=50K}$ case. Using the function `plotFsamples1D()` we can also inspect the `opm10`-agent's belief about the frequency distribution of $\\mathit{income}$ for the full population. This belief is represented by a generalized scatter plot of 200 representative frequency distributions, represented as the [light-blue lines]{.lightblue}:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opm10,\n n = 200, # number of example frequency distributions\n predictand = 'income',\n ylim = c(0,1), # y-axis range\n main = 'opm10') # plot title\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\nwhere the **black line** is the probability distribution previously calculated with the `infer()` function.\n\nThis plot expresses the `opm10`-agent's belief that future training data might lead to even higher probabilities for $\\cat{<=50K}$. But note that the agent is not excluding the possibility of lower probabilities.\n\n\nLet's visualize the beliefs of the `opmall`-agent, trained with the full training dataset:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opmall, n = 200, predictand = 'income',\n ylim = c(0,1), main = 'opmall')\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nThe probability that the next unit has $\\mathit{income}\\mo\\cat{<=50}$ is now above 70%. Also note that the `opmall`-agent doesn't believe that this probability would change appreciably if more training data were provided.\n\n\\\n\nWe can perform a similar exploration for any other variate. Let's take the $\\mathit{race}$ variate for example:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opm10, n = 200, predictand = 'race',\n ylim = c(0,1), main = 'opm10',\n cex.axis = 0.75) # smaller axis-font size\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-11-1.png){width=672}\n:::\n:::\n\n\nNote again how the little-trained `opm10`-agent has practically uniform beliefs. But it's also expressing the fact that future training data will probably increase the probability of $\\mathit{race}\\mo\\cat{White}$.\n\nThis is corroborated by the fully-trained agent:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opmall, n = 200, predictand = 'race',\n ylim = c(0,1), main = 'opmall', cex.axis = 0.75)\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\n\\\n\nThese checks are satisfying, but it's good to examine their agreement or disagreement with our intuition. Examine the last plot for example. The `opmall` agent has very firm beliefs (no spread in the [light-blue lines]{.lightblue}) about the full-population distribution of $\\mathit{race}$. Do you think its beliefs are too firm, after 11 000 datapoints? would you like the agent to be more \"open-minded\"? In that case you should go back to the [*Collect & prepare background info*]{.blue} stage, and for example modify the parameters $\\amin, \\amax$, then re-check. Or you could even try an agent with a different initial belief distribution.\n\nIn making this kind of considerations it's important to keep in mind what we learned and observed in previous chapters:\n\n:::{.callout-note}\n## \n\n[**Our goal: optimality, not \"success\"**]{.blue}\n\nRemember ([§@sec-optimality] and [§@sec-probability-def]) that a probability represents the *rational* degree of belief that an agent should have *given the particular information available*. We can't judge a probability from the value it assigns to something we later learn to be true -- because according to the information available it could be more rational (and optimal) to consider that something implausible (recall the example in [§@sec-probability-def] of an object falling from the sky as we cross the street).\n\nFrom this point of view we should be wary of comparing the probability of something with our a-posteriori knowledge about it.\n\n\\\n\n[**The data cannot speak for themselves**]{.blue}\n\nWe could build an agent that remains more \"open-minded\" (more spread in the [light-blue lines]{.lightblue}), having received *exactly the same* training data. This \"open-mindedness\" therefore cannot be determined by the training data. Once more this shows that data *cannot* \"speak for themselves\" ([§@sec-underlying-distribution]).\n\n:::\n\n:::{.column-margin}\n![](neverknowsbest.jpg){width=100%}\n:::\n\n\n\\\n\n### Application: specifying predictors\n\nLet's now draw inferences by specifying some predictors.\n\nWe ask the `opm10` agent to forecast the $\\mathit{income}$ of a new unit, given that the unit is known to have $\\mathit{occupation}\\mo\\cat{Exec-managerial}$ and $\\mathit{sex}\\mo\\cat{Male}$ (two predictor variates). What would you expect?\n\nThe `opm10`-agent's belief about the unit -- as well as about the *full subpopulation* ([§@sec-subpopulations]) of units having those predictors -- is shown in the following plot:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opm10, n = 200,\n predictand = 'income',\n predictor = list(occupation = 'Exec-managerial',\n sex = 'Male'),\n ylim = c(0,1), main = 'opm10')\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\nNote how the `opm10`-agent still slightly higher probability to $\\mathit{income}\\mo\\cat{<=50}$, but at the same time it is quite uncertain about the subpopulation frequencies; more than if the predictor had not been specified. That is, according to this little-trained agent there could be large variety of possibilities *within this specific subpopulation*.\n\nThe `opmall`-agent's beliefs are shown below:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplotFsamples1D(agent = opmall, n = 200,\n predictand = 'income',\n predictor = list(occupation = 'Exec-managerial',\n sex = 'Male'),\n ylim = c(0,1), main = 'opmall')\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n\nit believes with around 55% probability that such a unit would have higher, $\\cat{>50K}$ income. The representative subpopulation-frequency distributions in [light-blue]{.lightblue} indicate that this belief is unlikely to be changed by new training data.\n\n\\\n\nLet's now see an example of our agent's versatility by switching predictands and predictors. We tell the `opmall`-agent that the new unit has $\\mathit{income}\\mo\\cat{>50}$, and ask it to infer the joint variate $(\\mathit{occupation} \\and \\mathit{sex})$; let's present the results in rounded percentages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresult <- infer(agent = opmall,\n predictand = c('occupation', 'sex'),\n predictor = list(income = '>50K'))\n\nround(result * 100, 1) # round to one decimal\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n sex\noccupation Female Male\n Adm-clerical 3.1 4.1\n Armed-Forces 1.0 1.0\n Craft-repair 1.1 9.4\n Exec-managerial 3.6 16.0\n Farming-fishing 1.0 2.1\n Handlers-cleaners 1.0 1.6\n Machine-op-inspct 1.1 3.1\n Other-service 1.6 1.8\n Priv-house-serv 1.0 1.0\n Prof-specialty 4.8 14.7\n Protective-serv 1.0 3.2\n Sales 1.8 10.1\n Tech-support 1.5 3.4\n Transport-moving 1.1 3.7\n```\n\n\n:::\n:::\n\n\nIt returns a 14 × 2 table of joint probabilities. The most probable combinations are $(\\cat{Exec-managerial}, \\cat{Male})$ and $(\\cat{Prof-specialty}, \\cat{Male})$.\n\n### The `rF()` function\n\nThis function generates **full-population** frequency distributions (even for subpopulations) that are probable according to the data. It is used internally by `plotFsamples1D()`, which plots the generated frequency distributions as [light-blue lines]{.lightblue}.\n\nLet's see, as an example, three samples of how the full-population frequency distribution for $\\mathit{sex} \\and \\mathit{income}$ (jointly) could be:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nresult <- rF(n = 3, # number of samples\n agent = opmall, \n predictand = c('sex', 'income'))\n\n## name the samples\ndimnames(result)[1] <- list(samples = paste0('#',1:3))\n\n## permute & print so that samples are the last array dimension\nprint(aperm(result) * 100)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n, , sample = #1\n\n sex\nincome Female Male\n <=50K 27.6411 43.5611\n >50K 6.9703 21.8275\n\n, , sample = #2\n\n sex\nincome Female Male\n <=50K 28.31471 43.1568\n >50K 7.21764 21.3108\n\n, , sample = #3\n\n sex\nincome Female Male\n <=50K 28.44690 42.0664\n >50K 7.28531 22.2014\n```\n\n\n:::\n:::\n\n\n\nThese possible full-population frequency distributions can be used to assess how much the probabilities we find could change, if we collected a much, much larger amount of training data. Here is an example:\n\nWe generate 1000 frequency distributions for $(\\mathit{occupation} \\and \\mathit{sex})$ given $\\mathit{income}\\mo\\cat{>50K}$, and then take the standard deviations of the samples as a rough measure of how much the probabilities we calculated a couple of cells above could change:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreqsamples <- rF(n = 1000,\n agent = opmall,\n predictand = c('occupation', 'sex'),\n predictor = list(income = '>50K'))\n\nvariability <- apply(freqsamples,\n c('occupation','sex'), # which dimensions to apply\n sd) # function to apply to those dimensions\n\nround(variability * 100, 2) # round to two decimals\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n sex\noccupation Female Male\n Adm-clerical 0.28 0.33\n Armed-Forces 0.15 0.16\n Craft-repair 0.17 0.48\n Exec-managerial 0.29 0.59\n Farming-fishing 0.17 0.23\n Handlers-cleaners 0.16 0.20\n Machine-op-inspct 0.17 0.28\n Other-service 0.21 0.23\n Priv-house-serv 0.17 0.16\n Prof-specialty 0.35 0.59\n Protective-serv 0.17 0.29\n Sales 0.21 0.50\n Tech-support 0.19 0.29\n Transport-moving 0.17 0.30\n```\n\n\n:::\n:::\n\n\nthe agent believes (at around 68%) that the current probability wouldn't change more than about ±0.5%.\n\n\n\\\n\nThe inferences above were partially \nmeant as checks, but we see that we can actually ask our agent a wide variety of questions about the full population, and do all sorts of association studies.\n\n\n:::{.callout-important}\n## {{< fa exclamation-triangle >}} No \"test\" or \"validation\" datasets used or needed\n\nThe tests and explorations above were done without any \"validation\" or \"test\" datasets. This is because our agent is capable of calculating and showing its beliefs about the *full population* -- and therefore about future data.\n\nThe need for validation or test datasets with common machine-learning algorithms arise from the fact that full-population beliefs are hidden or, more commonly, not computed at all, in order to gain speed. The application of the trained machine-learning algorithm to a validation dataset is an approximate way of extracting such beliefs.\n\n:::\n\n\\\n\n### Exploring the population properties: mutual information\n\nIn [§@sec-entropy-mutualinfo] we introduced [*mutual information*]{.blue} as the information-theoretic measure of mutual relevance and association of two quantities or variates. For the present task, the `opmall`-agent can tell us the mutual information between any two sets of variates of our choice, with the function `mutualinfo()`.\n\nFor instance, let's calculate the mutual information between $\\mathit{occupation}$ and $\\mathit{marital\\_status}$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmutualinfo(agent = opmall,\n A = 'occupation', B = 'marital_status')\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0827823\n```\n\n\n:::\n:::\n\n\nIt is a very low association: knowing either variate decreases the effective number of possible values of the other only $2^{0.0827823\\,\\mathit{Sh}} \\approx 1.06$ times.\n\nNow let's consider a scenario where, in order to save resources, we can use *only one* variate to infer the income. Which of the other variates should we prefer? We can calculate the mutual information between each of them, in turn, and $\\mathit{income}$:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## list of all variates\nvariates <- names(dimnames(opmall$counts))\n\n## list of all variates except 'income'\npredictors <- variates[variates != 'income']\n\n## prepare vector to contain the mutual information\nrelevances <- numeric(length(predictors))\nnames(relevances) <- predictors\n\n## calculate, for each variate,\n## and the mutual information 'relevance' (in shannons)\n## between 'income' and that variate\nfor(var in predictors){\n relevances[var] <- mutualinfo(agent = opmall, A = 'income', B = var)\n}\n\n## output the mutual informations in decreasing order\nsort(relevances, decreasing = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nmarital_status relationship education occupation workclass \n 0.10074130 0.09046621 0.06332052 0.05506897 0.03002995 \nnative_country sex race \n 0.01925227 0.01456655 0.00870089 \n```\n\n\n:::\n:::\n\n\nIf we had to choose *only one* variate to infer the outcome, on average it would be best to use $\\mathit{marital\\_status}$. Our last choice should be $\\mathit{race}$.\n\n:::{.callout-caution}\n## {{< fa user-edit >}} Exercise\n\nNow consider the scenario where we must *exclude one variate* from the eight predictors, or, equivalently, we can only use seven variates as predictors. Which variate should we exclude?\n\nPrepare a script similar to the one above: it calculates the mutual information between $\\mathit{income}$ and the other predictors but with one omitted, omitting each of the eight in turn.\n\n[*Warning: this computation might require 10 or more minutes to complete.*]{.small .midgrey}\n\n- Which single variate should not be omitted from the predictors? which single variate could be dropped?\n\n- Do you obtain the same relevance ranking as in the \"use-one-variate-only\" scenario above?\n\n:::\n\n\n## Example application to new data\n\nLet's apply the `opmall`-agent to a test dataset with 33 914 new units. For each new unit, the agent:\n\n- calculates the probability of $\\mathit{income}\\mo\\cat{<=50}$, via the function `infer()`, using as predictors all variates except $\\mathit{income}$\n- chooses one of the two values $\\set{\\cat{<=50K}, \\cat{>50K}}$, via the function `decide()` trying to maximizing utilities corresponding to the accuracy metric\n\nThe function `decide()` will be described in more detail in chapters [-@sec-make-decision] and [-@sec-example-opm2].\n\nAt the end we plot a histogram of the probabilities calculated for the new units, to check for instance for how many of the agent was sure (beliefs around 0% or 100%) or unsure (beliefs around 50%). We also report the final utility/accuracy per unit, and the time needed for the computation:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Load test data\ntestdata <- read.csv('test-income_data_example.csv', header = TRUE,\n na.strings = '', stringsAsFactors = FALSE, tryLogical = FALSE)\n\nntest <- nrow(testdata) # size of test dataset\n\n## Let's time the calculation\nstopwatch <- Sys.time()\n\ntestprobs <- numeric(ntest) # prepare vector of probabilities\ntesthits <- numeric(ntest) # prepare vector of hits\nfor(i in 1:ntest){\n \n ## calculate probabilities given all variates except 'income'\n probs <- infer(agent = opmall,\n predictor = testdata[i, colnames(testdata) != 'income'])\n\n ## store the probability for <=50K\n testprobs[i] <- probs['<=50K']\n\n ## decide on one value\n chosenvalue <- decide(probs = probs)$optimal\n\n ## check if decision == true_value, and store result\n testhits[i] <- (chosenvalue == testdata[i, 'income'])\n}\n\n## Print total time required\nprint(Sys.time() - stopwatch)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTime difference of 4.21148 secs\n```\n\n\n:::\n\n```{.r .cell-code}\n## Histogram and average accuracy (rounded to one decimal)\nmyhist(testprobs, n = seq(0,1,length.out = 10), plot = TRUE,\n xlab = 'P(income = \"<=50K\")',\n ylab = 'frequency density in test set',\n main = paste0('accuracy: ', round(100*mean(testhits), 1), '%'))\n```\n\n::: {.cell-output-display}\n![](example_opm1_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n\\\n\n## Comparison {#sec-compare-opm}\n\n:::{.callout-caution}\n## {{< fa user-edit >}} Exercise\n\nNow try to use a popular machine-learning algorithm for the same task, using the same training data, and compare it with the prototype optimal predictor machine. Examine the differences. For example:\n\n- Can you inform the algorithm that $\\mathit{native\\_country}$ has two additional values $\\cat{Norway}$, $\\cat{Netherlands}$ not present in the training data? How?\n\n- Can you flexibly use the algorithm to specify any predictors and any predictands on the fly?\n\n- Does the algorithm inform you of how the inferences could change if more training data were available?\n\n- Which accuracy does the algorithm achieve on the test set?\n\n:::\n\n\n", "supporting": [ "example_opm1_files" ], diff --git a/_freeze/example_opm1/figure-html/unnamed-chunk-10-1.png b/_freeze/example_opm1/figure-html/unnamed-chunk-10-1.png index 4c7f29a..6b4394e 100644 Binary files a/_freeze/example_opm1/figure-html/unnamed-chunk-10-1.png and b/_freeze/example_opm1/figure-html/unnamed-chunk-10-1.png differ diff --git a/_freeze/example_opm1/figure-html/unnamed-chunk-11-1.png b/_freeze/example_opm1/figure-html/unnamed-chunk-11-1.png index ea3b722..3a3eae2 100644 Binary files a/_freeze/example_opm1/figure-html/unnamed-chunk-11-1.png and b/_freeze/example_opm1/figure-html/unnamed-chunk-11-1.png differ diff --git a/_freeze/example_opm1/figure-html/unnamed-chunk-12-1.png b/_freeze/example_opm1/figure-html/unnamed-chunk-12-1.png index 412695f..9327988 100644 Binary files a/_freeze/example_opm1/figure-html/unnamed-chunk-12-1.png and b/_freeze/example_opm1/figure-html/unnamed-chunk-12-1.png differ diff --git a/_freeze/example_opm1/figure-html/unnamed-chunk-13-1.png b/_freeze/example_opm1/figure-html/unnamed-chunk-13-1.png index 7bd149c..cbfc950 100644 Binary files a/_freeze/example_opm1/figure-html/unnamed-chunk-13-1.png and b/_freeze/example_opm1/figure-html/unnamed-chunk-13-1.png differ diff --git a/_freeze/example_opm1/figure-html/unnamed-chunk-14-1.png b/_freeze/example_opm1/figure-html/unnamed-chunk-14-1.png index bb01ee1..b811b7d 100644 Binary files a/_freeze/example_opm1/figure-html/unnamed-chunk-14-1.png and b/_freeze/example_opm1/figure-html/unnamed-chunk-14-1.png differ diff --git a/_freeze/example_opm1/figure-html/unnamed-chunk-20-1.png b/_freeze/example_opm1/figure-html/unnamed-chunk-20-1.png index 5191d75..a37ff5a 100644 Binary files a/_freeze/example_opm1/figure-html/unnamed-chunk-20-1.png and b/_freeze/example_opm1/figure-html/unnamed-chunk-20-1.png differ diff --git a/_freeze/example_opm1/figure-html/unnamed-chunk-9-1.png b/_freeze/example_opm1/figure-html/unnamed-chunk-9-1.png index c61bb6a..196334a 100644 Binary files a/_freeze/example_opm1/figure-html/unnamed-chunk-9-1.png and b/_freeze/example_opm1/figure-html/unnamed-chunk-9-1.png differ diff --git a/docs/example_opm1.html b/docs/example_opm1.html index 9ba2ad9..dd3d09b 100644 --- a/docs/example_opm1.html +++ b/docs/example_opm1.html @@ -1002,23 +1002,23 @@
rF()
func
, , sample = #1
sex
-income Female Male
- <=50K 28.17199 43.5344
- >50K 6.83616 21.4575
+income Female Male
+ <=50K 27.6411 43.5611
+ >50K 6.9703 21.8275
, , sample = #2
sex
income Female Male
- <=50K 28.18455 43.4979
- >50K 7.24585 21.0717
+ <=50K 28.31471 43.1568
+ >50K 7.21764 21.3108
, , sample = #3
sex
income Female Male
- <=50K 28.13035 43.6037
- >50K 7.00727 21.2587
+ <=50K 28.44690 42.0664
+ >50K 7.28531 22.2014
These possible full-population frequency distributions can be used to assess how much the probabilities we find could change, if we collected a much, much larger amount of training data. Here is an example:
@@ -1037,20 +1037,20 @@rF()
func
sex
occupation Female Male
- Adm-clerical 0.29 0.32
- Armed-Forces 0.16 0.17
- Craft-repair 0.17 0.47
- Exec-managerial 0.31 0.62
- Farming-fishing 0.16 0.23
- Handlers-cleaners 0.17 0.19
- Machine-op-inspct 0.17 0.29
- Other-service 0.20 0.22
- Priv-house-serv 0.16 0.16
- Prof-specialty 0.36 0.56
- Protective-serv 0.16 0.28
- Sales 0.22 0.49
- Tech-support 0.21 0.30
- Transport-moving 0.16 0.31
+ Adm-clerical 0.28 0.33
+ Armed-Forces 0.15 0.16
+ Craft-repair 0.17 0.48
+ Exec-managerial 0.29 0.59
+ Farming-fishing 0.17 0.23
+ Handlers-cleaners 0.16 0.20
+ Machine-op-inspct 0.17 0.28
+ Other-service 0.21 0.23
+ Priv-house-serv 0.17 0.16
+ Prof-specialty 0.35 0.59
+ Protective-serv 0.17 0.29
+ Sales 0.21 0.50
+ Tech-support 0.19 0.29
+ Transport-moving 0.17 0.30
the agent believes (at around 68%) that the current probability wouldn’t change more than about ±0.5%.
@@ -1099,15 +1099,14 @@marital_status relationship education occupation workclass
0.10074130 0.09046621 0.06332052 0.05506897 0.03002995
@@ -1177,7 +1176,7 @@ ## Print total time required
print(Sys.time() - stopwatch)
Time difference of 4.23584 secs
+Time difference of 4.21148 secs
## Histogram and average accuracy (rounded to one decimal)
myhist(testprobs, n = seq(0,1,length.out = 10), plot = TRUE,
diff --git a/docs/example_opm1_files/figure-html/unnamed-chunk-10-1.png b/docs/example_opm1_files/figure-html/unnamed-chunk-10-1.png
index 4c7f29a..6b4394e 100644
Binary files a/docs/example_opm1_files/figure-html/unnamed-chunk-10-1.png and b/docs/example_opm1_files/figure-html/unnamed-chunk-10-1.png differ
diff --git a/docs/example_opm1_files/figure-html/unnamed-chunk-11-1.png b/docs/example_opm1_files/figure-html/unnamed-chunk-11-1.png
index ea3b722..3a3eae2 100644
Binary files a/docs/example_opm1_files/figure-html/unnamed-chunk-11-1.png and b/docs/example_opm1_files/figure-html/unnamed-chunk-11-1.png differ
diff --git a/docs/example_opm1_files/figure-html/unnamed-chunk-12-1.png b/docs/example_opm1_files/figure-html/unnamed-chunk-12-1.png
index 412695f..9327988 100644
Binary files a/docs/example_opm1_files/figure-html/unnamed-chunk-12-1.png and b/docs/example_opm1_files/figure-html/unnamed-chunk-12-1.png differ
diff --git a/docs/example_opm1_files/figure-html/unnamed-chunk-13-1.png b/docs/example_opm1_files/figure-html/unnamed-chunk-13-1.png
index 7bd149c..cbfc950 100644
Binary files a/docs/example_opm1_files/figure-html/unnamed-chunk-13-1.png and b/docs/example_opm1_files/figure-html/unnamed-chunk-13-1.png differ
diff --git a/docs/example_opm1_files/figure-html/unnamed-chunk-14-1.png b/docs/example_opm1_files/figure-html/unnamed-chunk-14-1.png
index bb01ee1..b811b7d 100644
Binary files a/docs/example_opm1_files/figure-html/unnamed-chunk-14-1.png and b/docs/example_opm1_files/figure-html/unnamed-chunk-14-1.png differ
diff --git a/docs/example_opm1_files/figure-html/unnamed-chunk-20-1.png b/docs/example_opm1_files/figure-html/unnamed-chunk-20-1.png
index 5191d75..a37ff5a 100644
Binary files a/docs/example_opm1_files/figure-html/unnamed-chunk-20-1.png and b/docs/example_opm1_files/figure-html/unnamed-chunk-20-1.png differ
diff --git a/docs/example_opm1_files/figure-html/unnamed-chunk-9-1.png b/docs/example_opm1_files/figure-html/unnamed-chunk-9-1.png
index c61bb6a..196334a 100644
Binary files a/docs/example_opm1_files/figure-html/unnamed-chunk-9-1.png and b/docs/example_opm1_files/figure-html/unnamed-chunk-9-1.png differ
diff --git a/docs/search.json b/docs/search.json
index b9be1f6..5362b73 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -1978,7 +1978,7 @@
"href": "example_opm1.html#application-and-exploration",
"title": "35 Example application: adult-income task",
"section": "35.5 Application and exploration",
- "text": "35.5 Application and exploration\n\nApplication: only predictands\nOur two agents are ready to be applied to new instances.\nBefore applying them, let’s check some of their inferences, and see if we find anything unconvincing about them. If we find something unconvincing, it means that the background information we provided to the agent doesn’t match the one in our intuition. Then there are two or three possibilities: our intuition is misleading us and need correcting; or we need to go back to stage Collect & prepare background info and correct the background information given to the agent; or a combination of these two possibilities.\nWe ask the opm10 agent to forecast the \\(\\mathit{income}\\) of the next unit, using the infer() function:\n\ninfer(agent = opm10, predictand = 'income')\n\nincome\n <=50K >50K \n0.506288 0.493712 \n\n\nThis agent gives a slightly larger probability to the \\({\\small\\verb;<=50K;}\\) case. Using the function plotFsamples1D() we can also inspect the opm10-agent’s belief about the frequency distribution of \\(\\mathit{income}\\) for the full population. This belief is represented by a generalized scatter plot of 200 representative frequency distributions, represented as the light-blue lines:\n\nplotFsamples1D(agent = opm10,\n n = 200, # number of example frequency distributions\n predictand = 'income',\n ylim = c(0,1), # y-axis range\n main = 'opm10') # plot title\n\n\n\n\n\n\n\n\nwhere the black line is the probability distribution previously calculated with the infer() function.\nThis plot expresses the opm10-agent’s belief that future training data might lead to even higher probabilities for \\({\\small\\verb;<=50K;}\\). But note that the agent is not excluding the possibility of lower probabilities.\nLet’s visualize the beliefs of the opmall-agent, trained with the full training dataset:\n\nplotFsamples1D(agent = opmall, n = 200, predictand = 'income',\n ylim = c(0,1), main = 'opmall')\n\n\n\n\n\n\n\n\nThe probability that the next unit has \\(\\mathit{income}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;<=50;}\\) is now above 70%. Also note that the opmall-agent doesn’t believe that this probability would change appreciably if more training data were provided.\n\n\nWe can perform a similar exploration for any other variate. Let’s take the \\(\\mathit{race}\\) variate for example:\n\nplotFsamples1D(agent = opm10, n = 200, predictand = 'race',\n ylim = c(0,1), main = 'opm10',\n cex.axis = 0.75) # smaller axis-font size\n\n\n\n\n\n\n\n\nNote again how the little-trained opm10-agent has practically uniform beliefs. But it’s also expressing the fact that future training data will probably increase the probability of \\(\\mathit{race}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;White;}\\).\nThis is corroborated by the fully-trained agent:\n\nplotFsamples1D(agent = opmall, n = 200, predictand = 'race',\n ylim = c(0,1), main = 'opmall', cex.axis = 0.75)\n\n\n\n\n\n\n\n\n\n\nThese checks are satisfying, but it’s good to examine their agreement or disagreement with our intuition. Examine the last plot for example. The opmall agent has very firm beliefs (no spread in the light-blue lines) about the full-population distribution of \\(\\mathit{race}\\). Do you think its beliefs are too firm, after 11 000 datapoints? would you like the agent to be more “open-minded”? In that case you should go back to the Collect & prepare background info stage, and for example modify the parameters \\(k_{\\text{mi}}, k_{\\text{ma}}\\), then re-check. Or you could even try an agent with a different initial belief distribution.\nIn making this kind of considerations it’s important to keep in mind what we learned and observed in previous chapters:\n\n\n\n\n\n\nNote\n\n\n\nOur goal: optimality, not “success”\nRemember (§ 2.3 and § 8.1) that a probability represents the rational degree of belief that an agent should have given the particular information available. We can’t judge a probability from the value it assigns to something we later learn to be true – because according to the information available it could be more rational (and optimal) to consider that something implausible (recall the example in § 8.1 of an object falling from the sky as we cross the street).\nFrom this point of view we should be wary of comparing the probability of something with our a-posteriori knowledge about it.\n\n\nThe data cannot speak for themselves\nWe could build an agent that remains more “open-minded” (more spread in the light-blue lines), having received exactly the same training data. This “open-mindedness” therefore cannot be determined by the training data. Once more this shows that data cannot “speak for themselves” (§ 27.4).\n\n\n\n\n\n\n\n\n\nApplication: specifying predictors\nLet’s now draw inferences by specifying some predictors.\nWe ask the opm10 agent to forecast the \\(\\mathit{income}\\) of a new unit, given that the unit is known to have \\(\\mathit{occupation}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;Exec-managerial;}\\) and \\(\\mathit{sex}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;Male;}\\) (two predictor variates). What would you expect?\nThe opm10-agent’s belief about the unit – as well as about the full subpopulation (§ 22.1) of units having those predictors – is shown in the following plot:\n\nplotFsamples1D(agent = opm10, n = 200,\n predictand = 'income',\n predictor = list(occupation = 'Exec-managerial',\n sex = 'Male'),\n ylim = c(0,1), main = 'opm10')\n\n\n\n\n\n\n\n\nNote how the opm10-agent still slightly higher probability to \\(\\mathit{income}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;<=50;}\\), but at the same time it is quite uncertain about the subpopulation frequencies; more than if the predictor had not been specified. That is, according to this little-trained agent there could be large variety of possibilities within this specific subpopulation.\nThe opmall-agent’s beliefs are shown below:\n\nplotFsamples1D(agent = opmall, n = 200,\n predictand = 'income',\n predictor = list(occupation = 'Exec-managerial',\n sex = 'Male'),\n ylim = c(0,1), main = 'opmall')\n\n\n\n\n\n\n\n\nit believes with around 55% probability that such a unit would have higher, \\({\\small\\verb;>50K;}\\) income. The representative subpopulation-frequency distributions in light-blue indicate that this belief is unlikely to be changed by new training data.\n\n\nLet’s now see an example of our agent’s versatility by switching predictands and predictors. We tell the opmall-agent that the new unit has \\(\\mathit{income}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;>50;}\\), and ask it to infer the joint variate \\((\\mathit{occupation} \\mathbin{\\mkern-0.5mu,\\mkern-0.5mu}\\mathit{sex})\\); let’s present the results in rounded percentages:\n\nresult <- infer(agent = opmall,\n predictand = c('occupation', 'sex'),\n predictor = list(income = '>50K'))\n\nround(result * 100, 1) # round to one decimal\n\n sex\noccupation Female Male\n Adm-clerical 3.1 4.1\n Armed-Forces 1.0 1.0\n Craft-repair 1.1 9.4\n Exec-managerial 3.6 16.0\n Farming-fishing 1.0 2.1\n Handlers-cleaners 1.0 1.6\n Machine-op-inspct 1.1 3.1\n Other-service 1.6 1.8\n Priv-house-serv 1.0 1.0\n Prof-specialty 4.8 14.7\n Protective-serv 1.0 3.2\n Sales 1.8 10.1\n Tech-support 1.5 3.4\n Transport-moving 1.1 3.7\n\n\nIt returns a 14 × 2 table of joint probabilities. The most probable combinations are \\(({\\small\\verb;Exec-managerial;}, {\\small\\verb;Male;})\\) and \\(({\\small\\verb;Prof-specialty;}, {\\small\\verb;Male;})\\).\n\n\nThe rF() function\nThis function generates full-population frequency distributions (even for subpopulations) that are probable according to the data. It is used internally by plotFsamples1D(), which plots the generated frequency distributions as light-blue lines.\nLet’s see, as an example, three samples of how the full-population frequency distribution for \\(\\mathit{sex} \\mathbin{\\mkern-0.5mu,\\mkern-0.5mu}\\mathit{income}\\) (jointly) could be:\n\nresult <- rF(n = 3, # number of samples\n agent = opmall, \n predictand = c('sex', 'income'))\n\n## name the samples\ndimnames(result)[1] <- list(samples = paste0('#',1:3))\n\n## permute & print so that samples are the last array dimension\nprint(aperm(result) * 100)\n\n, , sample = #1\n\n sex\nincome Female Male\n <=50K 28.17199 43.5344\n >50K 6.83616 21.4575\n\n, , sample = #2\n\n sex\nincome Female Male\n <=50K 28.18455 43.4979\n >50K 7.24585 21.0717\n\n, , sample = #3\n\n sex\nincome Female Male\n <=50K 28.13035 43.6037\n >50K 7.00727 21.2587\n\n\nThese possible full-population frequency distributions can be used to assess how much the probabilities we find could change, if we collected a much, much larger amount of training data. Here is an example:\nWe generate 1000 frequency distributions for \\((\\mathit{occupation} \\mathbin{\\mkern-0.5mu,\\mkern-0.5mu}\\mathit{sex})\\) given \\(\\mathit{income}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;>50K;}\\), and then take the standard deviations of the samples as a rough measure of how much the probabilities we calculated a couple of cells above could change:\n\nfreqsamples <- rF(n = 1000,\n agent = opmall,\n predictand = c('occupation', 'sex'),\n predictor = list(income = '>50K'))\n\nvariability <- apply(freqsamples,\n c('occupation','sex'), # which dimensions to apply\n sd) # function to apply to those dimensions\n\nround(variability * 100, 2) # round to two decimals\n\n sex\noccupation Female Male\n Adm-clerical 0.29 0.32\n Armed-Forces 0.16 0.17\n Craft-repair 0.17 0.47\n Exec-managerial 0.31 0.62\n Farming-fishing 0.16 0.23\n Handlers-cleaners 0.17 0.19\n Machine-op-inspct 0.17 0.29\n Other-service 0.20 0.22\n Priv-house-serv 0.16 0.16\n Prof-specialty 0.36 0.56\n Protective-serv 0.16 0.28\n Sales 0.22 0.49\n Tech-support 0.21 0.30\n Transport-moving 0.16 0.31\n\n\nthe agent believes (at around 68%) that the current probability wouldn’t change more than about ±0.5%.\n\n\nThe inferences above were partially meant as checks, but we see that we can actually ask our agent a wide variety of questions about the full population, and do all sorts of association studies.\n\n\n\n\n\n\n No “test” or “validation” datasets used or needed\n\n\n\nThe tests and explorations above were done without any “validation” or “test” datasets. This is because our agent is capable of calculating and showing its beliefs about the full population – and therefore about future data.\nThe need for validation or test datasets with common machine-learning algorithms arise from the fact that full-population beliefs are hidden or, more commonly, not computed at all, in order to gain speed. The application of the trained machine-learning algorithm to a validation dataset is an approximate way of extracting such beliefs.\n\n\n\n\n\n\nExploring the population properties: mutual information\nIn § 18.5 we introduced mutual information as the information-theoretic measure of mutual relevance and association of two quantities or variates. For the present task, the opmall-agent can tell us the mutual information between any two sets of variates of our choice, with the function mutualinfo().\nFor instance, let’s calculate the mutual information between \\(\\mathit{occupation}\\) and \\(\\mathit{marital\\_status}\\).\n\nmutualinfo(agent = opmall,\n A = 'occupation', B = 'marital_status')\n\n[1] 0.0827823\n\n\nIt is a very low association: knowing either variate decreases the effective number of possible values of the other only \\(2^{0.0827823\\,\\mathit{Sh}} \\approx 1.06\\) times.\nNow let’s consider a scenario where, in order to save resources, we can use only one variate to infer the income. Which of the other variates should we prefer? We can calculate the mutual information between each of them, in turn, and \\(\\mathit{income}\\):\n\n## list of all variates\nvariates <- names(dimnames(opmall$counts))\n\n## list of all variates except 'income'\npredictors <- variates[variates != 'income']\n\n## prepare vector to contain the mutual information\nrelevances <- numeric(length(predictors))\nnames(relevances) <- predictors\n\n## calculate, for each variate,\n## the joint probability distribution 'probs'\n## and then mutual information 'relevance' (in shannons)\n## between 'income' and that variate\nfor(var in predictors){\n relevances[var] <- mutualinfo(agent = opmall, A = 'income', B = var)\n}\n\n## output the mutual informations in decreasing order\nsort(relevances, decreasing = TRUE)\n\nmarital_status relationship education occupation workclass \n 0.10074130 0.09046621 0.06332052 0.05506897 0.03002995 \nnative_country sex race \n 0.01925227 0.01456655 0.00870089 \n\n\nIf we had to choose only one variate to infer the outcome, on average it would be best to use \\(\\mathit{marital\\_status}\\). Our last choice should be \\(\\mathit{race}\\).\n\n\n\n\n\n\n Exercise\n\n\n\nNow consider the scenario where we must exclude one variate from the eight predictors, or, equivalently, we can only use seven variates as predictors. Which variate should we exclude?\nPrepare a script similar to the one above: it calculates the mutual information between \\(\\mathit{income}\\) and the other predictors but with one omitted, omitting each of the eight in turn.\nWarning: this computation might require 10 or more minutes to complete.\n\nWhich single variate should not be omitted from the predictors? which single variate could be dropped?\nDo you obtain the same relevance ranking as in the “use-one-variate-only” scenario above?",
+ "text": "35.5 Application and exploration\n\nApplication: only predictands\nOur two agents are ready to be applied to new instances.\nBefore applying them, let’s check some of their inferences, and see if we find anything unconvincing about them. If we find something unconvincing, it means that the background information we provided to the agent doesn’t match the one in our intuition. Then there are two or three possibilities: our intuition is misleading us and need correcting; or we need to go back to stage Collect & prepare background info and correct the background information given to the agent; or a combination of these two possibilities.\nWe ask the opm10 agent to forecast the \\(\\mathit{income}\\) of the next unit, using the infer() function:\n\ninfer(agent = opm10, predictand = 'income')\n\nincome\n <=50K >50K \n0.506288 0.493712 \n\n\nThis agent gives a slightly larger probability to the \\({\\small\\verb;<=50K;}\\) case. Using the function plotFsamples1D() we can also inspect the opm10-agent’s belief about the frequency distribution of \\(\\mathit{income}\\) for the full population. This belief is represented by a generalized scatter plot of 200 representative frequency distributions, represented as the light-blue lines:\n\nplotFsamples1D(agent = opm10,\n n = 200, # number of example frequency distributions\n predictand = 'income',\n ylim = c(0,1), # y-axis range\n main = 'opm10') # plot title\n\n\n\n\n\n\n\n\nwhere the black line is the probability distribution previously calculated with the infer() function.\nThis plot expresses the opm10-agent’s belief that future training data might lead to even higher probabilities for \\({\\small\\verb;<=50K;}\\). But note that the agent is not excluding the possibility of lower probabilities.\nLet’s visualize the beliefs of the opmall-agent, trained with the full training dataset:\n\nplotFsamples1D(agent = opmall, n = 200, predictand = 'income',\n ylim = c(0,1), main = 'opmall')\n\n\n\n\n\n\n\n\nThe probability that the next unit has \\(\\mathit{income}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;<=50;}\\) is now above 70%. Also note that the opmall-agent doesn’t believe that this probability would change appreciably if more training data were provided.\n\n\nWe can perform a similar exploration for any other variate. Let’s take the \\(\\mathit{race}\\) variate for example:\n\nplotFsamples1D(agent = opm10, n = 200, predictand = 'race',\n ylim = c(0,1), main = 'opm10',\n cex.axis = 0.75) # smaller axis-font size\n\n\n\n\n\n\n\n\nNote again how the little-trained opm10-agent has practically uniform beliefs. But it’s also expressing the fact that future training data will probably increase the probability of \\(\\mathit{race}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;White;}\\).\nThis is corroborated by the fully-trained agent:\n\nplotFsamples1D(agent = opmall, n = 200, predictand = 'race',\n ylim = c(0,1), main = 'opmall', cex.axis = 0.75)\n\n\n\n\n\n\n\n\n\n\nThese checks are satisfying, but it’s good to examine their agreement or disagreement with our intuition. Examine the last plot for example. The opmall agent has very firm beliefs (no spread in the light-blue lines) about the full-population distribution of \\(\\mathit{race}\\). Do you think its beliefs are too firm, after 11 000 datapoints? would you like the agent to be more “open-minded”? In that case you should go back to the Collect & prepare background info stage, and for example modify the parameters \\(k_{\\text{mi}}, k_{\\text{ma}}\\), then re-check. Or you could even try an agent with a different initial belief distribution.\nIn making this kind of considerations it’s important to keep in mind what we learned and observed in previous chapters:\n\n\n\n\n\n\nNote\n\n\n\nOur goal: optimality, not “success”\nRemember (§ 2.3 and § 8.1) that a probability represents the rational degree of belief that an agent should have given the particular information available. We can’t judge a probability from the value it assigns to something we later learn to be true – because according to the information available it could be more rational (and optimal) to consider that something implausible (recall the example in § 8.1 of an object falling from the sky as we cross the street).\nFrom this point of view we should be wary of comparing the probability of something with our a-posteriori knowledge about it.\n\n\nThe data cannot speak for themselves\nWe could build an agent that remains more “open-minded” (more spread in the light-blue lines), having received exactly the same training data. This “open-mindedness” therefore cannot be determined by the training data. Once more this shows that data cannot “speak for themselves” (§ 27.4).\n\n\n\n\n\n\n\n\n\nApplication: specifying predictors\nLet’s now draw inferences by specifying some predictors.\nWe ask the opm10 agent to forecast the \\(\\mathit{income}\\) of a new unit, given that the unit is known to have \\(\\mathit{occupation}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;Exec-managerial;}\\) and \\(\\mathit{sex}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;Male;}\\) (two predictor variates). What would you expect?\nThe opm10-agent’s belief about the unit – as well as about the full subpopulation (§ 22.1) of units having those predictors – is shown in the following plot:\n\nplotFsamples1D(agent = opm10, n = 200,\n predictand = 'income',\n predictor = list(occupation = 'Exec-managerial',\n sex = 'Male'),\n ylim = c(0,1), main = 'opm10')\n\n\n\n\n\n\n\n\nNote how the opm10-agent still slightly higher probability to \\(\\mathit{income}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;<=50;}\\), but at the same time it is quite uncertain about the subpopulation frequencies; more than if the predictor had not been specified. That is, according to this little-trained agent there could be large variety of possibilities within this specific subpopulation.\nThe opmall-agent’s beliefs are shown below:\n\nplotFsamples1D(agent = opmall, n = 200,\n predictand = 'income',\n predictor = list(occupation = 'Exec-managerial',\n sex = 'Male'),\n ylim = c(0,1), main = 'opmall')\n\n\n\n\n\n\n\n\nit believes with around 55% probability that such a unit would have higher, \\({\\small\\verb;>50K;}\\) income. The representative subpopulation-frequency distributions in light-blue indicate that this belief is unlikely to be changed by new training data.\n\n\nLet’s now see an example of our agent’s versatility by switching predictands and predictors. We tell the opmall-agent that the new unit has \\(\\mathit{income}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;>50;}\\), and ask it to infer the joint variate \\((\\mathit{occupation} \\mathbin{\\mkern-0.5mu,\\mkern-0.5mu}\\mathit{sex})\\); let’s present the results in rounded percentages:\n\nresult <- infer(agent = opmall,\n predictand = c('occupation', 'sex'),\n predictor = list(income = '>50K'))\n\nround(result * 100, 1) # round to one decimal\n\n sex\noccupation Female Male\n Adm-clerical 3.1 4.1\n Armed-Forces 1.0 1.0\n Craft-repair 1.1 9.4\n Exec-managerial 3.6 16.0\n Farming-fishing 1.0 2.1\n Handlers-cleaners 1.0 1.6\n Machine-op-inspct 1.1 3.1\n Other-service 1.6 1.8\n Priv-house-serv 1.0 1.0\n Prof-specialty 4.8 14.7\n Protective-serv 1.0 3.2\n Sales 1.8 10.1\n Tech-support 1.5 3.4\n Transport-moving 1.1 3.7\n\n\nIt returns a 14 × 2 table of joint probabilities. The most probable combinations are \\(({\\small\\verb;Exec-managerial;}, {\\small\\verb;Male;})\\) and \\(({\\small\\verb;Prof-specialty;}, {\\small\\verb;Male;})\\).\n\n\nThe rF() function\nThis function generates full-population frequency distributions (even for subpopulations) that are probable according to the data. It is used internally by plotFsamples1D(), which plots the generated frequency distributions as light-blue lines.\nLet’s see, as an example, three samples of how the full-population frequency distribution for \\(\\mathit{sex} \\mathbin{\\mkern-0.5mu,\\mkern-0.5mu}\\mathit{income}\\) (jointly) could be:\n\nresult <- rF(n = 3, # number of samples\n agent = opmall, \n predictand = c('sex', 'income'))\n\n## name the samples\ndimnames(result)[1] <- list(samples = paste0('#',1:3))\n\n## permute & print so that samples are the last array dimension\nprint(aperm(result) * 100)\n\n, , sample = #1\n\n sex\nincome Female Male\n <=50K 27.6411 43.5611\n >50K 6.9703 21.8275\n\n, , sample = #2\n\n sex\nincome Female Male\n <=50K 28.31471 43.1568\n >50K 7.21764 21.3108\n\n, , sample = #3\n\n sex\nincome Female Male\n <=50K 28.44690 42.0664\n >50K 7.28531 22.2014\n\n\nThese possible full-population frequency distributions can be used to assess how much the probabilities we find could change, if we collected a much, much larger amount of training data. Here is an example:\nWe generate 1000 frequency distributions for \\((\\mathit{occupation} \\mathbin{\\mkern-0.5mu,\\mkern-0.5mu}\\mathit{sex})\\) given \\(\\mathit{income}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;>50K;}\\), and then take the standard deviations of the samples as a rough measure of how much the probabilities we calculated a couple of cells above could change:\n\nfreqsamples <- rF(n = 1000,\n agent = opmall,\n predictand = c('occupation', 'sex'),\n predictor = list(income = '>50K'))\n\nvariability <- apply(freqsamples,\n c('occupation','sex'), # which dimensions to apply\n sd) # function to apply to those dimensions\n\nround(variability * 100, 2) # round to two decimals\n\n sex\noccupation Female Male\n Adm-clerical 0.28 0.33\n Armed-Forces 0.15 0.16\n Craft-repair 0.17 0.48\n Exec-managerial 0.29 0.59\n Farming-fishing 0.17 0.23\n Handlers-cleaners 0.16 0.20\n Machine-op-inspct 0.17 0.28\n Other-service 0.21 0.23\n Priv-house-serv 0.17 0.16\n Prof-specialty 0.35 0.59\n Protective-serv 0.17 0.29\n Sales 0.21 0.50\n Tech-support 0.19 0.29\n Transport-moving 0.17 0.30\n\n\nthe agent believes (at around 68%) that the current probability wouldn’t change more than about ±0.5%.\n\n\nThe inferences above were partially meant as checks, but we see that we can actually ask our agent a wide variety of questions about the full population, and do all sorts of association studies.\n\n\n\n\n\n\n No “test” or “validation” datasets used or needed\n\n\n\nThe tests and explorations above were done without any “validation” or “test” datasets. This is because our agent is capable of calculating and showing its beliefs about the full population – and therefore about future data.\nThe need for validation or test datasets with common machine-learning algorithms arise from the fact that full-population beliefs are hidden or, more commonly, not computed at all, in order to gain speed. The application of the trained machine-learning algorithm to a validation dataset is an approximate way of extracting such beliefs.\n\n\n\n\n\n\nExploring the population properties: mutual information\nIn § 18.5 we introduced mutual information as the information-theoretic measure of mutual relevance and association of two quantities or variates. For the present task, the opmall-agent can tell us the mutual information between any two sets of variates of our choice, with the function mutualinfo().\nFor instance, let’s calculate the mutual information between \\(\\mathit{occupation}\\) and \\(\\mathit{marital\\_status}\\).\n\nmutualinfo(agent = opmall,\n A = 'occupation', B = 'marital_status')\n\n[1] 0.0827823\n\n\nIt is a very low association: knowing either variate decreases the effective number of possible values of the other only \\(2^{0.0827823\\,\\mathit{Sh}} \\approx 1.06\\) times.\nNow let’s consider a scenario where, in order to save resources, we can use only one variate to infer the income. Which of the other variates should we prefer? We can calculate the mutual information between each of them, in turn, and \\(\\mathit{income}\\):\n\n## list of all variates\nvariates <- names(dimnames(opmall$counts))\n\n## list of all variates except 'income'\npredictors <- variates[variates != 'income']\n\n## prepare vector to contain the mutual information\nrelevances <- numeric(length(predictors))\nnames(relevances) <- predictors\n\n## calculate, for each variate,\n## and the mutual information 'relevance' (in shannons)\n## between 'income' and that variate\nfor(var in predictors){\n relevances[var] <- mutualinfo(agent = opmall, A = 'income', B = var)\n}\n\n## output the mutual informations in decreasing order\nsort(relevances, decreasing = TRUE)\n\nmarital_status relationship education occupation workclass \n 0.10074130 0.09046621 0.06332052 0.05506897 0.03002995 \nnative_country sex race \n 0.01925227 0.01456655 0.00870089 \n\n\nIf we had to choose only one variate to infer the outcome, on average it would be best to use \\(\\mathit{marital\\_status}\\). Our last choice should be \\(\\mathit{race}\\).\n\n\n\n\n\n\n Exercise\n\n\n\nNow consider the scenario where we must exclude one variate from the eight predictors, or, equivalently, we can only use seven variates as predictors. Which variate should we exclude?\nPrepare a script similar to the one above: it calculates the mutual information between \\(\\mathit{income}\\) and the other predictors but with one omitted, omitting each of the eight in turn.\nWarning: this computation might require 10 or more minutes to complete.\n\nWhich single variate should not be omitted from the predictors? which single variate could be dropped?\nDo you obtain the same relevance ranking as in the “use-one-variate-only” scenario above?",
"crumbs": [
"[**A prototype Optimal Predictor Machine**]{.red}",
"35 [Example application: adult-income task]{.red}"
@@ -1989,7 +1989,7 @@
"href": "example_opm1.html#example-application-to-new-data",
"title": "35 Example application: adult-income task",
"section": "35.6 Example application to new data",
- "text": "35.6 Example application to new data\nLet’s apply the opmall-agent to a test dataset with 33 914 new units. For each new unit, the agent:\n\ncalculates the probability of \\(\\mathit{income}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;<=50;}\\), via the function infer(), using as predictors all variates except \\(\\mathit{income}\\)\nchooses one of the two values \\(\\set{{\\small\\verb;<=50K;}, {\\small\\verb;>50K;}}\\), via the function decide() trying to maximizing utilities corresponding to the accuracy metric\n\nThe function decide() will be described in more detail in chapters 37 and 38.\nAt the end we plot a histogram of the probabilities calculated for the new units, to check for instance for how many of the agent was sure (beliefs around 0% or 100%) or unsure (beliefs around 50%). We also report the final utility/accuracy per unit, and the time needed for the computation:\n\n## Load test data\ntestdata <- read.csv('test-income_data_example.csv', header = TRUE,\n na.strings = '', stringsAsFactors = FALSE, tryLogical = FALSE)\n\nntest <- nrow(testdata) # size of test dataset\n\n## Let's time the calculation\nstopwatch <- Sys.time()\n\ntestprobs <- numeric(ntest) # prepare vector of probabilities\ntesthits <- numeric(ntest) # prepare vector of hits\nfor(i in 1:ntest){\n \n ## calculate probabilities given all variates except 'income'\n probs <- infer(agent = opmall,\n predictor = testdata[i, colnames(testdata) != 'income'])\n\n ## store the probability for <=50K\n testprobs[i] <- probs['<=50K']\n\n ## decide on one value\n chosenvalue <- decide(probs = probs)$optimal\n\n ## check if decision == true_value, and store result\n testhits[i] <- (chosenvalue == testdata[i, 'income'])\n}\n\n## Print total time required\nprint(Sys.time() - stopwatch)\n\nTime difference of 4.23584 secs\n\n## Histogram and average accuracy (rounded to one decimal)\nmyhist(testprobs, n = seq(0,1,length.out = 10), plot = TRUE,\n xlab = 'P(income = \"<=50K\")',\n ylab = 'frequency density in test set',\n main = paste0('accuracy: ', round(100*mean(testhits), 1), '%'))",
+ "text": "35.6 Example application to new data\nLet’s apply the opmall-agent to a test dataset with 33 914 new units. For each new unit, the agent:\n\ncalculates the probability of \\(\\mathit{income}\\mathclose{}\\mathord{\\nonscript\\mkern 0mu\\textrm{\\small=}\\nonscript\\mkern 0mu}\\mathopen{}{\\small\\verb;<=50;}\\), via the function infer(), using as predictors all variates except \\(\\mathit{income}\\)\nchooses one of the two values \\(\\set{{\\small\\verb;<=50K;}, {\\small\\verb;>50K;}}\\), via the function decide() trying to maximizing utilities corresponding to the accuracy metric\n\nThe function decide() will be described in more detail in chapters 37 and 38.\nAt the end we plot a histogram of the probabilities calculated for the new units, to check for instance for how many of the agent was sure (beliefs around 0% or 100%) or unsure (beliefs around 50%). We also report the final utility/accuracy per unit, and the time needed for the computation:\n\n## Load test data\ntestdata <- read.csv('test-income_data_example.csv', header = TRUE,\n na.strings = '', stringsAsFactors = FALSE, tryLogical = FALSE)\n\nntest <- nrow(testdata) # size of test dataset\n\n## Let's time the calculation\nstopwatch <- Sys.time()\n\ntestprobs <- numeric(ntest) # prepare vector of probabilities\ntesthits <- numeric(ntest) # prepare vector of hits\nfor(i in 1:ntest){\n \n ## calculate probabilities given all variates except 'income'\n probs <- infer(agent = opmall,\n predictor = testdata[i, colnames(testdata) != 'income'])\n\n ## store the probability for <=50K\n testprobs[i] <- probs['<=50K']\n\n ## decide on one value\n chosenvalue <- decide(probs = probs)$optimal\n\n ## check if decision == true_value, and store result\n testhits[i] <- (chosenvalue == testdata[i, 'income'])\n}\n\n## Print total time required\nprint(Sys.time() - stopwatch)\n\nTime difference of 4.21148 secs\n\n## Histogram and average accuracy (rounded to one decimal)\nmyhist(testprobs, n = seq(0,1,length.out = 10), plot = TRUE,\n xlab = 'P(income = \"<=50K\")',\n ylab = 'frequency density in test set',\n main = paste0('accuracy: ', round(100*mean(testhits), 1), '%'))",
"crumbs": [
"[**A prototype Optimal Predictor Machine**]{.red}",
"35 [Example application: adult-income task]{.red}"
diff --git a/example_opm1.qmd b/example_opm1.qmd
index 48413eb..2014e66 100644
--- a/example_opm1.qmd
+++ b/example_opm1.qmd
@@ -359,8 +359,7 @@ relevances <- numeric(length(predictors))
names(relevances) <- predictors
## calculate, for each variate,
-## the joint probability distribution 'probs'
-## and then mutual information 'relevance' (in shannons)
+## and the mutual information 'relevance' (in shannons)
## between 'income' and that variate
for(var in predictors){
relevances[var] <- mutualinfo(agent = opmall, A = 'income', B = var)