tryocoqa_v1.json

[{"conversation_id": "0", "context_id": "0", "story": "A guide to optimizing Transformer-based models for faster inference\n\nHave you ever suffered from high inference time when working with Transformers? In this blog post, we will show you how to optimize and deploy your model to improve speed up to x10!\n\nIf you have been keeping up with the fast-moving world of AI, you surely know that in recent years Transformer models have taken over the state-of-the-art in many vital tasks on NLP, Computer Vision, time series analysis, and much more.\n\nAlthough ubiquitous, many of these models are very big and slow at inference time, requiring billions of operations to process inputs. Ultimately, this affects user experience and increases infrastructure costs as heaps of memory and computing power are needed to run these models. Even our planet is affected by this, as more computing power demands more energy, which could translate to more pollution.\n\nOne solution for accelerating these models can be buying more powerful hardware. However, this isn\u2019t a very scalable solution. Also, it may be the case that we can\u2019t just use bigger hardware as we may want to run these models on the edge: mobile phones, security cameras, vacuum cleaners, or self-driving vehicles. By nature, these embedded devices are resource-constrained, with limited memory, computing power, and battery life. So, how can we make the most out of our hardware, whether a mobile phone or a huge data center? How can we fit our model in less memory, maintaining accuracy? And how can we make it faster or even run in real time?\n\nThe answer to these questions is optimization. This blog post presents a step-by-step guide with overviews of various techniques, plus a notebook hosted on Google Colab so you can easily reproduce it on your own!\n\nYou can apply many optimization techniques to make your models much more efficient: graph optimization, downcasting and quantization, using specialized runtimes, and more. Unfortunately, we couldn\u2019t find many comprehensive guides on how to apply these techniques to Transformer models.\n\nWe take a fine-tuned version of the well-known BERT model architecture as a case study. You will see how to perform model optimization by loading the transformer model, pruning the model with sparsity and block movement, converting the model to ONNX, and applying graph optimization and downcasting. Moreover, we will show you how to deploy your optimized model on an inference server.\n\nTry it out\nWe prepared and included the following live frontend on a \ud83e\udd17 Hugging Face space so you can test and compare the baseline model, which is a large BERT fine-tuned on the squadv2 dataset, with an already available pruned version of this baseline, and finally, our \u2018pruned + optimized\u2019 version!\n\nIn the interactive demo below, you will see three fields:\n\nthe Model field, where you can select any of the three models mentioned before (by default, our optimized model).\nthe Text field, where you should provide context about a topic.\nthe Question field, where you can ask the AI about something in the text.\n\nQuestion Answering (QA) models solve a very simple task: retrieve the answer to questions about a given text.\n\nLet\u2019s take the following text as an example:\n\nquotes\n\u201cThe first little pig was very lazy. He didn't want to work at all and he built his house out of straw. The second little pig worked a little bit harder but he was somewhat lazy too and he built his house out of sticks. Then, they sang and danced and played together the rest of the day.\u201d\n\nQuestion: \u201cWho worked a little bit harder?\u201d\n\nExpected answer: \u201cThe second little pig.\u201d\n\nYou can try this example on our live demo and see if the model gives the right answer!\n\nNote that state-of-the-art QA models can only correctly answer questions that can be answered as a specific portion of the input text. For example, if the question was \u201cWhich pigs are lazy?\u201d the correct answer should be something like \u201cthe first little pig and the second little pig\u201d or \u201cthe first and second little pigs\u201d. However, a QA model won\u2019t give any of these answers as they do not appear as such in the text.\n\nIf you select the prunned version of the model or our prunned + optimized version you should see shorter inference times. Also, you might see differences in the answers given by each model. For the example text provided above, try asking, \u201cWhat did the pigs do after building their house?\u201d.\n\nThe optimization journey\nNow let's walk together through the optimization path to get to a blazing fast model! We need to take care of two main tasks: optimizing the model and serving the model. The following diagram gives a broad overview of the steps needed to walk this road. And we know there are a few, but a journey that only takes two steps to complete isn\u2019t really that memorable, is it?\n\n\nModel\n1. Model Optimization\nStep #1: Loading the base model\nLoad a Pre-Trained Model\n\nThanks to \ud83e\udd17 Hugging Face\u2019s model zoo and the transformers library, we can load numerous pre-trained models with just a few lines of code!\n\nCheck out the from_pretrained() function on the transformers documentation!\n\nAs a starting point, we will need to select which model to use. In our case, we will be working with madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2, a BERT Large model fine-tuned on the dataset squad_v2. The SQuAD 2.0 dataset combines 100,000 questions in SQuAD 1.1 with over 50,000 unanswerable questions to look similar to answerable ones. This dataset and the trained model aim to resolve the problem of question answering while abstaining from answering questions that cannot be answered.\n\nWe can load this model in Python with the help of \ud83e\udd17 Hugging Face\u2019s transformers library by using the AutoModelForQuestionAnswering and AutoTokenizer classes in the following way:\n\nThis code will load the tokenizer and model and save this last one inside models/bert.\n\nThis model will be our starting point of comparison. We will check different metrics to evaluate how well we are optimizing in each step.\n\nWe will take a special look at the following:\n\nModel size: the size of the exported model file. It is important as it will impact how much space the model will take on the disk and loaded in the GPU/RAM.\nModel performance: how well it performs after each optimization with respect to the F1 score over the SQUADv2 validation set. Some steps of the optimization may affect the performance as they change the structure of the network.\nModel throughput: this is the metric we want to improve in each step (for the context of this blog post) and we will measure it by looking at the model\u2019s inference speed. To do this, we will use Google Colab in GPU mode. In our case the assigned GPU was an NVIDIA Tesla T4, and we will test it with the maximum sequence supported by the model (512 characters).\nTo measure inference speed, we will be using the following function:\n\nYou can find the definition of the benchmark function inside the Google Colab.\n\nThe metrics for the baseline model are:\n\nModel size\t1275M\nModel performance (Best F1)\t86.08%\nModel throughput (Tesla T4)\t140.529 ms\n\nStep #2: Pruning\nReduce model size via Pruning\n\nIn this step, we will prune the model. Pruning consists of various techniques to reduce the size of the model by modifying the architecture.\n\nIt works by removing weights from the model architecture, which removes connections between nodes in the graph. This directly reduces model size and helps reduce the necessary calculations for inference with a downside of losing performance as the model is less complex.\n\nThere are different approaches to pruning a neural network. We will be using Neural Networks Block Movement Pruning which uses a technique called sparsity. This work is backed by \ud83e\udd17 Hugging Face, thanks to the library nn_pruning. You can learn more about how this works internally on the blogs [1, 2] posted by Fran\u00e7ois Lagunas.\n\nAs a first step in optimizing the model, we will use pruning. Thankfully a pruned version of our base model already exists for us to use, post-training tuning included (sometimes called fine-pruning or fine-tuning a pruned model). This step is advised as the model changes, and performance will be affected. When we train the new model, we adapt the new weights to prevent losing performance.\n\nThis pruned model is available in the \ud83e\udd17 Hugging Face model zoo. If you want to prune a custom model, you can follow this notebook tutorial on how to do so in a few simple steps.\n\nSimilar to the previous step, let\u2019s load this pruned model:\n\nThis code will load the tokenizer and model and save it to models/bert_pruned.\n\nLet\u2019s compare this model to the baseline one. We can see an improvement in reducing the size by 14.90% and increasing the throughput by x1.55 while the performance decreased by 3.41 points.\n\nModel size\t1085M\nModel performance (Best F1)\t82.67%\nModel throughput (Tesla T4)\t90.801 ms\n\nStep #3: Convert to ONNX\nThe ONNX format\n\nONNX (Open Neural Network Exchange) is an open format to represent machine learning models by defining a common set of operators for the model architectures.\n\nMany frameworks for training models (PyTorch, TensorFlow, Caffe2, etc.) allow you to export their models to ONNX and also import models in ONNX format. This enables you to move models from one framework to another.\n\nTraining frameworks, like the ones mentioned above, are usually huge libraries requiring significant space and time to install and run. An advantage of converting our models to ONNX is that we can run them in small footprint libraries such as onnxruntime, giving us more space and a speed boost in our production environment.\n\nThe next step is to convert our model to ONNX. To do this, we can take advantage of the optimum library from \ud83e\udd17 Hugging Face, which helps us make our models more efficient in training and inference. In our case, we will take advantage of the implementations for ORT (ONNX Runtime).\n\nOptimum provides us the ability to convert \ud83e\udd17 Hugging Face models to ORT. In this case, we will use ORTModelForQuestionAnswering but keep in mind that there are options for the model of your choice.\n\nThis class takes a \ud83e\udd17 Hugging Face model and returns the ORT version of it. Let\u2019s use it to transform our pruned model to ORT.\n\nThis code will load the tokenizer and model and save it to models/bert_onnx_pruned/model.onnx.\n\nIn this step, the model size and the model performance stay still, as we do not generate a change in the architecture. However, thanks to the new execution environment, we can see an improvement in the model throughput of x1.18 compared to the previous step and x1.83 compared to the baseline model.\n\nModel size\t1085M\nModel performance (Best F1)\t82.67%\nModel throughput (Tesla T4)\t76.723 ms\n\nStep #4: Applying Graph optimization\nGraph Optimization\n\nConverting our model to ONNX means having a new set of tools at our disposal to optimize the models. An ONNX graph can be optimized through different methods.\n\nOne of them is node fusion, which consists of combining, when possible, multiple nodes of the graph into a single one and by doing so making the whole graph smaller.\n\nAnother method is constant folding, which works in the same way as in compilers. It pre-computes calculations that can be done before running the inference on the model. This removes the need to compute them on inference.\n\nRedundant node elimination is another technique for removing nodes that do not affect the inference of a model, such as identity nodes, dropout nodes, etc. Many more optimizations can be done in a graph; here is the full documentation on the implemented optimizations on ORT.\n\nIn this step, we will optimize the ONNX graph with the help of ORT, which provides a set of tools for this purpose. In this case, we will be using the default optimization level for BERT-like models, which is currently set to only basic optimizations by default.\n\nThis will enable constant folding, redundant node elimination and node fusion. To do this, we only need to write the following code:\n\nThis code will optimize the model using ORT and save it to models/bert_onnx_pruned/optimized.onnx.\n\nIn this step, the model\u2019s size and performance also stay the same as the basic optimizations do not interfere with the result of the model itself. However, they impact the throughput due to reducing the number of calculations taking place. We can see an improvement of x1.04 compared to the previous step and x1.9 compared to the baseline model.\n\nModel size\t1085M\nModel performance (Best F1)\t82.67%\nModel throughput (Tesla T4)\t73.859 ms\n\nStep #5: Downcast to FP16\nDowncasting\n\nDowncasting a neural network involves changing the precision of the weights from 32-bit (FP32) to 16-bit floating point (half-precision floating-point format or FP16).\n\nThis has different effects on the model itself. As it stores numbers that take up fewer bits, the model reduces its size (hopefully, as we are going from 32 to 16 bits, it will halve its size).\n\nAt the same time, we are forcing the model to do operations with less information, as it was trained with 32 bits. When the model does the inference with 16 bits, it will be less precise. This might affect the performance of the model.\n\nIn this step, we will reduce the precision of the model from 32 bits to 16. Thankfully onnxruntime has a utility to change our model precision to FP16. The model we have already loaded from the previous step has a method called convert_float_to_float16, which directly does what we need! We will be using deepcopy to create a copy of the model object to ensure we do not modify the state of the previous model.\n\nThis code will load the tokenizer and model and save it to models/bert_onnx_pruned/optimized_fp16.onnx.\n\nAfter this step, the size of the model was reduced by 49.03% compared to the previous one. This is almost half the size of the model. When compared to the baseline model, it decreased by 56.63%.\n\nSurprisingly, the performance was affected positively, increasing by 0.01%. This means that using the half-precision model, in this case, is better than using the FP32 one.\n\nFinally, we see a huge change in the throughput metric, improving x4.06 compared to the previous step. If we compare this against the baseline model, it is x7.73 faster.\n\nModel size\t553M\nModel performance (Best F1)\t82.68%\nModel throughput (Tesla T4)\t18.184 ms\n\nWhen we applied Downcasting, we converted the weights and biases of our model from a FP32 representation to a FP16 representation. An additional optimization step would be to apply Quantization, which takes it one step further and converts the weights and biases of our model to an 8-bit integer (INT8) representation. This representation takes four times less storage space, making your model even smaller! Also, operating with INT8 is usually faster and more portable, as almost every device supports integer operations, and not all devices support floating-point operations. We didn\u2019t apply quantization during this model optimization concretely. However, it is a good technique to consider if trying to optimize arithmetic processing mainly on the CPU or reduce the model\u2019s size.\n\nIntermediate summary\nAs the final result, we have a model that is almost half the size, loses only 3.24 performance points, and runs x7.73 times faster when compared to the original one. We can see a comparison of each step in the graph below, where we seek to have a high number in F1 and a low number in latency.\n\n\n2. Serving\nNow we have a super-fast and lightweight version of our model, so what\u2019s next? Well, you will surely want to make your model accessible to users, so it\u2019s time to take it to production! This section shows you how to easily and efficiently serve your model while testing several strategies. You can also check this work done by Micha\u00ebl Benesty on a different BERT model.\n\nIf you have ever tried to deploy a model, you have probably heard of FastAPI. FastAPI is a well-known open-source framework for developing Restful APIs in Python, developed by Sebasti\u00e1n Ramirez. It is user-friendly, performant, and can help you get your model up in a matter of minutes. Without a doubt, it is one of the first choices for every ML/DS engineer to get the first version of their model up and running!\n\nWe benchmark the different deployment strategies with different sequence lengths on the input. For each sequence length, we send 50 sequential, random requests and report the server's mean and standard deviation response time.\n\nHere is the benchmark of the base model deployed on FastAPI.\n\nHere is the code for the FastAPI app that serves our optimized Question and Answer model:\n\nHere are the results for the benchmark:\n\nWith just a few lines of code, we deployed an API that processes input and returns the output of the model in just a few milliseconds! Pretty cool, right?\n\nUsing our optimized model and the ONNX Runtime, we can enjoy a nice speed-up of x2.6 for the longest sequence length! But now that we have already optimized the model, what else can we do to accelerate inference time even more? Let\u2019s check it out!\n\nDeploy the Optimized Model on Triton\nAlthough FastAPI is a great tool and an enormous contribution to the open-source community, it wasn\u2019t explicitly designed for deploying ML models. Its scope is much broader and, thus, lacks some functionalities that are much needed for deploying models: dynamic batching, auto-scaling, GPU management, model version control, and more.\n\nIf you are looking for these features, there\u2019s another option. We are talking about NVIDIA\u2019s inference server, Triton. Triton is an open-source solution for fast and scalable inferencing. With minimal effort, you can have your model deployed and enjoy the great features Triton has to offer!\n\nAll you need to get your model up and running in Triton is a simple configuration file specifying the inputs and outputs of the model, the chosen runtime, and the hardware resources needed. You can even leave the inputs and outputs unspecified, and Triton will try to detect them automatically from the model file! Here is the config file for our example:\n\nWe include some explanations of the variables and their values below:\n\nmax_batch_size: 0 indicates that we don\u2019t want to perform dynamic batching.\nThe -1 in the dimensions of the inputs and outputs means that those dimensions are dynamic and could change for each inference request. For example, if we receive a longer sequence, we could expect the second dimension of input_ids to be bigger.\nThe instance_group property lets us specify how many copies of our model we want to deploy (for load balancing) and which hardware they should use. Here we are just simply deploying one instance running on GPU.\nHere is the benchmark for the optimized model deployed on Triton:\n\nAnd here is the comparison with the baseline:\n\nAs we can see, with little effort, Triton out-of-the-box achieves a speed-up of x3.67! Also, we didn\u2019t have to code anything! Triton processes the inputs, calls the model, and returns the outputs all by itself, which will surely save you some time.\n\nAlthough vanilla Triton achieved very good results, the tool has much more to offer than what we include in this blog post. Activating dynamic batching could save huge processing times by batching multiple requests, without significantly delaying response time. Model version management could help us test new versions of the model before taking them to production or allow us to have multiple versions of the model deployed simultaneously. Instance management enables us to perform load balancing or have multiple instances of the model for redundancy. With all these features and much more, Triton should definitely be amongst your first options for deploying models at scale!\n\nDeploy the Optimized Model on FastAPI with IOBinding\nWe decreased inference time by a factor of 3.67. However, looking at the results of the benchmarks, you might have noticed something is off\u2014the inference time appears to be increasing with the sequence length. This is due to poor management of GPU memory allocation and data transfer on our side. We will take one more step in optimizing inference time, applying a very cool technique called IOBinding.\n\nIOBinding allows the user to copy the model's inputs to GPU (or other non-CPU devices) and allocate memory for the outputs on the target device without having to do so while running the computational graph of the model. This saves huge processing times during inference, especially if inputs and outputs are very large!\n\nHere is the new code for the FastAPI app that serves our model, now with IOBinding activated.\n\nApplying this optimization is not straightforward and requires a bit more code, but it sure is worth it. Check the results of the benchmark!\n\nFastAPI with IOBinding achieves almost constant execution time, even for higher sequence lengths! This allows us to achieve a x10 speed-up!\n\n\nMean time\nClosing Thoughts\nThroughout this blog post, we saw various techniques for reducing the inference time and size of the model without losing much accuracy. We managed to decrease the model size by more than half its size and accelerate inference time by x7.73! We also presented two different tools for deploying your model and showed you how you can optimize serving in order to achieve a x10 speed-up!\n\nTransformer models are the future. With the tools we just provided, you will be able to run state-of-the-art models much faster and with fewer resources, saving you time, effort and money.", "questions": ["What is a QA model?", "What base model was used?", "What dataset was used to train it?", "How big is the dataset\u2019s training split?", "What metrics are used for evaluation?", "What metrics can be affected negatively during optimization?", "Why?"], "answers": {"input_text": ["QA models try to retrieve the answer to questions about a given text.", "madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2", "squad_v2", "I can't answer that.", "Model size, model performance and model throughput.", "Model performance", "Some steps change the structure of the network, which may affect the performance."]}}, {"conversation_id": "1", "context_id": "0", "story": "A guide to optimizing Transformer-based models for faster inference\n\nHave you ever suffered from high inference time when working with Transformers? In this blog post, we will show you how to optimize and deploy your model to improve speed up to x10!\n\nIf you have been keeping up with the fast-moving world of AI, you surely know that in recent years Transformer models have taken over the state-of-the-art in many vital tasks on NLP, Computer Vision, time series analysis, and much more.\n\nAlthough ubiquitous, many of these models are very big and slow at inference time, requiring billions of operations to process inputs. Ultimately, this affects user experience and increases infrastructure costs as heaps of memory and computing power are needed to run these models. Even our planet is affected by this, as more computing power demands more energy, which could translate to more pollution.\n\nOne solution for accelerating these models can be buying more powerful hardware. However, this isn\u2019t a very scalable solution. Also, it may be the case that we can\u2019t just use bigger hardware as we may want to run these models on the edge: mobile phones, security cameras, vacuum cleaners, or self-driving vehicles. By nature, these embedded devices are resource-constrained, with limited memory, computing power, and battery life. So, how can we make the most out of our hardware, whether a mobile phone or a huge data center? How can we fit our model in less memory, maintaining accuracy? And how can we make it faster or even run in real time?\n\nThe answer to these questions is optimization. This blog post presents a step-by-step guide with overviews of various techniques, plus a notebook hosted on Google Colab so you can easily reproduce it on your own!\n\nYou can apply many optimization techniques to make your models much more efficient: graph optimization, downcasting and quantization, using specialized runtimes, and more. Unfortunately, we couldn\u2019t find many comprehensive guides on how to apply these techniques to Transformer models.\n\nWe take a fine-tuned version of the well-known BERT model architecture as a case study. You will see how to perform model optimization by loading the transformer model, pruning the model with sparsity and block movement, converting the model to ONNX, and applying graph optimization and downcasting. Moreover, we will show you how to deploy your optimized model on an inference server.\n\nTry it out\nWe prepared and included the following live frontend on a \ud83e\udd17 Hugging Face space so you can test and compare the baseline model, which is a large BERT fine-tuned on the squadv2 dataset, with an already available pruned version of this baseline, and finally, our \u2018pruned + optimized\u2019 version!\n\nIn the interactive demo below, you will see three fields:\n\nthe Model field, where you can select any of the three models mentioned before (by default, our optimized model).\nthe Text field, where you should provide context about a topic.\nthe Question field, where you can ask the AI about something in the text.\n\nQuestion Answering (QA) models solve a very simple task: retrieve the answer to questions about a given text.\n\nLet\u2019s take the following text as an example:\n\nquotes\n\u201cThe first little pig was very lazy. He didn't want to work at all and he built his house out of straw. The second little pig worked a little bit harder but he was somewhat lazy too and he built his house out of sticks. Then, they sang and danced and played together the rest of the day.\u201d\n\nQuestion: \u201cWho worked a little bit harder?\u201d\n\nExpected answer: \u201cThe second little pig.\u201d\n\nYou can try this example on our live demo and see if the model gives the right answer!\n\nNote that state-of-the-art QA models can only correctly answer questions that can be answered as a specific portion of the input text. For example, if the question was \u201cWhich pigs are lazy?\u201d the correct answer should be something like \u201cthe first little pig and the second little pig\u201d or \u201cthe first and second little pigs\u201d. However, a QA model won\u2019t give any of these answers as they do not appear as such in the text.\n\nIf you select the prunned version of the model or our prunned + optimized version you should see shorter inference times. Also, you might see differences in the answers given by each model. For the example text provided above, try asking, \u201cWhat did the pigs do after building their house?\u201d.\n\nThe optimization journey\nNow let's walk together through the optimization path to get to a blazing fast model! We need to take care of two main tasks: optimizing the model and serving the model. The following diagram gives a broad overview of the steps needed to walk this road. And we know there are a few, but a journey that only takes two steps to complete isn\u2019t really that memorable, is it?\n\n\nModel\n1. Model Optimization\nStep #1: Loading the base model\nLoad a Pre-Trained Model\n\nThanks to \ud83e\udd17 Hugging Face\u2019s model zoo and the transformers library, we can load numerous pre-trained models with just a few lines of code!\n\nCheck out the from_pretrained() function on the transformers documentation!\n\nAs a starting point, we will need to select which model to use. In our case, we will be working with madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2, a BERT Large model fine-tuned on the dataset squad_v2. The SQuAD 2.0 dataset combines 100,000 questions in SQuAD 1.1 with over 50,000 unanswerable questions to look similar to answerable ones. This dataset and the trained model aim to resolve the problem of question answering while abstaining from answering questions that cannot be answered.\n\nWe can load this model in Python with the help of \ud83e\udd17 Hugging Face\u2019s transformers library by using the AutoModelForQuestionAnswering and AutoTokenizer classes in the following way:\n\nThis code will load the tokenizer and model and save this last one inside models/bert.\n\nThis model will be our starting point of comparison. We will check different metrics to evaluate how well we are optimizing in each step.\n\nWe will take a special look at the following:\n\nModel size: the size of the exported model file. It is important as it will impact how much space the model will take on the disk and loaded in the GPU/RAM.\nModel performance: how well it performs after each optimization with respect to the F1 score over the SQUADv2 validation set. Some steps of the optimization may affect the performance as they change the structure of the network.\nModel throughput: this is the metric we want to improve in each step (for the context of this blog post) and we will measure it by looking at the model\u2019s inference speed. To do this, we will use Google Colab in GPU mode. In our case the assigned GPU was an NVIDIA Tesla T4, and we will test it with the maximum sequence supported by the model (512 characters).\nTo measure inference speed, we will be using the following function:\n\nYou can find the definition of the benchmark function inside the Google Colab.\n\nThe metrics for the baseline model are:\n\nModel size\t1275M\nModel performance (Best F1)\t86.08%\nModel throughput (Tesla T4)\t140.529 ms\n\nStep #2: Pruning\nReduce model size via Pruning\n\nIn this step, we will prune the model. Pruning consists of various techniques to reduce the size of the model by modifying the architecture.\n\nIt works by removing weights from the model architecture, which removes connections between nodes in the graph. This directly reduces model size and helps reduce the necessary calculations for inference with a downside of losing performance as the model is less complex.\n\nThere are different approaches to pruning a neural network. We will be using Neural Networks Block Movement Pruning which uses a technique called sparsity. This work is backed by \ud83e\udd17 Hugging Face, thanks to the library nn_pruning. You can learn more about how this works internally on the blogs [1, 2] posted by Fran\u00e7ois Lagunas.\n\nAs a first step in optimizing the model, we will use pruning. Thankfully a pruned version of our base model already exists for us to use, post-training tuning included (sometimes called fine-pruning or fine-tuning a pruned model). This step is advised as the model changes, and performance will be affected. When we train the new model, we adapt the new weights to prevent losing performance.\n\nThis pruned model is available in the \ud83e\udd17 Hugging Face model zoo. If you want to prune a custom model, you can follow this notebook tutorial on how to do so in a few simple steps.\n\nSimilar to the previous step, let\u2019s load this pruned model:\n\nThis code will load the tokenizer and model and save it to models/bert_pruned.\n\nLet\u2019s compare this model to the baseline one. We can see an improvement in reducing the size by 14.90% and increasing the throughput by x1.55 while the performance decreased by 3.41 points.\n\nModel size\t1085M\nModel performance (Best F1)\t82.67%\nModel throughput (Tesla T4)\t90.801 ms\n\nStep #3: Convert to ONNX\nThe ONNX format\n\nONNX (Open Neural Network Exchange) is an open format to represent machine learning models by defining a common set of operators for the model architectures.\n\nMany frameworks for training models (PyTorch, TensorFlow, Caffe2, etc.) allow you to export their models to ONNX and also import models in ONNX format. This enables you to move models from one framework to another.\n\nTraining frameworks, like the ones mentioned above, are usually huge libraries requiring significant space and time to install and run. An advantage of converting our models to ONNX is that we can run them in small footprint libraries such as onnxruntime, giving us more space and a speed boost in our production environment.\n\nThe next step is to convert our model to ONNX. To do this, we can take advantage of the optimum library from \ud83e\udd17 Hugging Face, which helps us make our models more efficient in training and inference. In our case, we will take advantage of the implementations for ORT (ONNX Runtime).\n\nOptimum provides us the ability to convert \ud83e\udd17 Hugging Face models to ORT. In this case, we will use ORTModelForQuestionAnswering but keep in mind that there are options for the model of your choice.\n\nThis class takes a \ud83e\udd17 Hugging Face model and returns the ORT version of it. Let\u2019s use it to transform our pruned model to ORT.\n\nThis code will load the tokenizer and model and save it to models/bert_onnx_pruned/model.onnx.\n\nIn this step, the model size and the model performance stay still, as we do not generate a change in the architecture. However, thanks to the new execution environment, we can see an improvement in the model throughput of x1.18 compared to the previous step and x1.83 compared to the baseline model.\n\nModel size\t1085M\nModel performance (Best F1)\t82.67%\nModel throughput (Tesla T4)\t76.723 ms\n\nStep #4: Applying Graph optimization\nGraph Optimization\n\nConverting our model to ONNX means having a new set of tools at our disposal to optimize the models. An ONNX graph can be optimized through different methods.\n\nOne of them is node fusion, which consists of combining, when possible, multiple nodes of the graph into a single one and by doing so making the whole graph smaller.\n\nAnother method is constant folding, which works in the same way as in compilers. It pre-computes calculations that can be done before running the inference on the model. This removes the need to compute them on inference.\n\nRedundant node elimination is another technique for removing nodes that do not affect the inference of a model, such as identity nodes, dropout nodes, etc. Many more optimizations can be done in a graph; here is the full documentation on the implemented optimizations on ORT.\n\nIn this step, we will optimize the ONNX graph with the help of ORT, which provides a set of tools for this purpose. In this case, we will be using the default optimization level for BERT-like models, which is currently set to only basic optimizations by default.\n\nThis will enable constant folding, redundant node elimination and node fusion. To do this, we only need to write the following code:\n\nThis code will optimize the model using ORT and save it to models/bert_onnx_pruned/optimized.onnx.\n\nIn this step, the model\u2019s size and performance also stay the same as the basic optimizations do not interfere with the result of the model itself. However, they impact the throughput due to reducing the number of calculations taking place. We can see an improvement of x1.04 compared to the previous step and x1.9 compared to the baseline model.\n\nModel size\t1085M\nModel performance (Best F1)\t82.67%\nModel throughput (Tesla T4)\t73.859 ms\n\nStep #5: Downcast to FP16\nDowncasting\n\nDowncasting a neural network involves changing the precision of the weights from 32-bit (FP32) to 16-bit floating point (half-precision floating-point format or FP16).\n\nThis has different effects on the model itself. As it stores numbers that take up fewer bits, the model reduces its size (hopefully, as we are going from 32 to 16 bits, it will halve its size).\n\nAt the same time, we are forcing the model to do operations with less information, as it was trained with 32 bits. When the model does the inference with 16 bits, it will be less precise. This might affect the performance of the model.\n\nIn this step, we will reduce the precision of the model from 32 bits to 16. Thankfully onnxruntime has a utility to change our model precision to FP16. The model we have already loaded from the previous step has a method called convert_float_to_float16, which directly does what we need! We will be using deepcopy to create a copy of the model object to ensure we do not modify the state of the previous model.\n\nThis code will load the tokenizer and model and save it to models/bert_onnx_pruned/optimized_fp16.onnx.\n\nAfter this step, the size of the model was reduced by 49.03% compared to the previous one. This is almost half the size of the model. When compared to the baseline model, it decreased by 56.63%.\n\nSurprisingly, the performance was affected positively, increasing by 0.01%. This means that using the half-precision model, in this case, is better than using the FP32 one.\n\nFinally, we see a huge change in the throughput metric, improving x4.06 compared to the previous step. If we compare this against the baseline model, it is x7.73 faster.\n\nModel size\t553M\nModel performance (Best F1)\t82.68%\nModel throughput (Tesla T4)\t18.184 ms\n\nWhen we applied Downcasting, we converted the weights and biases of our model from a FP32 representation to a FP16 representation. An additional optimization step would be to apply Quantization, which takes it one step further and converts the weights and biases of our model to an 8-bit integer (INT8) representation. This representation takes four times less storage space, making your model even smaller! Also, operating with INT8 is usually faster and more portable, as almost every device supports integer operations, and not all devices support floating-point operations. We didn\u2019t apply quantization during this model optimization concretely. However, it is a good technique to consider if trying to optimize arithmetic processing mainly on the CPU or reduce the model\u2019s size.\n\nIntermediate summary\nAs the final result, we have a model that is almost half the size, loses only 3.24 performance points, and runs x7.73 times faster when compared to the original one. We can see a comparison of each step in the graph below, where we seek to have a high number in F1 and a low number in latency.\n\n\n2. Serving\nNow we have a super-fast and lightweight version of our model, so what\u2019s next? Well, you will surely want to make your model accessible to users, so it\u2019s time to take it to production! This section shows you how to easily and efficiently serve your model while testing several strategies. You can also check this work done by Micha\u00ebl Benesty on a different BERT model.\n\nIf you have ever tried to deploy a model, you have probably heard of FastAPI. FastAPI is a well-known open-source framework for developing Restful APIs in Python, developed by Sebasti\u00e1n Ramirez. It is user-friendly, performant, and can help you get your model up in a matter of minutes. Without a doubt, it is one of the first choices for every ML/DS engineer to get the first version of their model up and running!\n\nWe benchmark the different deployment strategies with different sequence lengths on the input. For each sequence length, we send 50 sequential, random requests and report the server's mean and standard deviation response time.\n\nHere is the benchmark of the base model deployed on FastAPI.\n\nHere is the code for the FastAPI app that serves our optimized Question and Answer model:\n\nHere are the results for the benchmark:\n\nWith just a few lines of code, we deployed an API that processes input and returns the output of the model in just a few milliseconds! Pretty cool, right?\n\nUsing our optimized model and the ONNX Runtime, we can enjoy a nice speed-up of x2.6 for the longest sequence length! But now that we have already optimized the model, what else can we do to accelerate inference time even more? Let\u2019s check it out!\n\nDeploy the Optimized Model on Triton\nAlthough FastAPI is a great tool and an enormous contribution to the open-source community, it wasn\u2019t explicitly designed for deploying ML models. Its scope is much broader and, thus, lacks some functionalities that are much needed for deploying models: dynamic batching, auto-scaling, GPU management, model version control, and more.\n\nIf you are looking for these features, there\u2019s another option. We are talking about NVIDIA\u2019s inference server, Triton. Triton is an open-source solution for fast and scalable inferencing. With minimal effort, you can have your model deployed and enjoy the great features Triton has to offer!\n\nAll you need to get your model up and running in Triton is a simple configuration file specifying the inputs and outputs of the model, the chosen runtime, and the hardware resources needed. You can even leave the inputs and outputs unspecified, and Triton will try to detect them automatically from the model file! Here is the config file for our example:\n\nWe include some explanations of the variables and their values below:\n\nmax_batch_size: 0 indicates that we don\u2019t want to perform dynamic batching.\nThe -1 in the dimensions of the inputs and outputs means that those dimensions are dynamic and could change for each inference request. For example, if we receive a longer sequence, we could expect the second dimension of input_ids to be bigger.\nThe instance_group property lets us specify how many copies of our model we want to deploy (for load balancing) and which hardware they should use. Here we are just simply deploying one instance running on GPU.\nHere is the benchmark for the optimized model deployed on Triton:\n\nAnd here is the comparison with the baseline:\n\nAs we can see, with little effort, Triton out-of-the-box achieves a speed-up of x3.67! Also, we didn\u2019t have to code anything! Triton processes the inputs, calls the model, and returns the outputs all by itself, which will surely save you some time.\n\nAlthough vanilla Triton achieved very good results, the tool has much more to offer than what we include in this blog post. Activating dynamic batching could save huge processing times by batching multiple requests, without significantly delaying response time. Model version management could help us test new versions of the model before taking them to production or allow us to have multiple versions of the model deployed simultaneously. Instance management enables us to perform load balancing or have multiple instances of the model for redundancy. With all these features and much more, Triton should definitely be amongst your first options for deploying models at scale!\n\nDeploy the Optimized Model on FastAPI with IOBinding\nWe decreased inference time by a factor of 3.67. However, looking at the results of the benchmarks, you might have noticed something is off\u2014the inference time appears to be increasing with the sequence length. This is due to poor management of GPU memory allocation and data transfer on our side. We will take one more step in optimizing inference time, applying a very cool technique called IOBinding.\n\nIOBinding allows the user to copy the model's inputs to GPU (or other non-CPU devices) and allocate memory for the outputs on the target device without having to do so while running the computational graph of the model. This saves huge processing times during inference, especially if inputs and outputs are very large!\n\nHere is the new code for the FastAPI app that serves our model, now with IOBinding activated.\n\nApplying this optimization is not straightforward and requires a bit more code, but it sure is worth it. Check the results of the benchmark!\n\nFastAPI with IOBinding achieves almost constant execution time, even for higher sequence lengths! This allows us to achieve a x10 speed-up!\n\n\nMean time\nClosing Thoughts\nThroughout this blog post, we saw various techniques for reducing the inference time and size of the model without losing much accuracy. We managed to decrease the model size by more than half its size and accelerate inference time by x7.73! We also presented two different tools for deploying your model and showed you how you can optimize serving in order to achieve a x10 speed-up!\n\nTransformer models are the future. With the tools we just provided, you will be able to run state-of-the-art models much faster and with fewer resources, saving you time, effort and money.", "questions": ["What are the main steps for optimizing a transformer model? ", "What can you tell me about the second one?", "What can you tell me about the third step?", "Tell me the benefits of running a model on onnxruntime", "What comes after model optimization?", "What is the difference between downcasting and quantization?", "How do they contribute to optimization?", "Which Triton features were not included on the blogpost?", "Befor activating IOBinding, what happens with the inference time?", "Why?", "Give a short explanation of what IOBinding is", "Why is this good?"], "answers": {"input_text": ["Loading the model, pruning, converting to ONNX, applying graph optimization and downcasting", "Pruning consists of various techniques to reduce the size of the model", "It involves converting our models to ONNX.", "It gives more space and a speed boost in the production environment.", "Make your model accesible to users", "Quantization converts the weights and biases of the model to an 8-bit integer representation while Downcasting converts them to a FP16 representation.", "Reducing the model size", "Dynamic batching, model version management and instance management", "It appears to be increasing with sequence length", "Poor management of GPU memory allocation and data transfer", "IOBinding allows the user to copy the model's inputs to the devices and allocate memory for the outputs without having to do so while running the computational graph of the model.", "It saves huge processing times during inference"]}}, {"conversation_id": "2", "context_id": "0", "story": "A guide to optimizing Transformer-based models for faster inference\n\nHave you ever suffered from high inference time when working with Transformers? In this blog post, we will show you how to optimize and deploy your model to improve speed up to x10!\n\nIf you have been keeping up with the fast-moving world of AI, you surely know that in recent years Transformer models have taken over the state-of-the-art in many vital tasks on NLP, Computer Vision, time series analysis, and much more.\n\nAlthough ubiquitous, many of these models are very big and slow at inference time, requiring billions of operations to process inputs. Ultimately, this affects user experience and increases infrastructure costs as heaps of memory and computing power are needed to run these models. Even our planet is affected by this, as more computing power demands more energy, which could translate to more pollution.\n\nOne solution for accelerating these models can be buying more powerful hardware. However, this isn\u2019t a very scalable solution. Also, it may be the case that we can\u2019t just use bigger hardware as we may want to run these models on the edge: mobile phones, security cameras, vacuum cleaners, or self-driving vehicles. By nature, these embedded devices are resource-constrained, with limited memory, computing power, and battery life. So, how can we make the most out of our hardware, whether a mobile phone or a huge data center? How can we fit our model in less memory, maintaining accuracy? And how can we make it faster or even run in real time?\n\nThe answer to these questions is optimization. This blog post presents a step-by-step guide with overviews of various techniques, plus a notebook hosted on Google Colab so you can easily reproduce it on your own!\n\nYou can apply many optimization techniques to make your models much more efficient: graph optimization, downcasting and quantization, using specialized runtimes, and more. Unfortunately, we couldn\u2019t find many comprehensive guides on how to apply these techniques to Transformer models.\n\nWe take a fine-tuned version of the well-known BERT model architecture as a case study. You will see how to perform model optimization by loading the transformer model, pruning the model with sparsity and block movement, converting the model to ONNX, and applying graph optimization and downcasting. Moreover, we will show you how to deploy your optimized model on an inference server.\n\nTry it out\nWe prepared and included the following live frontend on a \ud83e\udd17 Hugging Face space so you can test and compare the baseline model, which is a large BERT fine-tuned on the squadv2 dataset, with an already available pruned version of this baseline, and finally, our \u2018pruned + optimized\u2019 version!\n\nIn the interactive demo below, you will see three fields:\n\nthe Model field, where you can select any of the three models mentioned before (by default, our optimized model).\nthe Text field, where you should provide context about a topic.\nthe Question field, where you can ask the AI about something in the text.\n\nQuestion Answering (QA) models solve a very simple task: retrieve the answer to questions about a given text.\n\nLet\u2019s take the following text as an example:\n\nquotes\n\u201cThe first little pig was very lazy. He didn't want to work at all and he built his house out of straw. The second little pig worked a little bit harder but he was somewhat lazy too and he built his house out of sticks. Then, they sang and danced and played together the rest of the day.\u201d\n\nQuestion: \u201cWho worked a little bit harder?\u201d\n\nExpected answer: \u201cThe second little pig.\u201d\n\nYou can try this example on our live demo and see if the model gives the right answer!\n\nNote that state-of-the-art QA models can only correctly answer questions that can be answered as a specific portion of the input text. For example, if the question was \u201cWhich pigs are lazy?\u201d the correct answer should be something like \u201cthe first little pig and the second little pig\u201d or \u201cthe first and second little pigs\u201d. However, a QA model won\u2019t give any of these answers as they do not appear as such in the text.\n\nIf you select the prunned version of the model or our prunned + optimized version you should see shorter inference times. Also, you might see differences in the answers given by each model. For the example text provided above, try asking, \u201cWhat did the pigs do after building their house?\u201d.\n\nThe optimization journey\nNow let's walk together through the optimization path to get to a blazing fast model! We need to take care of two main tasks: optimizing the model and serving the model. The following diagram gives a broad overview of the steps needed to walk this road. And we know there are a few, but a journey that only takes two steps to complete isn\u2019t really that memorable, is it?\n\n\nModel\n1. Model Optimization\nStep #1: Loading the base model\nLoad a Pre-Trained Model\n\nThanks to \ud83e\udd17 Hugging Face\u2019s model zoo and the transformers library, we can load numerous pre-trained models with just a few lines of code!\n\nCheck out the from_pretrained() function on the transformers documentation!\n\nAs a starting point, we will need to select which model to use. In our case, we will be working with madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2, a BERT Large model fine-tuned on the dataset squad_v2. The SQuAD 2.0 dataset combines 100,000 questions in SQuAD 1.1 with over 50,000 unanswerable questions to look similar to answerable ones. This dataset and the trained model aim to resolve the problem of question answering while abstaining from answering questions that cannot be answered.\n\nWe can load this model in Python with the help of \ud83e\udd17 Hugging Face\u2019s transformers library by using the AutoModelForQuestionAnswering and AutoTokenizer classes in the following way:\n\nThis code will load the tokenizer and model and save this last one inside models/bert.\n\nThis model will be our starting point of comparison. We will check different metrics to evaluate how well we are optimizing in each step.\n\nWe will take a special look at the following:\n\nModel size: the size of the exported model file. It is important as it will impact how much space the model will take on the disk and loaded in the GPU/RAM.\nModel performance: how well it performs after each optimization with respect to the F1 score over the SQUADv2 validation set. Some steps of the optimization may affect the performance as they change the structure of the network.\nModel throughput: this is the metric we want to improve in each step (for the context of this blog post) and we will measure it by looking at the model\u2019s inference speed. To do this, we will use Google Colab in GPU mode. In our case the assigned GPU was an NVIDIA Tesla T4, and we will test it with the maximum sequence supported by the model (512 characters).\nTo measure inference speed, we will be using the following function:\n\nYou can find the definition of the benchmark function inside the Google Colab.\n\nThe metrics for the baseline model are:\n\nModel size\t1275M\nModel performance (Best F1)\t86.08%\nModel throughput (Tesla T4)\t140.529 ms\n\nStep #2: Pruning\nReduce model size via Pruning\n\nIn this step, we will prune the model. Pruning consists of various techniques to reduce the size of the model by modifying the architecture.\n\nIt works by removing weights from the model architecture, which removes connections between nodes in the graph. This directly reduces model size and helps reduce the necessary calculations for inference with a downside of losing performance as the model is less complex.\n\nThere are different approaches to pruning a neural network. We will be using Neural Networks Block Movement Pruning which uses a technique called sparsity. This work is backed by \ud83e\udd17 Hugging Face, thanks to the library nn_pruning. You can learn more about how this works internally on the blogs [1, 2] posted by Fran\u00e7ois Lagunas.\n\nAs a first step in optimizing the model, we will use pruning. Thankfully a pruned version of our base model already exists for us to use, post-training tuning included (sometimes called fine-pruning or fine-tuning a pruned model). This step is advised as the model changes, and performance will be affected. When we train the new model, we adapt the new weights to prevent losing performance.\n\nThis pruned model is available in the \ud83e\udd17 Hugging Face model zoo. If you want to prune a custom model, you can follow this notebook tutorial on how to do so in a few simple steps.\n\nSimilar to the previous step, let\u2019s load this pruned model:\n\nThis code will load the tokenizer and model and save it to models/bert_pruned.\n\nLet\u2019s compare this model to the baseline one. We can see an improvement in reducing the size by 14.90% and increasing the throughput by x1.55 while the performance decreased by 3.41 points.\n\nModel size\t1085M\nModel performance (Best F1)\t82.67%\nModel throughput (Tesla T4)\t90.801 ms\n\nStep #3: Convert to ONNX\nThe ONNX format\n\nONNX (Open Neural Network Exchange) is an open format to represent machine learning models by defining a common set of operators for the model architectures.\n\nMany frameworks for training models (PyTorch, TensorFlow, Caffe2, etc.) allow you to export their models to ONNX and also import models in ONNX format. This enables you to move models from one framework to another.\n\nTraining frameworks, like the ones mentioned above, are usually huge libraries requiring significant space and time to install and run. An advantage of converting our models to ONNX is that we can run them in small footprint libraries such as onnxruntime, giving us more space and a speed boost in our production environment.\n\nThe next step is to convert our model to ONNX. To do this, we can take advantage of the optimum library from \ud83e\udd17 Hugging Face, which helps us make our models more efficient in training and inference. In our case, we will take advantage of the implementations for ORT (ONNX Runtime).\n\nOptimum provides us the ability to convert \ud83e\udd17 Hugging Face models to ORT. In this case, we will use ORTModelForQuestionAnswering but keep in mind that there are options for the model of your choice.\n\nThis class takes a \ud83e\udd17 Hugging Face model and returns the ORT version of it. Let\u2019s use it to transform our pruned model to ORT.\n\nThis code will load the tokenizer and model and save it to models/bert_onnx_pruned/model.onnx.\n\nIn this step, the model size and the model performance stay still, as we do not generate a change in the architecture. However, thanks to the new execution environment, we can see an improvement in the model throughput of x1.18 compared to the previous step and x1.83 compared to the baseline model.\n\nModel size\t1085M\nModel performance (Best F1)\t82.67%\nModel throughput (Tesla T4)\t76.723 ms\n\nStep #4: Applying Graph optimization\nGraph Optimization\n\nConverting our model to ONNX means having a new set of tools at our disposal to optimize the models. An ONNX graph can be optimized through different methods.\n\nOne of them is node fusion, which consists of combining, when possible, multiple nodes of the graph into a single one and by doing so making the whole graph smaller.\n\nAnother method is constant folding, which works in the same way as in compilers. It pre-computes calculations that can be done before running the inference on the model. This removes the need to compute them on inference.\n\nRedundant node elimination is another technique for removing nodes that do not affect the inference of a model, such as identity nodes, dropout nodes, etc. Many more optimizations can be done in a graph; here is the full documentation on the implemented optimizations on ORT.\n\nIn this step, we will optimize the ONNX graph with the help of ORT, which provides a set of tools for this purpose. In this case, we will be using the default optimization level for BERT-like models, which is currently set to only basic optimizations by default.\n\nThis will enable constant folding, redundant node elimination and node fusion. To do this, we only need to write the following code:\n\nThis code will optimize the model using ORT and save it to models/bert_onnx_pruned/optimized.onnx.\n\nIn this step, the model\u2019s size and performance also stay the same as the basic optimizations do not interfere with the result of the model itself. However, they impact the throughput due to reducing the number of calculations taking place. We can see an improvement of x1.04 compared to the previous step and x1.9 compared to the baseline model.\n\nModel size\t1085M\nModel performance (Best F1)\t82.67%\nModel throughput (Tesla T4)\t73.859 ms\n\nStep #5: Downcast to FP16\nDowncasting\n\nDowncasting a neural network involves changing the precision of the weights from 32-bit (FP32) to 16-bit floating point (half-precision floating-point format or FP16).\n\nThis has different effects on the model itself. As it stores numbers that take up fewer bits, the model reduces its size (hopefully, as we are going from 32 to 16 bits, it will halve its size).\n\nAt the same time, we are forcing the model to do operations with less information, as it was trained with 32 bits. When the model does the inference with 16 bits, it will be less precise. This might affect the performance of the model.\n\nIn this step, we will reduce the precision of the model from 32 bits to 16. Thankfully onnxruntime has a utility to change our model precision to FP16. The model we have already loaded from the previous step has a method called convert_float_to_float16, which directly does what we need! We will be using deepcopy to create a copy of the model object to ensure we do not modify the state of the previous model.\n\nThis code will load the tokenizer and model and save it to models/bert_onnx_pruned/optimized_fp16.onnx.\n\nAfter this step, the size of the model was reduced by 49.03% compared to the previous one. This is almost half the size of the model. When compared to the baseline model, it decreased by 56.63%.\n\nSurprisingly, the performance was affected positively, increasing by 0.01%. This means that using the half-precision model, in this case, is better than using the FP32 one.\n\nFinally, we see a huge change in the throughput metric, improving x4.06 compared to the previous step. If we compare this against the baseline model, it is x7.73 faster.\n\nModel size\t553M\nModel performance (Best F1)\t82.68%\nModel throughput (Tesla T4)\t18.184 ms\n\nWhen we applied Downcasting, we converted the weights and biases of our model from a FP32 representation to a FP16 representation. An additional optimization step would be to apply Quantization, which takes it one step further and converts the weights and biases of our model to an 8-bit integer (INT8) representation. This representation takes four times less storage space, making your model even smaller! Also, operating with INT8 is usually faster and more portable, as almost every device supports integer operations, and not all devices support floating-point operations. We didn\u2019t apply quantization during this model optimization concretely. However, it is a good technique to consider if trying to optimize arithmetic processing mainly on the CPU or reduce the model\u2019s size.\n\nIntermediate summary\nAs the final result, we have a model that is almost half the size, loses only 3.24 performance points, and runs x7.73 times faster when compared to the original one. We can see a comparison of each step in the graph below, where we seek to have a high number in F1 and a low number in latency.\n\n\n2. Serving\nNow we have a super-fast and lightweight version of our model, so what\u2019s next? Well, you will surely want to make your model accessible to users, so it\u2019s time to take it to production! This section shows you how to easily and efficiently serve your model while testing several strategies. You can also check this work done by Micha\u00ebl Benesty on a different BERT model.\n\nIf you have ever tried to deploy a model, you have probably heard of FastAPI. FastAPI is a well-known open-source framework for developing Restful APIs in Python, developed by Sebasti\u00e1n Ramirez. It is user-friendly, performant, and can help you get your model up in a matter of minutes. Without a doubt, it is one of the first choices for every ML/DS engineer to get the first version of their model up and running!\n\nWe benchmark the different deployment strategies with different sequence lengths on the input. For each sequence length, we send 50 sequential, random requests and report the server's mean and standard deviation response time.\n\nHere is the benchmark of the base model deployed on FastAPI.\n\nHere is the code for the FastAPI app that serves our optimized Question and Answer model:\n\nHere are the results for the benchmark:\n\nWith just a few lines of code, we deployed an API that processes input and returns the output of the model in just a few milliseconds! Pretty cool, right?\n\nUsing our optimized model and the ONNX Runtime, we can enjoy a nice speed-up of x2.6 for the longest sequence length! But now that we have already optimized the model, what else can we do to accelerate inference time even more? Let\u2019s check it out!\n\nDeploy the Optimized Model on Triton\nAlthough FastAPI is a great tool and an enormous contribution to the open-source community, it wasn\u2019t explicitly designed for deploying ML models. Its scope is much broader and, thus, lacks some functionalities that are much needed for deploying models: dynamic batching, auto-scaling, GPU management, model version control, and more.\n\nIf you are looking for these features, there\u2019s another option. We are talking about NVIDIA\u2019s inference server, Triton. Triton is an open-source solution for fast and scalable inferencing. With minimal effort, you can have your model deployed and enjoy the great features Triton has to offer!\n\nAll you need to get your model up and running in Triton is a simple configuration file specifying the inputs and outputs of the model, the chosen runtime, and the hardware resources needed. You can even leave the inputs and outputs unspecified, and Triton will try to detect them automatically from the model file! Here is the config file for our example:\n\nWe include some explanations of the variables and their values below:\n\nmax_batch_size: 0 indicates that we don\u2019t want to perform dynamic batching.\nThe -1 in the dimensions of the inputs and outputs means that those dimensions are dynamic and could change for each inference request. For example, if we receive a longer sequence, we could expect the second dimension of input_ids to be bigger.\nThe instance_group property lets us specify how many copies of our model we want to deploy (for load balancing) and which hardware they should use. Here we are just simply deploying one instance running on GPU.\nHere is the benchmark for the optimized model deployed on Triton:\n\nAnd here is the comparison with the baseline:\n\nAs we can see, with little effort, Triton out-of-the-box achieves a speed-up of x3.67! Also, we didn\u2019t have to code anything! Triton processes the inputs, calls the model, and returns the outputs all by itself, which will surely save you some time.\n\nAlthough vanilla Triton achieved very good results, the tool has much more to offer than what we include in this blog post. Activating dynamic batching could save huge processing times by batching multiple requests, without significantly delaying response time. Model version management could help us test new versions of the model before taking them to production or allow us to have multiple versions of the model deployed simultaneously. Instance management enables us to perform load balancing or have multiple instances of the model for redundancy. With all these features and much more, Triton should definitely be amongst your first options for deploying models at scale!\n\nDeploy the Optimized Model on FastAPI with IOBinding\nWe decreased inference time by a factor of 3.67. However, looking at the results of the benchmarks, you might have noticed something is off\u2014the inference time appears to be increasing with the sequence length. This is due to poor management of GPU memory allocation and data transfer on our side. We will take one more step in optimizing inference time, applying a very cool technique called IOBinding.\n\nIOBinding allows the user to copy the model's inputs to GPU (or other non-CPU devices) and allocate memory for the outputs on the target device without having to do so while running the computational graph of the model. This saves huge processing times during inference, especially if inputs and outputs are very large!\n\nHere is the new code for the FastAPI app that serves our model, now with IOBinding activated.\n\nApplying this optimization is not straightforward and requires a bit more code, but it sure is worth it. Check the results of the benchmark!\n\nFastAPI with IOBinding achieves almost constant execution time, even for higher sequence lengths! This allows us to achieve a x10 speed-up!\n\n\nMean time\nClosing Thoughts\nThroughout this blog post, we saw various techniques for reducing the inference time and size of the model without losing much accuracy. We managed to decrease the model size by more than half its size and accelerate inference time by x7.73! We also presented two different tools for deploying your model and showed you how you can optimize serving in order to achieve a x10 speed-up!\n\nTransformer models are the future. With the tools we just provided, you will be able to run state-of-the-art models much faster and with fewer resources, saving you time, effort and money.", "questions": ["What's the problem with modern Transformer models?", "Why?", "And why is this a problem?", "Why do infrastructure costs go up?", "How much does the throughput improve compared to the baseline model after the graph optimization step?", "And compared to the previous step?", "What are the improvements of the model mentioned in the intermediate summary?", "What is a disadvantage of this model?", "Who is Sebastian Ramirez?", "What is FastAPI?", "What features does Triton have that FastAPI doesn't?", "Can I use Triton for free?", "Who developed it?", "What improvement do we get by using Triton?"], "answers": {"input_text": ["Many of these models are very big and slow at inference time", "They require billions of operations to process inputs", "It affects user experience, increases infrastructure costs and affects the planet", "Heaps of memory and computing power are needed", "x1.9", "x1.04", "The model is half the size and runs x7.73 times faster", "It loses 3.24 performance points", "The developer of FastAPI", "FastAPI is a well-known open-source framework for developing Restful APIs", "Dynamic batching, auto-scaling, GPU management, model version control, and more.", "Yes", "NVIDIA", "A speed-up of x3.67"]}}, {"conversation_id": "5", "context_id": "1", "story": "Machine Learning 101: Build, Train, Test, Rinse & Repeat\n\nIn the previous episode of our Road to Data Science series, we went through all the general skills needed to become a proficient data scientist. In this post, we start to dig deeper into the specifics of Machine Learning. We start by posing the simple questions: What is Machine Learning? What kind of problems can Machine Learning solve?\n\nWhat is Machine Learning?\nA great way to start understanding the idea behind Machine Learning is to think about what\u2019s new in Machine Learning compared to traditional programming.\n\nSo, essentially in traditional programming, we build a machine that outputs a result based on an input and a set of rules. Our job as programmers is to construct the correct set of rules for our system to output the desired result. Thanks to the computer\u2019s ability to follow complex instructions incredibly fast, this approach works very well for a vast set of tasks, from calculating taxes to landing a rocket on the moon.\n\nSo, why do we need Machine Learning? Well, machines are great at solving problems that look difficult or tedious to us. For example, imagine computing all the prime numbers up to 10,000,000. On the other hand, another set of problems, including some that seem trivial to us, are very hard to decompose into a set of deterministic rules.\n\nTake, for example, the problem of observing a picture and determining if there\u2019s a cat in it. The problem is trivial for any human, but not for a computer, especially considering all the variations between breeds and all the variations in angle, lighting, positioning, and occlusion. What can we do?\n\n\nCat photos showing occlusion, diversity, deformation, and lighting variations through the different photographs.\nMachine Learning to the rescue! Instead of building an explicit set of rules, Machine Learning systems learn the patterns that go from the input to the output based on observing a massive amount of data. This approach can solve many tasks, from identifying cats in pictures to modeling the consumer purchase decision.\n\nThe details of how our Machine Learning system learns these patterns depend on the specific algorithm we are using. These algorithms come in many shapes and flavors. There are whole families of them, but we can generally group them based on a couple of criteria.\n\nIf we look at how information is fed to the model, we can broadly classify Machine Learning techniques as supervised, unsupervised, or reinforcement. When labels are fed to the algorithm during training, the method is supervised. When no label is fed, the method is unsupervised. And when the system is allowed to make decisions and gets feedback on them, the method is Reinforcement Learning.\n\nOn the other hand, if we look at the kind of task they perform, Machine Learning techniques can be classified into several groups such as regression, clustering, single or multi-class classification, and so on.\n\nExtra readings\nCheck this post to learn more about what kind of tasks can be solved using Machine Learning.\nMicrosoft has a very comprehensive and extensive set of courses for beginners, one called ML for beginners and the other one called Data Science for beginners, be sure to check them out!\n\nA Shopping List to Machine Learning\nIf you had asked us 15 years ago what skills you needed to apply Machine Learning, our list would have included an endless list of mathematical, computer science, and statistical knowledge that could have discouraged even the most audacious.\n\nToday, thanks to the cumulative contributions of many in the Machine Learning community, the field has become much more approachable. Anyone who wants to do Machine Learning today stands on the shoulders of giants.\n\nIn fact, you might have heard that doing Machine Learning these days is as simple as writing a few lines of code. Well, the rumors are true, but that's only a fraction. Let's see that for ourselves.\n\nSay you were walking the countryside when suddenly you felt the unrelenting urge to differentiate the subspecies of a particular type of plant: the iris.\n\n\nIris flowers images.\nIris flower subspecies a to the left and subspecies b to the right. Images extrated from Unsplash.\nYou could take a picture of every iris flower you encounter and send it to your taxonomist friend or try to learn the pattern that goes from the dimensions of the flowers to their species. In Python, this looks like this:\n\nBoom! Problem solved with 96.7% accuracy in unseen data! Next time you see an iris, you can measure its flower and ask your model to predict the plant\u2019s species. Now, you might be looking at this unfamiliar code and wonder: Is it possible to learn this power? Or, if you\u2019ve been following along our blogposts and are already familiar with Python, you might wonder: What if I want to do something slightly more complex? What tools do I need? Where do I find the resources? Don\u2019t worry; we\u2019ve got you covered. We\u2019ll walk you through the Machine Learning pipeline step by step, referring you in each step to all the relevant material you need.\n\n1. Get Raw Data\nUnless you\u2019ve been living under a rock during the last decade, \u201cData is the New Oil\u201d is a headline you have seen several times. No wonder companies are so protective and private about their data - a lot of potential is there waiting to be exploited!\n\nDoes this mean that you can't access any interesting datasets? Well, if you are solving a novel or a very niche problem or maybe need a specialist like a doctor to tag medical images, you will likely need custom data and a custom process. But for simpler cases or educational purposes, the community has put an enormous effort into having publicly available and preprocessed data. What a time to be alive!\n\nSuppose you want to take a look at some curated datasets. To name just a few, you can check the COCO dataset for image segmentation, the ImageNet dataset for image classification, and Kaggle's Datasets for an extra wide variety of problems, from used car prices to a brain stroke prediction dataset.\n\nOnce you get your hands on your precious data, what's next? It depends on what kind of data we\u2019re talking about and its degree of \u201crawness.\u201d Are you dealing with images? Text? Is there a database (lucky you!)? Do you have a CSV (or JSON, or, God forbid!, XML)?\n\nIn any case, you need to be able to handle all these kinds of data and clean, tag, and transform it to your needs. You need to cook that raw data that you just got. For that purpose, let's jump to the next section called \u201cCook Transform Data.\u201d\n\nExtra readings\nLuminousmens article on data file formats and their comparative advantages.\n\n2. Transform Data\nOnce we have the data, we need to move it, change it, stretch it, and bend it to our will. Unless the dataset is very small, this requires some sort of programming.\n\nThere are many kinds of data, each requiring a different set of skills. We could mention images, videos, and text as popular data sources. However, for this blog, we will focus on the data type that (probably) rules them all: tabular data.\n\nA tabular dataset is basically a spreadsheet where each row is a new sample, and each column represents some information about those samples.\n\nManually changing things in spreadsheets is easy, albeit time-consuming, and usually unfeasible for large datasets. The equivalent way to manipulate tabular data in Python is with Pandas, the most widely adopted library for table manipulation in Python and one of the most popular libraries in the language.\n\nIf you are serious about doing Machine Learning in Python, you must learn Pandas. Luckily, there are many great resources available to help you learn.\n\nEssentials\nAs a starting resource, this Kaggle tutorial is a must.\nFurther, we recommend Towards Data Science\u2019s blog post on writing simpler and more consistent Pandas code.\nFor a hands-on guide from Pandas basics to its most important features, you can check our blog post on the subject.\n\n3. Pick a Machine Learning model\nThere are several ways to think about Machine Learning models. In the vignette below, XKCD summarizes quite well one of the most common ways people see Machine Learning models.\n\n\nVignet summaring one of the most common ways people see Machine Learning models.\nSource: XKCD.\nBut, we prefer a different way to think about them.\n\nWe think of a Machine Learning model as a coffee machine where you put coffee beans and water in and get magic on the other side. Even though making coffee is fairly easy, each coffee machine has its own way of working. If you are brewing coffee with a French press, you need a coarser grind, whereas if you use pour-over brewers, the grind needs to be much finer, almost finer than sand.\n\nSimilarly, we can think of Machine Learning models as machines. When given data (coffee beans), they output predictions (coffee/magic). We must be careful, though, because different models require different inputs. Some models require all data to be \u201cvery granular\u201d (normalized between [0, 1]), while other models accept arbitrary numbers. Almost all models require the input to be in a numeric format, neither text nor images.\n\nUnder this way of viewing Machine Learning, the most important task for working with a model is reading and understanding its instruction manual: dust or coarse, normalized or not, linearly independent or not, and so on.\n\nBut before turning the grinder on, it is wise to check if the coffee beans come with stones or dirt mixed in the same bag because you don't want to include them in your cup. That's why you will have to spend time separating good beans from dirt. The same is true for data: not every attribute will be useful, and maybe most will not. Looking at the data and deciding what will be useful for the model and what will not is often referred to as data cleaning and is a crucial part of Machine Learning.\n\nOnce you have your coffee ground, you may want to make a special blend choosing different origins of the beans. The reasons for blending can be many. You may want to bring more complex flavors or reduce the price of the package while maintaining the quality. The same goes for data. Depending on your model, you may want to transform a variable to have a normal distribution or help the image classification by tweaking the color balance of the images. This process is called feature engineering and is arguably one of the hardest parts of Machine Learning. As shown above, getting predictions from a Machine Learning model takes only one line of code, while a data processing pipeline can be several hundred \u2014if not thousands\u2014 of lines long.\n\nThere is a myriad of resources out there on how to become a great barista data scientist. If we need to name a few, those would be:\n\nTheory\nStatquest is an amazing YouTube channel to learn about the main ideas behind both specific models and overall Machine Learning concepts.\nA great resource on the theory behind Machine Learning can be found in Stanford's CS299 course. If you plan on taking this course, Garrett Thomas\u2019s math summary might also be useful.\n\nConcepts & exercises\nDataCamp offers a Data Scientist career track with almost 100 hours of hands-on courses on data manipulation and classic Machine Learning algorithms.\n\n4. Get the model to learn what you want, not what it wants\nSay you wanted to describe the concept of chairs to a friend, so you would start with the chairs you have at home as examples, and you start saying something like \"a chair is an item with 4 legs that you can sit on\".\n\nAt first, you are happy because it describes all the chairs at your place, but also your bed, your sofa, and a stool. Then you try a more complex description of what a chair is to exclude these other pieces of furniture. So you add more variables to describe your chairs, including color, shapes, or materials; and you come up with something like \"has 4 legs, you can sit on it, has a back, it\u2019s made of wood, the legs are brown, the seat is green with this specific pattern\". Not a great solution either, is it? Now you can perfectly classify the chairs back at home, but you won't be able to describe your friends\u2019 chairs. You\u2019ve overfitted your definition to the chairs you know.\n\nML models face the same problem as the person classifying chairs above. Go too simple, and you end up with a bad definition, go too complex, and you might end up just memorizing the data you see. This problem is known in machine learning as the bias-variance trade-off.\n\nOur job as developers is to design our model and its training process so that memorization is not a viable option and the only option left is to learn the relevant patterns. There are several tools that we use for this, and if you\u2019ve been following the courses above, you should be familiar with most of them. The following are some you should definitely know about.\n\nBasic concepts\nRegularization\n\nTrain - validation - test splits\n\nCross-validation\n\nEarly stopping\n\n5. Measure results\nBy now, you should have managed to have your model training for a couple of epochs and are probably wondering if it is doing any good.\n\nTo answer that, we need to introduce a new concept: metrics. Let's dig into this topic with an example. Imagine that you are training a model to predict the selling price of a house. You have 500,000 houses. For each of them, you have the real price and the predicted price. How do you check if the predictions are correct?\n\nSince evaluating a model\u2019s performance by looking at its prediction for every single data point is not feasible, we need a way to summarize the performance across all samples to get a single number that we can use to evaluate the model. When using metrics, you have to consider that it comes with a compromise: you will start evaluating things on average.\n\nThere are many, many metrics out there. You can even have custom metrics for your needs because not all metrics are well-posed to solve different problems with the same techniques. For regression problems, we have metrics such as MAE, RMSE, and MAPE, while for classification problems, we can use accuracy, f1-score, AUC, confusion matrix, etc. You should definitely get acquainted with their uses, strengths, and weaknesses.\n\n\nCommon metrics for regression and clasification graphic.\nCommon metrics for regression problems\n\nCommon metrics for regression and clasification graphic.\nCommon metrics for classification problems\nVery specific problems might even require custom metrics for the model to learn that something is meaningful. For example, when estimating the price of a house, it might be worse to underestimate the price of a house and lose money in the transaction rather than posting a very high price and not selling it. For such a case, a custom metric might be desirable, one that places a larger penalty on the model when underestimating the target and a smaller one when overestimating.\n\nA final remark regarding this topic: metrics can take more than one meaning in a Machine Learning project. One of them is, as we just described, that the metric can be used to evaluate the model after it\u2019s trained. Another meaning for metric might be the business metric that\u2019s expected to improve thanks to your machine learning model. You should be mindful of these metrics going into any project.\n\nExtra readings\nYou can find a review of the most used metrics for Machine Learning in this article.\n\n6. Visually inspect the results\nAlright! We have collected data, cleaned it, trained a model, and checked that the error is small. Are we done? Ye...Almost!\n\nWe, humans, are very visual creatures. It is often quite challenging to know if a problem has been correctly solved simply by looking at a metric. Furthermore, as discussed in the previous section, metrics can be deceiving. They often hide flaws in the way that our model has learned (remember the bad student that memorized all the answers).\n\nA great way to solve all these issues at once is to create charts showing the model\u2019s performance under several circumstances.\n\nThere are several types of charts that make it easier to understand the outcome of a model:\n\nShow the predicted value as a function of some of the features. For example, house selling price as a function of the house's area.\nA bar plot comparing the evaluation metric for different groups within the data. For instance, how does the error in predicting house prices change across different neighborhoods?\nIf the data contains images, show them, or if it contains geographic locations, plot a map.\nIf the data contains temporal data, show how the error evolves over time!\n\nCommon metrics for regression and clasification graphic.\nPlots make it easier to share the model\u2019s performance with others and help debug a model that is not working well.\n\nConsider, for example, a model trying to predict who will win the lottery: the model might predict that no one will win and have very high accuracy when, in fact, it is not learning anything (not that it could, but anyway). We might only learn about this behavior when plotting the model\u2019s predicted positive results and finding that there are none!\n\nBack to programming, to produce these visualizations in Python, you\u2019ll need to get a handle on two very important libraries: Matplotlib and Seaborn.\n\nMatplotlib and plotting in Python are synonymous. The library allows users to be as specific as they desire. It is a must-have in your tool belt. Most of the other plotting libraries in Python are implemented on top of Matplotlib, meaning that you will write code using this library sooner or later.\n\nSeaborn is almost a wrapper around Matplotlib that allows users to create plots that look nicer, faster. Since it is built on top of Matplotlib, you can always resort to the former library for very specific customizations.\n\nA word of advice: Plotting in Python is (a bit) cumbersome and takes time to reach a point where you can produce beautiful plots. The best way is to get started now! We recommend the following resources to do so:\n\nCourses\nHere again, first and foremost you should go through Kaggle's Visualization course\nDataCamp has some great hands-on courses on visualization\n\nDocumentation\nWhen in doubt, you can always check the documentation for Matplotlib and Seaborn or their respective tutorials (Matplotlib, Seaborn).\n\nBeyond the essentials\nIf you want to go beyond the basic visualizations, there\u2019s a lot of research with regards to optimizing plot format to deliver effective messaging. Check out data-viz resources.\n\nTakeaways\nOoooff, that was a long post. So where should you go from here?\n\nFirst, we recommend going through all the linked materials. Those are battle-tested and will kickstart your Data Science/Machine Learning Engineer journey.\n\nOnce you feel confident, you should get your hands dirty. Although learning from textbooks is a great way to learn the theory, it is when you put those learnings into practice that you get to master these concepts. If you don't know where to start, give some of these projects a try and let us know how you did!", "questions": ["What is the difference between traditional programming and machine learning?", "Please give me a scenario where Machine Learning is helpful", "Why are metrics needed?", "What is the main takeaway of section \"6. Visually inspect the results\"?", "Which one is better, Matplotlib or Seaborn?", "What are the differences between them?", "In what case would I need to build a custom dataset?"], "answers": {"input_text": ["Traditional programming consists of building a set of rules to output the desired result, while Machine Learning learns the patterns that go from the input to the output based on observing a massive amount of data.", "Identifying cats in pictures", "We need a way to summarize the performance of the model across all samples in a single number.", "Plots and charts are a great way of showing the model's performance and debugging a model that is not working well.", "I can't answer that.", "Seaborn is a wrapper around Matplotlib that allows users to create plots that look nicer, faster", "If you are solving a novel or very niche problem"]}}, {"conversation_id": "6", "context_id": "1", "story": "Machine Learning 101: Build, Train, Test, Rinse & Repeat\n\nIn the previous episode of our Road to Data Science series, we went through all the general skills needed to become a proficient data scientist. In this post, we start to dig deeper into the specifics of Machine Learning. We start by posing the simple questions: What is Machine Learning? What kind of problems can Machine Learning solve?\n\nWhat is Machine Learning?\nA great way to start understanding the idea behind Machine Learning is to think about what\u2019s new in Machine Learning compared to traditional programming.\n\nSo, essentially in traditional programming, we build a machine that outputs a result based on an input and a set of rules. Our job as programmers is to construct the correct set of rules for our system to output the desired result. Thanks to the computer\u2019s ability to follow complex instructions incredibly fast, this approach works very well for a vast set of tasks, from calculating taxes to landing a rocket on the moon.\n\nSo, why do we need Machine Learning? Well, machines are great at solving problems that look difficult or tedious to us. For example, imagine computing all the prime numbers up to 10,000,000. On the other hand, another set of problems, including some that seem trivial to us, are very hard to decompose into a set of deterministic rules.\n\nTake, for example, the problem of observing a picture and determining if there\u2019s a cat in it. The problem is trivial for any human, but not for a computer, especially considering all the variations between breeds and all the variations in angle, lighting, positioning, and occlusion. What can we do?\n\n\nCat photos showing occlusion, diversity, deformation, and lighting variations through the different photographs.\nMachine Learning to the rescue! Instead of building an explicit set of rules, Machine Learning systems learn the patterns that go from the input to the output based on observing a massive amount of data. This approach can solve many tasks, from identifying cats in pictures to modeling the consumer purchase decision.\n\nThe details of how our Machine Learning system learns these patterns depend on the specific algorithm we are using. These algorithms come in many shapes and flavors. There are whole families of them, but we can generally group them based on a couple of criteria.\n\nIf we look at how information is fed to the model, we can broadly classify Machine Learning techniques as supervised, unsupervised, or reinforcement. When labels are fed to the algorithm during training, the method is supervised. When no label is fed, the method is unsupervised. And when the system is allowed to make decisions and gets feedback on them, the method is Reinforcement Learning.\n\nOn the other hand, if we look at the kind of task they perform, Machine Learning techniques can be classified into several groups such as regression, clustering, single or multi-class classification, and so on.\n\nExtra readings\nCheck this post to learn more about what kind of tasks can be solved using Machine Learning.\nMicrosoft has a very comprehensive and extensive set of courses for beginners, one called ML for beginners and the other one called Data Science for beginners, be sure to check them out!\n\nA Shopping List to Machine Learning\nIf you had asked us 15 years ago what skills you needed to apply Machine Learning, our list would have included an endless list of mathematical, computer science, and statistical knowledge that could have discouraged even the most audacious.\n\nToday, thanks to the cumulative contributions of many in the Machine Learning community, the field has become much more approachable. Anyone who wants to do Machine Learning today stands on the shoulders of giants.\n\nIn fact, you might have heard that doing Machine Learning these days is as simple as writing a few lines of code. Well, the rumors are true, but that's only a fraction. Let's see that for ourselves.\n\nSay you were walking the countryside when suddenly you felt the unrelenting urge to differentiate the subspecies of a particular type of plant: the iris.\n\n\nIris flowers images.\nIris flower subspecies a to the left and subspecies b to the right. Images extrated from Unsplash.\nYou could take a picture of every iris flower you encounter and send it to your taxonomist friend or try to learn the pattern that goes from the dimensions of the flowers to their species. In Python, this looks like this:\n\nBoom! Problem solved with 96.7% accuracy in unseen data! Next time you see an iris, you can measure its flower and ask your model to predict the plant\u2019s species. Now, you might be looking at this unfamiliar code and wonder: Is it possible to learn this power? Or, if you\u2019ve been following along our blogposts and are already familiar with Python, you might wonder: What if I want to do something slightly more complex? What tools do I need? Where do I find the resources? Don\u2019t worry; we\u2019ve got you covered. We\u2019ll walk you through the Machine Learning pipeline step by step, referring you in each step to all the relevant material you need.\n\n1. Get Raw Data\nUnless you\u2019ve been living under a rock during the last decade, \u201cData is the New Oil\u201d is a headline you have seen several times. No wonder companies are so protective and private about their data - a lot of potential is there waiting to be exploited!\n\nDoes this mean that you can't access any interesting datasets? Well, if you are solving a novel or a very niche problem or maybe need a specialist like a doctor to tag medical images, you will likely need custom data and a custom process. But for simpler cases or educational purposes, the community has put an enormous effort into having publicly available and preprocessed data. What a time to be alive!\n\nSuppose you want to take a look at some curated datasets. To name just a few, you can check the COCO dataset for image segmentation, the ImageNet dataset for image classification, and Kaggle's Datasets for an extra wide variety of problems, from used car prices to a brain stroke prediction dataset.\n\nOnce you get your hands on your precious data, what's next? It depends on what kind of data we\u2019re talking about and its degree of \u201crawness.\u201d Are you dealing with images? Text? Is there a database (lucky you!)? Do you have a CSV (or JSON, or, God forbid!, XML)?\n\nIn any case, you need to be able to handle all these kinds of data and clean, tag, and transform it to your needs. You need to cook that raw data that you just got. For that purpose, let's jump to the next section called \u201cCook Transform Data.\u201d\n\nExtra readings\nLuminousmens article on data file formats and their comparative advantages.\n\n2. Transform Data\nOnce we have the data, we need to move it, change it, stretch it, and bend it to our will. Unless the dataset is very small, this requires some sort of programming.\n\nThere are many kinds of data, each requiring a different set of skills. We could mention images, videos, and text as popular data sources. However, for this blog, we will focus on the data type that (probably) rules them all: tabular data.\n\nA tabular dataset is basically a spreadsheet where each row is a new sample, and each column represents some information about those samples.\n\nManually changing things in spreadsheets is easy, albeit time-consuming, and usually unfeasible for large datasets. The equivalent way to manipulate tabular data in Python is with Pandas, the most widely adopted library for table manipulation in Python and one of the most popular libraries in the language.\n\nIf you are serious about doing Machine Learning in Python, you must learn Pandas. Luckily, there are many great resources available to help you learn.\n\nEssentials\nAs a starting resource, this Kaggle tutorial is a must.\nFurther, we recommend Towards Data Science\u2019s blog post on writing simpler and more consistent Pandas code.\nFor a hands-on guide from Pandas basics to its most important features, you can check our blog post on the subject.\n\n3. Pick a Machine Learning model\nThere are several ways to think about Machine Learning models. In the vignette below, XKCD summarizes quite well one of the most common ways people see Machine Learning models.\n\n\nVignet summaring one of the most common ways people see Machine Learning models.\nSource: XKCD.\nBut, we prefer a different way to think about them.\n\nWe think of a Machine Learning model as a coffee machine where you put coffee beans and water in and get magic on the other side. Even though making coffee is fairly easy, each coffee machine has its own way of working. If you are brewing coffee with a French press, you need a coarser grind, whereas if you use pour-over brewers, the grind needs to be much finer, almost finer than sand.\n\nSimilarly, we can think of Machine Learning models as machines. When given data (coffee beans), they output predictions (coffee/magic). We must be careful, though, because different models require different inputs. Some models require all data to be \u201cvery granular\u201d (normalized between [0, 1]), while other models accept arbitrary numbers. Almost all models require the input to be in a numeric format, neither text nor images.\n\nUnder this way of viewing Machine Learning, the most important task for working with a model is reading and understanding its instruction manual: dust or coarse, normalized or not, linearly independent or not, and so on.\n\nBut before turning the grinder on, it is wise to check if the coffee beans come with stones or dirt mixed in the same bag because you don't want to include them in your cup. That's why you will have to spend time separating good beans from dirt. The same is true for data: not every attribute will be useful, and maybe most will not. Looking at the data and deciding what will be useful for the model and what will not is often referred to as data cleaning and is a crucial part of Machine Learning.\n\nOnce you have your coffee ground, you may want to make a special blend choosing different origins of the beans. The reasons for blending can be many. You may want to bring more complex flavors or reduce the price of the package while maintaining the quality. The same goes for data. Depending on your model, you may want to transform a variable to have a normal distribution or help the image classification by tweaking the color balance of the images. This process is called feature engineering and is arguably one of the hardest parts of Machine Learning. As shown above, getting predictions from a Machine Learning model takes only one line of code, while a data processing pipeline can be several hundred \u2014if not thousands\u2014 of lines long.\n\nThere is a myriad of resources out there on how to become a great barista data scientist. If we need to name a few, those would be:\n\nTheory\nStatquest is an amazing YouTube channel to learn about the main ideas behind both specific models and overall Machine Learning concepts.\nA great resource on the theory behind Machine Learning can be found in Stanford's CS299 course. If you plan on taking this course, Garrett Thomas\u2019s math summary might also be useful.\n\nConcepts & exercises\nDataCamp offers a Data Scientist career track with almost 100 hours of hands-on courses on data manipulation and classic Machine Learning algorithms.\n\n4. Get the model to learn what you want, not what it wants\nSay you wanted to describe the concept of chairs to a friend, so you would start with the chairs you have at home as examples, and you start saying something like \"a chair is an item with 4 legs that you can sit on\".\n\nAt first, you are happy because it describes all the chairs at your place, but also your bed, your sofa, and a stool. Then you try a more complex description of what a chair is to exclude these other pieces of furniture. So you add more variables to describe your chairs, including color, shapes, or materials; and you come up with something like \"has 4 legs, you can sit on it, has a back, it\u2019s made of wood, the legs are brown, the seat is green with this specific pattern\". Not a great solution either, is it? Now you can perfectly classify the chairs back at home, but you won't be able to describe your friends\u2019 chairs. You\u2019ve overfitted your definition to the chairs you know.\n\nML models face the same problem as the person classifying chairs above. Go too simple, and you end up with a bad definition, go too complex, and you might end up just memorizing the data you see. This problem is known in machine learning as the bias-variance trade-off.\n\nOur job as developers is to design our model and its training process so that memorization is not a viable option and the only option left is to learn the relevant patterns. There are several tools that we use for this, and if you\u2019ve been following the courses above, you should be familiar with most of them. The following are some you should definitely know about.\n\nBasic concepts\nRegularization\n\nTrain - validation - test splits\n\nCross-validation\n\nEarly stopping\n\n5. Measure results\nBy now, you should have managed to have your model training for a couple of epochs and are probably wondering if it is doing any good.\n\nTo answer that, we need to introduce a new concept: metrics. Let's dig into this topic with an example. Imagine that you are training a model to predict the selling price of a house. You have 500,000 houses. For each of them, you have the real price and the predicted price. How do you check if the predictions are correct?\n\nSince evaluating a model\u2019s performance by looking at its prediction for every single data point is not feasible, we need a way to summarize the performance across all samples to get a single number that we can use to evaluate the model. When using metrics, you have to consider that it comes with a compromise: you will start evaluating things on average.\n\nThere are many, many metrics out there. You can even have custom metrics for your needs because not all metrics are well-posed to solve different problems with the same techniques. For regression problems, we have metrics such as MAE, RMSE, and MAPE, while for classification problems, we can use accuracy, f1-score, AUC, confusion matrix, etc. You should definitely get acquainted with their uses, strengths, and weaknesses.\n\n\nCommon metrics for regression and clasification graphic.\nCommon metrics for regression problems\n\nCommon metrics for regression and clasification graphic.\nCommon metrics for classification problems\nVery specific problems might even require custom metrics for the model to learn that something is meaningful. For example, when estimating the price of a house, it might be worse to underestimate the price of a house and lose money in the transaction rather than posting a very high price and not selling it. For such a case, a custom metric might be desirable, one that places a larger penalty on the model when underestimating the target and a smaller one when overestimating.\n\nA final remark regarding this topic: metrics can take more than one meaning in a Machine Learning project. One of them is, as we just described, that the metric can be used to evaluate the model after it\u2019s trained. Another meaning for metric might be the business metric that\u2019s expected to improve thanks to your machine learning model. You should be mindful of these metrics going into any project.\n\nExtra readings\nYou can find a review of the most used metrics for Machine Learning in this article.\n\n6. Visually inspect the results\nAlright! We have collected data, cleaned it, trained a model, and checked that the error is small. Are we done? Ye...Almost!\n\nWe, humans, are very visual creatures. It is often quite challenging to know if a problem has been correctly solved simply by looking at a metric. Furthermore, as discussed in the previous section, metrics can be deceiving. They often hide flaws in the way that our model has learned (remember the bad student that memorized all the answers).\n\nA great way to solve all these issues at once is to create charts showing the model\u2019s performance under several circumstances.\n\nThere are several types of charts that make it easier to understand the outcome of a model:\n\nShow the predicted value as a function of some of the features. For example, house selling price as a function of the house's area.\nA bar plot comparing the evaluation metric for different groups within the data. For instance, how does the error in predicting house prices change across different neighborhoods?\nIf the data contains images, show them, or if it contains geographic locations, plot a map.\nIf the data contains temporal data, show how the error evolves over time!\n\nCommon metrics for regression and clasification graphic.\nPlots make it easier to share the model\u2019s performance with others and help debug a model that is not working well.\n\nConsider, for example, a model trying to predict who will win the lottery: the model might predict that no one will win and have very high accuracy when, in fact, it is not learning anything (not that it could, but anyway). We might only learn about this behavior when plotting the model\u2019s predicted positive results and finding that there are none!\n\nBack to programming, to produce these visualizations in Python, you\u2019ll need to get a handle on two very important libraries: Matplotlib and Seaborn.\n\nMatplotlib and plotting in Python are synonymous. The library allows users to be as specific as they desire. It is a must-have in your tool belt. Most of the other plotting libraries in Python are implemented on top of Matplotlib, meaning that you will write code using this library sooner or later.\n\nSeaborn is almost a wrapper around Matplotlib that allows users to create plots that look nicer, faster. Since it is built on top of Matplotlib, you can always resort to the former library for very specific customizations.\n\nA word of advice: Plotting in Python is (a bit) cumbersome and takes time to reach a point where you can produce beautiful plots. The best way is to get started now! We recommend the following resources to do so:\n\nCourses\nHere again, first and foremost you should go through Kaggle's Visualization course\nDataCamp has some great hands-on courses on visualization\n\nDocumentation\nWhen in doubt, you can always check the documentation for Matplotlib and Seaborn or their respective tutorials (Matplotlib, Seaborn).\n\nBeyond the essentials\nIf you want to go beyond the basic visualizations, there\u2019s a lot of research with regards to optimizing plot format to deliver effective messaging. Check out data-viz resources.\n\nTakeaways\nOoooff, that was a long post. So where should you go from here?\n\nFirst, we recommend going through all the linked materials. Those are battle-tested and will kickstart your Data Science/Machine Learning Engineer journey.\n\nOnce you feel confident, you should get your hands dirty. Although learning from textbooks is a great way to learn the theory, it is when you put those learnings into practice that you get to master these concepts. If you don't know where to start, give some of these projects a try and let us know how you did!", "questions": ["In how many categories can we classify a Machine Learning algorithm depending on how inputs are fed to it?", "What are those categories?", "In which of those do I need to feed labels to the algorithm during training?", "What is Pandas used for?", "In which programming language is it available?", "What input formats are accepted by most models?", "What metrics are used for regression problems?", "How is MAE calculated?", "What is it used for?", "What are the most important recommendations of the last section?"], "answers": {"input_text": ["Three", "Supervised, unsupervised, or reinforcement.", "Supervised", "Tabular data manipulation", "Python", "Numeric format", "MAE, RMSE and MAPE", "I can't answer that.", "Create plots that look nicer, faster", "Going through all the linked materials and putting those learnings into practice"]}}, {"conversation_id": "7", "context_id": "1", "story": "Machine Learning 101: Build, Train, Test, Rinse & Repeat\n\nIn the previous episode of our Road to Data Science series, we went through all the general skills needed to become a proficient data scientist. In this post, we start to dig deeper into the specifics of Machine Learning. We start by posing the simple questions: What is Machine Learning? What kind of problems can Machine Learning solve?\n\nWhat is Machine Learning?\nA great way to start understanding the idea behind Machine Learning is to think about what\u2019s new in Machine Learning compared to traditional programming.\n\nSo, essentially in traditional programming, we build a machine that outputs a result based on an input and a set of rules. Our job as programmers is to construct the correct set of rules for our system to output the desired result. Thanks to the computer\u2019s ability to follow complex instructions incredibly fast, this approach works very well for a vast set of tasks, from calculating taxes to landing a rocket on the moon.\n\nSo, why do we need Machine Learning? Well, machines are great at solving problems that look difficult or tedious to us. For example, imagine computing all the prime numbers up to 10,000,000. On the other hand, another set of problems, including some that seem trivial to us, are very hard to decompose into a set of deterministic rules.\n\nTake, for example, the problem of observing a picture and determining if there\u2019s a cat in it. The problem is trivial for any human, but not for a computer, especially considering all the variations between breeds and all the variations in angle, lighting, positioning, and occlusion. What can we do?\n\n\nCat photos showing occlusion, diversity, deformation, and lighting variations through the different photographs.\nMachine Learning to the rescue! Instead of building an explicit set of rules, Machine Learning systems learn the patterns that go from the input to the output based on observing a massive amount of data. This approach can solve many tasks, from identifying cats in pictures to modeling the consumer purchase decision.\n\nThe details of how our Machine Learning system learns these patterns depend on the specific algorithm we are using. These algorithms come in many shapes and flavors. There are whole families of them, but we can generally group them based on a couple of criteria.\n\nIf we look at how information is fed to the model, we can broadly classify Machine Learning techniques as supervised, unsupervised, or reinforcement. When labels are fed to the algorithm during training, the method is supervised. When no label is fed, the method is unsupervised. And when the system is allowed to make decisions and gets feedback on them, the method is Reinforcement Learning.\n\nOn the other hand, if we look at the kind of task they perform, Machine Learning techniques can be classified into several groups such as regression, clustering, single or multi-class classification, and so on.\n\nExtra readings\nCheck this post to learn more about what kind of tasks can be solved using Machine Learning.\nMicrosoft has a very comprehensive and extensive set of courses for beginners, one called ML for beginners and the other one called Data Science for beginners, be sure to check them out!\n\nA Shopping List to Machine Learning\nIf you had asked us 15 years ago what skills you needed to apply Machine Learning, our list would have included an endless list of mathematical, computer science, and statistical knowledge that could have discouraged even the most audacious.\n\nToday, thanks to the cumulative contributions of many in the Machine Learning community, the field has become much more approachable. Anyone who wants to do Machine Learning today stands on the shoulders of giants.\n\nIn fact, you might have heard that doing Machine Learning these days is as simple as writing a few lines of code. Well, the rumors are true, but that's only a fraction. Let's see that for ourselves.\n\nSay you were walking the countryside when suddenly you felt the unrelenting urge to differentiate the subspecies of a particular type of plant: the iris.\n\n\nIris flowers images.\nIris flower subspecies a to the left and subspecies b to the right. Images extrated from Unsplash.\nYou could take a picture of every iris flower you encounter and send it to your taxonomist friend or try to learn the pattern that goes from the dimensions of the flowers to their species. In Python, this looks like this:\n\nBoom! Problem solved with 96.7% accuracy in unseen data! Next time you see an iris, you can measure its flower and ask your model to predict the plant\u2019s species. Now, you might be looking at this unfamiliar code and wonder: Is it possible to learn this power? Or, if you\u2019ve been following along our blogposts and are already familiar with Python, you might wonder: What if I want to do something slightly more complex? What tools do I need? Where do I find the resources? Don\u2019t worry; we\u2019ve got you covered. We\u2019ll walk you through the Machine Learning pipeline step by step, referring you in each step to all the relevant material you need.\n\n1. Get Raw Data\nUnless you\u2019ve been living under a rock during the last decade, \u201cData is the New Oil\u201d is a headline you have seen several times. No wonder companies are so protective and private about their data - a lot of potential is there waiting to be exploited!\n\nDoes this mean that you can't access any interesting datasets? Well, if you are solving a novel or a very niche problem or maybe need a specialist like a doctor to tag medical images, you will likely need custom data and a custom process. But for simpler cases or educational purposes, the community has put an enormous effort into having publicly available and preprocessed data. What a time to be alive!\n\nSuppose you want to take a look at some curated datasets. To name just a few, you can check the COCO dataset for image segmentation, the ImageNet dataset for image classification, and Kaggle's Datasets for an extra wide variety of problems, from used car prices to a brain stroke prediction dataset.\n\nOnce you get your hands on your precious data, what's next? It depends on what kind of data we\u2019re talking about and its degree of \u201crawness.\u201d Are you dealing with images? Text? Is there a database (lucky you!)? Do you have a CSV (or JSON, or, God forbid!, XML)?\n\nIn any case, you need to be able to handle all these kinds of data and clean, tag, and transform it to your needs. You need to cook that raw data that you just got. For that purpose, let's jump to the next section called \u201cCook Transform Data.\u201d\n\nExtra readings\nLuminousmens article on data file formats and their comparative advantages.\n\n2. Transform Data\nOnce we have the data, we need to move it, change it, stretch it, and bend it to our will. Unless the dataset is very small, this requires some sort of programming.\n\nThere are many kinds of data, each requiring a different set of skills. We could mention images, videos, and text as popular data sources. However, for this blog, we will focus on the data type that (probably) rules them all: tabular data.\n\nA tabular dataset is basically a spreadsheet where each row is a new sample, and each column represents some information about those samples.\n\nManually changing things in spreadsheets is easy, albeit time-consuming, and usually unfeasible for large datasets. The equivalent way to manipulate tabular data in Python is with Pandas, the most widely adopted library for table manipulation in Python and one of the most popular libraries in the language.\n\nIf you are serious about doing Machine Learning in Python, you must learn Pandas. Luckily, there are many great resources available to help you learn.\n\nEssentials\nAs a starting resource, this Kaggle tutorial is a must.\nFurther, we recommend Towards Data Science\u2019s blog post on writing simpler and more consistent Pandas code.\nFor a hands-on guide from Pandas basics to its most important features, you can check our blog post on the subject.\n\n3. Pick a Machine Learning model\nThere are several ways to think about Machine Learning models. In the vignette below, XKCD summarizes quite well one of the most common ways people see Machine Learning models.\n\n\nVignet summaring one of the most common ways people see Machine Learning models.\nSource: XKCD.\nBut, we prefer a different way to think about them.\n\nWe think of a Machine Learning model as a coffee machine where you put coffee beans and water in and get magic on the other side. Even though making coffee is fairly easy, each coffee machine has its own way of working. If you are brewing coffee with a French press, you need a coarser grind, whereas if you use pour-over brewers, the grind needs to be much finer, almost finer than sand.\n\nSimilarly, we can think of Machine Learning models as machines. When given data (coffee beans), they output predictions (coffee/magic). We must be careful, though, because different models require different inputs. Some models require all data to be \u201cvery granular\u201d (normalized between [0, 1]), while other models accept arbitrary numbers. Almost all models require the input to be in a numeric format, neither text nor images.\n\nUnder this way of viewing Machine Learning, the most important task for working with a model is reading and understanding its instruction manual: dust or coarse, normalized or not, linearly independent or not, and so on.\n\nBut before turning the grinder on, it is wise to check if the coffee beans come with stones or dirt mixed in the same bag because you don't want to include them in your cup. That's why you will have to spend time separating good beans from dirt. The same is true for data: not every attribute will be useful, and maybe most will not. Looking at the data and deciding what will be useful for the model and what will not is often referred to as data cleaning and is a crucial part of Machine Learning.\n\nOnce you have your coffee ground, you may want to make a special blend choosing different origins of the beans. The reasons for blending can be many. You may want to bring more complex flavors or reduce the price of the package while maintaining the quality. The same goes for data. Depending on your model, you may want to transform a variable to have a normal distribution or help the image classification by tweaking the color balance of the images. This process is called feature engineering and is arguably one of the hardest parts of Machine Learning. As shown above, getting predictions from a Machine Learning model takes only one line of code, while a data processing pipeline can be several hundred \u2014if not thousands\u2014 of lines long.\n\nThere is a myriad of resources out there on how to become a great barista data scientist. If we need to name a few, those would be:\n\nTheory\nStatquest is an amazing YouTube channel to learn about the main ideas behind both specific models and overall Machine Learning concepts.\nA great resource on the theory behind Machine Learning can be found in Stanford's CS299 course. If you plan on taking this course, Garrett Thomas\u2019s math summary might also be useful.\n\nConcepts & exercises\nDataCamp offers a Data Scientist career track with almost 100 hours of hands-on courses on data manipulation and classic Machine Learning algorithms.\n\n4. Get the model to learn what you want, not what it wants\nSay you wanted to describe the concept of chairs to a friend, so you would start with the chairs you have at home as examples, and you start saying something like \"a chair is an item with 4 legs that you can sit on\".\n\nAt first, you are happy because it describes all the chairs at your place, but also your bed, your sofa, and a stool. Then you try a more complex description of what a chair is to exclude these other pieces of furniture. So you add more variables to describe your chairs, including color, shapes, or materials; and you come up with something like \"has 4 legs, you can sit on it, has a back, it\u2019s made of wood, the legs are brown, the seat is green with this specific pattern\". Not a great solution either, is it? Now you can perfectly classify the chairs back at home, but you won't be able to describe your friends\u2019 chairs. You\u2019ve overfitted your definition to the chairs you know.\n\nML models face the same problem as the person classifying chairs above. Go too simple, and you end up with a bad definition, go too complex, and you might end up just memorizing the data you see. This problem is known in machine learning as the bias-variance trade-off.\n\nOur job as developers is to design our model and its training process so that memorization is not a viable option and the only option left is to learn the relevant patterns. There are several tools that we use for this, and if you\u2019ve been following the courses above, you should be familiar with most of them. The following are some you should definitely know about.\n\nBasic concepts\nRegularization\n\nTrain - validation - test splits\n\nCross-validation\n\nEarly stopping\n\n5. Measure results\nBy now, you should have managed to have your model training for a couple of epochs and are probably wondering if it is doing any good.\n\nTo answer that, we need to introduce a new concept: metrics. Let's dig into this topic with an example. Imagine that you are training a model to predict the selling price of a house. You have 500,000 houses. For each of them, you have the real price and the predicted price. How do you check if the predictions are correct?\n\nSince evaluating a model\u2019s performance by looking at its prediction for every single data point is not feasible, we need a way to summarize the performance across all samples to get a single number that we can use to evaluate the model. When using metrics, you have to consider that it comes with a compromise: you will start evaluating things on average.\n\nThere are many, many metrics out there. You can even have custom metrics for your needs because not all metrics are well-posed to solve different problems with the same techniques. For regression problems, we have metrics such as MAE, RMSE, and MAPE, while for classification problems, we can use accuracy, f1-score, AUC, confusion matrix, etc. You should definitely get acquainted with their uses, strengths, and weaknesses.\n\n\nCommon metrics for regression and clasification graphic.\nCommon metrics for regression problems\n\nCommon metrics for regression and clasification graphic.\nCommon metrics for classification problems\nVery specific problems might even require custom metrics for the model to learn that something is meaningful. For example, when estimating the price of a house, it might be worse to underestimate the price of a house and lose money in the transaction rather than posting a very high price and not selling it. For such a case, a custom metric might be desirable, one that places a larger penalty on the model when underestimating the target and a smaller one when overestimating.\n\nA final remark regarding this topic: metrics can take more than one meaning in a Machine Learning project. One of them is, as we just described, that the metric can be used to evaluate the model after it\u2019s trained. Another meaning for metric might be the business metric that\u2019s expected to improve thanks to your machine learning model. You should be mindful of these metrics going into any project.\n\nExtra readings\nYou can find a review of the most used metrics for Machine Learning in this article.\n\n6. Visually inspect the results\nAlright! We have collected data, cleaned it, trained a model, and checked that the error is small. Are we done? Ye...Almost!\n\nWe, humans, are very visual creatures. It is often quite challenging to know if a problem has been correctly solved simply by looking at a metric. Furthermore, as discussed in the previous section, metrics can be deceiving. They often hide flaws in the way that our model has learned (remember the bad student that memorized all the answers).\n\nA great way to solve all these issues at once is to create charts showing the model\u2019s performance under several circumstances.\n\nThere are several types of charts that make it easier to understand the outcome of a model:\n\nShow the predicted value as a function of some of the features. For example, house selling price as a function of the house's area.\nA bar plot comparing the evaluation metric for different groups within the data. For instance, how does the error in predicting house prices change across different neighborhoods?\nIf the data contains images, show them, or if it contains geographic locations, plot a map.\nIf the data contains temporal data, show how the error evolves over time!\n\nCommon metrics for regression and clasification graphic.\nPlots make it easier to share the model\u2019s performance with others and help debug a model that is not working well.\n\nConsider, for example, a model trying to predict who will win the lottery: the model might predict that no one will win and have very high accuracy when, in fact, it is not learning anything (not that it could, but anyway). We might only learn about this behavior when plotting the model\u2019s predicted positive results and finding that there are none!\n\nBack to programming, to produce these visualizations in Python, you\u2019ll need to get a handle on two very important libraries: Matplotlib and Seaborn.\n\nMatplotlib and plotting in Python are synonymous. The library allows users to be as specific as they desire. It is a must-have in your tool belt. Most of the other plotting libraries in Python are implemented on top of Matplotlib, meaning that you will write code using this library sooner or later.\n\nSeaborn is almost a wrapper around Matplotlib that allows users to create plots that look nicer, faster. Since it is built on top of Matplotlib, you can always resort to the former library for very specific customizations.\n\nA word of advice: Plotting in Python is (a bit) cumbersome and takes time to reach a point where you can produce beautiful plots. The best way is to get started now! We recommend the following resources to do so:\n\nCourses\nHere again, first and foremost you should go through Kaggle's Visualization course\nDataCamp has some great hands-on courses on visualization\n\nDocumentation\nWhen in doubt, you can always check the documentation for Matplotlib and Seaborn or their respective tutorials (Matplotlib, Seaborn).\n\nBeyond the essentials\nIf you want to go beyond the basic visualizations, there\u2019s a lot of research with regards to optimizing plot format to deliver effective messaging. Check out data-viz resources.\n\nTakeaways\nOoooff, that was a long post. So where should you go from here?\n\nFirst, we recommend going through all the linked materials. Those are battle-tested and will kickstart your Data Science/Machine Learning Engineer journey.\n\nOnce you feel confident, you should get your hands dirty. Although learning from textbooks is a great way to learn the theory, it is when you put those learnings into practice that you get to master these concepts. If you don't know where to start, give some of these projects a try and let us know how you did!", "questions": ["What is a Machine Learning system?", "What datasets are there for image related tasks?", "What basic libraries should I learn to work on Machine Learning?", "Mention one learning from the section \"2. Transform Data\"", "Why?"], "answers": {"input_text": ["Machine Learning systems learn the patterns that go from an input to an output based on observing a massive amount of data. ", "COCO and ImageNet", "Pandas", "If you are serious about doing Machine Learning in Python, you must learn Pandas", "Pandas is the most widely adopted library for table manipulation"]}}, {"conversation_id": "10", "context_id": "2", "story": "Price optimization for e-commerce: a case study\n\nWe have previously discussed how a data-driven price optimization is no longer an option for retailers, but a question of how and when to do it. With a world that's moving towards changing prices more and more dynamically, static pricing strategies can't keep up, and data-driven approaches have arrived to stay. In this post, we'll be focusing on how to perform data-driven price optimization, using a case study to showcase what can be done using sales data to predict demand and optimize prices.\n\nHaving read our previous post, you may now ask yourself:\n\nIs price optimization with Machine Learning worth the investment in my case?\nHow do I know if the quality of my sales data is good enough?\nOr putting it simply: price optimization, here I go! But am I ready?\nTo address these and other questions, we will demonstrate a practical example using the tools we've developed in the past years while excelling in our client's pricing strategies. To do so, we will take the publicly available dataset from the Brazilian marketplace Olist as our use case foundation.\n\nThis will be a glimpse of what you can achieve by sharing your sales data with us. We could perform an opportunity summary report in line with this use case, and uncover your own opportunities regarding price optimization. So, fasten your seatbelt and get ready for the take-off.\n\nAnalyzing an E-Commerce's Sales Data\nWhile diving into the practical example and explain how we solve these kinds of problems, let's refresh some price optimization concepts and introduce the work pipeline we will follow. There are different ways to tackle a price optimization problem and different strategies to perform a Machine Learning approach for this task.\n\nOur price optimization pipeline will be based on six phases or steps:\n\nData Collection and Integration: build a data asset that integrates all the relevant information for demand forecasting.\nData analysis: generate explanatory data analysis reports and categorize items according to demand patterns.\nAnomaly detection report: generate an anomaly detection report to identify inconsistent fields, detect out-of-stock issues, observe sales distribution, and detect outliers.\nDemand forecasting: estimate the demand curves for each one of the items, i.e., understand the elasticity of the demand.\nExperimentation: elaborate the strategy on how to perform exploration and exploitation.\nOptimization: optimize the price for the items, subject to a list of specific business constraints.\nIn the remainder of the post we will describe each one of these steps using the Olist dataset as an explanatory example.\n\n\nIllustrations flow with pricing elements like data and charts\nPrice optimization pipeline\n1. Data Collection and Integration\nThe first step is to build our data asset, taking different sources of information that will be used to train our demand forecasting models.\n\nThe cornerstone to begin working on any price optimization project is the sales dataset, i.e., to have the sales information of your business. Whether you are an e-commerce or a brick-and-mortar store, sales information is the first asset we have to focus on, and where our analysis will begin.\n\nWhat do we mean by the sales dataset? To put it simply, we need access to the sales information of the business segment that the company is willing to optimize the pricing strategy for. If you already know beforehand which segment this is, great! You are one step ahead. But if you don't, no worries, we can help you decide which is the best segment to start based on your sales information, as we will see in our example.\n\nUsually, companies choose to store their sales information in one of the following structures:\n\nBy sales transactions: each transaction detail with its item identifier, the timestamp, the number of items sold, and its sell price. Other data fields that can be useful to complement this information are the payment method (cash, credit card, etc.) and the shipping information.\nBy daily aggregation: some companies store their sales information this way because they need to have an accurate cash flow.\nComplementing your sales data\n\nNaturally, sales data won't be the only source of information we will consider when addressing the price optimization task. As demand is affected by several factors (such as the price, competitor's price, season, holidays, and macroeconomic variables), this is only one of the many factors to consider.\n\nAt this point, the knowledge of a business expert will play a big part to jointly work on a data structure that works for each specific problem.\n\nOnce again, don't worry if you are not storing all this data currently. We can quickly provide you with several of these public datasets.\n\nHere we list other sources of information that are usually considered when collecting data:\n\nItems information: data such as category, sub-category, description, number of photos, photos, and general characteristics can be used to profile each item and relate it with others.\nMacroeconomics: different indicators such as GDP, unemployment rate, consumer confidence, and currency exchange. It is crucial to keep in mind that these indicators are all gathered during different periods. Some of them are calculated quarterly, some others monthly, and some others daily.\nStores or seller information: location, size, etc.\nHolidays and events: certain holidays and events can have a direct impact on the number of sales of different items, e.g., Valentine's Day, Christmas, Mother's Day, etc.\nItem's reviews: reviews performed by customers can also be included in our forecasting model.\nCompetitor(s)' price: the presence of one or several strong competitors can have a direct impact on the demand of our items, so having information about how they are pricing can potentially be used as insights for our models.\nCustomer traffic: number of visitors (e-commerce and brick-and-mortar), average time on page, and number of clicks (e-commerce only).\nOther business-specific information.\n2. Data Analysis\nNow that the basics are clear let's dive into the Olist e-commerce data. We always start working on an exploratory data analysis of the sales data to gain insights about the different available items, their trends in the sales, and the pricing history as well.\n\nFor the Olist example, the sales data is stored as transactions, divided into three different datasets:\n\norder-items: which includes the items and sell price for each order.\norders: contains all the information related to the orders, such as customer, seller, order status, order timestamp, and the estimated delivery date.\norder-payments: it includes the payment information for each order.\nAfter processing these three datasets, we can build a standard sales dataset, as shown in the following figure, which aggregates item-seller sales per day:\n\n\nOlist Data set\nStandard sales dataset.\nOnce we define the sales dataset, we should question ourselves how many items should we choose to begin the optimization? Or even further: are all items ready to be optimized?\n\nTo begin with, we will set a threshold on the number of historical sales observations. To estimate the future demand for a given item, we need historical sales data. Therefore, we will start considering those articles that have, at minimum, ten weeks with at least 1 unit sold.\n\nUsing this threshold, in our Olist use case, we will have a universe of 1,197 items. On the other hand, we could think of a price automation solution for those unique items that are sold only once or very few times.\n\n\n1.197 articles have, at minimum, ten weeks with at least 1 unit sold. \nMoreover, this subset of items can be further categorized. In particular, we will classify items based on demand history and the number of different prices that have historically been used. The shape of the demand history will have a direct impact on the modeling efforts, and so will the number of price changes on the experimentation strategy. Therefore, we will divide our approach into two different stages to select the first set of items:\n\nStudy the variability of the demand history concerning timing and magnitude.\nStudy the number of price changes and timing.\nVariability of the demand history\n\nAt this first stage we will study the variability of the demand history, as we are looking to identify those items that have a simpler selling pattern than others. This approach will not only impact the final accuracy of the sales forecasting but also the feature engineering and modeling efforts.\n\nTo perform this categorization, we will use the one proposed by Syntetos et al. (2005), in which the demands are divided into four different categories:\n\nIntermittent, for items with infrequent demand occurrences but regular in magnitude\nLumpy, for items whose demand is intermittent and highly variable in magnitude\nSmooth, for items whose demand is both frequent and regular in magnitude\nErratic, for items whose demand is frequent but highly variable in magnitude\nIn the following figure, we show how each of these demands would look like in a typical scenario, plotting the demand magnitude to the time.\n\n\nDemand behavior for each scenario.\nSmooth and intermittent demands are easier to model than lumpy and erratic. In particular, lumpy items are hard to model, because its demand shape is both erratic and irregular.\n\nInstead of looking at the shape of each item's demand, we can calculate two magnitudes that will help us to classify all items at once:\n\nHow spaced the sales are over time. Calculated as the mean inter-demand interval (p).\nA measure of how bumpy the sales are. Calculated as the squared coefficient of variation of demand sizes (CV^2CV \n2\n ).\nAlthough we'll not dive into the details, we can use both values to perform a scatter plot of the items and cluster them into the four mentioned categories. This classification will also help us to define the level of aggregation of our data. In the following figures, we compare how the items are classified when aggregating the sales data daily and weekly.\n\n\nDaily aggregation chart\n\nWeekly aggregation chart\nAs you can see, a weekly aggregation of the data provides us a higher number of items to consider given our classification. This is also shown in the table below.\n\nItems classification according to demand variability\n\nWe will aggregate our data by week and we will select those items that fall in the smooth, erratic, and intermittent categories (629 items), leaving out of the scope those with a lumpy demand (568 items) for which we should think of another level of aggregation.\n\n\nanalysis flow\n638 items have intermittent, erratic and smooth demand pattern, while 559 items present lumpy demand pattern.\nNumber of price changes and timing\n\nAnother important aspect to consider for each one of the items is how many prices have been historically tried and during which period of time.\n\nThe more information we have about the demand changes relative to the price, the less the exploration efforts needed, as we will see later on.\n\nThese are the filters that we are considering for the Olist dataset:\n\nHave historically changed their price at least three times in the 2-year span\nEach of these prices has at least four days of sales.\nWith these constraints, we will be able to filter those items that have a significant amount of information. This allows us to estimate the demand curves within our historical sales, i.e., we want to select those items in which we have information about the sales behavior at different points in time, using different prices.\n\nIf we only filter by the number of price changes and timing specified in 1. and 2. the number of valid items is reduced to 144.\n\nItems selection\n\nWe are now ready to join both conditions together \u2014pattern shape of historical demand & the number of price changes and timing\u2014 to reach the final number of items that, without doing any other further experimentation, are ready to optimize. Taking both conditions into consideration, we reach a universe of 51 items that are ready to start.\n\nAs mentioned before, the rest of the items won't be discarded, but further price exploration will have to take place to generate meaningful datapoints. The price exploration strategy will be explained in the \"Experimentation setup\" section.\n\n\nLast stage of the funnel\n51 items ready to optimize when considering pattern shape of historical demand & the number of price changes and timing.\n3. Anomaly detection\nFrom our experience working with multiple datasets and several retailers, we have been able to develop an anomaly detection report. This report provides useful information to the data science team working with the dataset and provides meaningful insights to the business team. It has been of great help to identify missing information or misconceptions of the available data.\n\nAnomaly detection involves identifying the differences, deviations, and exceptions from the norm in the data.\n\nIn a typical retail scenario, our anomaly detection report consists of three main topics:\n\nIdentifying missing and inconsistent fields\nDetecting and handling out-of-stocks\nObserving sales distribution and detecting outliers\nWe will dig into the details of each one of these topics in the following sections.\n\n1. Identify missing and inconsistent fields\n\n1.1 Missing fields\n\nThough there is a lot of meaningful information in the sales information of Olist, there are missing features that are essential to tackle our price optimization problem. For example, we don't have information about the manufacturer's suggested retail price (MSRP) of the items nor about its costs.\n\nLet's say that we want to optimize our prices to maximize profit. Let's take a look then to the definition of the profit function:\n\n\nQ represents the demand, p the price, and c the cost, and the sum is performed over our universe of items (N). As you can see, the cost is needed to calculate the profit, and so is to perform the optimization. If we don't have this information, we can assume a zero cost for the items, and then optimize our prices to maximize revenue instead of profit:\n \n\nOn the other hand, the MSRP can be used to compare the selling prices against the manufacturer's suggestion. It can be useful when analyzing each item's historical pricing strategy and its effects on the demand.\n\nAlso, we don't have much information about the items themselves. We lack information such as the item's description, images, name, sub-category, or brand, which are of great help to relate and cluster similar items. Nevertheless, as we will see, we can still get meaningful results.\n\n1.2 Inconsistent fields\n\nWe can also run a series of inconsistency checks between the features. For example, if we have the sales data aggregated by date and not by order or basket, we can check if the \"Total Sales\" for a particular item/date is equal to the number of units sold times the price.\n\nWe would all expect this to be true, but what if:\n\nthere have been refunds on that given date\nsome units were sold with a privileged discount\nsome customer paid with a different currency\nWe need to make sure that all of these details are correctly recorded, and if not, try to detect these anomalies automatically to take specific actions.\n\nMoreover, many times retailers have business-specific rules which can be checked in a data-driven way to make sure they are being applied or not. Does any of the following business rules sound familiar to you?\n\nPrices for the same item must be consistent across stores\nNo more than 40% discount from MSRP is allowed\nSell price must always be lower than the competitor's price\nDifferent price promotions cannot overlap\nWith our anomaly detection report, you can specify the business rules, and we can check these quickly, easily, and automatically.\n\nGoing back to the Olist sample dataset, we don't have information about the MSRP, the cost of each item, or any specific business rule. However, we could still check for certain general inconsistent fields such as the following:\n\nDo we have negative values on sales units or prices?\n\nNegative values screenshot\nIs the same item sold at different prices on the same date?\n\nDifferent prices screenshot\n2. Detecting and handling out-of-stocks\n\nAnother chapter in our anomaly detection report for retailers consists of detecting out-of-stock issues. This is very important because of two main reasons:\n\nRevenue loss: having out-of-stock issues implies losing sales and sometimes even clients. Therefore, pricing and inventory management decisions have to be taken jointly.\nDemand curve estimation: to correctly estimate a demand curve for each item/store, we need to identify the causes for which we observe zero sales for certain items in certain moments. In particular, we need to distinguish between zero demand and zero supply.\nIdeally, we could incorporate inventory data and flag days as out-of-stock if the stock level for a specific item is equal to zero. Nevertheless, from our experience working with retailers, we have learned that inventory data is not necessarily 100% accurate. For example, in a brick-and-mortar store, we may observe zero sales for consecutive days, even when the inventory data suggests there should be five or even ten items in-store. Still, these items are not properly displayed to the public.\n\nTherefore, we have designed different heuristics according to the client's data to correctly identify out-of-stock days and better estimate the demand curves. Some of these heuristics are complex and client-specific, while others are simple and straightforward.\n\nAs a preliminary approach, we can identify per item/store long gaps of zero sales and assume these are out-of-stock problems. This definition of out-of-stock will depend on the consecutive zeros threshold we define. For example, we can determine the consecutive zeros threshold as 30 days; this is to say that an item will be considered as out-of-stock if there are no sales in 30 consecutive days. We can visualize the out-of-stock periods as the shadowed area on the following plot:\n\nOut of stock report.\nGiven this definition, we can estimate the revenue lost as a consequence of out-of-stock problems, e.g., assuming that the sales during the out-of-stock periods would have been equal to an average date.\n\nFurthermore, we could also download a report with the estimated revenue lost for each item/seller on each out-of-stock period. The Out-of-Stock Report would look something like this:\n\n\nOut of stock analysis.\nEven though this report gives us some insights into the impact that out-of-stocks is having on our revenue, these are just preliminary estimations. To better approximate the revenue loss and to identify the out-of-stock days better, we can use the same Machine Learning models we use to perform the forecast of the demands.\n\nLet's picture this with an example. Imagine our Machine Learning model estimates that 53 units would've been sold of a specific item, on a particular date, given a certain price and conditions, and we observe that only 1 unit has been sold. This would raise an alert since it could be that a specific item has potentially run out-of-stock.\n\nFinally, but not least important, once the Machine Learning models are trained, they can also be used in a proactive way instead of reactive. Several weeks in advance, we could estimate how many units will be sold and generate stock replenishment recommendations to avoid out-of-stock issues.\n\n3. Observe sales distribution and detect outliers\n\nTo detect outliers, we can plot the daily sales distribution per seller or store by day of the week, month, or year. In particular, we can observe the distribution and detect outliers by generating different visualizations of the data.\n\nFor example, we can generate a boxplot to observe the distribution of units sold by day of the week. These plots are useful to understand how many units are sold on average, by a seller or store, per item on each category group.\n\nAs shown in the following figure, we can observe from the boxes sizes that, on average, all days have a similar behavior where most sellers per day sell only one unit per item. Nevertheless, there seem to be several outliers (dots that are far from the normal distribution). If we hover over the plot, for example, on the point which is farther away on the Friday category, we can see that it corresponds to a particular seller who sold 25 units for a specific item on Friday, 24th November 2017.\n\nDistribution per day chart\nDoes that date ring a bell to you? Why could you sell a totally different amount of units than what you usually sell? BINGO! That was Black Friday!\n\nEven though we were considering national holidays as features for our demand estimation, from the data, we have learned that it may be relevant to include Black Friday as a special date.\n\nHowever, we may also wonder what happened on Black Friday in other years? Was it also the date with the highest sales in 2018? Or what about Black Friday for other sellers? Did their sales increase dramatically too? By having a look at the boxplot by year, we can check what happened in 2018:\n\nDistribution per year chart\nFor 2016, we have almost no observations (data started in September 2016), but if we compare 2017 and 2018, the distribution seems very similar. Nevertheless, the seller who sold the maximum amount of units per item in 2018 was a different seller and did so on a different date. In particular, it was on May 8, 2018. What could have happened on that specific date?\n\nWhen looking in detail at the list of holidays and events for Brazil in 2018, we can see that May 8 corresponds to the week before Sunday of Mother's Day (May 13). Also, this makes total sense when we see that the category of the item is marked as \"watches gifts.\"\n\nTaking consideration of holidays is important to detect and classify the outliers into different groups, i.e., those potentially related to errors in the data ingestion process \u2014and that need to be treated as misleading\u2014 and those strictly associated with special events.\n\nFor sure, we need to continue exploring the data in order to gain more insights. We could, for example, observe the location for that particular outlier seller and check the sales distribution of the nearest stores to see whether it was one specific event that affected several stores or just this one. Or we could check if a sudden decrease in price can explain the increase in sales by plotting both variables together, as shown in the following plot.\n\nSales units and price chart\nFollowing our intuition, this sudden increase in sales may be explained by a decrease in price. But we need to continue thinking of possible features that could help us better understand and explain the price elasticity of demand. This is where our Machine Learning models come into play. By including several different features, we can learn from the data which variables are mostly affecting the demand for each item. By modeling this particular item, we would have learned that it is not only the reduction in price that explained the increase in sales but also the days until Mother's Day, the day of the week, and several other variables.\n\nFinally, it's important to stress the relevance of business expertise to help point out those relevant dates when you know sales go up or down. The Machine Learning models help you validate and test new hypotheses.\n\n4. Modeling the demand\nHaving our data ready, we proceed to train the demand forecasting models. Our goal is to be able to build the demand curve for each item. The models will learn the relation between the demand and several factors such as the price, holidays/events, and macroeconomics.\n\nDuring this phase, the collaboration with the business team is particularly important. Business insights help to validate the selected features and ensures that we are not missing any important aspect that can be potentially used to forecast the demand.\n\nBelow we show an example of the demand curve for an item in the garden tools category.\n\n\nDemand curve chart\nThis plot shows the predicted demand curve (blue line) for a particular date (April 29, 2018). The vertical green lines are the historically tested prices (from where our model learned the price elasticity). The red line shows the expected profit (assuming item cost equal to zero, as we don't have this information).\n\nWhat insights do we get from the estimated demand curve? In this case, we can see that between $50 and $60, the number of units sold would be equal to 14, meaning that in this particular price range and week, the demand curve is completely inelastic. Nevertheless, if we increase the price somewhere close to $63, the number of units sold will decrease, one unit less will be sold but still obtain higher revenue. According to our model estimations, without having a retailer and business background, we would recommend starting with a price between $63 and $70 to explore how the demand behaves on this point.\n\nOnce we have information about the demand using this price, we can use the latest information to feed the model and then decide which is the following price to set. The advantage of trying prices far from the historically tested ones would be that we may find a new optimum price (exploring). Still, trying prices that have never been tested before brings more uncertainty as the accuracy of the model in those regions is lower.\n\nMoreover, explainability is a desired feature of the models in price optimization. Most of the time, it provides new business insights to retailers and validates that the model is learning correctly. Our models can tell us which features were the most relevant to fit the data. Below we show the top-10 feature importance for the model associated with our garden tool item.\n\n\nFeature importance chart.\nThe most important features according to the model are:\n\nDays until Black Friday\nItem ID\nConsumer Confidence Index\nTotal imports in USD\nWeek number of the year\nMonth number of the year\nDays until Spring\nDays until Winter\nSeller ID\nSell price\nAs you can see, Black Friday is the most important event for this retailer. This means that each time the feature was used, the model achieves a greater increase in its accuracy. Also, the Consumer Confidence Index (CCI) and the total imports in USD (Imports) are the most relevant macroeconomic factors for this particular item. Furthermore, the beginning of Winter and Spring seasons also play an important role in explaining the demand for our garden tool item.\n\n5. Experimental setup\nOnce we have our demand curves, we need to decide what exact prices we will try. This is to say, we have to plan what concrete actions we will take, and these actions or strategies will depend on our experimental setup. Three main questions need to be answered to define the experimental setup:\n\n1. How are we going to measure the profit gain for the business as usual?\n\nWe need to define a measure against which we will be compared in order to evaluate the progress of our pricing system. We can have different options here, for example:\n\nControl store: we can start by comparing our store's performance against a control one, that will keep using the same business-as-usual pricing strategy. This scenario usually applies to brick-and-mortar stores.\n\nSynthetic control: when we lack a reference or control metric to compare against, we can synthesize one using other available metrics. This is known as synthetic control and provides us a way to measure the profit gain with respect to the business-as-usual even when there is no clear control measure. This method can be used for both e-commerce and brick-and-mortar retail companies.\n\n2. Which price change frequency is best suited in my case?\n\nWe need to define the frequency of the price changes, that is if we are going to change the prices hourly, daily, weekly, etc. This will depend on two major factors:\n\nIs your business Brick-and-mortar or e-commerce?\n\nBrick-and-mortar stores, due to the operational limitations of physically changing price tags, usually use weekly or bi-weekly frequencies.\n\nSome of these companies use digital price tags that remove this limitation and let them change prices more frequently. However, if the frequency is too high, customers may feel that the prices in the store are unfair, so this should be taken cautiously.\n\nOn the other hand, e-commerce retailers usually select their prices more dynamically, changing prices in the hour-daily range. Often, customers are more likely to accept these higher pricing frequencies since they see this behavior in big players like Amazon.\n\nHow accurate are our models for each price change frequency?\n\nThe other major factor to consider is the ability of the models to accurately predict the sales units in different time windows. This will depend on various factors, such as the shapes of the demands and the amount of historically price changes, as we discussed in the previous section.\n\nSo as a general rule, we should choose the frequency in which our models are accurate enough to perform the predictions, and that maximizes our objective function (e.g., profit) while remaining operationally viable.\n\n3. What will be the Exploration vs. Exploitation strategy.\n\nDuring exploration, we try new unseen prices that will add crucial information to the data about the shape of the demand curve. On the other hand, during exploitation, we exploit the information that we already know from the demand curves and try to maximize our objective function.\n\nDifferent strategies can be applied depending on the necessities of the retailer and the quality of the data regarding historical prices. In the Olist dataset, we have 51 items ready to start exploitation. The rest of the items have very few prices tried or very few historical sales and will require some level of exploration.\n\nHere below, we list some recommendations that are useful when defining the exploration strategy:\n\nIf we count on the MSRP, start at this price, and make variations around this price to see how the demand behaves.\nIf not, start setting a price near the ones that have been historically tried and make variations around this price to see how the demand behaves.\nWith the business team, set a reasonable threshold to the upper and lower boundaries of the exploration price range.\n6. Optimization\nOnce all these questions have been answered, you have everything set to optimize your pricing strategy, as many other retailers are already doing.\n\nOptimizing your pricing strategy means defining your function to be maximized and consider any restrictions to include as constraints. In most scenarios, the objective function to be maximized will be the overall profit subject to the stock level or sales velocity preferred. However, you may be willing to maximize sales volume or customer\u2019s lifetime value. The definition of the optimization problem must be clearly defined with the client.\n\nAll in all, we show the opportunity summary report that showcases several promising lines of work.\n\nOpportunity summary report\n\nFinal thoughts and concluding remarks\nWe have shown how a typical price optimization pipeline would look like, and how we would asses the potential opportunities for a pricing model.\n\nIn this particular example, working with the publicly available sales data of the Brazilian marketplace Olist and with no further information nor insights from their side, we have shown that there is room for improvement regarding their pricing decisions.\n\nCombining their available dataset and other publicly available information, such as holidays/events data and macroeconomic series, we have been able to estimate demand curves for a subset of the items which would allow them to take optimal pricing decisions. The remaining items should undergo an exploration phase where new prices would be tried in order to be able to estimate their demand curves accurately. The exploration strategy is generally decided jointly with the client.\n\nFor this example, given the data available and our past experience, we would suggest performing weekly price changes during the exploration phase. Furthermore, we have shown that there is plenty of room for inventory management improvement since there seems to be an important amount of revenue lost due to understock.\n\nFinally, during this analysis, we have focused on working with items that have been sold several times during the two years time span. Nevertheless, it is essential to note that there are thousands of items that have been sold only once. For those items, we could think of a price automation solution that could also help your company boost its profit.\n\nIn summary, we have shown that price optimization results are immediate and disruptive for companies using traditional pricing. We have also demonstrated that demand forecasting can also be used in other areas of operations, including assortment optimization, marketing, and inventory planning. In other words, investing in a good demand forecasting system opens the door for new opportunities that can be developed component by component based on business priorities.\n\nWe can show you how this works, and help you find the opportunities you may be missing.", "questions": ["How many steps are there on the price optimization pipeline?", "Give me some details about the last step.", "What is needed to predict the future sales of a product?", "In how many categories can the demand history be classified?", "What's the difference between an intermittent demand history and a lumpy one?", "And what about an erratic demand history?", "Which of these demand histories is easier to work with?", "Why?", "What strategies are there for evaluating the progress of the pricing system?", "Explain the first one", "What is the main achievement of the system described in the blog post?", "Why is this good?", "Why is this good?"], "answers": {"input_text": ["six", "Optimize the price for the items, subject to a list of specific business constraints", "Historical sales data", "four", "Intermittent demand histories have infrequent demand occurrences but regular in magnitude. Lumpy ones also have infrequent demand occurrences but highly variable in magnitude", "It has high frequency and is highly variable in magnitude", "Smooth and intermittent", "Lumpy demand histories are harder to model because its demand shape is both erratic and irregular.", "Control store and synthetic control", "We can compare our system's performance against a store that is still using the business-as-usual pricing strategy.", "It was able to estimate demand curves for a subset of the items ", "Because it allows to take optimal pricing decisions", "Because it allows to take optimal pricing decisions"]}}, {"conversation_id": "11", "context_id": "2", "story": "Price optimization for e-commerce: a case study\n\nWe have previously discussed how a data-driven price optimization is no longer an option for retailers, but a question of how and when to do it. With a world that's moving towards changing prices more and more dynamically, static pricing strategies can't keep up, and data-driven approaches have arrived to stay. In this post, we'll be focusing on how to perform data-driven price optimization, using a case study to showcase what can be done using sales data to predict demand and optimize prices.\n\nHaving read our previous post, you may now ask yourself:\n\nIs price optimization with Machine Learning worth the investment in my case?\nHow do I know if the quality of my sales data is good enough?\nOr putting it simply: price optimization, here I go! But am I ready?\nTo address these and other questions, we will demonstrate a practical example using the tools we've developed in the past years while excelling in our client's pricing strategies. To do so, we will take the publicly available dataset from the Brazilian marketplace Olist as our use case foundation.\n\nThis will be a glimpse of what you can achieve by sharing your sales data with us. We could perform an opportunity summary report in line with this use case, and uncover your own opportunities regarding price optimization. So, fasten your seatbelt and get ready for the take-off.\n\nAnalyzing an E-Commerce's Sales Data\nWhile diving into the practical example and explain how we solve these kinds of problems, let's refresh some price optimization concepts and introduce the work pipeline we will follow. There are different ways to tackle a price optimization problem and different strategies to perform a Machine Learning approach for this task.\n\nOur price optimization pipeline will be based on six phases or steps:\n\nData Collection and Integration: build a data asset that integrates all the relevant information for demand forecasting.\nData analysis: generate explanatory data analysis reports and categorize items according to demand patterns.\nAnomaly detection report: generate an anomaly detection report to identify inconsistent fields, detect out-of-stock issues, observe sales distribution, and detect outliers.\nDemand forecasting: estimate the demand curves for each one of the items, i.e., understand the elasticity of the demand.\nExperimentation: elaborate the strategy on how to perform exploration and exploitation.\nOptimization: optimize the price for the items, subject to a list of specific business constraints.\nIn the remainder of the post we will describe each one of these steps using the Olist dataset as an explanatory example.\n\n\nIllustrations flow with pricing elements like data and charts\nPrice optimization pipeline\n1. Data Collection and Integration\nThe first step is to build our data asset, taking different sources of information that will be used to train our demand forecasting models.\n\nThe cornerstone to begin working on any price optimization project is the sales dataset, i.e., to have the sales information of your business. Whether you are an e-commerce or a brick-and-mortar store, sales information is the first asset we have to focus on, and where our analysis will begin.\n\nWhat do we mean by the sales dataset? To put it simply, we need access to the sales information of the business segment that the company is willing to optimize the pricing strategy for. If you already know beforehand which segment this is, great! You are one step ahead. But if you don't, no worries, we can help you decide which is the best segment to start based on your sales information, as we will see in our example.\n\nUsually, companies choose to store their sales information in one of the following structures:\n\nBy sales transactions: each transaction detail with its item identifier, the timestamp, the number of items sold, and its sell price. Other data fields that can be useful to complement this information are the payment method (cash, credit card, etc.) and the shipping information.\nBy daily aggregation: some companies store their sales information this way because they need to have an accurate cash flow.\nComplementing your sales data\n\nNaturally, sales data won't be the only source of information we will consider when addressing the price optimization task. As demand is affected by several factors (such as the price, competitor's price, season, holidays, and macroeconomic variables), this is only one of the many factors to consider.\n\nAt this point, the knowledge of a business expert will play a big part to jointly work on a data structure that works for each specific problem.\n\nOnce again, don't worry if you are not storing all this data currently. We can quickly provide you with several of these public datasets.\n\nHere we list other sources of information that are usually considered when collecting data:\n\nItems information: data such as category, sub-category, description, number of photos, photos, and general characteristics can be used to profile each item and relate it with others.\nMacroeconomics: different indicators such as GDP, unemployment rate, consumer confidence, and currency exchange. It is crucial to keep in mind that these indicators are all gathered during different periods. Some of them are calculated quarterly, some others monthly, and some others daily.\nStores or seller information: location, size, etc.\nHolidays and events: certain holidays and events can have a direct impact on the number of sales of different items, e.g., Valentine's Day, Christmas, Mother's Day, etc.\nItem's reviews: reviews performed by customers can also be included in our forecasting model.\nCompetitor(s)' price: the presence of one or several strong competitors can have a direct impact on the demand of our items, so having information about how they are pricing can potentially be used as insights for our models.\nCustomer traffic: number of visitors (e-commerce and brick-and-mortar), average time on page, and number of clicks (e-commerce only).\nOther business-specific information.\n2. Data Analysis\nNow that the basics are clear let's dive into the Olist e-commerce data. We always start working on an exploratory data analysis of the sales data to gain insights about the different available items, their trends in the sales, and the pricing history as well.\n\nFor the Olist example, the sales data is stored as transactions, divided into three different datasets:\n\norder-items: which includes the items and sell price for each order.\norders: contains all the information related to the orders, such as customer, seller, order status, order timestamp, and the estimated delivery date.\norder-payments: it includes the payment information for each order.\nAfter processing these three datasets, we can build a standard sales dataset, as shown in the following figure, which aggregates item-seller sales per day:\n\n\nOlist Data set\nStandard sales dataset.\nOnce we define the sales dataset, we should question ourselves how many items should we choose to begin the optimization? Or even further: are all items ready to be optimized?\n\nTo begin with, we will set a threshold on the number of historical sales observations. To estimate the future demand for a given item, we need historical sales data. Therefore, we will start considering those articles that have, at minimum, ten weeks with at least 1 unit sold.\n\nUsing this threshold, in our Olist use case, we will have a universe of 1,197 items. On the other hand, we could think of a price automation solution for those unique items that are sold only once or very few times.\n\n\n1.197 articles have, at minimum, ten weeks with at least 1 unit sold. \nMoreover, this subset of items can be further categorized. In particular, we will classify items based on demand history and the number of different prices that have historically been used. The shape of the demand history will have a direct impact on the modeling efforts, and so will the number of price changes on the experimentation strategy. Therefore, we will divide our approach into two different stages to select the first set of items:\n\nStudy the variability of the demand history concerning timing and magnitude.\nStudy the number of price changes and timing.\nVariability of the demand history\n\nAt this first stage we will study the variability of the demand history, as we are looking to identify those items that have a simpler selling pattern than others. This approach will not only impact the final accuracy of the sales forecasting but also the feature engineering and modeling efforts.\n\nTo perform this categorization, we will use the one proposed by Syntetos et al. (2005), in which the demands are divided into four different categories:\n\nIntermittent, for items with infrequent demand occurrences but regular in magnitude\nLumpy, for items whose demand is intermittent and highly variable in magnitude\nSmooth, for items whose demand is both frequent and regular in magnitude\nErratic, for items whose demand is frequent but highly variable in magnitude\nIn the following figure, we show how each of these demands would look like in a typical scenario, plotting the demand magnitude to the time.\n\n\nDemand behavior for each scenario.\nSmooth and intermittent demands are easier to model than lumpy and erratic. In particular, lumpy items are hard to model, because its demand shape is both erratic and irregular.\n\nInstead of looking at the shape of each item's demand, we can calculate two magnitudes that will help us to classify all items at once:\n\nHow spaced the sales are over time. Calculated as the mean inter-demand interval (p).\nA measure of how bumpy the sales are. Calculated as the squared coefficient of variation of demand sizes (CV^2CV \n2\n ).\nAlthough we'll not dive into the details, we can use both values to perform a scatter plot of the items and cluster them into the four mentioned categories. This classification will also help us to define the level of aggregation of our data. In the following figures, we compare how the items are classified when aggregating the sales data daily and weekly.\n\n\nDaily aggregation chart\n\nWeekly aggregation chart\nAs you can see, a weekly aggregation of the data provides us a higher number of items to consider given our classification. This is also shown in the table below.\n\nItems classification according to demand variability\n\nWe will aggregate our data by week and we will select those items that fall in the smooth, erratic, and intermittent categories (629 items), leaving out of the scope those with a lumpy demand (568 items) for which we should think of another level of aggregation.\n\n\nanalysis flow\n638 items have intermittent, erratic and smooth demand pattern, while 559 items present lumpy demand pattern.\nNumber of price changes and timing\n\nAnother important aspect to consider for each one of the items is how many prices have been historically tried and during which period of time.\n\nThe more information we have about the demand changes relative to the price, the less the exploration efforts needed, as we will see later on.\n\nThese are the filters that we are considering for the Olist dataset:\n\nHave historically changed their price at least three times in the 2-year span\nEach of these prices has at least four days of sales.\nWith these constraints, we will be able to filter those items that have a significant amount of information. This allows us to estimate the demand curves within our historical sales, i.e., we want to select those items in which we have information about the sales behavior at different points in time, using different prices.\n\nIf we only filter by the number of price changes and timing specified in 1. and 2. the number of valid items is reduced to 144.\n\nItems selection\n\nWe are now ready to join both conditions together \u2014pattern shape of historical demand & the number of price changes and timing\u2014 to reach the final number of items that, without doing any other further experimentation, are ready to optimize. Taking both conditions into consideration, we reach a universe of 51 items that are ready to start.\n\nAs mentioned before, the rest of the items won't be discarded, but further price exploration will have to take place to generate meaningful datapoints. The price exploration strategy will be explained in the \"Experimentation setup\" section.\n\n\nLast stage of the funnel\n51 items ready to optimize when considering pattern shape of historical demand & the number of price changes and timing.\n3. Anomaly detection\nFrom our experience working with multiple datasets and several retailers, we have been able to develop an anomaly detection report. This report provides useful information to the data science team working with the dataset and provides meaningful insights to the business team. It has been of great help to identify missing information or misconceptions of the available data.\n\nAnomaly detection involves identifying the differences, deviations, and exceptions from the norm in the data.\n\nIn a typical retail scenario, our anomaly detection report consists of three main topics:\n\nIdentifying missing and inconsistent fields\nDetecting and handling out-of-stocks\nObserving sales distribution and detecting outliers\nWe will dig into the details of each one of these topics in the following sections.\n\n1. Identify missing and inconsistent fields\n\n1.1 Missing fields\n\nThough there is a lot of meaningful information in the sales information of Olist, there are missing features that are essential to tackle our price optimization problem. For example, we don't have information about the manufacturer's suggested retail price (MSRP) of the items nor about its costs.\n\nLet's say that we want to optimize our prices to maximize profit. Let's take a look then to the definition of the profit function:\n\n\nQ represents the demand, p the price, and c the cost, and the sum is performed over our universe of items (N). As you can see, the cost is needed to calculate the profit, and so is to perform the optimization. If we don't have this information, we can assume a zero cost for the items, and then optimize our prices to maximize revenue instead of profit:\n \n\nOn the other hand, the MSRP can be used to compare the selling prices against the manufacturer's suggestion. It can be useful when analyzing each item's historical pricing strategy and its effects on the demand.\n\nAlso, we don't have much information about the items themselves. We lack information such as the item's description, images, name, sub-category, or brand, which are of great help to relate and cluster similar items. Nevertheless, as we will see, we can still get meaningful results.\n\n1.2 Inconsistent fields\n\nWe can also run a series of inconsistency checks between the features. For example, if we have the sales data aggregated by date and not by order or basket, we can check if the \"Total Sales\" for a particular item/date is equal to the number of units sold times the price.\n\nWe would all expect this to be true, but what if:\n\nthere have been refunds on that given date\nsome units were sold with a privileged discount\nsome customer paid with a different currency\nWe need to make sure that all of these details are correctly recorded, and if not, try to detect these anomalies automatically to take specific actions.\n\nMoreover, many times retailers have business-specific rules which can be checked in a data-driven way to make sure they are being applied or not. Does any of the following business rules sound familiar to you?\n\nPrices for the same item must be consistent across stores\nNo more than 40% discount from MSRP is allowed\nSell price must always be lower than the competitor's price\nDifferent price promotions cannot overlap\nWith our anomaly detection report, you can specify the business rules, and we can check these quickly, easily, and automatically.\n\nGoing back to the Olist sample dataset, we don't have information about the MSRP, the cost of each item, or any specific business rule. However, we could still check for certain general inconsistent fields such as the following:\n\nDo we have negative values on sales units or prices?\n\nNegative values screenshot\nIs the same item sold at different prices on the same date?\n\nDifferent prices screenshot\n2. Detecting and handling out-of-stocks\n\nAnother chapter in our anomaly detection report for retailers consists of detecting out-of-stock issues. This is very important because of two main reasons:\n\nRevenue loss: having out-of-stock issues implies losing sales and sometimes even clients. Therefore, pricing and inventory management decisions have to be taken jointly.\nDemand curve estimation: to correctly estimate a demand curve for each item/store, we need to identify the causes for which we observe zero sales for certain items in certain moments. In particular, we need to distinguish between zero demand and zero supply.\nIdeally, we could incorporate inventory data and flag days as out-of-stock if the stock level for a specific item is equal to zero. Nevertheless, from our experience working with retailers, we have learned that inventory data is not necessarily 100% accurate. For example, in a brick-and-mortar store, we may observe zero sales for consecutive days, even when the inventory data suggests there should be five or even ten items in-store. Still, these items are not properly displayed to the public.\n\nTherefore, we have designed different heuristics according to the client's data to correctly identify out-of-stock days and better estimate the demand curves. Some of these heuristics are complex and client-specific, while others are simple and straightforward.\n\nAs a preliminary approach, we can identify per item/store long gaps of zero sales and assume these are out-of-stock problems. This definition of out-of-stock will depend on the consecutive zeros threshold we define. For example, we can determine the consecutive zeros threshold as 30 days; this is to say that an item will be considered as out-of-stock if there are no sales in 30 consecutive days. We can visualize the out-of-stock periods as the shadowed area on the following plot:\n\nOut of stock report.\nGiven this definition, we can estimate the revenue lost as a consequence of out-of-stock problems, e.g., assuming that the sales during the out-of-stock periods would have been equal to an average date.\n\nFurthermore, we could also download a report with the estimated revenue lost for each item/seller on each out-of-stock period. The Out-of-Stock Report would look something like this:\n\n\nOut of stock analysis.\nEven though this report gives us some insights into the impact that out-of-stocks is having on our revenue, these are just preliminary estimations. To better approximate the revenue loss and to identify the out-of-stock days better, we can use the same Machine Learning models we use to perform the forecast of the demands.\n\nLet's picture this with an example. Imagine our Machine Learning model estimates that 53 units would've been sold of a specific item, on a particular date, given a certain price and conditions, and we observe that only 1 unit has been sold. This would raise an alert since it could be that a specific item has potentially run out-of-stock.\n\nFinally, but not least important, once the Machine Learning models are trained, they can also be used in a proactive way instead of reactive. Several weeks in advance, we could estimate how many units will be sold and generate stock replenishment recommendations to avoid out-of-stock issues.\n\n3. Observe sales distribution and detect outliers\n\nTo detect outliers, we can plot the daily sales distribution per seller or store by day of the week, month, or year. In particular, we can observe the distribution and detect outliers by generating different visualizations of the data.\n\nFor example, we can generate a boxplot to observe the distribution of units sold by day of the week. These plots are useful to understand how many units are sold on average, by a seller or store, per item on each category group.\n\nAs shown in the following figure, we can observe from the boxes sizes that, on average, all days have a similar behavior where most sellers per day sell only one unit per item. Nevertheless, there seem to be several outliers (dots that are far from the normal distribution). If we hover over the plot, for example, on the point which is farther away on the Friday category, we can see that it corresponds to a particular seller who sold 25 units for a specific item on Friday, 24th November 2017.\n\nDistribution per day chart\nDoes that date ring a bell to you? Why could you sell a totally different amount of units than what you usually sell? BINGO! That was Black Friday!\n\nEven though we were considering national holidays as features for our demand estimation, from the data, we have learned that it may be relevant to include Black Friday as a special date.\n\nHowever, we may also wonder what happened on Black Friday in other years? Was it also the date with the highest sales in 2018? Or what about Black Friday for other sellers? Did their sales increase dramatically too? By having a look at the boxplot by year, we can check what happened in 2018:\n\nDistribution per year chart\nFor 2016, we have almost no observations (data started in September 2016), but if we compare 2017 and 2018, the distribution seems very similar. Nevertheless, the seller who sold the maximum amount of units per item in 2018 was a different seller and did so on a different date. In particular, it was on May 8, 2018. What could have happened on that specific date?\n\nWhen looking in detail at the list of holidays and events for Brazil in 2018, we can see that May 8 corresponds to the week before Sunday of Mother's Day (May 13). Also, this makes total sense when we see that the category of the item is marked as \"watches gifts.\"\n\nTaking consideration of holidays is important to detect and classify the outliers into different groups, i.e., those potentially related to errors in the data ingestion process \u2014and that need to be treated as misleading\u2014 and those strictly associated with special events.\n\nFor sure, we need to continue exploring the data in order to gain more insights. We could, for example, observe the location for that particular outlier seller and check the sales distribution of the nearest stores to see whether it was one specific event that affected several stores or just this one. Or we could check if a sudden decrease in price can explain the increase in sales by plotting both variables together, as shown in the following plot.\n\nSales units and price chart\nFollowing our intuition, this sudden increase in sales may be explained by a decrease in price. But we need to continue thinking of possible features that could help us better understand and explain the price elasticity of demand. This is where our Machine Learning models come into play. By including several different features, we can learn from the data which variables are mostly affecting the demand for each item. By modeling this particular item, we would have learned that it is not only the reduction in price that explained the increase in sales but also the days until Mother's Day, the day of the week, and several other variables.\n\nFinally, it's important to stress the relevance of business expertise to help point out those relevant dates when you know sales go up or down. The Machine Learning models help you validate and test new hypotheses.\n\n4. Modeling the demand\nHaving our data ready, we proceed to train the demand forecasting models. Our goal is to be able to build the demand curve for each item. The models will learn the relation between the demand and several factors such as the price, holidays/events, and macroeconomics.\n\nDuring this phase, the collaboration with the business team is particularly important. Business insights help to validate the selected features and ensures that we are not missing any important aspect that can be potentially used to forecast the demand.\n\nBelow we show an example of the demand curve for an item in the garden tools category.\n\n\nDemand curve chart\nThis plot shows the predicted demand curve (blue line) for a particular date (April 29, 2018). The vertical green lines are the historically tested prices (from where our model learned the price elasticity). The red line shows the expected profit (assuming item cost equal to zero, as we don't have this information).\n\nWhat insights do we get from the estimated demand curve? In this case, we can see that between $50 and $60, the number of units sold would be equal to 14, meaning that in this particular price range and week, the demand curve is completely inelastic. Nevertheless, if we increase the price somewhere close to $63, the number of units sold will decrease, one unit less will be sold but still obtain higher revenue. According to our model estimations, without having a retailer and business background, we would recommend starting with a price between $63 and $70 to explore how the demand behaves on this point.\n\nOnce we have information about the demand using this price, we can use the latest information to feed the model and then decide which is the following price to set. The advantage of trying prices far from the historically tested ones would be that we may find a new optimum price (exploring). Still, trying prices that have never been tested before brings more uncertainty as the accuracy of the model in those regions is lower.\n\nMoreover, explainability is a desired feature of the models in price optimization. Most of the time, it provides new business insights to retailers and validates that the model is learning correctly. Our models can tell us which features were the most relevant to fit the data. Below we show the top-10 feature importance for the model associated with our garden tool item.\n\n\nFeature importance chart.\nThe most important features according to the model are:\n\nDays until Black Friday\nItem ID\nConsumer Confidence Index\nTotal imports in USD\nWeek number of the year\nMonth number of the year\nDays until Spring\nDays until Winter\nSeller ID\nSell price\nAs you can see, Black Friday is the most important event for this retailer. This means that each time the feature was used, the model achieves a greater increase in its accuracy. Also, the Consumer Confidence Index (CCI) and the total imports in USD (Imports) are the most relevant macroeconomic factors for this particular item. Furthermore, the beginning of Winter and Spring seasons also play an important role in explaining the demand for our garden tool item.\n\n5. Experimental setup\nOnce we have our demand curves, we need to decide what exact prices we will try. This is to say, we have to plan what concrete actions we will take, and these actions or strategies will depend on our experimental setup. Three main questions need to be answered to define the experimental setup:\n\n1. How are we going to measure the profit gain for the business as usual?\n\nWe need to define a measure against which we will be compared in order to evaluate the progress of our pricing system. We can have different options here, for example:\n\nControl store: we can start by comparing our store's performance against a control one, that will keep using the same business-as-usual pricing strategy. This scenario usually applies to brick-and-mortar stores.\n\nSynthetic control: when we lack a reference or control metric to compare against, we can synthesize one using other available metrics. This is known as synthetic control and provides us a way to measure the profit gain with respect to the business-as-usual even when there is no clear control measure. This method can be used for both e-commerce and brick-and-mortar retail companies.\n\n2. Which price change frequency is best suited in my case?\n\nWe need to define the frequency of the price changes, that is if we are going to change the prices hourly, daily, weekly, etc. This will depend on two major factors:\n\nIs your business Brick-and-mortar or e-commerce?\n\nBrick-and-mortar stores, due to the operational limitations of physically changing price tags, usually use weekly or bi-weekly frequencies.\n\nSome of these companies use digital price tags that remove this limitation and let them change prices more frequently. However, if the frequency is too high, customers may feel that the prices in the store are unfair, so this should be taken cautiously.\n\nOn the other hand, e-commerce retailers usually select their prices more dynamically, changing prices in the hour-daily range. Often, customers are more likely to accept these higher pricing frequencies since they see this behavior in big players like Amazon.\n\nHow accurate are our models for each price change frequency?\n\nThe other major factor to consider is the ability of the models to accurately predict the sales units in different time windows. This will depend on various factors, such as the shapes of the demands and the amount of historically price changes, as we discussed in the previous section.\n\nSo as a general rule, we should choose the frequency in which our models are accurate enough to perform the predictions, and that maximizes our objective function (e.g., profit) while remaining operationally viable.\n\n3. What will be the Exploration vs. Exploitation strategy.\n\nDuring exploration, we try new unseen prices that will add crucial information to the data about the shape of the demand curve. On the other hand, during exploitation, we exploit the information that we already know from the demand curves and try to maximize our objective function.\n\nDifferent strategies can be applied depending on the necessities of the retailer and the quality of the data regarding historical prices. In the Olist dataset, we have 51 items ready to start exploitation. The rest of the items have very few prices tried or very few historical sales and will require some level of exploration.\n\nHere below, we list some recommendations that are useful when defining the exploration strategy:\n\nIf we count on the MSRP, start at this price, and make variations around this price to see how the demand behaves.\nIf not, start setting a price near the ones that have been historically tried and make variations around this price to see how the demand behaves.\nWith the business team, set a reasonable threshold to the upper and lower boundaries of the exploration price range.\n6. Optimization\nOnce all these questions have been answered, you have everything set to optimize your pricing strategy, as many other retailers are already doing.\n\nOptimizing your pricing strategy means defining your function to be maximized and consider any restrictions to include as constraints. In most scenarios, the objective function to be maximized will be the overall profit subject to the stock level or sales velocity preferred. However, you may be willing to maximize sales volume or customer\u2019s lifetime value. The definition of the optimization problem must be clearly defined with the client.\n\nAll in all, we show the opportunity summary report that showcases several promising lines of work.\n\nOpportunity summary report\n\nFinal thoughts and concluding remarks\nWe have shown how a typical price optimization pipeline would look like, and how we would asses the potential opportunities for a pricing model.\n\nIn this particular example, working with the publicly available sales data of the Brazilian marketplace Olist and with no further information nor insights from their side, we have shown that there is room for improvement regarding their pricing decisions.\n\nCombining their available dataset and other publicly available information, such as holidays/events data and macroeconomic series, we have been able to estimate demand curves for a subset of the items which would allow them to take optimal pricing decisions. The remaining items should undergo an exploration phase where new prices would be tried in order to be able to estimate their demand curves accurately. The exploration strategy is generally decided jointly with the client.\n\nFor this example, given the data available and our past experience, we would suggest performing weekly price changes during the exploration phase. Furthermore, we have shown that there is plenty of room for inventory management improvement since there seems to be an important amount of revenue lost due to understock.\n\nFinally, during this analysis, we have focused on working with items that have been sold several times during the two years time span. Nevertheless, it is essential to note that there are thousands of items that have been sold only once. For those items, we could think of a price automation solution that could also help your company boost its profit.\n\nIn summary, we have shown that price optimization results are immediate and disruptive for companies using traditional pricing. We have also demonstrated that demand forecasting can also be used in other areas of operations, including assortment optimization, marketing, and inventory planning. In other words, investing in a good demand forecasting system opens the door for new opportunities that can be developed component by component based on business priorities.\n\nWe can show you how this works, and help you find the opportunities you may be missing.", "questions": ["What steps constitute the price optimization pipeline?", "Explain the first one.", "Why should I do an exploratory data analysis of the sales?", "What are the datasets in which the sales data from Olist is stored?", "What is the information related specifically to orders?", "Why is it important to consider how many prices have been historically tried for each item?", "How many items are left after filtering by number of price changes?", "Explain how the pricing system can be used in a proactive way instead of reactive", "What is the Optimization step about?"], "answers": {"input_text": ["Data Collection and Integration, Data analysis, Anomaly detection report, Demand forecasting, Experimentation, and Optimization.", "It involves building a data asset that integrates all the relevant information for demand forecasting.", "To gain insights about the different available items, their trends in sales, and the pricing history", "order-items, orders, order-payments", "customer, seller, order status, order timestamp, and the estimated delivery date", "The more information we have about the demand changes relative to the price, the less the exploration efforts needed", "144", "We could estimate how many units will be sold and generate stock replenishment recommendations.", "It is about defining your function to be maximized and considering any restrictions to include as constraints."]}}, {"conversation_id": "12", "context_id": "2", "story": "Price optimization for e-commerce: a case study\n\nWe have previously discussed how a data-driven price optimization is no longer an option for retailers, but a question of how and when to do it. With a world that's moving towards changing prices more and more dynamically, static pricing strategies can't keep up, and data-driven approaches have arrived to stay. In this post, we'll be focusing on how to perform data-driven price optimization, using a case study to showcase what can be done using sales data to predict demand and optimize prices.\n\nHaving read our previous post, you may now ask yourself:\n\nIs price optimization with Machine Learning worth the investment in my case?\nHow do I know if the quality of my sales data is good enough?\nOr putting it simply: price optimization, here I go! But am I ready?\nTo address these and other questions, we will demonstrate a practical example using the tools we've developed in the past years while excelling in our client's pricing strategies. To do so, we will take the publicly available dataset from the Brazilian marketplace Olist as our use case foundation.\n\nThis will be a glimpse of what you can achieve by sharing your sales data with us. We could perform an opportunity summary report in line with this use case, and uncover your own opportunities regarding price optimization. So, fasten your seatbelt and get ready for the take-off.\n\nAnalyzing an E-Commerce's Sales Data\nWhile diving into the practical example and explain how we solve these kinds of problems, let's refresh some price optimization concepts and introduce the work pipeline we will follow. There are different ways to tackle a price optimization problem and different strategies to perform a Machine Learning approach for this task.\n\nOur price optimization pipeline will be based on six phases or steps:\n\nData Collection and Integration: build a data asset that integrates all the relevant information for demand forecasting.\nData analysis: generate explanatory data analysis reports and categorize items according to demand patterns.\nAnomaly detection report: generate an anomaly detection report to identify inconsistent fields, detect out-of-stock issues, observe sales distribution, and detect outliers.\nDemand forecasting: estimate the demand curves for each one of the items, i.e., understand the elasticity of the demand.\nExperimentation: elaborate the strategy on how to perform exploration and exploitation.\nOptimization: optimize the price for the items, subject to a list of specific business constraints.\nIn the remainder of the post we will describe each one of these steps using the Olist dataset as an explanatory example.\n\n\nIllustrations flow with pricing elements like data and charts\nPrice optimization pipeline\n1. Data Collection and Integration\nThe first step is to build our data asset, taking different sources of information that will be used to train our demand forecasting models.\n\nThe cornerstone to begin working on any price optimization project is the sales dataset, i.e., to have the sales information of your business. Whether you are an e-commerce or a brick-and-mortar store, sales information is the first asset we have to focus on, and where our analysis will begin.\n\nWhat do we mean by the sales dataset? To put it simply, we need access to the sales information of the business segment that the company is willing to optimize the pricing strategy for. If you already know beforehand which segment this is, great! You are one step ahead. But if you don't, no worries, we can help you decide which is the best segment to start based on your sales information, as we will see in our example.\n\nUsually, companies choose to store their sales information in one of the following structures:\n\nBy sales transactions: each transaction detail with its item identifier, the timestamp, the number of items sold, and its sell price. Other data fields that can be useful to complement this information are the payment method (cash, credit card, etc.) and the shipping information.\nBy daily aggregation: some companies store their sales information this way because they need to have an accurate cash flow.\nComplementing your sales data\n\nNaturally, sales data won't be the only source of information we will consider when addressing the price optimization task. As demand is affected by several factors (such as the price, competitor's price, season, holidays, and macroeconomic variables), this is only one of the many factors to consider.\n\nAt this point, the knowledge of a business expert will play a big part to jointly work on a data structure that works for each specific problem.\n\nOnce again, don't worry if you are not storing all this data currently. We can quickly provide you with several of these public datasets.\n\nHere we list other sources of information that are usually considered when collecting data:\n\nItems information: data such as category, sub-category, description, number of photos, photos, and general characteristics can be used to profile each item and relate it with others.\nMacroeconomics: different indicators such as GDP, unemployment rate, consumer confidence, and currency exchange. It is crucial to keep in mind that these indicators are all gathered during different periods. Some of them are calculated quarterly, some others monthly, and some others daily.\nStores or seller information: location, size, etc.\nHolidays and events: certain holidays and events can have a direct impact on the number of sales of different items, e.g., Valentine's Day, Christmas, Mother's Day, etc.\nItem's reviews: reviews performed by customers can also be included in our forecasting model.\nCompetitor(s)' price: the presence of one or several strong competitors can have a direct impact on the demand of our items, so having information about how they are pricing can potentially be used as insights for our models.\nCustomer traffic: number of visitors (e-commerce and brick-and-mortar), average time on page, and number of clicks (e-commerce only).\nOther business-specific information.\n2. Data Analysis\nNow that the basics are clear let's dive into the Olist e-commerce data. We always start working on an exploratory data analysis of the sales data to gain insights about the different available items, their trends in the sales, and the pricing history as well.\n\nFor the Olist example, the sales data is stored as transactions, divided into three different datasets:\n\norder-items: which includes the items and sell price for each order.\norders: contains all the information related to the orders, such as customer, seller, order status, order timestamp, and the estimated delivery date.\norder-payments: it includes the payment information for each order.\nAfter processing these three datasets, we can build a standard sales dataset, as shown in the following figure, which aggregates item-seller sales per day:\n\n\nOlist Data set\nStandard sales dataset.\nOnce we define the sales dataset, we should question ourselves how many items should we choose to begin the optimization? Or even further: are all items ready to be optimized?\n\nTo begin with, we will set a threshold on the number of historical sales observations. To estimate the future demand for a given item, we need historical sales data. Therefore, we will start considering those articles that have, at minimum, ten weeks with at least 1 unit sold.\n\nUsing this threshold, in our Olist use case, we will have a universe of 1,197 items. On the other hand, we could think of a price automation solution for those unique items that are sold only once or very few times.\n\n\n1.197 articles have, at minimum, ten weeks with at least 1 unit sold. \nMoreover, this subset of items can be further categorized. In particular, we will classify items based on demand history and the number of different prices that have historically been used. The shape of the demand history will have a direct impact on the modeling efforts, and so will the number of price changes on the experimentation strategy. Therefore, we will divide our approach into two different stages to select the first set of items:\n\nStudy the variability of the demand history concerning timing and magnitude.\nStudy the number of price changes and timing.\nVariability of the demand history\n\nAt this first stage we will study the variability of the demand history, as we are looking to identify those items that have a simpler selling pattern than others. This approach will not only impact the final accuracy of the sales forecasting but also the feature engineering and modeling efforts.\n\nTo perform this categorization, we will use the one proposed by Syntetos et al. (2005), in which the demands are divided into four different categories:\n\nIntermittent, for items with infrequent demand occurrences but regular in magnitude\nLumpy, for items whose demand is intermittent and highly variable in magnitude\nSmooth, for items whose demand is both frequent and regular in magnitude\nErratic, for items whose demand is frequent but highly variable in magnitude\nIn the following figure, we show how each of these demands would look like in a typical scenario, plotting the demand magnitude to the time.\n\n\nDemand behavior for each scenario.\nSmooth and intermittent demands are easier to model than lumpy and erratic. In particular, lumpy items are hard to model, because its demand shape is both erratic and irregular.\n\nInstead of looking at the shape of each item's demand, we can calculate two magnitudes that will help us to classify all items at once:\n\nHow spaced the sales are over time. Calculated as the mean inter-demand interval (p).\nA measure of how bumpy the sales are. Calculated as the squared coefficient of variation of demand sizes (CV^2CV \n2\n ).\nAlthough we'll not dive into the details, we can use both values to perform a scatter plot of the items and cluster them into the four mentioned categories. This classification will also help us to define the level of aggregation of our data. In the following figures, we compare how the items are classified when aggregating the sales data daily and weekly.\n\n\nDaily aggregation chart\n\nWeekly aggregation chart\nAs you can see, a weekly aggregation of the data provides us a higher number of items to consider given our classification. This is also shown in the table below.\n\nItems classification according to demand variability\n\nWe will aggregate our data by week and we will select those items that fall in the smooth, erratic, and intermittent categories (629 items), leaving out of the scope those with a lumpy demand (568 items) for which we should think of another level of aggregation.\n\n\nanalysis flow\n638 items have intermittent, erratic and smooth demand pattern, while 559 items present lumpy demand pattern.\nNumber of price changes and timing\n\nAnother important aspect to consider for each one of the items is how many prices have been historically tried and during which period of time.\n\nThe more information we have about the demand changes relative to the price, the less the exploration efforts needed, as we will see later on.\n\nThese are the filters that we are considering for the Olist dataset:\n\nHave historically changed their price at least three times in the 2-year span\nEach of these prices has at least four days of sales.\nWith these constraints, we will be able to filter those items that have a significant amount of information. This allows us to estimate the demand curves within our historical sales, i.e., we want to select those items in which we have information about the sales behavior at different points in time, using different prices.\n\nIf we only filter by the number of price changes and timing specified in 1. and 2. the number of valid items is reduced to 144.\n\nItems selection\n\nWe are now ready to join both conditions together \u2014pattern shape of historical demand & the number of price changes and timing\u2014 to reach the final number of items that, without doing any other further experimentation, are ready to optimize. Taking both conditions into consideration, we reach a universe of 51 items that are ready to start.\n\nAs mentioned before, the rest of the items won't be discarded, but further price exploration will have to take place to generate meaningful datapoints. The price exploration strategy will be explained in the \"Experimentation setup\" section.\n\n\nLast stage of the funnel\n51 items ready to optimize when considering pattern shape of historical demand & the number of price changes and timing.\n3. Anomaly detection\nFrom our experience working with multiple datasets and several retailers, we have been able to develop an anomaly detection report. This report provides useful information to the data science team working with the dataset and provides meaningful insights to the business team. It has been of great help to identify missing information or misconceptions of the available data.\n\nAnomaly detection involves identifying the differences, deviations, and exceptions from the norm in the data.\n\nIn a typical retail scenario, our anomaly detection report consists of three main topics:\n\nIdentifying missing and inconsistent fields\nDetecting and handling out-of-stocks\nObserving sales distribution and detecting outliers\nWe will dig into the details of each one of these topics in the following sections.\n\n1. Identify missing and inconsistent fields\n\n1.1 Missing fields\n\nThough there is a lot of meaningful information in the sales information of Olist, there are missing features that are essential to tackle our price optimization problem. For example, we don't have information about the manufacturer's suggested retail price (MSRP) of the items nor about its costs.\n\nLet's say that we want to optimize our prices to maximize profit. Let's take a look then to the definition of the profit function:\n\n\nQ represents the demand, p the price, and c the cost, and the sum is performed over our universe of items (N). As you can see, the cost is needed to calculate the profit, and so is to perform the optimization. If we don't have this information, we can assume a zero cost for the items, and then optimize our prices to maximize revenue instead of profit:\n \n\nOn the other hand, the MSRP can be used to compare the selling prices against the manufacturer's suggestion. It can be useful when analyzing each item's historical pricing strategy and its effects on the demand.\n\nAlso, we don't have much information about the items themselves. We lack information such as the item's description, images, name, sub-category, or brand, which are of great help to relate and cluster similar items. Nevertheless, as we will see, we can still get meaningful results.\n\n1.2 Inconsistent fields\n\nWe can also run a series of inconsistency checks between the features. For example, if we have the sales data aggregated by date and not by order or basket, we can check if the \"Total Sales\" for a particular item/date is equal to the number of units sold times the price.\n\nWe would all expect this to be true, but what if:\n\nthere have been refunds on that given date\nsome units were sold with a privileged discount\nsome customer paid with a different currency\nWe need to make sure that all of these details are correctly recorded, and if not, try to detect these anomalies automatically to take specific actions.\n\nMoreover, many times retailers have business-specific rules which can be checked in a data-driven way to make sure they are being applied or not. Does any of the following business rules sound familiar to you?\n\nPrices for the same item must be consistent across stores\nNo more than 40% discount from MSRP is allowed\nSell price must always be lower than the competitor's price\nDifferent price promotions cannot overlap\nWith our anomaly detection report, you can specify the business rules, and we can check these quickly, easily, and automatically.\n\nGoing back to the Olist sample dataset, we don't have information about the MSRP, the cost of each item, or any specific business rule. However, we could still check for certain general inconsistent fields such as the following:\n\nDo we have negative values on sales units or prices?\n\nNegative values screenshot\nIs the same item sold at different prices on the same date?\n\nDifferent prices screenshot\n2. Detecting and handling out-of-stocks\n\nAnother chapter in our anomaly detection report for retailers consists of detecting out-of-stock issues. This is very important because of two main reasons:\n\nRevenue loss: having out-of-stock issues implies losing sales and sometimes even clients. Therefore, pricing and inventory management decisions have to be taken jointly.\nDemand curve estimation: to correctly estimate a demand curve for each item/store, we need to identify the causes for which we observe zero sales for certain items in certain moments. In particular, we need to distinguish between zero demand and zero supply.\nIdeally, we could incorporate inventory data and flag days as out-of-stock if the stock level for a specific item is equal to zero. Nevertheless, from our experience working with retailers, we have learned that inventory data is not necessarily 100% accurate. For example, in a brick-and-mortar store, we may observe zero sales for consecutive days, even when the inventory data suggests there should be five or even ten items in-store. Still, these items are not properly displayed to the public.\n\nTherefore, we have designed different heuristics according to the client's data to correctly identify out-of-stock days and better estimate the demand curves. Some of these heuristics are complex and client-specific, while others are simple and straightforward.\n\nAs a preliminary approach, we can identify per item/store long gaps of zero sales and assume these are out-of-stock problems. This definition of out-of-stock will depend on the consecutive zeros threshold we define. For example, we can determine the consecutive zeros threshold as 30 days; this is to say that an item will be considered as out-of-stock if there are no sales in 30 consecutive days. We can visualize the out-of-stock periods as the shadowed area on the following plot:\n\nOut of stock report.\nGiven this definition, we can estimate the revenue lost as a consequence of out-of-stock problems, e.g., assuming that the sales during the out-of-stock periods would have been equal to an average date.\n\nFurthermore, we could also download a report with the estimated revenue lost for each item/seller on each out-of-stock period. The Out-of-Stock Report would look something like this:\n\n\nOut of stock analysis.\nEven though this report gives us some insights into the impact that out-of-stocks is having on our revenue, these are just preliminary estimations. To better approximate the revenue loss and to identify the out-of-stock days better, we can use the same Machine Learning models we use to perform the forecast of the demands.\n\nLet's picture this with an example. Imagine our Machine Learning model estimates that 53 units would've been sold of a specific item, on a particular date, given a certain price and conditions, and we observe that only 1 unit has been sold. This would raise an alert since it could be that a specific item has potentially run out-of-stock.\n\nFinally, but not least important, once the Machine Learning models are trained, they can also be used in a proactive way instead of reactive. Several weeks in advance, we could estimate how many units will be sold and generate stock replenishment recommendations to avoid out-of-stock issues.\n\n3. Observe sales distribution and detect outliers\n\nTo detect outliers, we can plot the daily sales distribution per seller or store by day of the week, month, or year. In particular, we can observe the distribution and detect outliers by generating different visualizations of the data.\n\nFor example, we can generate a boxplot to observe the distribution of units sold by day of the week. These plots are useful to understand how many units are sold on average, by a seller or store, per item on each category group.\n\nAs shown in the following figure, we can observe from the boxes sizes that, on average, all days have a similar behavior where most sellers per day sell only one unit per item. Nevertheless, there seem to be several outliers (dots that are far from the normal distribution). If we hover over the plot, for example, on the point which is farther away on the Friday category, we can see that it corresponds to a particular seller who sold 25 units for a specific item on Friday, 24th November 2017.\n\nDistribution per day chart\nDoes that date ring a bell to you? Why could you sell a totally different amount of units than what you usually sell? BINGO! That was Black Friday!\n\nEven though we were considering national holidays as features for our demand estimation, from the data, we have learned that it may be relevant to include Black Friday as a special date.\n\nHowever, we may also wonder what happened on Black Friday in other years? Was it also the date with the highest sales in 2018? Or what about Black Friday for other sellers? Did their sales increase dramatically too? By having a look at the boxplot by year, we can check what happened in 2018:\n\nDistribution per year chart\nFor 2016, we have almost no observations (data started in September 2016), but if we compare 2017 and 2018, the distribution seems very similar. Nevertheless, the seller who sold the maximum amount of units per item in 2018 was a different seller and did so on a different date. In particular, it was on May 8, 2018. What could have happened on that specific date?\n\nWhen looking in detail at the list of holidays and events for Brazil in 2018, we can see that May 8 corresponds to the week before Sunday of Mother's Day (May 13). Also, this makes total sense when we see that the category of the item is marked as \"watches gifts.\"\n\nTaking consideration of holidays is important to detect and classify the outliers into different groups, i.e., those potentially related to errors in the data ingestion process \u2014and that need to be treated as misleading\u2014 and those strictly associated with special events.\n\nFor sure, we need to continue exploring the data in order to gain more insights. We could, for example, observe the location for that particular outlier seller and check the sales distribution of the nearest stores to see whether it was one specific event that affected several stores or just this one. Or we could check if a sudden decrease in price can explain the increase in sales by plotting both variables together, as shown in the following plot.\n\nSales units and price chart\nFollowing our intuition, this sudden increase in sales may be explained by a decrease in price. But we need to continue thinking of possible features that could help us better understand and explain the price elasticity of demand. This is where our Machine Learning models come into play. By including several different features, we can learn from the data which variables are mostly affecting the demand for each item. By modeling this particular item, we would have learned that it is not only the reduction in price that explained the increase in sales but also the days until Mother's Day, the day of the week, and several other variables.\n\nFinally, it's important to stress the relevance of business expertise to help point out those relevant dates when you know sales go up or down. The Machine Learning models help you validate and test new hypotheses.\n\n4. Modeling the demand\nHaving our data ready, we proceed to train the demand forecasting models. Our goal is to be able to build the demand curve for each item. The models will learn the relation between the demand and several factors such as the price, holidays/events, and macroeconomics.\n\nDuring this phase, the collaboration with the business team is particularly important. Business insights help to validate the selected features and ensures that we are not missing any important aspect that can be potentially used to forecast the demand.\n\nBelow we show an example of the demand curve for an item in the garden tools category.\n\n\nDemand curve chart\nThis plot shows the predicted demand curve (blue line) for a particular date (April 29, 2018). The vertical green lines are the historically tested prices (from where our model learned the price elasticity). The red line shows the expected profit (assuming item cost equal to zero, as we don't have this information).\n\nWhat insights do we get from the estimated demand curve? In this case, we can see that between $50 and $60, the number of units sold would be equal to 14, meaning that in this particular price range and week, the demand curve is completely inelastic. Nevertheless, if we increase the price somewhere close to $63, the number of units sold will decrease, one unit less will be sold but still obtain higher revenue. According to our model estimations, without having a retailer and business background, we would recommend starting with a price between $63 and $70 to explore how the demand behaves on this point.\n\nOnce we have information about the demand using this price, we can use the latest information to feed the model and then decide which is the following price to set. The advantage of trying prices far from the historically tested ones would be that we may find a new optimum price (exploring). Still, trying prices that have never been tested before brings more uncertainty as the accuracy of the model in those regions is lower.\n\nMoreover, explainability is a desired feature of the models in price optimization. Most of the time, it provides new business insights to retailers and validates that the model is learning correctly. Our models can tell us which features were the most relevant to fit the data. Below we show the top-10 feature importance for the model associated with our garden tool item.\n\n\nFeature importance chart.\nThe most important features according to the model are:\n\nDays until Black Friday\nItem ID\nConsumer Confidence Index\nTotal imports in USD\nWeek number of the year\nMonth number of the year\nDays until Spring\nDays until Winter\nSeller ID\nSell price\nAs you can see, Black Friday is the most important event for this retailer. This means that each time the feature was used, the model achieves a greater increase in its accuracy. Also, the Consumer Confidence Index (CCI) and the total imports in USD (Imports) are the most relevant macroeconomic factors for this particular item. Furthermore, the beginning of Winter and Spring seasons also play an important role in explaining the demand for our garden tool item.\n\n5. Experimental setup\nOnce we have our demand curves, we need to decide what exact prices we will try. This is to say, we have to plan what concrete actions we will take, and these actions or strategies will depend on our experimental setup. Three main questions need to be answered to define the experimental setup:\n\n1. How are we going to measure the profit gain for the business as usual?\n\nWe need to define a measure against which we will be compared in order to evaluate the progress of our pricing system. We can have different options here, for example:\n\nControl store: we can start by comparing our store's performance against a control one, that will keep using the same business-as-usual pricing strategy. This scenario usually applies to brick-and-mortar stores.\n\nSynthetic control: when we lack a reference or control metric to compare against, we can synthesize one using other available metrics. This is known as synthetic control and provides us a way to measure the profit gain with respect to the business-as-usual even when there is no clear control measure. This method can be used for both e-commerce and brick-and-mortar retail companies.\n\n2. Which price change frequency is best suited in my case?\n\nWe need to define the frequency of the price changes, that is if we are going to change the prices hourly, daily, weekly, etc. This will depend on two major factors:\n\nIs your business Brick-and-mortar or e-commerce?\n\nBrick-and-mortar stores, due to the operational limitations of physically changing price tags, usually use weekly or bi-weekly frequencies.\n\nSome of these companies use digital price tags that remove this limitation and let them change prices more frequently. However, if the frequency is too high, customers may feel that the prices in the store are unfair, so this should be taken cautiously.\n\nOn the other hand, e-commerce retailers usually select their prices more dynamically, changing prices in the hour-daily range. Often, customers are more likely to accept these higher pricing frequencies since they see this behavior in big players like Amazon.\n\nHow accurate are our models for each price change frequency?\n\nThe other major factor to consider is the ability of the models to accurately predict the sales units in different time windows. This will depend on various factors, such as the shapes of the demands and the amount of historically price changes, as we discussed in the previous section.\n\nSo as a general rule, we should choose the frequency in which our models are accurate enough to perform the predictions, and that maximizes our objective function (e.g., profit) while remaining operationally viable.\n\n3. What will be the Exploration vs. Exploitation strategy.\n\nDuring exploration, we try new unseen prices that will add crucial information to the data about the shape of the demand curve. On the other hand, during exploitation, we exploit the information that we already know from the demand curves and try to maximize our objective function.\n\nDifferent strategies can be applied depending on the necessities of the retailer and the quality of the data regarding historical prices. In the Olist dataset, we have 51 items ready to start exploitation. The rest of the items have very few prices tried or very few historical sales and will require some level of exploration.\n\nHere below, we list some recommendations that are useful when defining the exploration strategy:\n\nIf we count on the MSRP, start at this price, and make variations around this price to see how the demand behaves.\nIf not, start setting a price near the ones that have been historically tried and make variations around this price to see how the demand behaves.\nWith the business team, set a reasonable threshold to the upper and lower boundaries of the exploration price range.\n6. Optimization\nOnce all these questions have been answered, you have everything set to optimize your pricing strategy, as many other retailers are already doing.\n\nOptimizing your pricing strategy means defining your function to be maximized and consider any restrictions to include as constraints. In most scenarios, the objective function to be maximized will be the overall profit subject to the stock level or sales velocity preferred. However, you may be willing to maximize sales volume or customer\u2019s lifetime value. The definition of the optimization problem must be clearly defined with the client.\n\nAll in all, we show the opportunity summary report that showcases several promising lines of work.\n\nOpportunity summary report\n\nFinal thoughts and concluding remarks\nWe have shown how a typical price optimization pipeline would look like, and how we would asses the potential opportunities for a pricing model.\n\nIn this particular example, working with the publicly available sales data of the Brazilian marketplace Olist and with no further information nor insights from their side, we have shown that there is room for improvement regarding their pricing decisions.\n\nCombining their available dataset and other publicly available information, such as holidays/events data and macroeconomic series, we have been able to estimate demand curves for a subset of the items which would allow them to take optimal pricing decisions. The remaining items should undergo an exploration phase where new prices would be tried in order to be able to estimate their demand curves accurately. The exploration strategy is generally decided jointly with the client.\n\nFor this example, given the data available and our past experience, we would suggest performing weekly price changes during the exploration phase. Furthermore, we have shown that there is plenty of room for inventory management improvement since there seems to be an important amount of revenue lost due to understock.\n\nFinally, during this analysis, we have focused on working with items that have been sold several times during the two years time span. Nevertheless, it is essential to note that there are thousands of items that have been sold only once. For those items, we could think of a price automation solution that could also help your company boost its profit.\n\nIn summary, we have shown that price optimization results are immediate and disruptive for companies using traditional pricing. We have also demonstrated that demand forecasting can also be used in other areas of operations, including assortment optimization, marketing, and inventory planning. In other words, investing in a good demand forecasting system opens the door for new opportunities that can be developed component by component based on business priorities.\n\nWe can show you how this works, and help you find the opportunities you may be missing.", "questions": ["What data is used during the analysis?", "Is it public?", "What is the most important dataset needed for starting a price optimization project?", "What it contains?", "Which structures exist to store the sales information?", "What other data sources should I consider when collecting data?", "What do you mean by macroeconomic data?", "What is the first indicator?", "What is anomaly detection?", "Why do pricing and inventory management decisions need to be made together?", "Why would I test prices far from the prices used historically?", "How's this method called?", "What's a disadvantage of this method?", "In this case, what is the most important feature?", "What's the difference between exploration and exploitation?"], "answers": {"input_text": ["Brazilian marketplace Olist ", "Yes", "Sales dataset", "The sales information of the business segment that the company wants to optimize the pricing strategy for.", "By sales transactions and by daily aggregation.", "Items information, macroeconomics, stores or seller information, holidays and events, item's reviews, competitors price, customer traffic and other business-specific information", "Different indicators such as GDP, unemployment rate, consumer confidence and currency exchange.", "I can't answer that", "Identifying the differences, deviations, and exceptions from the norm in the data", "having out-of-stock issues implies losing sales and clients", "We may find a new optimum price", "Exploring", "Brings more uncertainty as the accuracy of the model in those regions is lower", "Days until Black Friday", "During exploration, we try new unseen prices that will add crucial information to the data. During exploitation, we exploit the information that we already know."]}}, {"conversation_id": "15", "context_id": "3", "story": "Automatically measuring soccer ball possession with AI and video analytics\n\nQatar 2022 World Cup is just around the corner and in Tryolabs everybody is excited to have their national team compete. As the teams prepare for the event, they more than ever rely on AI-assisted sports analytics for inspecting their performance based on both recordings of their previous games and real-time information delivered to the coach during the matches.\n\nAI is being used to identify patterns that lead to success and to compute indicators that give coaches objective numbers over which to optimize to maximize their team\u2019s chances of scoring. One such indicator is ball possession.\n\nLeveraging Norfair, our open-source tracking library, we put our video analytics skills to the test and wrote some code to showcase how AI can compute ball possession by team automatically by watching a video of a match. Here is how it looks in a snippet of a match between Chelsea and Manchester City:\n\nWe also made it open source on GitHub. You can inspect every single part of the code, and use it as a base to build something else!\n\nIn this post, we are going to dig a bit deeper into what exactly ball possession is, explain how this system was built from simpler components and how these work.\n\nDefining ball possession\nBall possession is one of the most important statistics in a soccer game. According to a 2020 study conducted over 625 UEFA Champions League matches, teams with more ball possession won 49.2%, drew 22.0%, and lost 28.7% of the matches overall, exceeding the winning rates of their rivals. This effect was even greater when the gap of ball possession percentage between two teams in a match was higher.\n\nThere is definitely something in this number. The team with a higher possession is more likely controlling the match.\n\nFrom \"Relationship between ball possession and match outcome in UEFA Champions League\u201d, 2020.\nWe now know how important ball possession is to soccer analytics. But what exactly is it? How is it computed? There are actually two methods that yield different results.\n\nMethod 1: based on calculating passes\nThe first method is to consider ball possession in relation to the number of passes. The possession metric would be the number of passes on each team during the match, divided by the total number of passes in the game.\n\nWhile straightforward to compute if counting passes manually (with a clicker), this method has a big drawback: it doesn\u2019t account for the total time that players have control of the ball.\n\nMethod 2: based on time\nAnother method that is widely used consists of controlling a clock manually for each team. A person has to be in charge of starting a timer when a team gets a hold of the ball and stopping it when they lose it. The timers also need to be paused when the game is stopped. While accurate, this approach needs the scorekeeper to be very focused for the duration of the game. It\u2019s also susceptible to human errors. Forgetting to turn on or off the clock can mess up the metrics (and you don\u2019t want that!).\n\nUsing AI to compute ball possession\nDo we really need to burden someone with this task in this day and age?\n\nTurns out that with deep learning and computer vision techniques it should be very feasible to automate this. For the remainder of this post, we will understand ball possession as the % of the time that each team has the ball (method 2).\n\nIn a nutshell, we will build a clock powered by AI instead of a human.\n\nDivide & conquer: the steps\nLet\u2019s start by breaking down the problem. We are going to need to:\n\nGet a few videos to test.\nDetect important objects.\nDetect the players.\nDetect the ball.\nIdentify each player\u2019s team.\nDetermine which player has the ball at all times.\nThere are also some \u201cnice to haves\u201d:\n\nDrawing the trailing path of the ball during the match (for better visual inspection, remember we are doing sports analytics here!).\nDetecting and drawing passes completed by each team.\nNow that we know what we need to do, let\u2019s see how we can develop each step.\n\nStep 1: Getting a few videos to test\nThere is no better way to test our algorithm than on live soccer footage videos. However, we can\u2019t just use any live soccer footage video, we have to be sure that the video does not change between different cameras. Fortunately, this can be easily accomplished by trimming a live soccer match from Youtube.\n\nWe searched for a few matches with different teams to test our project in different situations. We are going to test our project in the following matches:\n\nChelsea vs Manchester City.\nReal Madrid vs Barcelona.\nFrance vs Croatia \u2014 FIFA World Cup Final 2018.\nStep 2: Detecting objects\nWhat is a soccer match without players and a ball? Just a referee running from one side of the field to the other \ud83e\udd41.\n\nIt\u2019s crucial to know the location of the players and the ball before ever thinking about calculating the ball possession. To do this, we need two deep learning models: one for detecting the players and another for detecting the ball.\n\nDetecting players\nWhat would have been a challenging problem a decade ago is nothing but a few lines of code in 2022. There are multiple pre-trained models we can use to detect people. In this case, we will use YOLOv5 trained in the COCO dataset. This will get us bounding boxes around each player, and confidence scores. Here are the results:\n\nDetecting players using YOLOv5.\nDetecting the ball\nThe COCO dataset has a specific label for sports balls, but the results we got using a model trained using it were not good enough for this type of live soccer footage so we had to come up with a different approach.\n\nIn order to improve the results for our specific task, we finetuned the YOLOv5 detector trained in COCO with a specific dataset of balls containing only footage from live soccer matches.\n\nBall detection using YOLOv5 and a custom-built dataset.\nThe dataset was created with footage of soccer videos with a similar camera view, and the labeling was made with an open-source labeling tool named LabelImg. The finetuning process was made following the instructions of the official YOLOv5 repository.\n\nPlease note that this is by no means a fully robust ball detection model, the development of which is outside the scope of this blog post. Our resulting model mostly works on the videos we used for these demos. If you want to run it on other videos, the model will need more finetuning over labeled data.\n\nPutting things together\nGreat! We now have two models that we can use as if it was only one model by adding both of the detections. The result is a perfect model for soccer video analytics.\n\nDetection of players and ball.\nStep 3: Identifying player\u2019s teams\nHow can we know which team has the ball if we don\u2019t know the team of each player? We need to have a stage that takes the players\u2019 detection as inputs and outputs the detections with their respective classifications.\n\nHere, there are several approaches we could follow, each one with its pros and cons. In this section, we will restrict ourselves to two simple ones and leave out more complex techniques such as clustering, siamese networks, or contrastive learning (we encourage you to try these, though!).\n\nApproach 1: Neural network based on jersey\nOne approach is to train a neural network for image classification based on the team's jersey. The dataset can be generated by running a video with a player detection model and saving the crops of the detections as a dataset for training. The labeling can be easily done from a Jupyter Notebook with a tool like pigeonXT.\n\nA neural network could be a good approach when there are complex scenarios to classify. For example, to distinguish between jerseys with similar colors. Or when there are occlusions of players. However, this advantage comes with a cost. This approach requires you to create a dataset and train the model for every match that you want to analyze. This can be daunting if you would want to be able to analyze the ball possession in many different soccer matches.\n\nApproach 2: Color filtering with HSV\nAs in most sports, different teams are expected to use easily distinguishable jerseys for a match, so what if we can leverage that information using some classical computer vision?\n\nLet's take an example of a match between Chelsea and Manchester City. Here we have four distinctive colors:\n\nClassification\tJersey Color\nChelsea player\tBlue\nManchester City player\tSky Blue\nChelsea Goalkeeper\tGreen\nReferee\tBlack\nNote that there is no color for the Manchester City Goalkeeper. This is because his jersey is black, and would therefore have the same color as the referee's. We didn't claim this approach could cover every single case :)\n\nFor each color, we created an HSV filter that will tell us how many pixels of that color the image has.\n\nThe reason we chose to filter with HSV values instead of RGB is that HSV filtering is more robust to lighting changes. By adjusting only the hue value, you are choosing what type of color you want to keep (blue, red, yellow, etc), independent of how dark or light the color is.\n\nBefore filtering the colors, the player image is cropped in order to keep just the jersey. The crop is a specific percentage of the image. This helps to reduce unnecessary information for the classification algorithm. For example, in this image taken from a bounding box, the cropped image removes most of the pixels of the unwanted player, but still keeps most of the information of the desired one.\n\n\nCropping to the player\u2019s jersey.\nFor each color range, we created a filter that keeps the pixels in the color range and sets the rest to black.\n\nThe cropped image is then fed to the 4 color filters specified before. The output of the 4 filters is then passed through a median blur filter that will be in charge of removing unwanted noise.\n\nFor each output, we count the number of pixels that passed through the filter (i.e. the non-black ones). The filter with the highest count will give us the team that represents the player!\n\nThe following animations show the described process:\n\nHSV Classifier classifying a Manchester City player.\nHSV Classifier classifying a Chelsea player.\nHSV Classifier classifying the Chelsea Goalkeeper.\nIf a team has more than one color, the sum of the non-black pixels of all the corresponding colors will be taken into account for the comparison.\n\nFor more details on how the classifier is implemented, please take a look at this file.\n\nImproving classification with inertia\nOur HSV classifier works well\u2026 most of the time. Occlusions and imprecisions of the bounding boxes sometimes make the predictions unstable. In order to stabilize them, we need tracking. By introducing Norfair, we can link players\u2019 bounding boxes from one frame to the next, allowing us to look not just at the current team prediction but past ones for the same player.\n\nLet\u2019s see Norfair in action:\n\nTracking players in a soccer match using Norfair.\nIt\u2019s common sense that a player shouldn't be able to change teams across the video. We can use this fact to our advantage. Therefore, the principle of inertia for classification states that a player's classification shouldn't be based only on the current frame but on the history of the nn previous classifications of the same object.\n\nFor example, if the classifier has set inertia equal to 1010, the player classification on frame ii will be decided by the mode of classification results from frames i-10i\u221210 to ii. This ensures that a subtle change in the classification due to noise or occlusion will not necessarily change the player\u2019s team.\n\nIdeally, infinite inertia would be great. However, the tracker can mix some ids of the players. In this case, if inertia is too large, it can take too much time for the classifier to start predicting the correct team of the player. An ideal value of inertia is not too big or small.\n\nHere are examples comparing the HSV classifier with and without inertia:\n\nHSV classifier without inertia.\nHSV classifier with inertia 20 on a 25 FPS video.\nStep 4: Determining the player with the ball\nThis is the final piece to our computation of ball possession. We need to decide who has the ball at all times.\n\nA simple method works by determining the distance from all the players to the ball. The closest player will be the one who has the ball. This is not infallible of course but mostly works fine.\n\nFor our demo, we defined the distance as the distance from the closest foot of the player to the center of the ball. For simplicity, we will consider the bottom left corner of the bounding box as the left foot of the player, and the bottom right corner of the bounding box as the right foot.\n\n\n\nDistance from the right foot to the center of the ball.\n\n\nDistance from the left foot to the center of the ball.\nOnce we know who the closest player to the ball is, we have to define a threshold to know if the player is near enough to the ball to be considered in possession of it.\n\n\n\nNo player with the ball, the closest player is too far from the ball.\n\n\nA player has the ball, the closest player is near the ball.\nFrom models and heuristics to possession\nWe now have everything that we needed in order to calculate the ball possession metrics. However, determining which team is in possession of the ball is not as simple as stating that it's the team of the player that has the ball in each frame. As with our team classification process, the algorithm for ball possession should also have some inertia.\n\nIn this case and given a team with the ball, inertia states how many consecutive frames the players from the other team should have the ball for us to consider that the possession changed. It\u2019s very important to consider consecutive frames. If there is an interruption, the inertia restarts.\n\nWithout inertia, the possession will change unnecessarily in events such as a rebound, or when the ball passes near a player that didn't have the ball.\n\nNo inertia: ball possession mistakenly changes during a rebound.\nInertia: ball possession doesn\u2019t change during the rebound.\nMaking our demo pretty\nThanks to the new release of Norfair, we can understand how the camera moves in a specific video. This information allowed us to draw the path of the ball in the exact location of the field even when the camera moved from one side of the field to the other.\n\nWe also developed an easy-customizable scoreboard that can be used with every possible team to keep track of the metrics in a fashionable manner.\n\nOur final result with every step together and our custom design looks like this:\n\nFinal result with the Tryolabs AI possession scoreboard.\nFreebie: detecting passes!\nOur approach so far can also give us more useful information to analyze a soccer match.\n\nWe can define a passing event as when the ball changes from one player to another one from the same team. As we have seen before, we have everything in place for this. Let\u2019s draw the passes and see how it looks!\n\nAI automatically marks passes in a soccer match.\nTrying it on different matches\nHere we have results over other video snippets:\n\nReal Madrid vs Barcelona.\n2018 FIFA World Cup Final \u2014 France vs Croatia.\nShow me the code!\n\nThe entire repository is available on GitHub.\n\nApart from the computation of possession, we also include code for drawing the passes of each team with arrows.\n\nCaveats\nIt\u2019s important to keep in mind that this project is only a demo with the purpose of showing what can be done with AI and video analytics in a short amount of time. It does not attempt to be a robust system and cannot be used on real matches. Much work remains to be done for that!\n\nSome weak points of the current system:\n\nThe ball detection model that we used is not robust. This can be easily improved by fine-tuning an object detection model on a better dataset.\nThe player detection model might not work very well when many players are near each other, especially when there is a significant occlusion.\nThe team jersey detection method can be improved.\nOur system breaks if the camera vantage point changes. Most live TV footage usually takes close-ups of players with different cameras.\nOur system doesn't detect special events in the match such as corners, free kicks, and injuries. In these events, the ball possession will still be calculated and there will be a difference between the calculation of the real ball possession and the one calculated with our algorithm. In order to perfect our system, we would need to detect these events and stop the clock accordingly.\nConclusion\nVideo analytics is a lot of fun, but it\u2019s no magic. With this blog post, we tried to shed some light on how some of these solutions are implemented behind the scenes and release the code for everyone to play with. We hope that you found it useful!\n\nAgain, by no means is this a perfect system. Professional development of these tools for sports analytics will likely need to use several high-precision cameras, larger custom datasets, and possibly recognizing the 3D position of the objects for improved accuracy.\n\nWhen faced with the development of a large system, we might feel daunted. We must remember that every complex system starts from simple components.", "questions": ["What are the differences between the two methods for defining ball possession?", "What cons does the first method have?", "What about the second one?", "What are the steps for building the system?", "How many matches are used to test the project?", "Which matches are those?", "Did YOLOv5 trained with the COCO dataset work well for detecting balls at first?", "How many approaches are there for identifying the player's teams?", "What makes the classification of the teams unstable?", "What tool did you use to solve this?", "How does it help in stabilizing the predictions?"], "answers": {"input_text": ["The first method considers ball possession in relation to the number of passes on each team. The second one consists of starting a timer when a team gets a hold of the ball and stopping it when they lose it.", "It doesn't account for the total time that players have control of the ball.", "It needs the scorekeeper to be very focused for the duration of the game and it's susceptible to human errors.", "Get a few videos to test, detect the players and the ball, identify each player's team, and determine which player has the ball at all times", "three", "Chelsea vs Manchester City, Real Madrid vs Barcelona and France vs Croatia", "No", "two", "Occlusions and imprecisions of the bounding boxes", "Norfair", "Linking players' bounding boxes from one frame to the next"]}}, {"conversation_id": "16", "context_id": "3", "story": "Automatically measuring soccer ball possession with AI and video analytics\n\nQatar 2022 World Cup is just around the corner and in Tryolabs everybody is excited to have their national team compete. As the teams prepare for the event, they more than ever rely on AI-assisted sports analytics for inspecting their performance based on both recordings of their previous games and real-time information delivered to the coach during the matches.\n\nAI is being used to identify patterns that lead to success and to compute indicators that give coaches objective numbers over which to optimize to maximize their team\u2019s chances of scoring. One such indicator is ball possession.\n\nLeveraging Norfair, our open-source tracking library, we put our video analytics skills to the test and wrote some code to showcase how AI can compute ball possession by team automatically by watching a video of a match. Here is how it looks in a snippet of a match between Chelsea and Manchester City:\n\nWe also made it open source on GitHub. You can inspect every single part of the code, and use it as a base to build something else!\n\nIn this post, we are going to dig a bit deeper into what exactly ball possession is, explain how this system was built from simpler components and how these work.\n\nDefining ball possession\nBall possession is one of the most important statistics in a soccer game. According to a 2020 study conducted over 625 UEFA Champions League matches, teams with more ball possession won 49.2%, drew 22.0%, and lost 28.7% of the matches overall, exceeding the winning rates of their rivals. This effect was even greater when the gap of ball possession percentage between two teams in a match was higher.\n\nThere is definitely something in this number. The team with a higher possession is more likely controlling the match.\n\nFrom \"Relationship between ball possession and match outcome in UEFA Champions League\u201d, 2020.\nWe now know how important ball possession is to soccer analytics. But what exactly is it? How is it computed? There are actually two methods that yield different results.\n\nMethod 1: based on calculating passes\nThe first method is to consider ball possession in relation to the number of passes. The possession metric would be the number of passes on each team during the match, divided by the total number of passes in the game.\n\nWhile straightforward to compute if counting passes manually (with a clicker), this method has a big drawback: it doesn\u2019t account for the total time that players have control of the ball.\n\nMethod 2: based on time\nAnother method that is widely used consists of controlling a clock manually for each team. A person has to be in charge of starting a timer when a team gets a hold of the ball and stopping it when they lose it. The timers also need to be paused when the game is stopped. While accurate, this approach needs the scorekeeper to be very focused for the duration of the game. It\u2019s also susceptible to human errors. Forgetting to turn on or off the clock can mess up the metrics (and you don\u2019t want that!).\n\nUsing AI to compute ball possession\nDo we really need to burden someone with this task in this day and age?\n\nTurns out that with deep learning and computer vision techniques it should be very feasible to automate this. For the remainder of this post, we will understand ball possession as the % of the time that each team has the ball (method 2).\n\nIn a nutshell, we will build a clock powered by AI instead of a human.\n\nDivide & conquer: the steps\nLet\u2019s start by breaking down the problem. We are going to need to:\n\nGet a few videos to test.\nDetect important objects.\nDetect the players.\nDetect the ball.\nIdentify each player\u2019s team.\nDetermine which player has the ball at all times.\nThere are also some \u201cnice to haves\u201d:\n\nDrawing the trailing path of the ball during the match (for better visual inspection, remember we are doing sports analytics here!).\nDetecting and drawing passes completed by each team.\nNow that we know what we need to do, let\u2019s see how we can develop each step.\n\nStep 1: Getting a few videos to test\nThere is no better way to test our algorithm than on live soccer footage videos. However, we can\u2019t just use any live soccer footage video, we have to be sure that the video does not change between different cameras. Fortunately, this can be easily accomplished by trimming a live soccer match from Youtube.\n\nWe searched for a few matches with different teams to test our project in different situations. We are going to test our project in the following matches:\n\nChelsea vs Manchester City.\nReal Madrid vs Barcelona.\nFrance vs Croatia \u2014 FIFA World Cup Final 2018.\nStep 2: Detecting objects\nWhat is a soccer match without players and a ball? Just a referee running from one side of the field to the other \ud83e\udd41.\n\nIt\u2019s crucial to know the location of the players and the ball before ever thinking about calculating the ball possession. To do this, we need two deep learning models: one for detecting the players and another for detecting the ball.\n\nDetecting players\nWhat would have been a challenging problem a decade ago is nothing but a few lines of code in 2022. There are multiple pre-trained models we can use to detect people. In this case, we will use YOLOv5 trained in the COCO dataset. This will get us bounding boxes around each player, and confidence scores. Here are the results:\n\nDetecting players using YOLOv5.\nDetecting the ball\nThe COCO dataset has a specific label for sports balls, but the results we got using a model trained using it were not good enough for this type of live soccer footage so we had to come up with a different approach.\n\nIn order to improve the results for our specific task, we finetuned the YOLOv5 detector trained in COCO with a specific dataset of balls containing only footage from live soccer matches.\n\nBall detection using YOLOv5 and a custom-built dataset.\nThe dataset was created with footage of soccer videos with a similar camera view, and the labeling was made with an open-source labeling tool named LabelImg. The finetuning process was made following the instructions of the official YOLOv5 repository.\n\nPlease note that this is by no means a fully robust ball detection model, the development of which is outside the scope of this blog post. Our resulting model mostly works on the videos we used for these demos. If you want to run it on other videos, the model will need more finetuning over labeled data.\n\nPutting things together\nGreat! We now have two models that we can use as if it was only one model by adding both of the detections. The result is a perfect model for soccer video analytics.\n\nDetection of players and ball.\nStep 3: Identifying player\u2019s teams\nHow can we know which team has the ball if we don\u2019t know the team of each player? We need to have a stage that takes the players\u2019 detection as inputs and outputs the detections with their respective classifications.\n\nHere, there are several approaches we could follow, each one with its pros and cons. In this section, we will restrict ourselves to two simple ones and leave out more complex techniques such as clustering, siamese networks, or contrastive learning (we encourage you to try these, though!).\n\nApproach 1: Neural network based on jersey\nOne approach is to train a neural network for image classification based on the team's jersey. The dataset can be generated by running a video with a player detection model and saving the crops of the detections as a dataset for training. The labeling can be easily done from a Jupyter Notebook with a tool like pigeonXT.\n\nA neural network could be a good approach when there are complex scenarios to classify. For example, to distinguish between jerseys with similar colors. Or when there are occlusions of players. However, this advantage comes with a cost. This approach requires you to create a dataset and train the model for every match that you want to analyze. This can be daunting if you would want to be able to analyze the ball possession in many different soccer matches.\n\nApproach 2: Color filtering with HSV\nAs in most sports, different teams are expected to use easily distinguishable jerseys for a match, so what if we can leverage that information using some classical computer vision?\n\nLet's take an example of a match between Chelsea and Manchester City. Here we have four distinctive colors:\n\nClassification\tJersey Color\nChelsea player\tBlue\nManchester City player\tSky Blue\nChelsea Goalkeeper\tGreen\nReferee\tBlack\nNote that there is no color for the Manchester City Goalkeeper. This is because his jersey is black, and would therefore have the same color as the referee's. We didn't claim this approach could cover every single case :)\n\nFor each color, we created an HSV filter that will tell us how many pixels of that color the image has.\n\nThe reason we chose to filter with HSV values instead of RGB is that HSV filtering is more robust to lighting changes. By adjusting only the hue value, you are choosing what type of color you want to keep (blue, red, yellow, etc), independent of how dark or light the color is.\n\nBefore filtering the colors, the player image is cropped in order to keep just the jersey. The crop is a specific percentage of the image. This helps to reduce unnecessary information for the classification algorithm. For example, in this image taken from a bounding box, the cropped image removes most of the pixels of the unwanted player, but still keeps most of the information of the desired one.\n\n\nCropping to the player\u2019s jersey.\nFor each color range, we created a filter that keeps the pixels in the color range and sets the rest to black.\n\nThe cropped image is then fed to the 4 color filters specified before. The output of the 4 filters is then passed through a median blur filter that will be in charge of removing unwanted noise.\n\nFor each output, we count the number of pixels that passed through the filter (i.e. the non-black ones). The filter with the highest count will give us the team that represents the player!\n\nThe following animations show the described process:\n\nHSV Classifier classifying a Manchester City player.\nHSV Classifier classifying a Chelsea player.\nHSV Classifier classifying the Chelsea Goalkeeper.\nIf a team has more than one color, the sum of the non-black pixels of all the corresponding colors will be taken into account for the comparison.\n\nFor more details on how the classifier is implemented, please take a look at this file.\n\nImproving classification with inertia\nOur HSV classifier works well\u2026 most of the time. Occlusions and imprecisions of the bounding boxes sometimes make the predictions unstable. In order to stabilize them, we need tracking. By introducing Norfair, we can link players\u2019 bounding boxes from one frame to the next, allowing us to look not just at the current team prediction but past ones for the same player.\n\nLet\u2019s see Norfair in action:\n\nTracking players in a soccer match using Norfair.\nIt\u2019s common sense that a player shouldn't be able to change teams across the video. We can use this fact to our advantage. Therefore, the principle of inertia for classification states that a player's classification shouldn't be based only on the current frame but on the history of the nn previous classifications of the same object.\n\nFor example, if the classifier has set inertia equal to 1010, the player classification on frame ii will be decided by the mode of classification results from frames i-10i\u221210 to ii. This ensures that a subtle change in the classification due to noise or occlusion will not necessarily change the player\u2019s team.\n\nIdeally, infinite inertia would be great. However, the tracker can mix some ids of the players. In this case, if inertia is too large, it can take too much time for the classifier to start predicting the correct team of the player. An ideal value of inertia is not too big or small.\n\nHere are examples comparing the HSV classifier with and without inertia:\n\nHSV classifier without inertia.\nHSV classifier with inertia 20 on a 25 FPS video.\nStep 4: Determining the player with the ball\nThis is the final piece to our computation of ball possession. We need to decide who has the ball at all times.\n\nA simple method works by determining the distance from all the players to the ball. The closest player will be the one who has the ball. This is not infallible of course but mostly works fine.\n\nFor our demo, we defined the distance as the distance from the closest foot of the player to the center of the ball. For simplicity, we will consider the bottom left corner of the bounding box as the left foot of the player, and the bottom right corner of the bounding box as the right foot.\n\n\n\nDistance from the right foot to the center of the ball.\n\n\nDistance from the left foot to the center of the ball.\nOnce we know who the closest player to the ball is, we have to define a threshold to know if the player is near enough to the ball to be considered in possession of it.\n\n\n\nNo player with the ball, the closest player is too far from the ball.\n\n\nA player has the ball, the closest player is near the ball.\nFrom models and heuristics to possession\nWe now have everything that we needed in order to calculate the ball possession metrics. However, determining which team is in possession of the ball is not as simple as stating that it's the team of the player that has the ball in each frame. As with our team classification process, the algorithm for ball possession should also have some inertia.\n\nIn this case and given a team with the ball, inertia states how many consecutive frames the players from the other team should have the ball for us to consider that the possession changed. It\u2019s very important to consider consecutive frames. If there is an interruption, the inertia restarts.\n\nWithout inertia, the possession will change unnecessarily in events such as a rebound, or when the ball passes near a player that didn't have the ball.\n\nNo inertia: ball possession mistakenly changes during a rebound.\nInertia: ball possession doesn\u2019t change during the rebound.\nMaking our demo pretty\nThanks to the new release of Norfair, we can understand how the camera moves in a specific video. This information allowed us to draw the path of the ball in the exact location of the field even when the camera moved from one side of the field to the other.\n\nWe also developed an easy-customizable scoreboard that can be used with every possible team to keep track of the metrics in a fashionable manner.\n\nOur final result with every step together and our custom design looks like this:\n\nFinal result with the Tryolabs AI possession scoreboard.\nFreebie: detecting passes!\nOur approach so far can also give us more useful information to analyze a soccer match.\n\nWe can define a passing event as when the ball changes from one player to another one from the same team. As we have seen before, we have everything in place for this. Let\u2019s draw the passes and see how it looks!\n\nAI automatically marks passes in a soccer match.\nTrying it on different matches\nHere we have results over other video snippets:\n\nReal Madrid vs Barcelona.\n2018 FIFA World Cup Final \u2014 France vs Croatia.\nShow me the code!\n\nThe entire repository is available on GitHub.\n\nApart from the computation of possession, we also include code for drawing the passes of each team with arrows.\n\nCaveats\nIt\u2019s important to keep in mind that this project is only a demo with the purpose of showing what can be done with AI and video analytics in a short amount of time. It does not attempt to be a robust system and cannot be used on real matches. Much work remains to be done for that!\n\nSome weak points of the current system:\n\nThe ball detection model that we used is not robust. This can be easily improved by fine-tuning an object detection model on a better dataset.\nThe player detection model might not work very well when many players are near each other, especially when there is a significant occlusion.\nThe team jersey detection method can be improved.\nOur system breaks if the camera vantage point changes. Most live TV footage usually takes close-ups of players with different cameras.\nOur system doesn't detect special events in the match such as corners, free kicks, and injuries. In these events, the ball possession will still be calculated and there will be a difference between the calculation of the real ball possession and the one calculated with our algorithm. In order to perfect our system, we would need to detect these events and stop the clock accordingly.\nConclusion\nVideo analytics is a lot of fun, but it\u2019s no magic. With this blog post, we tried to shed some light on how some of these solutions are implemented behind the scenes and release the code for everyone to play with. We hope that you found it useful!\n\nAgain, by no means is this a perfect system. Professional development of these tools for sports analytics will likely need to use several high-precision cameras, larger custom datasets, and possibly recognizing the 3D position of the objects for improved accuracy.\n\nWhen faced with the development of a large system, we might feel daunted. We must remember that every complex system starts from simple components.", "questions": ["What technology was used for tracking the ball?", "Is it open-source?", "Is the team with higher ball possesion more likely to win?", "Why?", "According to what?", "What approaches are there for identifying the player's teams?", "Mention one disadvantage of the first approach", "Mention one advantage of the first approach", "What metric was used for determining who has the ball?", "How is this used for determining which player has the ball?", "What is one of the weak points of the system?"], "answers": {"input_text": ["Norfair", "Yes", "Yes", "Teams with more ball possession won 49.2%", "A 2020 study conducted over 625 UEFA Champions League matches.", "Neural network based on jersey and color filtering with HSV.", "It requires you to create a dataset and train the model for every match that you want to analyze", "It could be a good approach when there are complex scenarios to classify", "Distance from all the players to the ball", "The closest player will be the one who has the ball", "Many player near each other"]}}, {"conversation_id": "17", "context_id": "3", "story": "Automatically measuring soccer ball possession with AI and video analytics\n\nQatar 2022 World Cup is just around the corner and in Tryolabs everybody is excited to have their national team compete. As the teams prepare for the event, they more than ever rely on AI-assisted sports analytics for inspecting their performance based on both recordings of their previous games and real-time information delivered to the coach during the matches.\n\nAI is being used to identify patterns that lead to success and to compute indicators that give coaches objective numbers over which to optimize to maximize their team\u2019s chances of scoring. One such indicator is ball possession.\n\nLeveraging Norfair, our open-source tracking library, we put our video analytics skills to the test and wrote some code to showcase how AI can compute ball possession by team automatically by watching a video of a match. Here is how it looks in a snippet of a match between Chelsea and Manchester City:\n\nWe also made it open source on GitHub. You can inspect every single part of the code, and use it as a base to build something else!\n\nIn this post, we are going to dig a bit deeper into what exactly ball possession is, explain how this system was built from simpler components and how these work.\n\nDefining ball possession\nBall possession is one of the most important statistics in a soccer game. According to a 2020 study conducted over 625 UEFA Champions League matches, teams with more ball possession won 49.2%, drew 22.0%, and lost 28.7% of the matches overall, exceeding the winning rates of their rivals. This effect was even greater when the gap of ball possession percentage between two teams in a match was higher.\n\nThere is definitely something in this number. The team with a higher possession is more likely controlling the match.\n\nFrom \"Relationship between ball possession and match outcome in UEFA Champions League\u201d, 2020.\nWe now know how important ball possession is to soccer analytics. But what exactly is it? How is it computed? There are actually two methods that yield different results.\n\nMethod 1: based on calculating passes\nThe first method is to consider ball possession in relation to the number of passes. The possession metric would be the number of passes on each team during the match, divided by the total number of passes in the game.\n\nWhile straightforward to compute if counting passes manually (with a clicker), this method has a big drawback: it doesn\u2019t account for the total time that players have control of the ball.\n\nMethod 2: based on time\nAnother method that is widely used consists of controlling a clock manually for each team. A person has to be in charge of starting a timer when a team gets a hold of the ball and stopping it when they lose it. The timers also need to be paused when the game is stopped. While accurate, this approach needs the scorekeeper to be very focused for the duration of the game. It\u2019s also susceptible to human errors. Forgetting to turn on or off the clock can mess up the metrics (and you don\u2019t want that!).\n\nUsing AI to compute ball possession\nDo we really need to burden someone with this task in this day and age?\n\nTurns out that with deep learning and computer vision techniques it should be very feasible to automate this. For the remainder of this post, we will understand ball possession as the % of the time that each team has the ball (method 2).\n\nIn a nutshell, we will build a clock powered by AI instead of a human.\n\nDivide & conquer: the steps\nLet\u2019s start by breaking down the problem. We are going to need to:\n\nGet a few videos to test.\nDetect important objects.\nDetect the players.\nDetect the ball.\nIdentify each player\u2019s team.\nDetermine which player has the ball at all times.\nThere are also some \u201cnice to haves\u201d:\n\nDrawing the trailing path of the ball during the match (for better visual inspection, remember we are doing sports analytics here!).\nDetecting and drawing passes completed by each team.\nNow that we know what we need to do, let\u2019s see how we can develop each step.\n\nStep 1: Getting a few videos to test\nThere is no better way to test our algorithm than on live soccer footage videos. However, we can\u2019t just use any live soccer footage video, we have to be sure that the video does not change between different cameras. Fortunately, this can be easily accomplished by trimming a live soccer match from Youtube.\n\nWe searched for a few matches with different teams to test our project in different situations. We are going to test our project in the following matches:\n\nChelsea vs Manchester City.\nReal Madrid vs Barcelona.\nFrance vs Croatia \u2014 FIFA World Cup Final 2018.\nStep 2: Detecting objects\nWhat is a soccer match without players and a ball? Just a referee running from one side of the field to the other \ud83e\udd41.\n\nIt\u2019s crucial to know the location of the players and the ball before ever thinking about calculating the ball possession. To do this, we need two deep learning models: one for detecting the players and another for detecting the ball.\n\nDetecting players\nWhat would have been a challenging problem a decade ago is nothing but a few lines of code in 2022. There are multiple pre-trained models we can use to detect people. In this case, we will use YOLOv5 trained in the COCO dataset. This will get us bounding boxes around each player, and confidence scores. Here are the results:\n\nDetecting players using YOLOv5.\nDetecting the ball\nThe COCO dataset has a specific label for sports balls, but the results we got using a model trained using it were not good enough for this type of live soccer footage so we had to come up with a different approach.\n\nIn order to improve the results for our specific task, we finetuned the YOLOv5 detector trained in COCO with a specific dataset of balls containing only footage from live soccer matches.\n\nBall detection using YOLOv5 and a custom-built dataset.\nThe dataset was created with footage of soccer videos with a similar camera view, and the labeling was made with an open-source labeling tool named LabelImg. The finetuning process was made following the instructions of the official YOLOv5 repository.\n\nPlease note that this is by no means a fully robust ball detection model, the development of which is outside the scope of this blog post. Our resulting model mostly works on the videos we used for these demos. If you want to run it on other videos, the model will need more finetuning over labeled data.\n\nPutting things together\nGreat! We now have two models that we can use as if it was only one model by adding both of the detections. The result is a perfect model for soccer video analytics.\n\nDetection of players and ball.\nStep 3: Identifying player\u2019s teams\nHow can we know which team has the ball if we don\u2019t know the team of each player? We need to have a stage that takes the players\u2019 detection as inputs and outputs the detections with their respective classifications.\n\nHere, there are several approaches we could follow, each one with its pros and cons. In this section, we will restrict ourselves to two simple ones and leave out more complex techniques such as clustering, siamese networks, or contrastive learning (we encourage you to try these, though!).\n\nApproach 1: Neural network based on jersey\nOne approach is to train a neural network for image classification based on the team's jersey. The dataset can be generated by running a video with a player detection model and saving the crops of the detections as a dataset for training. The labeling can be easily done from a Jupyter Notebook with a tool like pigeonXT.\n\nA neural network could be a good approach when there are complex scenarios to classify. For example, to distinguish between jerseys with similar colors. Or when there are occlusions of players. However, this advantage comes with a cost. This approach requires you to create a dataset and train the model for every match that you want to analyze. This can be daunting if you would want to be able to analyze the ball possession in many different soccer matches.\n\nApproach 2: Color filtering with HSV\nAs in most sports, different teams are expected to use easily distinguishable jerseys for a match, so what if we can leverage that information using some classical computer vision?\n\nLet's take an example of a match between Chelsea and Manchester City. Here we have four distinctive colors:\n\nClassification\tJersey Color\nChelsea player\tBlue\nManchester City player\tSky Blue\nChelsea Goalkeeper\tGreen\nReferee\tBlack\nNote that there is no color for the Manchester City Goalkeeper. This is because his jersey is black, and would therefore have the same color as the referee's. We didn't claim this approach could cover every single case :)\n\nFor each color, we created an HSV filter that will tell us how many pixels of that color the image has.\n\nThe reason we chose to filter with HSV values instead of RGB is that HSV filtering is more robust to lighting changes. By adjusting only the hue value, you are choosing what type of color you want to keep (blue, red, yellow, etc), independent of how dark or light the color is.\n\nBefore filtering the colors, the player image is cropped in order to keep just the jersey. The crop is a specific percentage of the image. This helps to reduce unnecessary information for the classification algorithm. For example, in this image taken from a bounding box, the cropped image removes most of the pixels of the unwanted player, but still keeps most of the information of the desired one.\n\n\nCropping to the player\u2019s jersey.\nFor each color range, we created a filter that keeps the pixels in the color range and sets the rest to black.\n\nThe cropped image is then fed to the 4 color filters specified before. The output of the 4 filters is then passed through a median blur filter that will be in charge of removing unwanted noise.\n\nFor each output, we count the number of pixels that passed through the filter (i.e. the non-black ones). The filter with the highest count will give us the team that represents the player!\n\nThe following animations show the described process:\n\nHSV Classifier classifying a Manchester City player.\nHSV Classifier classifying a Chelsea player.\nHSV Classifier classifying the Chelsea Goalkeeper.\nIf a team has more than one color, the sum of the non-black pixels of all the corresponding colors will be taken into account for the comparison.\n\nFor more details on how the classifier is implemented, please take a look at this file.\n\nImproving classification with inertia\nOur HSV classifier works well\u2026 most of the time. Occlusions and imprecisions of the bounding boxes sometimes make the predictions unstable. In order to stabilize them, we need tracking. By introducing Norfair, we can link players\u2019 bounding boxes from one frame to the next, allowing us to look not just at the current team prediction but past ones for the same player.\n\nLet\u2019s see Norfair in action:\n\nTracking players in a soccer match using Norfair.\nIt\u2019s common sense that a player shouldn't be able to change teams across the video. We can use this fact to our advantage. Therefore, the principle of inertia for classification states that a player's classification shouldn't be based only on the current frame but on the history of the nn previous classifications of the same object.\n\nFor example, if the classifier has set inertia equal to 1010, the player classification on frame ii will be decided by the mode of classification results from frames i-10i\u221210 to ii. This ensures that a subtle change in the classification due to noise or occlusion will not necessarily change the player\u2019s team.\n\nIdeally, infinite inertia would be great. However, the tracker can mix some ids of the players. In this case, if inertia is too large, it can take too much time for the classifier to start predicting the correct team of the player. An ideal value of inertia is not too big or small.\n\nHere are examples comparing the HSV classifier with and without inertia:\n\nHSV classifier without inertia.\nHSV classifier with inertia 20 on a 25 FPS video.\nStep 4: Determining the player with the ball\nThis is the final piece to our computation of ball possession. We need to decide who has the ball at all times.\n\nA simple method works by determining the distance from all the players to the ball. The closest player will be the one who has the ball. This is not infallible of course but mostly works fine.\n\nFor our demo, we defined the distance as the distance from the closest foot of the player to the center of the ball. For simplicity, we will consider the bottom left corner of the bounding box as the left foot of the player, and the bottom right corner of the bounding box as the right foot.\n\n\n\nDistance from the right foot to the center of the ball.\n\n\nDistance from the left foot to the center of the ball.\nOnce we know who the closest player to the ball is, we have to define a threshold to know if the player is near enough to the ball to be considered in possession of it.\n\n\n\nNo player with the ball, the closest player is too far from the ball.\n\n\nA player has the ball, the closest player is near the ball.\nFrom models and heuristics to possession\nWe now have everything that we needed in order to calculate the ball possession metrics. However, determining which team is in possession of the ball is not as simple as stating that it's the team of the player that has the ball in each frame. As with our team classification process, the algorithm for ball possession should also have some inertia.\n\nIn this case and given a team with the ball, inertia states how many consecutive frames the players from the other team should have the ball for us to consider that the possession changed. It\u2019s very important to consider consecutive frames. If there is an interruption, the inertia restarts.\n\nWithout inertia, the possession will change unnecessarily in events such as a rebound, or when the ball passes near a player that didn't have the ball.\n\nNo inertia: ball possession mistakenly changes during a rebound.\nInertia: ball possession doesn\u2019t change during the rebound.\nMaking our demo pretty\nThanks to the new release of Norfair, we can understand how the camera moves in a specific video. This information allowed us to draw the path of the ball in the exact location of the field even when the camera moved from one side of the field to the other.\n\nWe also developed an easy-customizable scoreboard that can be used with every possible team to keep track of the metrics in a fashionable manner.\n\nOur final result with every step together and our custom design looks like this:\n\nFinal result with the Tryolabs AI possession scoreboard.\nFreebie: detecting passes!\nOur approach so far can also give us more useful information to analyze a soccer match.\n\nWe can define a passing event as when the ball changes from one player to another one from the same team. As we have seen before, we have everything in place for this. Let\u2019s draw the passes and see how it looks!\n\nAI automatically marks passes in a soccer match.\nTrying it on different matches\nHere we have results over other video snippets:\n\nReal Madrid vs Barcelona.\n2018 FIFA World Cup Final \u2014 France vs Croatia.\nShow me the code!\n\nThe entire repository is available on GitHub.\n\nApart from the computation of possession, we also include code for drawing the passes of each team with arrows.\n\nCaveats\nIt\u2019s important to keep in mind that this project is only a demo with the purpose of showing what can be done with AI and video analytics in a short amount of time. It does not attempt to be a robust system and cannot be used on real matches. Much work remains to be done for that!\n\nSome weak points of the current system:\n\nThe ball detection model that we used is not robust. This can be easily improved by fine-tuning an object detection model on a better dataset.\nThe player detection model might not work very well when many players are near each other, especially when there is a significant occlusion.\nThe team jersey detection method can be improved.\nOur system breaks if the camera vantage point changes. Most live TV footage usually takes close-ups of players with different cameras.\nOur system doesn't detect special events in the match such as corners, free kicks, and injuries. In these events, the ball possession will still be calculated and there will be a difference between the calculation of the real ball possession and the one calculated with our algorithm. In order to perfect our system, we would need to detect these events and stop the clock accordingly.\nConclusion\nVideo analytics is a lot of fun, but it\u2019s no magic. With this blog post, we tried to shed some light on how some of these solutions are implemented behind the scenes and release the code for everyone to play with. We hope that you found it useful!\n\nAgain, by no means is this a perfect system. Professional development of these tools for sports analytics will likely need to use several high-precision cameras, larger custom datasets, and possibly recognizing the 3D position of the objects for improved accuracy.\n\nWhen faced with the development of a large system, we might feel daunted. We must remember that every complex system starts from simple components.", "questions": ["Is the showcase public?", "Where can I find it?", "What does the second method for measuring ball possession involves?", "What ML libraries are mentioned in this use case?", "Can you give me more details about LabelImg?", "Which part of the player's outfit is used for identifying the player's team?", "Why is only the jersey used?", "Describe one of the weak points of the system", "How can this be solved?"], "answers": {"input_text": ["Yes", "GitHub", "Controlling a clock for each team and measuring the time each team has the ball", "Norfair, LabelImg and pigeonXT.", "It's an open-source labelling tool", "The jersey", "Helps to reduce unnecessary information for the classification algorithm", "The ball detection model that was used is not robust", "Fine-tuning an object detection model on a better dataset"]}}, {"conversation_id": "20", "context_id": "4", "story": "Becoming an AI organization\n\nAI may sound like something from the future, but nowadays, it's becoming a must to gain a competitive advantage. As the industry is booming and the budgets are quickly expanding, this is fueling the expansion of AI-based solutions.\n\nHowever, according to Harvard Business Review, only 8% of companies have adopted AI in their core activities because most still think of AI as isolated projects in their organizations.\n\nRecently, Maia Brenner and Alan Descoins, our Head of Business Development and CTO, respectively, hosted a roundtable for ELC. They shared some insights on how to start the AI journey. That session then was continued on a podcast episode, proving a genuine audience interest in getting more details about the challenges companies face when adopting AI in their process.\n\nIn this blog post, we will discuss the issues many companies face in AI adoption and the outstanding results they get.\n\nFundamental elements of an AI strategy\nDemystify AI\nOne of the first steps to becoming an AI organization is understanding what AI can and cannot do. Knowing its limitations helps you manage your expectations and realize its adoption isn't going to make all the problems magically go away. We have covered this before, the many myths around AI, and our position about them.\n\nAnother concept to keep in mind is not to fall in love with the tools. Instead, focus on the problem you're trying to solve. Remember that AI is just another set of tools, and there are as many solutions as possible problems. So look for the one that better suits your needs.\n\nFocus on the proper business case\nThe second challenge to tackle is to find the right business use case to apply AI. Your first AI project shouldn't be the most ambitious one; try starting small and look for those opportunities with higher ROI.\n\nThere are two possible approaches here, you can attack a problem you've already identified, or you can make a first explorative phase in your AI journey and try to discover the most appropriate use cases with which to start. We always share our work and experiences. Check them out; you might feel identified with some of them!\n\nGet your senior management on board\nBusiness leaders must understand how AI will affect their companies and get prepared, so they won't be left behind. A clear understanding also means everyone knows about its limitations and makes it easier to get rid of unrealistic expectations. You also have to state clearly to all the stakeholders how you measure the success of the project. Getting your senior management AI literate in the subject is always going to be in your favor. Here are some guides to the hottest topics in AI, such as price optimization, real estate price prediction, computer vision, and AI for finance.\n\nTake a look at your data\nOne of the mantras within the AI world is garbage in, garbage out. Even though I strongly agree with this statement, because the input we give to our models is one of the essential points to obtain good results, you don't have to fall into the data perfection trap.\n\nObtaining more data and improving quality is an iterative process in your AI journey. You can find more practical advice on data prep here.\n\nBuild the right team\nTo get your machine learning project off the ground, you must have a team with a broad range of skills, including non-technical roles. This means having the right combination of people who can identify the limitations of both the technical side and the business side.\n\nInterdisciplinary collaboration in AI projects helps you cover many aspects so that your project is on the right path to success, utilizing the best of each profile.\n\nAnother aspect of planning is devising is a system that allows collaborators to work in a coordinated way with a goal-oriented mindset.\n\nCommon pitfalls to avoid\nSince AI is still in its early stages in the business world, things going wrong is always a possibility. So here are some common pitfalls to avoid!\n\nAll-in projects\nI want to start with a piece of handy advice: don't try to boil the ocean on day one. You are definitely not going to transform all the processes in your entire organization in a blink.\n\nThe key to achieving remarkable results is to find the low-hanging fruit projects. Go tier by tier, as opportunities may not be straightforward to notice. Here's where your domain knowledge of the company can shine. You should analyze all the high-cost components of your operation and think whether something could be better by using data or partially automating a process.\n\nLook for high-impact projects that make the investment worth it. But don't overdo it by going with the most ambitious venture; prioritize opportunities with higher RoI first.\n\nquotes\nLook at the big picture but start small while measuring and showing impact.\n\n\nStarting without the right people in the right place\nWhen starting an AI project, one of the first questions is whether to outsource or build an in-house data science team. The answer depends greatly on the degree of maturity your organization has concerning its data, technical capabilities, and culture.\n\nWhile building an in-house data science team sounds appealing in the long run, based on our experience, outsourcing seems like the right decision to start this mind-blowing journey, especially when you have time constraints.\n\nRemember that partnering with an AI consulting firm means fast-tracking your AI journey. Specialized companies provide you guidance, knowledge, and tools based on their experience on similar successful projects and shorten the path to success and achieve outstanding results.\n\nIf we look at the international evidence, outsourcing is a growing trend, especially for companies that want to minimize their non-core activity cost structure. According to the Appen Whitepaper, The State of AI and Machine Learning, organizations outsourcing AI projects are more likely to deploy them and produce a significant ROI.\n\nStay halfway\nHaving a great idea to introduce AI in your organization is not enough to achieve results in your bottom line. You need the correct implementation too. Machine Learning models are intended to guide us in our business decisions. Not everything is about predictions, but what we do with them.\n\nA good way to avoid this trap is to think about how our model will be integrated into our current workflow. For example, if our model tells us that a customer will probably leave us, state in advance what specific actions we will take and determine who is responsible for setting them in motion.\n\n\nMeasure results\nUnderstanding how you'll measure success in AI projects is one of the most important elements of a good strategic plan. So, pay close attention. When you measure the results of an AI project, you have to bear in mind two types of benefits for your organization, tangible and intangible.\n\nWhen you decide on your intangible desired results, you have to ask yourself what are the problems you intend to resolve, something usually underestimated or forgotten. This means broadening the spectrum to concepts such as the company's objectives, strategy, and processes. A clear result for those who are just dipping their toes into the AI pool is to start spinning the AI wheel internally. It doesn't matter if the first project falls short of expectations. It will be the beginning of a valuable AI journey.\n\nquotes\nWhat's important is to fail fast and learn faster.\n\n\nOne of the central aspects of measuring tangible results is to separate business KPI's from ML models metrics. Companies that focus only on chasing higher accuracy metrics clearly leave money on the table and fail to maximize resources. The key is to explicitly link the output of your AI model to specific business objectives.\n\nComing back to our business problem, you have to define success from the very beginning of your project. To do so, we advise you to define SMART objectives, those that are Specific, Measurable, Achievable, Relevant, and Time-Bound. A few examples of business KPIs that can be improved using AI can be defined as increasing revenue, reducing cost, optimizing inventory, or reducing sales loss due to out-of-stock.\n\nAnother well-used option could be propensity score matching. This is a quasi-experimental setting in which statistical techniques are used to construct an artificial control group. This means matching each treatment item with a non-treatment one of similar characteristics. Measuring the differences in performance between those items can be a really good solution when the matching is done correctly.\n\nHaving an experimental mindset will be in your favor since it's one of the most assertive ways to measure impact. When talking about experimental settings, think about A/B testing, control, and treatment groups.\n\nA/B testing is a basic randomized control experiment. For example, for an e-commerce business, you can apply your solution to a set of defined users and contrast the results with users that remain unchanged. Keeping in mind that both groups are similar in any other characteristics, the difference in performance can be attributed to your solution. If you are a brick-and-mortar company, you can apply the model in some of your stores and use the rest as a control group.\n\n\nConclusions\nNobody said that becoming an AI organization is an easy task to implement and deploy. However, it is a path worth going through to obtain a competitive advantage that differentiates you from your competitors and increases your revenue.\n\nIf you want to explore further the path of becoming an AI Organization, we recommend you start with an assessment of your organization's current situation through these 11 questions. And don't hesitate and get in touch, let us understand where you are standing.", "questions": ["What is Alan Descoins' position on Tryolabs?", "What is a CTO?", "Explain the following sentence: \"One of the mantras within the AI world is garbage in, garbage out.\"", "What is a good characteristic for a machine learning team?", "What are the kind of projects I should be looking for when starting my AI organization?", "What is the advantage of outsourcing a data science team to an AI consulting firm?", "What does the SMART acronym mean?", "What is A/B testing?"], "answers": {"input_text": ["CTO", "I can't answer that.", "If you have bad data, you will get bad results.", "Broad range of skills.", "High-impact projects that make the investment worth it", "Guidance, knowledge, and tools based on their experience", "Specific, Measurable, Achievable, Relevant, and Time-Bound", "A basic randomized control experiment"]}}, {"conversation_id": "21", "context_id": "4", "story": "Becoming an AI organization\n\nAI may sound like something from the future, but nowadays, it's becoming a must to gain a competitive advantage. As the industry is booming and the budgets are quickly expanding, this is fueling the expansion of AI-based solutions.\n\nHowever, according to Harvard Business Review, only 8% of companies have adopted AI in their core activities because most still think of AI as isolated projects in their organizations.\n\nRecently, Maia Brenner and Alan Descoins, our Head of Business Development and CTO, respectively, hosted a roundtable for ELC. They shared some insights on how to start the AI journey. That session then was continued on a podcast episode, proving a genuine audience interest in getting more details about the challenges companies face when adopting AI in their process.\n\nIn this blog post, we will discuss the issues many companies face in AI adoption and the outstanding results they get.\n\nFundamental elements of an AI strategy\nDemystify AI\nOne of the first steps to becoming an AI organization is understanding what AI can and cannot do. Knowing its limitations helps you manage your expectations and realize its adoption isn't going to make all the problems magically go away. We have covered this before, the many myths around AI, and our position about them.\n\nAnother concept to keep in mind is not to fall in love with the tools. Instead, focus on the problem you're trying to solve. Remember that AI is just another set of tools, and there are as many solutions as possible problems. So look for the one that better suits your needs.\n\nFocus on the proper business case\nThe second challenge to tackle is to find the right business use case to apply AI. Your first AI project shouldn't be the most ambitious one; try starting small and look for those opportunities with higher ROI.\n\nThere are two possible approaches here, you can attack a problem you've already identified, or you can make a first explorative phase in your AI journey and try to discover the most appropriate use cases with which to start. We always share our work and experiences. Check them out; you might feel identified with some of them!\n\nGet your senior management on board\nBusiness leaders must understand how AI will affect their companies and get prepared, so they won't be left behind. A clear understanding also means everyone knows about its limitations and makes it easier to get rid of unrealistic expectations. You also have to state clearly to all the stakeholders how you measure the success of the project. Getting your senior management AI literate in the subject is always going to be in your favor. Here are some guides to the hottest topics in AI, such as price optimization, real estate price prediction, computer vision, and AI for finance.\n\nTake a look at your data\nOne of the mantras within the AI world is garbage in, garbage out. Even though I strongly agree with this statement, because the input we give to our models is one of the essential points to obtain good results, you don't have to fall into the data perfection trap.\n\nObtaining more data and improving quality is an iterative process in your AI journey. You can find more practical advice on data prep here.\n\nBuild the right team\nTo get your machine learning project off the ground, you must have a team with a broad range of skills, including non-technical roles. This means having the right combination of people who can identify the limitations of both the technical side and the business side.\n\nInterdisciplinary collaboration in AI projects helps you cover many aspects so that your project is on the right path to success, utilizing the best of each profile.\n\nAnother aspect of planning is devising is a system that allows collaborators to work in a coordinated way with a goal-oriented mindset.\n\nCommon pitfalls to avoid\nSince AI is still in its early stages in the business world, things going wrong is always a possibility. So here are some common pitfalls to avoid!\n\nAll-in projects\nI want to start with a piece of handy advice: don't try to boil the ocean on day one. You are definitely not going to transform all the processes in your entire organization in a blink.\n\nThe key to achieving remarkable results is to find the low-hanging fruit projects. Go tier by tier, as opportunities may not be straightforward to notice. Here's where your domain knowledge of the company can shine. You should analyze all the high-cost components of your operation and think whether something could be better by using data or partially automating a process.\n\nLook for high-impact projects that make the investment worth it. But don't overdo it by going with the most ambitious venture; prioritize opportunities with higher RoI first.\n\nquotes\nLook at the big picture but start small while measuring and showing impact.\n\n\nStarting without the right people in the right place\nWhen starting an AI project, one of the first questions is whether to outsource or build an in-house data science team. The answer depends greatly on the degree of maturity your organization has concerning its data, technical capabilities, and culture.\n\nWhile building an in-house data science team sounds appealing in the long run, based on our experience, outsourcing seems like the right decision to start this mind-blowing journey, especially when you have time constraints.\n\nRemember that partnering with an AI consulting firm means fast-tracking your AI journey. Specialized companies provide you guidance, knowledge, and tools based on their experience on similar successful projects and shorten the path to success and achieve outstanding results.\n\nIf we look at the international evidence, outsourcing is a growing trend, especially for companies that want to minimize their non-core activity cost structure. According to the Appen Whitepaper, The State of AI and Machine Learning, organizations outsourcing AI projects are more likely to deploy them and produce a significant ROI.\n\nStay halfway\nHaving a great idea to introduce AI in your organization is not enough to achieve results in your bottom line. You need the correct implementation too. Machine Learning models are intended to guide us in our business decisions. Not everything is about predictions, but what we do with them.\n\nA good way to avoid this trap is to think about how our model will be integrated into our current workflow. For example, if our model tells us that a customer will probably leave us, state in advance what specific actions we will take and determine who is responsible for setting them in motion.\n\n\nMeasure results\nUnderstanding how you'll measure success in AI projects is one of the most important elements of a good strategic plan. So, pay close attention. When you measure the results of an AI project, you have to bear in mind two types of benefits for your organization, tangible and intangible.\n\nWhen you decide on your intangible desired results, you have to ask yourself what are the problems you intend to resolve, something usually underestimated or forgotten. This means broadening the spectrum to concepts such as the company's objectives, strategy, and processes. A clear result for those who are just dipping their toes into the AI pool is to start spinning the AI wheel internally. It doesn't matter if the first project falls short of expectations. It will be the beginning of a valuable AI journey.\n\nquotes\nWhat's important is to fail fast and learn faster.\n\n\nOne of the central aspects of measuring tangible results is to separate business KPI's from ML models metrics. Companies that focus only on chasing higher accuracy metrics clearly leave money on the table and fail to maximize resources. The key is to explicitly link the output of your AI model to specific business objectives.\n\nComing back to our business problem, you have to define success from the very beginning of your project. To do so, we advise you to define SMART objectives, those that are Specific, Measurable, Achievable, Relevant, and Time-Bound. A few examples of business KPIs that can be improved using AI can be defined as increasing revenue, reducing cost, optimizing inventory, or reducing sales loss due to out-of-stock.\n\nAnother well-used option could be propensity score matching. This is a quasi-experimental setting in which statistical techniques are used to construct an artificial control group. This means matching each treatment item with a non-treatment one of similar characteristics. Measuring the differences in performance between those items can be a really good solution when the matching is done correctly.\n\nHaving an experimental mindset will be in your favor since it's one of the most assertive ways to measure impact. When talking about experimental settings, think about A/B testing, control, and treatment groups.\n\nA/B testing is a basic randomized control experiment. For example, for an e-commerce business, you can apply your solution to a set of defined users and contrast the results with users that remain unchanged. Keeping in mind that both groups are similar in any other characteristics, the difference in performance can be attributed to your solution. If you are a brick-and-mortar company, you can apply the model in some of your stores and use the rest as a control group.\n\n\nConclusions\nNobody said that becoming an AI organization is an easy task to implement and deploy. However, it is a path worth going through to obtain a competitive advantage that differentiates you from your competitors and increases your revenue.\n\nIf you want to explore further the path of becoming an AI Organization, we recommend you start with an assessment of your organization's current situation through these 11 questions. And don't hesitate and get in touch, let us understand where you are standing.", "questions": ["What is the first thing a company should know to become an AI organization?", "Why?", "Why getting your senior management on board with AI adoption is useful?", "Explain the following sentence: \"don't try to boil the ocean on day one\"", "How can avoid staying halfway with the introduction of AI in my organization?", "What are the types of benefits that an AI project can bring to the organization?", "What is the main takeaway from the conclusions?"], "answers": {"input_text": ["What AI can and cannot do.", "Knowing its limitations helps you manage your expectations", "Business leaders must understand how AI will affect their companies and get prepared, so they won't be left behind", "Don't try to transform all the processes in your entire organization in a blink", "Think about how your model will be integrated into the current workflow", "tangible and intangible", "Becoming an AI organization is not an easy task, but it is worth to do it to obtain competitive advantage"]}}, {"conversation_id": "22", "context_id": "4", "story": "Becoming an AI organization\n\nAI may sound like something from the future, but nowadays, it's becoming a must to gain a competitive advantage. As the industry is booming and the budgets are quickly expanding, this is fueling the expansion of AI-based solutions.\n\nHowever, according to Harvard Business Review, only 8% of companies have adopted AI in their core activities because most still think of AI as isolated projects in their organizations.\n\nRecently, Maia Brenner and Alan Descoins, our Head of Business Development and CTO, respectively, hosted a roundtable for ELC. They shared some insights on how to start the AI journey. That session then was continued on a podcast episode, proving a genuine audience interest in getting more details about the challenges companies face when adopting AI in their process.\n\nIn this blog post, we will discuss the issues many companies face in AI adoption and the outstanding results they get.\n\nFundamental elements of an AI strategy\nDemystify AI\nOne of the first steps to becoming an AI organization is understanding what AI can and cannot do. Knowing its limitations helps you manage your expectations and realize its adoption isn't going to make all the problems magically go away. We have covered this before, the many myths around AI, and our position about them.\n\nAnother concept to keep in mind is not to fall in love with the tools. Instead, focus on the problem you're trying to solve. Remember that AI is just another set of tools, and there are as many solutions as possible problems. So look for the one that better suits your needs.\n\nFocus on the proper business case\nThe second challenge to tackle is to find the right business use case to apply AI. Your first AI project shouldn't be the most ambitious one; try starting small and look for those opportunities with higher ROI.\n\nThere are two possible approaches here, you can attack a problem you've already identified, or you can make a first explorative phase in your AI journey and try to discover the most appropriate use cases with which to start. We always share our work and experiences. Check them out; you might feel identified with some of them!\n\nGet your senior management on board\nBusiness leaders must understand how AI will affect their companies and get prepared, so they won't be left behind. A clear understanding also means everyone knows about its limitations and makes it easier to get rid of unrealistic expectations. You also have to state clearly to all the stakeholders how you measure the success of the project. Getting your senior management AI literate in the subject is always going to be in your favor. Here are some guides to the hottest topics in AI, such as price optimization, real estate price prediction, computer vision, and AI for finance.\n\nTake a look at your data\nOne of the mantras within the AI world is garbage in, garbage out. Even though I strongly agree with this statement, because the input we give to our models is one of the essential points to obtain good results, you don't have to fall into the data perfection trap.\n\nObtaining more data and improving quality is an iterative process in your AI journey. You can find more practical advice on data prep here.\n\nBuild the right team\nTo get your machine learning project off the ground, you must have a team with a broad range of skills, including non-technical roles. This means having the right combination of people who can identify the limitations of both the technical side and the business side.\n\nInterdisciplinary collaboration in AI projects helps you cover many aspects so that your project is on the right path to success, utilizing the best of each profile.\n\nAnother aspect of planning is devising is a system that allows collaborators to work in a coordinated way with a goal-oriented mindset.\n\nCommon pitfalls to avoid\nSince AI is still in its early stages in the business world, things going wrong is always a possibility. So here are some common pitfalls to avoid!\n\nAll-in projects\nI want to start with a piece of handy advice: don't try to boil the ocean on day one. You are definitely not going to transform all the processes in your entire organization in a blink.\n\nThe key to achieving remarkable results is to find the low-hanging fruit projects. Go tier by tier, as opportunities may not be straightforward to notice. Here's where your domain knowledge of the company can shine. You should analyze all the high-cost components of your operation and think whether something could be better by using data or partially automating a process.\n\nLook for high-impact projects that make the investment worth it. But don't overdo it by going with the most ambitious venture; prioritize opportunities with higher RoI first.\n\nquotes\nLook at the big picture but start small while measuring and showing impact.\n\n\nStarting without the right people in the right place\nWhen starting an AI project, one of the first questions is whether to outsource or build an in-house data science team. The answer depends greatly on the degree of maturity your organization has concerning its data, technical capabilities, and culture.\n\nWhile building an in-house data science team sounds appealing in the long run, based on our experience, outsourcing seems like the right decision to start this mind-blowing journey, especially when you have time constraints.\n\nRemember that partnering with an AI consulting firm means fast-tracking your AI journey. Specialized companies provide you guidance, knowledge, and tools based on their experience on similar successful projects and shorten the path to success and achieve outstanding results.\n\nIf we look at the international evidence, outsourcing is a growing trend, especially for companies that want to minimize their non-core activity cost structure. According to the Appen Whitepaper, The State of AI and Machine Learning, organizations outsourcing AI projects are more likely to deploy them and produce a significant ROI.\n\nStay halfway\nHaving a great idea to introduce AI in your organization is not enough to achieve results in your bottom line. You need the correct implementation too. Machine Learning models are intended to guide us in our business decisions. Not everything is about predictions, but what we do with them.\n\nA good way to avoid this trap is to think about how our model will be integrated into our current workflow. For example, if our model tells us that a customer will probably leave us, state in advance what specific actions we will take and determine who is responsible for setting them in motion.\n\n\nMeasure results\nUnderstanding how you'll measure success in AI projects is one of the most important elements of a good strategic plan. So, pay close attention. When you measure the results of an AI project, you have to bear in mind two types of benefits for your organization, tangible and intangible.\n\nWhen you decide on your intangible desired results, you have to ask yourself what are the problems you intend to resolve, something usually underestimated or forgotten. This means broadening the spectrum to concepts such as the company's objectives, strategy, and processes. A clear result for those who are just dipping their toes into the AI pool is to start spinning the AI wheel internally. It doesn't matter if the first project falls short of expectations. It will be the beginning of a valuable AI journey.\n\nquotes\nWhat's important is to fail fast and learn faster.\n\n\nOne of the central aspects of measuring tangible results is to separate business KPI's from ML models metrics. Companies that focus only on chasing higher accuracy metrics clearly leave money on the table and fail to maximize resources. The key is to explicitly link the output of your AI model to specific business objectives.\n\nComing back to our business problem, you have to define success from the very beginning of your project. To do so, we advise you to define SMART objectives, those that are Specific, Measurable, Achievable, Relevant, and Time-Bound. A few examples of business KPIs that can be improved using AI can be defined as increasing revenue, reducing cost, optimizing inventory, or reducing sales loss due to out-of-stock.\n\nAnother well-used option could be propensity score matching. This is a quasi-experimental setting in which statistical techniques are used to construct an artificial control group. This means matching each treatment item with a non-treatment one of similar characteristics. Measuring the differences in performance between those items can be a really good solution when the matching is done correctly.\n\nHaving an experimental mindset will be in your favor since it's one of the most assertive ways to measure impact. When talking about experimental settings, think about A/B testing, control, and treatment groups.\n\nA/B testing is a basic randomized control experiment. For example, for an e-commerce business, you can apply your solution to a set of defined users and contrast the results with users that remain unchanged. Keeping in mind that both groups are similar in any other characteristics, the difference in performance can be attributed to your solution. If you are a brick-and-mortar company, you can apply the model in some of your stores and use the rest as a control group.\n\n\nConclusions\nNobody said that becoming an AI organization is an easy task to implement and deploy. However, it is a path worth going through to obtain a competitive advantage that differentiates you from your competitors and increases your revenue.\n\nIf you want to explore further the path of becoming an AI Organization, we recommend you start with an assessment of your organization's current situation through these 11 questions. And don't hesitate and get in touch, let us understand where you are standing.", "questions": ["Who is Tryolabs' CTO?", "What does it mean to \"demystify AI\"?", "What does \"Focus on the proper business case\" refers to?", "What are some of the most popular topics in AI?", "What propensity score matching is about?", "Why is it good to have an experimental mindset?", "What is the key to measuring tangible results for an ML model?"], "answers": {"input_text": ["Alan Descoins", "Understanding what AI can and cannot do", "Try starting small and look for those opportunities with higher ROI", "Price optimization, real estate price prediction, computer vision, and AI for finance", "This is a quasi-experimental setting in which statistical techniques are used to construct an artificial control group.", "It's one of the most assertive ways to measure impact", "Explicitly link the output of your AI model to specific business objectives"]}}]