Inference Phi-3 in Nvidia Jetson

Nvidia Jetson is a series of embedded computing boards from Nvidia. The Jetson TK1, TX1 and TX2 models all carry a Tegra processor (or SoC) from Nvidia that integrates an ARM architecture central processing unit (CPU). Jetson is a low-power system and is designed for accelerating machine learning applications.Nvidia Jetson is used by professional developers to create breakthrough AI products across all industries, and by students and enthusiasts for hands-on AI learning and making amazing projects.SLM is deployed in edge devices such as Jetson, which will enable better implementation of industrial generative AI application scenarios.

Deployment on NVIDIA Jetson:

Developers working on autonomous robotics and embedded devices can leverage Phi-3 Mini. Phi-3 relatively small size makes it ideal for edge deployment. Parameters have been meticulously tuned during training, ensuring high accuracy in responses.

TensorRT-LLM Optimization:

NVIDIA's TensorRT-LLM library optimizes large language model inference. It supports Phi-3 Mini's long context window, enhancing both throughput and latency. Optimizations include techniques like LongRoPE, FP8, and inflight batching.

Availability and Deployment:

Developers can explore Phi-3 Mini with the 128K context window at NVIDIA's AI. It's packaged as an NVIDIA NIM, a microservice with a standard API that can be deployed anywhere. Additionally, the TensorRT-LLM implementations on GitHub.

1. Preparation

a. Jetson Orin NX / Jetson NX

b. JetPack 5.1.2+

c. Cuda 11.8

d. Python 3.8+

2. Running Phi-3 in Jetson

We can choose Ollama or LlamaEdge

If you want to use gguf in the cloud and edge devices at the same time, LlamaEdge can be understood as WasmEdge (WasmEdge is a lightweight, high-performance, scalable WebAssembly runtime suitable for cloud native, edge and decentralized applications. It supports serverless applications, embedded functions, microservices, smart contracts and IoT devices. You can deploy gguf's quantitative model to edge devices and the cloud through LlamaEdge.

Here are the steps to use

Install and download related libraries and files

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz

tar xzf chatbot-ui.tar.gz

Note: llama-api-server.wasm and chatbot-ui need to be in the same directory

Run scripts in terminal

wasmedge --dir .:. --nn-preload default:GGML:AUTO:{Your gguf path} llama-api-server.wasm -p phi-3-chat

Here is the running result

Sample code Phi-3 mini WASM Notebook Sample

In summary, Phi-3 Mini represents a leap forward in language modeling, combining efficiency, context awareness, and NVIDIA's optimization prowess. Whether you're building robots or edge applications, Phi-3 Mini is a powerful tool to be aware of.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jetson_Inference.md

Jetson_Inference.md

Inference Phi-3 in Nvidia Jetson

Deployment on NVIDIA Jetson:

TensorRT-LLM Optimization:

Availability and Deployment:

1. Preparation

2. Running Phi-3 in Jetson

Files

Jetson_Inference.md

Latest commit

History

Jetson_Inference.md

File metadata and controls

Inference Phi-3 in Nvidia Jetson

Deployment on NVIDIA Jetson:

TensorRT-LLM Optimization:

Availability and Deployment:

1. Preparation

2. Running Phi-3 in Jetson