In the last project, you trained a simple PyTorch model using SGD
. Now, you'll write a program that can load a model (similar to the one from P2) and use it to make predictions upon request (this kind of program is called a model server).
Your model server will use multiple threads and cache the predictions (to save effort when the server is given the same inputs repeatedly).
You'll start by writing your code in a Python class and calling the methods in it. By the end, though, your model server will run in a Docker container and receive requests over a network (via gRPC calls).
Learning objectives:
- Use threads and locks correctly
- Cache expensive compute results with an LRU policy
- Measure performance statistics like cache hit rate
- Communicate between clients and servers via gRPC
Before starting, please review the general project directions.
- Sept 30: Match up
test_prediction_cache.py
andtest_modelserver.py
with specification - Sept 30: Update Testing section
- Sept 30: Add additional
Predict
test case indocker_autograde.py
- Sept 30: Add comments in
autograde.py
- Sept 30: Use -
Request
and -Response
inmodelserver.proto
- Oct 02: Reword hit/miss rate to hit and miss rate (your hit rate is the number of hits divided by the total number of calls)
- Oct 03: Added
--break-system-packages
to installation command - Oct 05: Increased time that autograder waits for server to bootup to 5 seconds
Write a class called PredictionCache
with two methods: SetCoefs(coefs)
and Predict(X)
in a file called server.py
.
SetCoefs
will store coefs
in the PredictionCache object; coefs
will be a PyTorch tensor containing a vertical vector of float32
s.
Predict
will take a 2D tensor and use it to predict y
values (which it will return) using the previously set coefs
, like this:
y = X @ coefs
In Python, a return statement can have multiple values; in this case it should have two:
- the predict y values
- a
bool
, indicating whether PredictionCache could make the prediction using a cache (see below)
Add code for an LRU cache to your PredictionCache
class. Requirements:
Predict
should round the X values to 4 decimal places before using them for anything (https://pytorch.org/docs/stable/generated/torch.round.html); the idea is to be able to use cached results for inputs that are approximately the same- The cache should hold a maximum of 10 entries
- Whenever
SetCoefs
is called, invalidate the cache (meaning clear out all the entries in the cache) because we won't expect the same predictions for the same inputs now that the model itself has changes - The second value returned by
Predict
should indicate whether there was a hit - When adding an
X
value to a caching dictionary or looking it up, first convert X to a tuple, like this:tuple(X.flatten().tolist())
. The reason is that PyTorch tensors don't work as you would expect as keys in a Pythondict
(but tuples do work)
Use test_prediction_cache.py
and verify that it produces the expected output indicated by the comments.
There will eventually be multiple threads calling methods in PredictionCache
simultaneously, so add a lock.
The lock should:
- be held when any shared data (for example, attributes in the class) are modified
- get released at the end of each call, even if there is an exception
Read this guide for gRPC with Python:
Install the tools (be sure to upgrade pip first, as described in the directions):
pip3 install grpcio==1.58.0 grpcio-tools==1.58.0 --break-system-packages
Create a file called modelserver.proto
containing a service called ModelServer
.
Specify syntax="proto3";
at the top of your file.
ModelServer
will contain 2 RPCs:
SetCoefs
Request
:coefs
(repeated float
)Response
:error
(string
)
Predict
Request
:X
(repeated float
)Response
:y
(float
),hit
(bool
), anderror
(string
)
You can build your .proto
with:
python3 -m grpc_tools.protoc -I=. --python_out=. --grpc_python_out=. modelserver.proto
Verify modelserver_pb2_grpc.py
was generated.
Add a ModelServer
class to server.py
that inherits from modelserver_pb2_grpc.ModelServerServicer
.
ModelServer
should override the two methods of ModelServerServicer
and use a PredictionCache
to help calculate the answers.
You'll need to manipulate the data to translate back and forth between the repeated float
values from gRPC and the tensors in the shapes needed by PredictionCache
.
Although PredictionCache.Predict
can work on multiple rows of X
data at once, the X
values received by ModelServer
should be arranged as a single row.
The error
fields should contain the empty string ""
when all is well, or an error message that can help you debug when there was an exception or other issue (otherwise exceptions happening on the server side won't show up anywhere, which makes troubleshooting difficult).
Start your server like this:
import grpc
from concurrent import futures
server = grpc.server(futures.ThreadPoolExecutor(max_workers=4), options=(('grpc.so_reuseport', 0),))
modelserver_pb2_grpc.add_ModelServerServicer_to_server(ModelServer(), server)
server.add_insecure_port("[::]:5440", )
server.start()
server.wait_for_termination()
You can do this directly in the bottom of your server.py, or within a main
function; feel free to move imports to the top of your file if you like.
Use test_modelserver.py
after starting the server to verify that it produces the expected output indicated by the comments.
Write a gRPC client named client.py
that can be run like this:
python3 client.py <PORT> <COEF> <THREAD1-WORK.csv> <THREAD2-WORK.csv> ...
For example, say you run python3 client.py 5440 "1.0,2.0,3.0" x1.csv x2.csv x3.csv
.
For this example, your client should do the following, in order:
- Connect to the server at port
5440
. - Call
SetCoef
with [1.0,2.0,3.0]. - Launch three threads, each responsible for one of the 3 CSV files.
- Each thread should loop over the rows in its CSV files. Each row will contain a list of floats that should be used to make a
Predict
call to the server. The threads should collect statistics about the numbers of hits and misses. - The main thread should call
join
to wait until the 3 threads are finished.
The client can print other stuff, but its very last line of output should be the overall hit rate. For example, if the hit and total counts for the three threads are 1/1, 0/1, and 3/8, then the overall hit rate would be (1+0+3) / (1+1+8) = 0.4.
You should write a Dockerfile
to build an image with everything needed to run both your server and client. Your Docker image should:
- Build via
docker build -t p3 .
- Run via
docker run -p 127.0.0.1:54321:5440 p3
(i.e., you can map any external port to the internal port of 5440)
You should then be able to run the client outside of the container (using port 54321), or use a docker exec
to enter the container and run the client with port 5440.
You should organize and commit your files such that we can run the code in your repo like this:
python3 -m grpc_tools.protoc -I=. --python_out=. --grpc_python_out=. modelserver.proto
python3 autograde.py
Run python autograde.py
to get your score, which contains the following tests:
docker_build_run
(10)- docker image builds and can start running
run_docker_autograde
(0)- It starts the docker autograder (within the docker container)
protobuf_interface
(10)- interface matches spec
set_coefs
(10)- is correct, no errors
predict
(10)- get expected result
predict_single_call_cache
(10)- repeated call will hit cache
predict_full_cache_eviction
(10)- repeated call will not hit cache after cache fills up
set_coefs_cache_invalidation
(10)- cache is invalidated after SetCoefs
client_workload_1
(10)- no repeats/cache_hits
client_workload_2
(10)- LRU check
client_workload_3
(10)- handles multiple CSVs
There will also be a manual grading portion, so the score you see in test.json
may not reflect your final score.
We will check the following:
- Client uses threads
- Server uses threads
- Server locking is correct
- Etc.