mirror of
https://github.com/facebookresearch/ReAgent.git
synced 2026-05-17 12:40:39 +00:00
b227a7efe0
Summary: Updated CB RASP tutorial to use new OSS workflow. Introduce FileSystemPublisher, a key-value database with tinydb, an OSS version of predictor. This way, after training the path to torchscript can be uploaded, and during evaluation it can be read. Pull Request resolved: https://github.com/facebookresearch/ReAgent/pull/270 Reviewed By: MisterTea Differential Revision: D21689872 fbshipit-source-id: e8d510fe895dcb33d41464e026af842e31366f87
441 lines
15 KiB
ReStructuredText
441 lines
15 KiB
ReStructuredText
.. _rasp_tutorial:
|
||
|
||
ReAgent Serving Platform (RASP)
|
||
===============================
|
||
|
||
Welcome to the ReAgent Serving Platform! This tutorial gets readers
|
||
familiar with reasoning at scale by building an artificial e-commerce
|
||
site.
|
||
|
||
What is RASP?
|
||
-------------
|
||
|
||
RASP is a set of scoring and ranking functions and a systematic way to
|
||
collect data and deploy models. A set of potential actions are input to
|
||
RASP and a ranked list of scores is output. The method for scoring and
|
||
ranking actions is called a decision plan. In this tutorial, we will
|
||
create several different decision plans and simulate user traffic to see
|
||
the results.
|
||
|
||
Installing ReAgent
|
||
------------------
|
||
|
||
Before beginning this tutorial, please install ReAgent by following
|
||
these instructions:
|
||
https://github.com/facebookresearch/ReAgent/blob/master/docs/installation.rst
|
||
|
||
Store set-up
|
||
------------
|
||
|
||
For this tutorial, we will be in charge of recommendations for
|
||
Pig-E-Barbeque, a pork e-store. Pig-E-Barbeque sells two products: Ribs
|
||
and Bacon. Our product manager has told us to optimize for clicks. Since
|
||
we are just starting out, we don’t know anything about our visitors, but
|
||
we know that bacon is delicious. We give bacon a score of 1.1 and ribs a
|
||
score of 1.0. (If we were optimizing for revenue, we could set the score
|
||
to the price, or we could have a custom scoring function.)
|
||
|
||
We also need to provide a ranking function that takes our scores and
|
||
decides which items to recommend. In Pig-E-Barbeque, we only have one
|
||
spot for recommendations, so the first item will be shown to visitors
|
||
and the second choice is discarded. If we use a greedy ranking function,
|
||
we will always show bacon (with it’s score of 1.1) and never show ribs
|
||
(with a score of 0.9). This means we will never know the true
|
||
performance of recommending ribs and can’t improve our system in the
|
||
future. This is known as the cold-start or explore-exploit problem
|
||
( https://arxiv.org/abs/1812.00116 ).
|
||
|
||
To avoid that problem, we will use the SoftmaxRanker, which will show
|
||
bacon 52% of the time and ribs 48% of the time. The SoftmaxRanker
|
||
operator is based on the softmax function:
|
||
|
||
::
|
||
|
||
>>> import numpy as np
|
||
>>>
|
||
>>> def softmax(x):
|
||
... e_x = np.exp(x - np.max(x))
|
||
... return e_x / e_x.sum()
|
||
...
|
||
>>> print(softmax([1.1, 1.0]))
|
||
[0.52497919 0.47502081]
|
||
|
||
Here is the decision plan generator:
|
||
|
||
::
|
||
|
||
def softmaxranker_decision_plan():
|
||
op = SoftmaxRanker(temperature=1.0, values={"Bacon": 1.1, "Ribs": 1.0})
|
||
return DecisionPlanBuilder().set_root(op).build()
|
||
|
||
And here is the generated decision plan:
|
||
|
||
::
|
||
|
||
{
|
||
"operators": [
|
||
{
|
||
"name": "SoftmaxRanker_1",
|
||
"op_name": "SoftmaxRanker",
|
||
"input_dep_map": {
|
||
"temperature": "constant_2",
|
||
"values": "constant_3"
|
||
}
|
||
}
|
||
],
|
||
"constants": [
|
||
{
|
||
"name": "constant_2",
|
||
"value": {
|
||
"double_value": 1.0
|
||
}
|
||
},
|
||
{
|
||
"name": "constant_3",
|
||
"value": {
|
||
"map_double_value": {
|
||
"Bacon": 1.1,
|
||
"Ribs": 1.0
|
||
}
|
||
}
|
||
}
|
||
],
|
||
"num_actions_to_choose": 1,
|
||
"reward_function": "reward",
|
||
"reward_aggregator": "sum"
|
||
}
|
||
|
||
User simulator
|
||
--------------
|
||
|
||
Because this isn’t a real store, we need a way to simulate users. Our
|
||
simulator has a few rules:
|
||
|
||
1. Visitors click on bacon recommendations 50% of the time
|
||
2. 10% of visits are by rib lovers and the rest are regular visitors
|
||
|
||
1. Rib lovers click on rib recommendations 90% of the time
|
||
2. Regular visitors click on rib recommendations 10%
|
||
|
||
We will be using the built-in web service directly for this tutorial.
|
||
The simulator code can be found at:
|
||
serving/examples/ecommerce/customer_simulator.py
|
||
|
||
Makin’ bacon
|
||
------------
|
||
|
||
In one terminal window, start the RP server:
|
||
|
||
::
|
||
|
||
➜ ./serving/build/RaspCli --logtostderr
|
||
I1014 17:23:19.736086 457250240 DiskConfigProvider.cpp:10] READING CONFIGS FROM serving/examples/ecommerce/plans
|
||
I1014 17:23:19.738142 457250240 DiskConfigProvider.cpp:42] GOT CONFIG multi_armed_bandit.json AT serving/examples/ecommerce/plans/multi_armed_bandit.json
|
||
I1014 17:23:19.738286 457250240 DiskConfigProvider.cpp:46] Registered decision config: multi_armed_bandit.json
|
||
I1014 17:23:19.738932 457250240 DiskConfigProvider.cpp:42] GOT CONFIG contextual_bandit.json AT serving/examples/ecommerce/plans/contextual_bandit.json
|
||
I1014 17:23:19.739020 457250240 DiskConfigProvider.cpp:46] Registered decision config: contextual_bandit.json
|
||
I1014 17:23:19.739610 457250240 DiskConfigProvider.cpp:42] GOT CONFIG heuristic.json AT serving/examples/ecommerce/plans/heuristic.json
|
||
I1014 17:23:19.739682 457250240 DiskConfigProvider.cpp:46] Registered decision config: heuristic.json
|
||
I1014 17:23:19.739843 131715072 Server.cpp:58] STARTING SERVER
|
||
|
||
Then in another, run our simulator. The simulator will spawn many
|
||
threads and call RASP 1,000 times:
|
||
|
||
::
|
||
|
||
➜ python serving/examples/ecommerce/customer_simulator.py heuristic.json
|
||
0
|
||
200
|
||
100
|
||
400
|
||
300
|
||
500
|
||
600
|
||
700
|
||
800
|
||
900
|
||
Average reward: 0.363
|
||
Action Distribution: {'Ribs': 471, 'Bacon': 529}
|
||
|
||
As expected, we recommend Bacon 52% of the time and Ribs 48% of the
|
||
time. We get an average reward (in this case, average # of clicks) of about 0.36.
|
||
|
||
This is our baseline performance, but can we do better? From the log, we
|
||
can see that more bacon recommendations were clicked on:
|
||
|
||
::
|
||
|
||
➜ cat /tmp/rasp_logging/log.txt | grep '"name":"Ribs"}]' | grep '"reward":0.0' | wc -l
|
||
390 # Ribs not clicked
|
||
➜ cat /tmp/rasp_logging/log.txt | grep '"name":"Ribs"}]' | grep '"reward":1.0' | wc -l
|
||
88 # Ribs clicked
|
||
➜ cat /tmp/rasp_logging/log.txt | grep '"name":"Bacon"}]' | grep '"reward":1.0' | wc -l
|
||
266 # Bacon clicked
|
||
➜ cat /tmp/rasp_logging/log.txt | grep '"name":"Bacon"}]' | grep '"reward":0.0' | wc -l
|
||
253 # Bacon not clicked
|
||
|
||
This makes sense since, from our simulator definition, most people
|
||
aren’t rib-lovers and only click on ribs 10% of the time. We can change
|
||
the decision plan to use a multi-armed bandit that will learn to show
|
||
bacon much more often. For this tutorial, we will use the UCB1 bandit
|
||
ranker. Passing this to the plan generator:
|
||
|
||
::
|
||
|
||
def ucb_decision_plan():
|
||
op = UCB(method="UCB1", batch_size=16)
|
||
return DecisionPlanBuilder().set_root(op).build()
|
||
|
||
Generates this plan:
|
||
|
||
::
|
||
|
||
➜ cat serving/examples/ecommerce/plans/multi_armed_bandit.json
|
||
{
|
||
"operators": [
|
||
{
|
||
"name": "UCB_1",
|
||
"op_name": "Ucb",
|
||
"input_dep_map": {
|
||
"method": "constant_2",
|
||
"batch_size": "constant_3"
|
||
}
|
||
}
|
||
],
|
||
"constants": [
|
||
{
|
||
"name": "constant_2",
|
||
"value": {
|
||
"string_value": "UCB1"
|
||
}
|
||
},
|
||
{
|
||
"name": "constant_3",
|
||
"value": {
|
||
"int_value": 16
|
||
}
|
||
}
|
||
],
|
||
"num_actions_to_choose": 1,
|
||
"reward_function": "reward",
|
||
"reward_aggregator": "sum"
|
||
}
|
||
|
||
Running with this new plan gives:
|
||
|
||
::
|
||
|
||
➜ python serving/examples/ecommerce/customer_simulator.py multi_armed_bandit.json
|
||
0
|
||
200
|
||
100
|
||
400
|
||
300
|
||
500
|
||
600
|
||
700
|
||
800
|
||
900
|
||
Average reward: 0.447
|
||
Action Distribution: {'Ribs': 184, 'Bacon': 816}
|
||
|
||
This is already better than our previous score of 0.363. While we were
|
||
running, the bandit was learning and adapting the scores. Let’s run
|
||
again:
|
||
|
||
::
|
||
|
||
➜ python serving/examples/ecommerce/customer_simulator.py multi_armed_bandit.json
|
||
0
|
||
200
|
||
100
|
||
400
|
||
300
|
||
500
|
||
600
|
||
700
|
||
800
|
||
900
|
||
Average reward: 0.497
|
||
Action Distribution: {'Bacon': 926, 'Ribs': 74}
|
||
|
||
So the new ranker chooses bacon more often and gets more reward on
|
||
average than our first plan. If we keep running, eventually the model
|
||
will stop exploring the Ribs action and the average reward will approach
|
||
50% (which is the chance of a reward that we set in our simulator).
|
||
|
||
Straight Outta Context
|
||
----------------------
|
||
|
||
While running the store, our data scientist has discovered a way to
|
||
figure out who is a rib-lover. Now we can pass a context feature which
|
||
is 1 when the visitor is a rib lover and 0 otherwise. In this section we
|
||
will train a contextual bandit that learns to show ribs to rib lovers
|
||
and bacon to everyone else.
|
||
|
||
As we specified in our config, RP has been writing a log of visits and
|
||
feedback to a file. We can input this file with a training config to
|
||
ReAgent to train a contextual bandit model. First, let’s clear our
|
||
training data and start over by sending a SIGINT (control-c) to our
|
||
instance of RaspCli:
|
||
|
||
::
|
||
|
||
…
|
||
I1014 17:45:36.613893 6602752 Server.cpp:58] STARTING SERVER
|
||
^C
|
||
➜ rm /tmp/rasp_logging/log.txt
|
||
➜ ./serving/build/RaspCli --logtostderr
|
||
I1014 17:48:49.674149 144418240 DiskConfigProvider.cpp:10] READING CONFIGS FROM serving/examples/ecommerce/plans
|
||
I1014 17:48:49.678155 144418240 DiskConfigProvider.cpp:42] GOT CONFIG multi_armed_bandit.json AT serving/examples/ecommerce/plans/multi_armed_bandit.json
|
||
I1014 17:48:49.679606 144418240 DiskConfigProvider.cpp:46] Registered decision config: multi_armed_bandit.json
|
||
I1014 17:48:49.680496 144418240 DiskConfigProvider.cpp:42] GOT CONFIG contextual_bandit.json AT serving/examples/ecommerce/plans/contextual_bandit.json
|
||
I1014 17:48:49.680778 144418240 DiskConfigProvider.cpp:46] Registered decision config: contextual_bandit.json
|
||
I1014 17:48:49.682201 144418240 DiskConfigProvider.cpp:42] GOT CONFIG heuristic.json AT serving/examples/ecommerce/plans/heuristic.json
|
||
I1014 17:48:49.682344 144418240 DiskConfigProvider.cpp:46] Registered decision config: heuristic.json
|
||
I1014 17:48:49.682667 65638400 Server.cpp:58] STARTING SERVER
|
||
|
||
Now let’s run the heuristic model a few times to generate enough data
|
||
(this may take a few minutes). At the end there should be 10000 samples
|
||
(we can verify this with the wc command):
|
||
|
||
::
|
||
|
||
➜ for run in {1..10}; do python serving/examples/ecommerce/customer_simulator.py heuristic.json; done
|
||
0
|
||
200
|
||
...
|
||
900
|
||
Average reward: 0.36
|
||
Action Distribution: {'Bacon': 516, 'Ribs': 484}
|
||
➜ wc -l /tmp/rasp_logging/log.txt
|
||
10000 /tmp/rasp_logging/log.txt
|
||
|
||
RASP’s logging format and the ReAgent models’ input format is slightly
|
||
different. Fortunately, there’s a tool to convert from one to the other:
|
||
|
||
::
|
||
|
||
➜ python serving/scripts/rasp_to_model.py /tmp/rasp_logging/log.txt /tmp/input_df.pkl
|
||
...
|
||
INFO:__main__: ds mdp_id sequence_number state_features action reward action_probability possible_actions metrics
|
||
0 2019-01-01 1287515757457242569 0 {0: 0.0, 1: 1.0} Ribs 0.0 0.475021 [Bacon, Ribs] {'reward': 0.0}
|
||
1 2019-01-01 -1441171268272508658 0 {0: 0.0, 1: 1.0} Ribs 0.0 0.475021 [Bacon, Ribs] {'reward': 0.0}
|
||
2 2019-01-01 -267723109738500267 0 {0: 0.0, 1: 1.0} Bacon 1.0 0.524979 [Bacon, Ribs] {'reward': 1.0}
|
||
3 2019-01-01 7619952535038766490 0 {0: 0.0, 1: 1.0} Ribs 0.0 0.475021 [Bacon, Ribs] {'reward': 0.0}
|
||
4 2019-01-01 -2393212434904546228 0 {0: 0.0, 1: 1.0} Bacon 0.0 0.524979 [Bacon, Ribs] {'reward': 0.0}
|
||
|
||
|
||
Since we are using the contextual bandit or RL model, we need to build a
|
||
timeline:
|
||
|
||
::
|
||
|
||
# Set the config
|
||
➜ export CONFIG=serving/examples/ecommerce/training/contextual_bandit.yaml
|
||
|
||
# First clean up derby database from last run
|
||
➜ rm -Rf spark-warehouse derby.log metastore_db preprocessing/spark-warehouse preprocessing/metastore_db preprocessing/derby.log
|
||
|
||
# Run timeline operator
|
||
➜ ./reagent/workflow/cli.py run reagent.workflow.gym_batch_rl.timeline_operator "$CONFIG"
|
||
|
||
The `Click <https://click.palletsprojects.com/en/7.x/>`_ command submits a Spark job that uploads the timeline table to Hive.
|
||
Now we can train the contextual bandit.
|
||
|
||
::
|
||
|
||
➜ ./reagent/workflow/cli.py run reagent.workflow.training.identify_and_train_network "$CONFIG"
|
||
...
|
||
I0524 112136.208 model_manager.py:213] Saved torchscript model to model_1590344496.torchscript
|
||
|
||
At this point, we have a model saved at ``model_*.torchscript``. We
|
||
are going to combine this scoring model with an Softmax ranker. The
|
||
ranker chooses the best actions most of the time, but rarely
|
||
chooses other actions to explore:
|
||
|
||
::
|
||
|
||
{
|
||
"operators": [
|
||
{
|
||
"name": "ActionValueScoringOp",
|
||
"op_name": "ActionValueScoring",
|
||
"input_dep_map": {
|
||
"model_id": "model_id",
|
||
"snapshot_id": "snapshot_id"
|
||
}
|
||
},
|
||
{
|
||
"name": "SoftmaxRankerOp",
|
||
"op_name": "SoftmaxRanker",
|
||
"input_dep_map": {
|
||
"temperature": "constant_2",
|
||
"values": "ActionValueScoringOp"
|
||
}
|
||
}
|
||
],
|
||
"constants": [
|
||
{
|
||
"name": "model_id",
|
||
"value": {
|
||
"int_value": 0
|
||
}
|
||
},
|
||
{
|
||
"name": "snapshot_id",
|
||
"value": {
|
||
"int_value": 0
|
||
}
|
||
},
|
||
{
|
||
"name": "constant_2",
|
||
"value": {
|
||
"double_value": 0.001
|
||
}
|
||
}
|
||
],
|
||
"num_actions_to_choose": 1,
|
||
"reward_function": "reward",
|
||
"reward_aggregator": "sum"
|
||
}
|
||
|
||
The “model_id” and “snapshot_id” tell us where to find the model. Let’s
|
||
put the model there so we can find it:
|
||
|
||
::
|
||
|
||
➜ mkdir -p /tmp/0
|
||
➜ cp model_*.torchscript /tmp/0/0
|
||
|
||
Let’s run with our model:
|
||
|
||
::
|
||
|
||
➜ python serving/examples/ecommerce/customer_simulator.py contextual_bandit.json
|
||
0
|
||
200
|
||
100
|
||
400
|
||
300
|
||
500
|
||
600
|
||
700
|
||
800
|
||
900
|
||
Average reward: 0.52
|
||
Action Distribution: {'Bacon': 883, 'Ribs': 117}
|
||
|
||
Nice! We have a reward higher than 50%, which is the click-through-rate
|
||
for bacon. This means that we must be getting most of the rib lovers. In
|
||
case you were curious, the best possible score is (0.9*0.5 + 0.1*\ 0.9)
|
||
== 0.54. We still have some exploration in our new plan so we won’t get
|
||
exactly 0.54 even with many iterations, but we need that exploration to
|
||
generate an even better model next time when we learn more about our
|
||
customers.
|
||
|
||
All of the decisions made so far have been pointwise: we don’t consider
|
||
repeat visitors. ReAgent can also optimize for long-term value in
|
||
sequential decisions using reinforcement learning, but that is out of
|
||
the scope of this starting tutorial.
|