Model problem - general help #1046
Replies: 1 comment 1 reply
-
Thanks for your detailed description. Would it be better to start from a simplified version, the gradually update to your final goal? Imagine that there will be bugs to fix, it is important to have clues. |
Beta Was this translation helpful? Give feedback.
-
Hi
I´ve read all the really interresting blogs on towards datascience and looked through how you do it in this repo to. Did a lot of trial and error and have gotten some way.
My current case has many similarities but also a few slightly different nuances, so I basicly trying to get some general advice from more knowlegable people :-). Basicly how would you approach such a thing in general if presented with it as described - like your go to setup based on experience.
I´m trying to do a reinforcement leaning model and uses PPO as a strating point.
Data:
Lets imagine one has a lot of "decoupled" time series each, like in you basic stock trading scenario. Decoupled in the sense that one sample has a fixed number of time steps like a 100 (or very close to). Maybe (Im not a financial expert) conceptually a bit like a stock option where you can buy one a given day and then choose to sell it on a later day but ultimately on some some future fixed date its either worth something (you buy the stock at a discount) or worthless (you loose the investment because the option has expired).
Imagine you have 10.000 samples of how such a conceptual thing playes out - "price" movements untill termination. Of course the variance between each 10.000 samples are / can be great.
Its an incomplete information situation in the sense that no one knows what the furture would bring unlike some Gyms where you can better guide / shape the reward in the direction of the goal. Like in your stock price examples.
Goal:
Maximize mean reward across samples - like predicting the next never seen sample.
Environemnt
One of the fundamental things Im struggling with.
Bundle samples together into one "series" e.g. 1.000 out of 10.000 samples and terminate / return done after 1.000 * 100 steps summing up the reward across all samples? and then resetting the environment to the next 1.000 random samples from the total pool
Like 2) but just all samples in sequence - possibly randomly?
I have three actions and so far they are discrete [0;1] in each step - like if in your implementation you could only buy or sell a fixed amount of stocks. Just for environment simplicity for now. Will be continous later. Might have read that some policies / algorithms are better with continous actions - any thougths on this?
Hyper params
Im using PPO so far (and RLLib).
Can anything be said about sort of sensible starting hyper params from the descibed situation? Not that it not up for tuning but kind of in the dark as to where to start just roughtly.
Not limited to but maybe
After a lot of trial and error it seems 1) will never work and 2) is better in the sense it more smooth and upwards moving but 3) might be even better but very compute intensive of course. Not in itself a problem. On back testing its still negative even after training gets to a 3.000+ mean reward.
Hope anyone has some insights based on experience. Thanks in advance, anything is appreciated.
Beta Was this translation helpful? Give feedback.
All reactions