Model problem - general help #1046

SValentinhH · 2023-06-23T11:09:20Z

SValentinhH
Jun 23, 2023

Hi

I´ve read all the really interresting blogs on towards datascience and looked through how you do it in this repo to. Did a lot of trial and error and have gotten some way.

My current case has many similarities but also a few slightly different nuances, so I basicly trying to get some general advice from more knowlegable people :-). Basicly how would you approach such a thing in general if presented with it as described - like your go to setup based on experience.

I´m trying to do a reinforcement leaning model and uses PPO as a strating point.

Data:
Lets imagine one has a lot of "decoupled" time series each, like in you basic stock trading scenario. Decoupled in the sense that one sample has a fixed number of time steps like a 100 (or very close to). Maybe (Im not a financial expert) conceptually a bit like a stock option where you can buy one a given day and then choose to sell it on a later day but ultimately on some some future fixed date its either worth something (you buy the stock at a discount) or worthless (you loose the investment because the option has expired).

Imagine you have 10.000 samples of how such a conceptual thing playes out - "price" movements untill termination. Of course the variance between each 10.000 samples are / can be great.

Its an incomplete information situation in the sense that no one knows what the furture would bring unlike some Gyms where you can better guide / shape the reward in the direction of the goal. Like in your stock price examples.

Goal:
Maximize mean reward across samples - like predicting the next never seen sample.

Environemnt
One of the fundamental things Im struggling with.

Would you approch it by having the environment terminate at each individual sample (100 steps) and then start a new (randomly selected) sample (next 100 steps)?

Having tried the mean reward oscillates like crazy. I presume maybe because weights are adjusted to one samples individually and then due to variance it converges to nothing

Bundle samples together into one "series" e.g. 1.000 out of 10.000 samples and terminate / return done after 1.000 * 100 steps summing up the reward across all samples? and then resetting the environment to the next 1.000 random samples from the total pool
Like 2) but just all samples in sequence - possibly randomly?

I have three actions and so far they are discrete [0;1] in each step - like if in your implementation you could only buy or sell a fixed amount of stocks. Just for environment simplicity for now. Will be continous later. Might have read that some policies / algorithms are better with continous actions - any thougths on this?

Hyper params
Im using PPO so far (and RLLib).
Can anything be said about sort of sensible starting hyper params from the descibed situation? Not that it not up for tuning but kind of in the dark as to where to start just roughtly.

Not limited to but maybe

train_batch_size => would that be closer to or equal to NumberOfSteps in one iteration e.g. 1.000 * 100 in case 2) than the default 4.000?
sgd_minibatch_size => No idea - one or a few samples like 100, 200, 300
Mode: CompleteEpisodes vs not
Did try to play around with LSTM and Attention too. Any thoughs on this?
whatever else that might pop into mind.

After a lot of trial and error it seems 1) will never work and 2) is better in the sense it more smooth and upwards moving but 3) might be even better but very compute intensive of course. Not in itself a problem. On back testing its still negative even after training gets to a 3.000+ mean reward.

Hope anyone has some insights based on experience. Thanks in advance, anything is appreciated.

YangletLiu · 2023-07-02T18:32:11Z

YangletLiu
Jul 2, 2023

Hi

I´ve read all the really interresting blogs on towards datascience and looked through how you do it in this repo to. Did a lot of trial and error and have gotten some way.

My current case has many similarities but also a few slightly different nuances, so I basicly trying to get some general advice from more knowlegable people :-). Basicly how would you approach such a thing in general if presented with it as described - like your go to setup based on experience.

I´m trying to do a reinforcement leaning model and uses PPO as a strating point.

Data: Lets imagine one has a lot of "decoupled" time series each, like in you basic stock trading scenario. Decoupled in the sense that one sample has a fixed number of time steps like a 100 (or very close to). Maybe (Im not a financial expert) conceptually a bit like a stock option where you can buy one a given day and then choose to sell it on a later day but ultimately on some some future fixed date its either worth something (you buy the stock at a discount) or worthless (you loose the investment because the option has expired).

Imagine you have 10.000 samples of how such a conceptual thing playes out - "price" movements untill termination. Of course the variance between each 10.000 samples are / can be great.

Its an incomplete information situation in the sense that no one knows what the furture would bring unlike some Gyms where you can better guide / shape the reward in the direction of the goal. Like in your stock price examples.

Goal: Maximize mean reward across samples - like predicting the next never seen sample.

Environemnt One of the fundamental things Im struggling with.

Would you approch it by having the environment terminate at each individual sample (100 steps) and then start a new (randomly selected) sample (next 100 steps)?

Having tried the mean reward oscillates like crazy. I presume maybe because weights are adjusted to one samples individually and then due to variance it converges to nothing

Bundle samples together into one "series" e.g. 1.000 out of 10.000 samples and terminate / return done after 1.000 * 100 steps summing up the reward across all samples? and then resetting the environment to the next 1.000 random samples from the total pool

Like 2) but just all samples in sequence - possibly randomly?

I have three actions and so far they are discrete [0;1] in each step - like if in your implementation you could only buy or sell a fixed amount of stocks. Just for environment simplicity for now. Will be continous later. Might have read that some policies / algorithms are better with continous actions - any thougths on this?

Hyper params Im using PPO so far (and RLLib). Can anything be said about sort of sensible starting hyper params from the descibed situation? Not that it not up for tuning but kind of in the dark as to where to start just roughtly.

Not limited to but maybe

train_batch_size => would that be closer to or equal to NumberOfSteps in one iteration e.g. 1.000 * 100 in case 2) than the default 4.000?

sgd_minibatch_size => No idea - one or a few samples like 100, 200, 300

Mode: CompleteEpisodes vs not

Did try to play around with LSTM and Attention too. Any thoughs on this?

whatever else that might pop into mind.

After a lot of trial and error it seems 1) will never work and 2) is better in the sense it more smooth and upwards moving but 3) might be even better but very compute intensive of course. Not in itself a problem. On back testing its still negative even after training gets to a 3.000+ mean reward.

Hope anyone has some insights based on experience. Thanks in advance, anything is appreciated.

Thanks for your detailed description. Would it be better to start from a simplified version, the gradually update to your final goal? Imagine that there will be bugs to fix, it is important to have clues.

1 reply

SValentinhH Jul 2, 2023
Author

Im doing this :-).
One fundamental thing Im struggling with is the concept of what to include in an episode.
Imagine learning to drive a car on different racecourses maximizing rewards over all causes. Each course has a different layout. Similar to my setup where each "sample" plays out in different ways.

If I do an environment where it trains on one "racecourse" and uppon env termination reset to a different "racecourse" every time it learns nothing. I can learn one easilly by restarting it over and over again on the same one.

If I create an environment where it picks up rewards on racecourse 1 and when it ends I just continue on the next and so on and then terminates after e.g. 100 courses summing up rewards accrros all it seems to learn those too - with more effort of course.

In a situation like this does it seem to make sense to always have multiple courses in the same episode or should it logically be one at a time.

As a gut feeling do you think Im on the right path on training on multiple courses in every episode and not one?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model problem - general help #1046

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Model problem - general help #1046

SValentinhH Jun 23, 2023

Replies: 1 comment · 1 reply

YangletLiu Jul 2, 2023

SValentinhH Jul 2, 2023 Author

SValentinhH
Jun 23, 2023

Replies: 1 comment 1 reply

YangletLiu
Jul 2, 2023

SValentinhH Jul 2, 2023
Author