I am a graduate student at Georgia tech, currently taking RL, I can certainly say that the first book u mentioned ( Sutton and Barto) is like the Bible for RL, and Deepmind’s course by David silver is the best course out there and it follows the book closely, so I recommend doing both together. For coding assignments check this repo from udacity : https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893
Thanks for the recommendation Aerys! I will also point the OP to the continuous action space-specific implementation details here: https://costa.sh/blog-the-32-implementation-details-of-ppo.html#https://www.notion.so/7d8d15d1b5dc45bca7791ff1f1946532
In this paper they present a new algorithm that reaches LSTD's level of performance without requiring explicit inversion. The downside is that this adds another hyperparameter.
Only yesterday I finally decided just yesterday to study the matrix gain update for a while on toy MDPs and noticed that the steady state matrix when inverted result in one that does optimal credit assignment per update. Meaning that the reason for LSTD/Zap's great performance is because they are doing model based RL under the hood. The LSTD paper I've linked to actually said as much, but I've gone for a long time without at all being aware of this. Such unluck.
But this intuition is important - had I known it months ago, I'd definitely have tested Zap on RNNs to see if see it leads to the net learning past backpropagation through time's truncation window.
Has this been tried?
Edit: I've tried it and it works as well as SGD. That is one idea down. It was too greedy to possibly work anyway. One thing worthy of note is that switching from the Cholesky decomposition based matrix inverse which I use in KFAC to invert covariance matrices to the LU decomposition based one for the sake of inverting the steady state matrix slowed things down by a large degree.
The Polsa algorithm in the paper might be worth using if it were not for that note that it might not necessarily work with batch optimization. It might be worth testing whether or not that is true, but if it is then that would be bad as it would be unusable with RNNs since BPTT counts as batch updates. I once tried updating the critic on each round of the backward pass and it made the network deeply unstable even with Zap.
Other info:
No, I am with you on providing the results on all the seeds (as that would help others report the statistical uncertainty in your results). But reporting only a point estimate on those seeds, which is often done to report aggregate benchmark performance across tasks, is quite bad as it ignores the uncertainty in aggregate performance.
Re determinism: non-deterministic CUDA ops can make your code stochastic even in simulation (using Jax and tf on GPU) and there is a non-trivial cost of making hardware fully deterministic: see this paper for more details.
Also, replicating someone's results often requires the same hardware too otherwise we get different results especially in RL (even with identical seeds). From Pytorch documentation : "Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds."
Use a vectorized env if possible, otherwise what you want to do is pass a local cpu copy of the model to each process and periodically send each process the updated state dictionary of the model. Read the official page for more info.
https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893
maaaybe you can find it on one unspecified russian tracker
Hi 👋,
The prerequisites are having some good skills at Python and having notions in Deep Learning. But no prerequisites in Deep RL
If it's not the case yet you can check these free resources:
Intro course in Python: https://www.udacity.com/course/introduction-to-python--ud1110
And this one in Deep Learning: https://www.udacity.com/course/deep-learning-pytorch--ud188
Hi, I work using RL for research projects, but there are not too many projects right now. I'm trying to apply RL for automatic trading, you can write me privately if you're interested. I recommend you to take a look at this nanodegree (<strong>https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893</strong>).
​
I'm a mentor in this nano degree, write me privately if you're interested :)
How about this microcontroller, it contains the inbuilt wifi module so it can deliver the sensor readings through wifi, right?
First, I am sorry but I don't really see the point of this question. Well, I find it too broad and too specific at the same time.
Secondly, the best working tool when it comes to OS is the one that you master the most. If by "RL" you mean programming (software development), studying (paper reading and web navigation) and running reinforcement algorithms, then both Windows and any Linux distro might fit.
That being said, my personal opinion is that Linux is way nicer to work with than Windows and specifically for this kind of workflow: - It is often lighter and less bloated than Windows - Many great tools exist for programmers and scientists - It is the assumed default environment in domains such as CS research (you find installation instructions for stuff more easily on linux)
Now, for which Linux distro you might want to go for is another question. Here are three options that I would recommend. - Ubuntu is the standard distribution, specifically for scientific work. You will thus find a ton of guides, forum, instructions that will refer primarily to Ubuntu. It is kind of the safest choice. - Linux Mint is a really great noob-friendly distro, based on Ubuntu. I would specifically recommend it if you come from Windows, as the graphical design is really close to it. - PopOS is another Ubuntu-based beginner-friendly distro. It is developped by folks at System76. I think this is my preferred one among the three. It has the benefits of Ubuntu's availability without most of its defaults. The desktop environment is an enhanced version of Gnome. It is really a great modern and reliable distro.
All in all, you cannot really make a bad choice if you go with one of those three. I hope to have helped you a bit. Feel free to elaborate your experience and needs for more precise recommendations.
Happy hacking and learning ! :)
Udacity has 1 month free.
Anyone want to do the DRL nano-degree in one month with me?
https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893
I don't know if that is even doable? 15hrs/wk for 4 months to 1 month is 60hrs/week?
If you don't mind spending money I would recommend the reinforcement learning nanodegree program from Udacity (https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893). It will take you about four months to complete if you are diligent.
It was created by leading practitioners in the field and the learning material is excellent. If you invest in the course and complete it, you will get a large return on your investment. After completing the program you will have four reinforcement learning projects under your belt that you can show future employers.
You can find additional content on the recent experimental results with ETF(s) in the following presentation:
https://www.slideshare.net/KamerAliYuksel/how-to-invest-in-etfs-like-a-pro-220498662
Whether it is partial observable depends on how the learning problem is being modeled, as explained by /u/satya_gupta, but normally e.g. chess and backgammon are not categorized as partislly observable.
For example, in perfect information two-player zero-sum games, you can do dynamic programming like value/policy iteration. This is essentially what AlphaZero does on a much larger scale (and using search to enhance policy improvement).
Using fixed opponent policies is a good example, because then it's really just and MDP since you can think of the opponents as part of your transition function (see https://www.semanticscholar.org/paper/Solving-for-Best-Responses-in-Extensive-Form-Games-Greenwald-Li/85fcb8a3209b1b8ef5a7939a35946fe56dd73a8b for the reductions). But when all the agents are learning things do indeed get weird, even worse than just partially observable, it could be non-Markovian from the perspective of one agent. The examples in this paper demonstrate the peoblem nicely: https://www.semanticscholar.org/paper/The-world-of-independent-learners-is-not-markovian-Laurent-Matignon/3be171b274728549e6a348dc40597e17284e7e36
Best book i read so far:
https://www.amazon.de/Deep-Reinforcement-Learning-Python-distributional/dp/1839210680
It coveres as much python as nessary. The most imp part in RL is the theorie and this book explains it really well in simple language.
Potentially...I'd like this account to stay anonymous, so I'll redact identifying details and the discussion of my current work and host it somewhere. The slides themselves also aren't super detailed (kept most of that to my talking). Still, hope they help someone!
I was curious whether my advice is sound so I tried implementing it myself. You were right that this needs a model to match what was shown in the video, but without it TD(0) will still converge in about 50 steps. Forget what I said about averaging. Hopefully the F# code won't be too hard to understand.
🔥🔥 THE RUBY COLLECTION OF LEARNING BIG DATA IS AVAILABLE NOW FOR 4.99$ INSTEAD OF 406$⚡
🎁 Coupon : ZVIP50 🎁
Link : https://payhip.com/b/GMuP ✅
Book 1
Title : Handbook of Big Data Technologies
Author : Albert Y. Zomaya, Sherif Sakr
Format : PDF
Price : 286 💲
Book 2
Title : Big Data: A Primer
Author : Deepak Chenthati, Hrushikesha Mohanty, Prachet Bhuyan
Format : PDF
Price : 51 💲
Book 3
Title : SQL on Big Data
Author : SUMIT PAL
Format : PDF
Price : 28 💲
Book 4
Title : Big Data, Big Innovation
Author : Evan Stubbs
Format : PDF
Price : 21 💲
Book 5
Title : Big Data - Using SMART Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance
Author : Bernard Marr
Format : PDF
Price : 20 💲
It was in one of the lectures by Levine. TRPO actually needs second order information in order to do approximate natural gradient updates, while PPO uses only first order information to do approximate natural gradient updates by clipping on the objective and making only small updates to the policy.
One thing particularly interesting about natural gradient methods is that they are invariant to reward scaling.
In either case though, TRPO and PPO are PG methods that do Monte Carlo updates so they are closer to supervised learning than Q learning which does bootstrapping. I've found MC updates to be significantly more stable than Q learning which is not exactly a surprise.
The video by Sutton on TD learning convinced me that MC updates are a dead end in the long run. PG methods are popular at the moment, but they have significant flaws to them.
Assuming that natural gradient updates are the key to stabilizing deep Q/TD learning, it is really a mystery what an architecture which does complete natural gradient updates with just first order information would be like.
There are some hints in this direction. It is rumored that the reason for the success of batch norm is because it pushes standard gradient updates closer to those of the natural gradient. Extensions to it which do whitening like EigenNets and decorrelated batch norm give even better results. Maybe there exists architectures capable of going all the way in an online fashion? RL could really use such a thing.
Even though I say that, from what I could tell it does not seem that batch norm actually helps RL any for some reason. I'd like to know why.
I love the video. I'm going to hate myself for saying this, but to be very nitpicky I'll raise the point that NEAT is not, in its original formation, an Evolution Strategy. It is, however, an evolutionary algorithm which is just a broader class.
Unless they used a meta-evolution strategy to optimise the parameters for NEAT, which would be pretty cool.
I researched more: it seems return and cumulative reward are indeed the same. Here is an excerpt from http://www.scholarpedia.org/article/Reinforcement_learning :
The goal of an RL algorithm is to select actions that maximize the expected cumulative reward (the return) of the agent.
I'm not sure you will find an option that's easier to configure than sagemaker + docker. And you will have to learn something new, I haven't found any options that aren't a little painful. Maybe someone else knows.
I've come across distributed tensorflow in google cloud. It's probobly similar to "offloading Tensorflow sessions" but I doubt it's painless.
This may not be applicable, but personally if I want scale ML tasks I start with a dedicated instance. Then if I really need to scale (in production) I use kubernates jobs. I set up a docker container that takes commandline arguments as inputs, and start kubernates Jobs that run then spin down. It will probably be more complicated to set up though, and some setups want you to allocate a set number of dedicated ec2 instances for the jobs to run on anyway.
I haven’t had a chance to read your code. I just implemented DQN last night in my game engine and this article was a huge help: https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
This is very coincidental for me, since I've been working on this recently. I've been using this example from pytorch:
https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
However, it doesn't work for me. I downloaded it as an .ipynb, and didn't modify it or anything (besides increasing the num_episodes). It simply doesn't work
Here is a graph of the durations: https://i.imgur.com/9chjrX7.png
Unfortunately I don't have much in the way of help for you, but hopefully some discussion begins
> The model input is a stack of 4 246x332 b/w images. I've verified this in the past.
I'm not talking about verifying the shape - I mean verifying the actual values in the tensor. If you're accidentally zeroing the data or otherwise incorrectly manipulating the frames, your model won't learn.
> I haven't graphed those... I'll need to research pytorch tools to see how to best show this information.
The easiest way to do this is with a PyTorch TensorBoard writer. Check out https://pytorch.org/docs/stable/tensorboard.html and use add_histogram() for your gradients. This is also a useful tool for logging other information e.g. loss, return, reward per step, etc.
> What do you mean correctly calculated?
Your return should probably be a discounted sum over rewards. Implementing discounting can be tricky (I know from previously screwing it up). Have you sat down to confirm that your return tensor's values are correct?
​
I mean this with no disrespect: I think you should invest more heavily in checking that each component of the system is doing exactly what you think it's doing.
Matplotlib isn't made for generating images quickly, and as such is likely too slow for RL to be feasible on anything but incredibly simple tasks.
I'm can't say for sure without seeing your environment, but something like PyGame would probably be a better choice for you.
If your 'maze' is stored as a grid though, just feed it into the agent like that. It's already basically an image at that point (from a data perspective).
If you are not sure, then you are definitely not using one.
I would recommend creating a virtual environment for each of your machine learning project, so as to avoid conflicts between the libraries and dependencies that might occur if you were to use your native system's python. Furthermore, it allows you to easily choose which version of python to use.
I have been using Mujoco 2 for a while now, and did not have build errors. (Not on Ubuntu at least). I would recommend trying to install it inside a virtual environment. Here are some general guidelines I would suggest you to try and come back. 1. Install Anaconda for virtual environment manager from: https://www.anaconda.com/distribution/ . Python 3.7 version recommended.
Create an virtual environment for your project with conda create -n mujoco-env python=3.7 -y
Once it has downloaded all the packages, activate the environment with conda activate mujoco-env
Once the environment is activated, you should see a prefix in your terminal such as (mujoco-env) ~/
.
Run pip install gym mujoco-py
.
This should get you a working installation of Mujoco 2. To run your reinforcement learning scripts, just make sure the environment is activated, then just do python myscript.py
as you would normally do.
For more industrial applications if RL checkout this one https://www.amazon.com/Reinforcement-Learning-Industrial-Applications-Intelligent/dp/1098114833/ref=pd_aw_sim_8/140-9243744-4332920?pd_rd_w=7WH5D&pf_rd_p=61e03cde-d57c-4984-9f73-f76bf2c32442&pf_rd_r=BYV126AZAB92YEB1TF2K&pd_rd_r=6e448c5...
If you want, you can download the real game that i made my python game after: https://play.google.com/store/apps/details?id=com.differencetenderwhite.skirt
As the rules dictate the point system is like this:
you get as many points as there's blocks on a puzzle piece
additionally, for each line you clear you get points. 1 line = 10 points. 2 lines = 30 points. 3 lines = 50 points etc.
First i tried doing the point system like that (and no penalty for invalid actions). that didnt work.
Then i tried doing -0.01 penalty for invalid actions, but that didnt help. maybe i need to do a more "neutral" point system without thinking too much about how the real game does it.
BTW. When i did the 2.5b timesteps training (took 3-4 days). it didnt work AT ALL. Couldn't perform a single correct step and it kept peforming the same invalid action
This was recommended in another thread : Foundations of Deep Reinforcement Learning: Theory and Practice in Python (Addison-Wesley Data & Analytics Series) https://www.amazon.com/dp/0135172381/ref=cm_sw_r_cp_api_fab_VlEDFbQW9EM07
Its true. I used to do online-rl applied to embedded systems. If you know your problem, you can do rl on a beetle arduino. Your problem defines your hardware needs.