开发者:上海品职教育科技有限公司隐私政策详情

应用版本:4.2.11(IOS)｜3.2.5(安卓)APP下载

学习体验
App下载
手机上的品职教育

随时随地学习课程，支持音视频下载！
- 扫码下载品职教育APP
进入课程
登录 | 注册

pengyaning · 2023年10月17日

这里的0.9是否需要折现？

NO.PZ2023091601000107

问题如下：

suppose that there are four states and three actions, and that the current Q(S, A) values are as indicated in Table 14.2.

a. Suppose that on the next trial, Action 3 is taken in State 4 and the total subsequent reward is 1.0. If α = 0.05, what will the value of Q(4,3) be updated using Monte Carlo method ?

b. Suppose that the next decision that has to be made on the trial we are considering turns out to be when we are in State 3. Suppose further that a reward of 0.2 is earned between the two decisions. Using the temporal difference method, we would note that the value of being in State 3 is currently estimated to be 0.9. If a = 0.05, what will the value of Q(4,3) be updated using Temporal difference learning?

选项：

解释：

a.If α = 0.05, the Monte Carlo method would lead to Q(4,3) being updated from 0.8 to: 0.8 + 0.05(1.0 − 0.8) = 0.81

b. Suppose that when we take action A in state S we move to state S'. We can use the current value for V(S’) to update as follows:

Qnew(S,A) = Qold(S,A) +α[R+γV(S') - Qold(S,A)]

where R is the reward at the next step and γ is the discount factor.

Thus, in this example, the temporal difference method would lead to Q(4,3) being updated from 0.8 to: 0.8 + 0.05(0.2 + 0.9 − 0.8) = 0.815

这里的0.9是否需要折现？

添加评论

0
0

1 个答案

品职答疑小助手雍 · 2023年10月18日

同学你好，不需要。

蒙特卡洛模拟其实就是按照假设去预测未来的走势或者路径，这些预测未来形成的点代表的是未来当时的情况，不需要折现。

添加评论

0
0

1
回答
0
关注
199
浏览

我要回答关注问题

相关问题

NO.PZ2023091601000107 问题如下 suppose ththere are four states anthreeactions, anththe current Q(S, values are incatein Table 14.2. Suppose thon the next trial, Action 3 istaken in State 4 anthe totsubsequent rewaris 1.0. If α = 0.05, whwillthe value of Q(4,3) upteusing Monte Carlo metho?Suppose ththe next cisionthhto ma on the triwe are consiring turns out to when we arein State 3. Suppose further tha rewarof 0.2 is earnebetween the twocisions. Using the temporfferenmetho we woulnote ththe valueof being in State 3 is currently estimateto 0.9. If a = 0.05, whwillthe value of Q(4,3) upteusing Temporfferenlearning? a.If α = 0.05, the Monte Carlo methowoulleto Q(4,3) beinguptefrom 0.8 to: 0.8 + 0.05(1.0 − 0.8) = 0.81Suppose thwhen we take action A in state S we move to state S'.We cuse the current value for V(S’) to upte follows:Qnew(S,= QolS,+α[R+γV(S') -QolS,A)]where R is the rewarthe next step anγis the scount factor.Thus, in this example, the temporalfferenmethowoulleto Q(4,3) being uptefrom 0.8 to: 0.8 +0.05(0.2 + 0.9 − 0.8) = 0.815

2024-05-07 21:42 1 · 回答

NO.PZ2023091601000107问题如下 suppose ththere are four states anthreeactions, anththe current Q(S, values are incatein Table 14.2. Suppose thon the next trial, Action 3 istaken in State 4 anthe totsubsequent rewaris 1.0. If α = 0.05, whwillthe value of Q(4,3) upteusing Monte Carlo metho?Suppose ththe next cisionthhto ma on the triwe are consiring turns out to when we arein State 3. Suppose further tha rewarof 0.2 is earnebetween the twocisions. Using the temporfferenmetho we woulnote ththe valueof being in State 3 is currently estimateto 0.9. If a = 0.05, whwillthe value of Q(4,3) upteusing Temporfferenlearning? a.If α = 0.05, the Monte Carlo methowoulleto Q(4,3) beinguptefrom 0.8 to: 0.8 + 0.05(1.0 − 0.8) = 0.81Suppose thwhen we take action A in state S we move to state S'.We cuse the current value for V(S’) to upte follows:Qnew(S,= QolS,+α[R+γV(S') -QolS,A)]where R is the rewarthe next step anγis the scount factor.Thus, in this example, the temporalfferenmethowoulleto Q(4,3) being uptefrom 0.8 to: 0.8 +0.05(0.2 + 0.9 − 0.8) = 0.815 公式是什么谢谢什么考点

2024-05-07 20:35 1 · 回答