1、5231045P(4|3)?Random walk?2,)=P(Xt+1|Xt,Xt?1)St=(Xt,Xt?1)St2(s,s),(s,r),(r,s),(r,r)?random walk?Markov reward process(MRP)10Reward 20MDP=Markov process+reward/utility functions?+?231045Reward 5Reward 0RewardRewardReward u(S=3)u(S=4)0.10.90.20.81.01.0MRP?state transition prob.?reward function?discoun
2、t factor?SPU?MRP-?Reward 20231045Reward 5Reward 0RewardRewardReward u(S=3)u(S=4)0.10.90.20.81.01.0MRP13Reward 20231045Reward 5Reward 0Reward 6Reward 2Reward 90.10.90.20.81.01.0?Reward?immediate?“?”?SH(S)start from hereMRP?Backward induction14Reward 20231045Reward 5Reward 0Reward 6Reward 2Reward 90.1
3、0.90.20.81.01.0H(S=4)=u(S=4)=2H(S=5)=u(S=5)=9MRP?15Reward 20231045Reward 5Reward 0Reward 6Reward 2Reward 90.10.90.20.81.01.0?H(S=3)=u(S=3)+?0.2H(S=4)+0.8H(S=5)=6+?0.2 2+0.8 9?2 0,1)MRP?16Reward 20231045Reward 5Reward 0Reward 6Reward 2Reward 90.10.90.20.81.01.0?H(S=2),H(S=1),H(S=3)=u(S=3)+?0.2 2+0.8
4、9MRP17Reward 20231045Reward 5Reward 0Reward 6Reward 2Reward 90.10.90.20.81.01.0H(St)=E?u(St)+?H(St+1)H(S)=u(S)+?XS02SP(S,S0)H(S0)?(?)?absorbing state?Reward 202310Reward 5Reward 0Reward 61.01.01.01.0MRP-?Value iteration19Reward 20231045Reward 5Reward 0Reward 6Reward 2Reward 90.10.90.20.81.01.0H(S),0
5、,8S 2 S?H(S)=u(S)+?H(S)?Markov decision process(MDP)22123Action A1Reward 20Current statePossible future statePossible future stateMDP=Markov process+actions+reward functions?123Action A2Reward 5Current statePossible future statePossible future state0.10.9?1?2?Markov decision process(MDP)23Action A1R
6、eward 20MDP=Markov process+actions+reward functions?action?SAPU?CMDP?POMDP?action/decision?reward-?PSA?Policy?:S 7!ASA?A?Bellman?Bellman equation?XS02SP(S,S0)H(S0)H(S,A)=u(S,A)+?XS02SP(S,A,S0)U(S0)U(S)=maxA2AH(S,A)?(S)=argmaxH(A,S)Bellman equation?backward induction?(Value iteration algorithm)?0?Bel
7、lman eqn?123Action A2Reward 5Action A1Reward 20Current statePossible future statePossible future state0.10.9U0(S),0,8S 2 SHn+1(S,A)=u(S,A)+?XS02SP(S,A,S0)Un(S0)Un+1(S)=maxA2AHn+1(S,A)U(S)Value iteration algorithmFor each state :SH0(S),0Repeat until converge:For each state :SFor each action :AHn+1(S,
8、A)=u(S,A)+?XS02SP(S,A,S0)Un(S0)ComputeCompute and store?n+1(S)=argmaxAHn+1(S,A)Compute and storeUn+1(S)=maxA2AHn+1(S,A)Return?(S),U(S),8S 2 S?(Policy iteration algorithm)?Value iteration?0(S),8S 2 S?n+1(S):A,8S 2 S?-The principle of optimality?O(|A|S|2)|A|S|f(x)=x?State 0State 1State 2State 3S=0,1,2
9、,3A=Left,Right0123Action:LeftAction:RightReward:-1 for every step movedDiscount factor:0.5MDP?State 0State 1State 2State 30123Action:RightP(A=Left)=266410001000010000103775P(A=Right)=266410000010000100013775?Value:H=0.00.00.00.0Action:/Value:H=0.0-1.0-1.0-1.0Action:H=0.0-1.0-1.5-1.5Action:Period 1Pe
10、riod 2Period 3MDP?,?41Copyright:Forbes?RF energy Tx/Rx Friis formula Beamforming?42Powercaster Tx and RxPCharging station?Electricity chargers?At different fixed locations,e.g.,power outlets,base stations End users of energy?Those who need energy,but are not covered by chargers Mobile energy gateway
11、?Moving and charge/transferring(wirelessly)43?Buy/Sell energyEnergy gateway buys from chargers(Charging)Each charger asks a certain price when charging Energy gateway sells to end users(Transferring)More users,more payments Near user gets more energy,thus higher payments44“?”“?Mobile energy gateway?
12、end user of energy?RF?45?;46S=S=(L,E,N,P)LENdecides end user paymentA=A=0,1,2PMDP:47f(n,l|N)=3RB(n+23,N?n+1)B(N?n+1,n)?l3R3;n+23,N?n+1R(n,ES)=ZR?0f(n,l|N)r(eDn)dl+ZRR?f(n,l|N)r(gESl2)dlnthsum up to get overall paymentMDP:48MDP:49P(A=1)=2664.0.30.7.3775P(A=0)=2664.1.00.0.3775?value iteration algorithm?pymdptoolbox?mdptoolbox?Matlab?Greedy scheme GRDY?maximizing immediate utility Random scheme RND?randomly taking any action(i.e.,0,1,2)from the action set Location-aware scheme LOCA:charging at charger,transfe