Accumulation vs. replacement; model-free vs. model-based RL

Accumulation vs. replacement; model-free vs. model-based RL

Administrivia • Pseudo-HW3 today • Not graded • Worth doing anyway • Good for your soul... • ... better for your final exam. • We can discuss in class next Tues

Today in history • Last time: • Action selection • Use of experience • Eligibility traces • SARSA(λ) • Today • Replacing vs accumulating traces • Thinking about eligibility • Model-free vs. model-based learning (?) • R3 discussion

Presentation hints • Formal presentation to an audience • Trying to convince audience of something • E.g., you have invented a great idea and proven that it works • Subtext: you’re smart and they should invest in you • Think of it as a sales pitch (sort-of) • Get the core idea across • Don’t dwell on tedious detail • Don’t be fluffy

Presentation hints • Practice! • Time will be tight -- time yourself • Get friends/colleagues to help you practice • Practice! • Think about order of material presentation • Practice!

Presentation hints • Avoid • using • every • clever • powerpoint • trick And be careful with cute, but pointless images

Presentation hints Oh, and avoid using bizarre fonts and really tiny font sizes just so that you can cram as much junk on the screen as possible. Remember: it’s more important that the audience actually understand your material than that you convey more ‘volume’ of material in the same time. It’s essentially pointless to ream through bunches of text or incredible amounts of math if nobody in the audience gets it. At best, they will be bored and zone out for most of your talk. At worst, they will be actively put off or annoyed by your presentation. And, presumably, you want them all to like you and be impressed with your material and ideas, so it’s counterproductive to antagonize your audience. Remember: at some point, your project, future funding, and/or job may depend on a presentation like this, so it behooves you to keep your audience happy. I have actually seen people give abysmally bad presentations and be completely rejected from the job opening because of their poor presentations. Now that that has been said, I still need to fill out this page with a large blob of text so that it’s as intimidating as possible. Honestly, I don’t expect anybody to actually read this far even in the online copy, let alone in class. If you do actually get this far while I’’m flashing this page up in class, do please shout out. I’ll be most impressed and you’ll get brownie points for speed reading. Even if you happen to read this far in the online copy, please send me a note, just to satisfy my curiosity about who’s determined enough to get that far. Hm. Still half a page to fill. This is a pretty drastically condensed slide. Let’s see. Need more text. Maybe a little web mining... Ok, here we go: APRIL is the cruellest month, breeding / Lilacs out of the dead land, mixing / Memory and desire, stirring / Dull roots with spring rain. / Winter kept us warm, covering / Earth in forgetful snow, feeding / A little life with dried tubers. / Summer surprised us, coming over the Starnbergersee / With a shower of rain; we stopped in the colonnade, / And went on in sunlight, into the Hofgarten, / And drank coffee, and talked for an hour. / Bin gar keine Russin, stamm' aus Litauen, echt deutsch. / And when we were children, staying at the archduke's, / My cousin's, he took me out on a sled, / And I was frightened. He said, Marie, / Marie, hold on tight. And down we went. / In the mountains, there you feel free. / I read, much of the night, and go south in the winter. / / What are the roots that clutch, what branches grow / Out of this stony rubbish? Son of man, / You cannot say, or guess, for you know only / A heap of broken images, where the sun beats, / And the dead tree gives no shelter, the cricket no relief, / And the dry stone no sound of water. Only / There is shadow under this red rock, / (Come in under the shadow of this red rock), / And I will show you something different from either / Your shadow at morning striding behind you / Or your shadow at evening rising to meet you; / I will show you fear in a handful of dust. / Frisch weht der Wind / Der Heimat zu. / Mein Irisch Kind, / Wo weilest du? / 'You gave me hyacinths first a year ago; / 'They called me the hyacinth girl.' / —Yet when we came back, late, from the Hyacinth garden, / Your arms full, and your hair wet, I could not / Speak, and my eyes failed, I was neither / Living nor dead, and I knew nothing, / Looking into the heart of light, the silence. / Od' und leer das Meer.

Presentation hints Oh yeah. Don’t switch slides too quickly.

Presentation hints • Be sure to look at audience • Don’t just read from your slides • Don’t stare at screen whole time • Be careful w/ laser pointers • Practice!

The Q-learning algorithm • Algorithm: Q_learn • Inputs: State space S; Act. space A • Discount γ (0<=γ<1); Learning rate α (0<=α<1) • Outputs: Q • Repeat { • s=get_current_world_state() • a=pick_next_action(Q,s) • (r,s’)=act_in_world(a) • Q(s,a)=Q(s,a)+α*(r+γ*max_a’(Q(s’,a’))-Q(s,a)) • } Until (bored)

SARSA-learning algorithm • Algorithm: SARSA_learn • Inputs: State space S; Act. space A • Discount γ (0<=γ<1); Learning rate α (0<=α<1) • Outputs: Q • s=get_current_world_state() • a=pick_next_action(Q,s) • Repeat { • (r,s’)=act_in_world(a) • a’=pick_next_action(Q,s’) • Q(s,a)=Q(s,a)+α*(r+γ*Q(s’,a’)-Q(s,a)) • a=a’; s=s’; • } Until (bored)

SARSA vs. Q • SARSA and Q-learning very similar • SARSA updates Q(s,a) for the policy it’s actually executing • Lets the pick_next_action()function pick action to update • Q updates Q(s,a) for greedy policy w.r.t. current Q • Uses max_a to pick action to update • might be diff than the action it executes at s’ • In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing • Exploration can get Q-learning in trouble...

Radioactive breadcrumbs • Can now define eligibility traces for SARSA • In addition to Q(s,a) table, keep an e(s,a) table • Records “eligibility” (real number) for each state/action pair • At every step ((s,a,r,s’,a’) tuple): • Increment e(s,a) for current (s,a) pair by 1 • Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’) • Decay all e(s’’,a’’) by factor of λγ • Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

SARSA(λ)-learning alg. • Algorithm: SARSA(λ)_learn • Inputs: S, A, γ (0<=γ<1), α (0<=α<1), λ (0<=λ<1) • Outputs: Q • e(s,a)=0 // for all s, a • s=get_curr_world_st(); a=pick_nxt_act(Q,s); • Repeat { • (r,s’)=act_in_world(a) • a’=pick_next_action(Q,s’) • δ=r+γ*Q(s’,a’)-Q(s,a) • e(s,a)+=1 • foreach (s’’,a’’) pair in (SXA) { • Q(s’’,a’’)=Q(s’’,a’’)+α*e(s’’,a’’)*δ • e(s’’,a’’)*=λγ } • a=a’; s=s’; • } Until (bored)

The trail of crumbs Sutton & Barto, Sec 7.5

The trail of crumbs λ=0 Sutton & Barto, Sec 7.5

The trail of crumbs Sutton & Barto, Sec 7.5

Eligibility for a single state e(si,aj) 1st visit 2nd visit ... Sutton & Barto, Sec 7.5

Eligibility trace followup • Eligibility trace allows: • Tracking where the agent has been • Backup of rewards over longer periods • Credit assignment: state/action pairs rewarded for having contributed to getting to the reward • Why does it work?

The “forward view” of elig. • Original SARSA did “one step” backup: Info backup Rest of trajectory Q(st+1,at+1) rt Q(s,a)

The “forward view” of elig. • Original SARSA did “one step” backup: • Could also do a “two step backup”: Info backup Rest of trajectory Q(st+2,at+2) rt+1 rt Q(s,a)

The “forward view” of elig. • Original SARSA did “one step” backup: • Could also do a “two step backup”: • Or even an “n step backup”:

The “forward view” of elig. • Small-step backups (n=1, n=2, etc.) are slow and nearsighted • Large-step backups (n=100, n=1000, n=∞) are expensive and may miss near-term effects • Want a way to combine them • Can take a weighted average of different backups • E.g.:

The “forward view” of elig. 1/3 2/3

The “forward view” of elig. • How do you know which number of steps to avg over? And what the weights should be? • Accumulating eligibility traces are just a clever way to easily avg. over alln:

The “forward view” of elig. λ0 λ1 λ2 λn-1

Replacing traces • Kind just described are accumulating e-traces • Every time you go back to state, add extra e. • There are also replacing eligibility traces • Every time you go back to a state/action, reset e(s,a) to 1 • Works better sometimes Sutton & Barto, Sec 7.8

Accumulation vs. replacement; model-free vs. model-based RL