>'Simplified Prototype Model
Haha, good thing too. :^) Nice graphic, btw.
>So I've been working on developing a model that combines MuZero, the Intrinsic Curiosity Module, Go-Explore, Hindsight Experience Replay and Divide-and-Conquer MCTS to solve SNES RPGs and am faced with some pretty tough questions to solve:
Q: how are you even doing that
? In the most basic practical sense, I mean. Python-scripting modules to talk together?
>1. How can an agent learn to set its own abstract goals? For example, if an agent attacks an enemy with a certain spell, it may wish to go back and try a different spell on that enemy. Perhaps enemies don't re-spawn again in the same area and the agent must try it on a similar enemy in another area.
If an agent attacks an enemy with a certain weapon, a sword say, and it fails but observable damage occurs, then that's a clue that sword was at least somewhat effective so maybe try a different one--let's say a bastard sword. Kind of like if a BFG worked some but wasn't quite there, then whip out the BFG9000 on him. If on the other hand, not even the slightest damage occurred on the first attempt, then the algorithm probably should favor a alternate approach that doesn't involve that general class of weapon against the enemy at all.
So, keeping track of some kind of a 'score-card association', that's temporally-constrained and bounded by the set of current circumstances (stored in a set of dynamic Python dictionaries, say) for both the weapon class and that particular weapon. So, just resort the multi-variate dictionary after the first encounter using the updated scoring to find the next top three choices. Then just pick one of those three at random and go for it. This should vaguely simulate a reasonable 'choice' in the circumstances.
>2. How can an agent bail on a goal that is not achievable? Suppose an agent took a wrong turn of choices in a visual novel and its waifu dies from its decisions. It's unable to go back and do something different it wishes it could do unless it restarts the game. How can the agent discern possible goals from impossible goals?
To continue on the above scenario, if you've tried a couple of different choices, and neither work then you'd probably begin to move the Goal variable more towards the flight
mode and less towards the fight
mode. Fail, one more time at it say, then just haul ass. If you need to re-spawn as a result of a series of bad choices then at the least you know not to go that precise sequence route again. You should be storing a record of the previous history
of states, not just the last particular one. During each temporal snapshot of the upcoming play sequence compare 'frame-by-frame' the similarity to previous temporal sequences (pruning out duplicate irrelevancies such as taking the same entrance into the single-entrance-only-dungeon) and get a 'running commentary' on the current progress, as it were. Discern 'possible' from 'impossible' may prove, well, impossible
. Do we
always know the difference for example? If humans tend to fail at a type of endeavor then in general it's not unreasonable at this point in history to presume an AI will. But don't let the impossible stop you Anon, haha. After all, does the Bumblebee know it can't fly? Note: we did finally figure that one out in the end heh
>3. How can it transfer that desired goal to a realistic goal? This is similar to the above two questions. In the case of Question 1 it wants to transfer that goal to attacking a similar enemy in a different area with the same spell. In the case of Question 2, it wants to make sure it doesn't make the same mistake again that got its waifu killed by transferring that desired goal to protect another waifu.
So part a might just use the same type of sword against a similarly-classed enemy in a slightly different circumstance, based on the scoring approach mentioned above. For part b, it might use the previous encounter's 'commentary playback stream' mentioned above to make a brief analysis of the current circumstances, then tend to randomly choose slight variations early-on during the encounter to potentially alter the outcome (if it was a bad one) or tend reinforce the previous choice sequences (if it was a good outcome).
>4. How can an agent be instructed to perform abstract goals with difficult to describe success conditions without a reward function? MERLIN (arXiv:1803.10760) provided some insight to this by training an agent to respond to entered text commands and getting a reward once it was achieved. However, it is limited by what you can implement as a reward function. Ideally you want to be able to instruct the agent to do many things. Something as simple as asking an agent to run in a circle is extremely difficult to implement into a reward function and only applicable to that one task.
By enforcing behavioral dictates at a level above
the straightforward reward-function-only level maybe? When all else fails, the agent can just rely on a pre-programmed set of directives provided by the oracle (you, the developer, ofc). For an example analogy, say a Christian might face a conundrum; "what to do about Satanism being promoted on your previous imageboard, a circumstance you yourself allowed to take root simply by ignoring it."
That Christian might ignore his own past failures and any merely social current embarrassments and look outward to the Bible--a guidebook directed for him by a higher Oracle--for guidance. In some similar sense you might direct for particular outcomes for the agent at a higher level, when the lower-level systems such as reward mechanisms fail in a given circumstance.
>5. How can novelty search be enhanced with a value function? There's biological evidence that dopamine release in animals and human beings is enhanced when the perceived value of the novelty is high, whether it's beneficial or an unforeseen threat. Should the value function be merely based off survival of the agent's identity? How can and should the agent's identity expand and develop as it gains experiences? For example, it might not control the other party members but they are working together as one unit. It seems like this would require implementing some sort of abstract identity the agent is trying to preserve while exploring novel states.
<some sort of abstract identity the agent is trying to preserve while exploring novel states.
Precisely. The Theory of Mind
could be valuable here. Mere survival is a baser instinct and one we humans share with other biological systems around the planet. But as a human being
higher-order priorities may come into play. Self-sacrifice for the greater good. A soldier throwing himself on top of a grenade tossed into their bunker to save all his buddies, for a decent example of this. Animals won't do this, but humans might. Animals don't seem to carry an internal 'sense of personhood' (aka Theory of Mind), but normal humans obviously do. These are more or less philosophical questions you're asking in the end. Well-worn-path philosophical answers
may prove valuable here Anon.
>Also a thought I've had for implementing goals is to represent them as a change in the state's latent variables. If the state has latent variables for counting money, a goal vector to increase money would be mostly zero except for positive values for the variables that count money. But I don't think it will work out that simply because the network will learn its own compressed encoding to store to more information.
As indicated in my response above to 1 & 2, keeping a running, multi-variate scorecard in a set of dictionaries might help you sort through this problem Anon. It's also pretty much directly suited to DoD (data-oriented design) which by now is a very tried-and-true programming approach to vidya dev.
In fact, many of the issues you're bringing up here have already seen attempted answers by vidya dev teams in the past, some approaches more successful, some less so. It might be informative to research the literature that's out there on this topic as well as guidebooks that exist. The GPU Gems
set of collections comes to mind for me here.
Good luck. Great questions Anon.