[ot][spam][crazy] draft: learning RL

Mon May 9 05:17:01 PDT 2022

On Mon, May 9, 2022, 8:15 AM Undiscussed Horrific Abuse, One Victim of Many
<gmkarl at gmail.com> wrote:

>
>
> On Mon, May 9, 2022, 8:14 AM Undiscussed Horrific Abuse, One Victim of
> Many <gmkarl at gmail.com> wrote:
>
>>
>>
>> On Mon, May 9, 2022, 8:12 AM Undiscussed Horrific Abuse, One Victim of
>> Many <gmkarl at gmail.com> wrote:
>>
>>>
>>>
>>> On Mon, May 9, 2022, 8:05 AM Undiscussed Horrific Abuse, One Victim of
>>> Many <gmkarl at gmail.com> wrote:
>>>
>>>> To represent normal goal behavior with maximization, the
>>>>>>>>>>
>>>>>>>>>
>>> This is all confused to me, but normally when we meet goals we don't
>>> influence things not related to the goal. This is not usually included in
>>> maximization, unless
>>>
>>> return function needs to not only be incredibly complex, but
>>>>>>>>>>
>>>>>>>>>
>>> the return to be maximized were to include them, by maybe always being
>>> 1.0, I don't really know.
>>>
>>> also feed back to its own evaluation, in a way not
>>>>>>>>>>
>>>>>>>>>
>>> Maybe this relates to not learning habits unrelated to the goal, that
>>> would influence other goals badly.
>>>
>>> provided for in these libraries.
>>>>>>>>>>
>>>>>>>>>
>>> But something different is thinking at this time. It is the role of a
>>> part of a mind to try to relate with the other parts. Improving this in a
>>> general way is likely known well to be important.
>>>
>>>
>>>> Daydreaming: I'm thinking of how in reality and normality, we have many
>>>> many goals going at once (most of them "common sense" and/or "staying being
>>>> a living human").  Similarly, I'm thinking of how with normal transformer
>>>> models, one trains according to a loss rather than a reward.
>>>>
>>>> I'm considering what if it were more interesting when an agent _fails_
>>>> to meet a goal. Its reward would usually be full, 1.0, but would multiply
>>>> by losses when goals are not met.
>>>>
>>>> This seems much nicer to me.
>>>>
>>>
>> I don't know how RL works since I haven't taken the course, but it looks
>> to me from a distance like it would just learn at a different (slower) rate
>> [with other differences]
>> > yes
>>
> > I think it relates to the other inhibited concept, of value vs action
> learning. a reward starts at just the event of interest, for example, but
> the system then learns to apply rewards to things that can relate to the
> event, like preceding time points [states].
>
> in the end, what is important is what you are asking to change in the
real world. if the final goal state has an infinite quantity, then
maximisation has been misused [still thinking though, this leaked out]

>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 11307 bytes
Desc: not available
URL: <https://lists.cpunks.org/pipermail/cypherpunks/attachments/20220509/af1cafa1/attachment.txt>