[wrong] [ot] crazy journal

30 May 2021

      Maybe wrong is a reasonable replacement for spam.  Not using my spam tag
seems a little worrisome.   Wrong might call out my attention to not use it
if what i'm saying is clearly right.

I worked on gnuradio blocks for a bit!  Yay!  Radios are exciting.

I'm stuck around my cognitive inhibitions with a stats problem.  I'm trying
to estimate which among a set of populatiom histograms a sample is most
likely to fit, and I keep on freakin' out trying to make my brain do it
rightly.  Really i'd like to give a number to each option, so the user had
an idea of how certain to be, and about what.

I have two half-clang heuristics. One is to use a bernoulli distribution
for each bin to give the portion of samplings that would be a worse guess
for the bin, and take the product of that metric for all the bins.  I don't
remember it well: it gived a nice similarity metric from 0%-100%.

The other half-clang heuristic is a matrix solution.  I make a matrix out
of all the population histograms, and solve for what vector multiplies them
to get the histogram in question.  This comes out with a nice result where
1.0 occupies choices that are precisely the same, but makes incredibly poor
guesses for situations where there is no really good choice.

I made the heuristics while playing around with actual probability and
statistics, but it isn't quite cognitively working for me.  At this point I
can barely think about it!  I can barely review my existing heuristics,
even.

Ideally, I'd like to output the actual probability of each histogram, being
the one among the set, that the sampled histogram was sampled from.

I've never taken many probability or stats classes, and the few I did take
seemed to be just parroting things that were already obvious, so I didn't
attend to them well, and don't have much experience or training with these
things.

When I google things like this, it usually tells me to go through a song
and dance involving confidence intervals and significance and such.  These
are things I value, but as I guess I was a hacker I really value
understanding the things I use, and picking the best solution based on that
understanding.

One thing i've noticed that makes it a little easiet for me, is that stats
descriptions can leave out the properties they are describing the statistic
of, which can make it more confusing.  The probability of something
happening is different than the probability of your guess about it
happening being right, which is different than the probability of it
happening if undescribed information about it is known, etc etc.  A stats
page might mention the distinction once and then assume everyone remembers,
and that's hard for me nowadays.

Notably, the probability of something happening given data, is different
from the probability if it happening in the real world.  That's confusing
to me.  It could be fun to model everything I measure to give it a good
prior thingy, but that's not even what I want: I don't want to know what is
most likely overall, I want to know what the data indicates is most
likely.  I'm not totally sure how to think about that, but the fundamental
concept of summing favorable events and dividing them by total outcomes
clearly assumes a uniform distribution of outcomes, and my brain does that
too when I consider things around me.  If you want to make a fair
comparison of things, assume they have a uniform distribution so the
comparison acts fairly.  I don't really know.

So, if we want to figure out the likelihood of one histogram being made
from another among a set of distribution histograms, we could obviously
enumerate all possible histograms that could be sampled from any one among
the set, count how many are the same as one we sampled from each, and
divide the number for the one in question by the total that are the same.
I think I'm leaving something out there, statistically, when I consider it,
but it seems like a helpful grounding point.  The goal is clearly possible.

With small histograms and relatively few samples, it would even be possible
to simulate the above brute-force solution, to give the probability a
histogram fits among a set, chart some data along every variable, and
empirically derive an equation of probability.

But it seems like such a basic thing that, had I the education, I expect
there would be some formula solution one would know from the problem.

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

Karl

tags

participants (1)