Population estimates from samples
Karl had been asking about how to estimate population statistics from a sample. I came across a fascinating approach to this, called bootstrapping or the bootstrap method. I learned about this an EdX data science course offered by Berkeley. The method is described in the course's free online textbook: https://inferentialthinking.com/chapters/13/2/Bootstrap.html This method makes some assumptions, including that your sample is reasonably large, and that the population distribution is approximately normal. The basic approach is to take your sample, and then randomly re-sample from the sample. This lets you build up a probability distribution of samples which, in turn, is representative of the population. The text includes some worked examples. The course uses Python (Jupyter notebooks) and a computational framework based on Pandas. The course sequence: https://www.edx.org/professional-certificate/berkeleyx-foundations-of-data-s... It's the second course, "Inferential thinking through simulations," which introduces and builds on the bootstrap concept. Lots of the materials are free - including the computational framework and examples. I'm not sure whether you can audit the (self-paced) course for free. I found the bootstrap method to be very interesting. It is not something that came up in my many grad and undergrad statistics courses or other research methods (maybe it emerged since I was a university student). I hope this helps. Greg
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Monday, May 31, 2021 8:51 AM, Greg Newby <gbnewby@pglaf.org> wrote: ...
I came across a fascinating approach to this, called bootstrapping or the bootstrap method. I learned about this an EdX data science course offered by Berkeley.
The method is described in the course's free online textbook: https://inferentialthinking.com/chapters/13/2/Bootstrap.html
"Any method based on sampling has the possibility of being off. The beauty of methods based on random sampling is that we can quantify how often they are likely to be off." indeed :)
Just a note that getting familiar with jupyter notebooks can get you into machine learning really fast. I've actually written a general equation solver for histograms, it's in the gr-blocks subfolder of my openemissions repo. Given histograms of known values it computes a histogram of an unknown, using the expression I crazily posted. I'm very curious how and why the reference quantifies the relation between properties of sampling a sample and properties of a population.
participants (3)
-
coderman
-
Greg Newby
-
Karl