Plot the histogram of a normal distribution

A simple interview question for data scientist. Since I am preparing for an interview, I need to choose to memorize the function of random number generators in random or numpy package. I choose the former, expecting that only uniform and normal are needed.
Question:
Write a function that draws N samples from a population with mean = 0, SD = 1
and plot the histogram.

import random
import matplotlib.pyplot as plt
N = 100
x2 = [random.gauss(0,1) for _ in range(N)] ## this
plt.hist(x2)

Does the function run sufficiently efficient? In particular, is there any way to make “this” step more efficient?

Sufficiently efficient, yes.

I think I read somewhere that a for: loop can be sped up by hardcoding the range() value so that the range() function doesn’t have to look up a variable value at each generation cycle–or at least that it’s faster to look up a constant than a variable. Whether you can do that or not depends on your intended use for the code, of course.

I’m guessing that your efficiency question was only about generating the gaussian distribution, but… so many plot libraries exist that some are sure to be more efficient than others.

I’ve only made plots with matplotlib and pandas (mostly for time series) I didn’t xompare their speeds but Pandas’ plots were much easier to overlay plots as a composite. The rendering was also more polished. As a bonus, Pandas is a purpose-built data science package, so its plots and data massaging are a all-inclusive.

I think a more efficient way is to use np.random.randn(N).
For-loop is generally slower than built-in functions or methods written in c/c++.

more efficient way

Absolutely. numpy.random.randn() generates an n-dimension array in C code. However, John said that he chose the random() function in the Standard Library rather than an external library.

To Kazuya’s point, you could   from <library> import <function>   and only use the functions you need.

There is no ONE True Answer here because you would need to know what the interviewer had in mind by “efficient” because it could be:

  • speed (CPU cycles)
  • memory usage
  • code simplicity (module imports could be viewed as “complexity”)
  • code readability (which includes transparency)
  • …and more

Or maybe they just want you to describe the ways that your code is efficient. An open-ended approach is usually better unless it’s a standardized exercise where you need to compare the solutions of multiple candidates.

Oh, I just overlooked that OP says “I choose the former”.

BTW, as OP imported plt, numpy is ready to use, and you can write plt.np.random.randn... :wink:

By Leland Parker via Discussions on Python.org at 17Jun2022 17:42:

I think I read somewhere that a for: loop can be sped up by
hardcoding the range() value so that the range() function doesn’t
have to look up a variable value at each generation cycle–or at least
that it’s faster to look up a constant than a variable. Whether you
can do that or not depends on your intended use for the code.

??? He uses range() once. It looks up N when the range object is
made.

It is true in any language that computing something once instead of many
times is faster, but I don’t think that’s what’s happening with
range() - it is computed only once.

To my eye his code is efficient, and can really only be improved if
numpy has a bulk-gauss function of some kind.

I’m guessing that your efficiency question was only about generating
the gaussian distribution, but… so many plot libraries exist that
some are sure to be more efficient than others.

I wouldn’t worry about the plot library itself. Your plot call is so
simple that switch to some other library would be trivial. And
irrelevant.

Cheers,
Cameron Simpson cs@cskk.id.au

This topic is a hypothetical scenario and probably more of a thought exercise than anything. It’s anyone’s guess whether the range() is run often enough to be an opportunity to “optimize”, whatever the a means in the given context.

Kazuya’s recommendation of numPy was explicitly ruled out by the OP but had more than enough merit to be a point of discussion since the OP seemed to want to avoid importing numPy. Importing only the function(s) used was a good compromise there; of course, you still need to install the package and maybe the hardware is an IOT device with very little memory. Again, we can’t rule anything out because it’s an interview exercise and anything is on the table.

I took this topic as a brainstorming session.