Loops, arrays, dictionaries -- oh my

ncnels · October 12, 2022, 7:15pm

Is there a way to carry out this set of instructions using a list or dictionary? Or what would your approach be to make the code more efficient? Thx!

steven.daprano · October 12, 2022, 8:59pm

Please don’t post screenshots of code. We don’t use Photoshop to edit code

Please copy and paste the code, as text, and place it between “code fences”:

```
code goes here
```

You can also use three tildes ~~~ instead of backticks.

ncnels · October 12, 2022, 10:19pm

Sorry about that.

import numpy as np

y2_arrayT = np.array([[1, 2, 3], [4, 5, 6]])
i = 0
for spectrum in y2_arrayT:
    #  bkg1-3 change for each spectrum in y2_arrayT
    bkg1 = np.array([[1, 2, 3], [4, 5, 6]])
    bkg2 = np.array([[1, 1, 1], [1, 1, 1]])
    bkg3 = np.array([[0, 0, 0], [0, 0, 0]])
    
    diff1 = spectrum - bkg1
    diff2 = spectrum - bkg2
    diff3 = spectrum - bkg3

    if i == 0:
        d1 = diff1
        d2 = diff2
        d3 = diff3
        i =+ 1
        continue

    d1 = np.vstack((d1, diff1))
    d2 = np.vstack((d2, diff2))
    d3 = np.vstack((d3, diff3))

print(d1, 2*'\n', d2, 2*'\n', d3)

cameron · October 12, 2022, 10:46pm

Is there a way to carry out this set of instructions using a list or
dictionary? Or what would your approach be to make the code more
efficient? Thx!

Do you mean: instead of using np.array?

You can certainly do all of this with lists - semanticly your 2x3
np.array can be done with nested lists.

But I wouldn’t. numpy (a) has a huge suite of nice math functions for
bulk computation, which come into play when you use things like
np.vstack or spectrum - bkg1 and (b) those operations are far
faster than what you’d get replicating them in pure Python, because they
will be vectorised and run at essentailly machine speed - there’s a
little cost to orchestrating the setup, but once kicked off the gains
far outweigh the setup because of their bulk nature.

numpy will only be less efficient if you’re mangling some natural and
efficient method to use numpy methods where the numpy methods are a
bad fit to the task.

If numpy expresses what you’re doing clearly and directly, just use
it.

I’ve got a couple of very minor comments/queries about the code:

for spectrum in y2_arrayT:
   #  bkg1-3 change for each spectrum in y2_arrayT
   bkg1 = np.array([[1, 2, 3], [4, 5, 6]])
   bkg2 = np.array([[1, 1, 1], [1, 1, 1]])
   bkg3 = np.array([[0, 0, 0], [0, 0, 0]])

It seems to me that the comment is incorrect; bkg1 et al are set up
the same way each time and not modified, and so you could define them
just once before the loop. Unless these are special cased for the
example, adapted from a large more general piece of code.

print(d1, 2*‘\n’, d2, 2*‘\n’, d3)

Any reason for 2*'\n' instead of just '\n\n' here?

Cheers,
Cameron Simpson cs@cskk.id.au

ncnels · October 13, 2022, 3:44pm

Thanks for your comments Cameron

No, I mean using np.array, but just condensing all the repeat code. y2_arrayT in the actual code (the one above is structurally identical) is 11x1037. The original data is 11 csv files with 1037 data points and I read that it is best to convert the dataframe created on importing the files into an array for manipulation.

Yes, the comment is not valid for the above code, but using actual data the bkg1 et al. are a function of ‘spectrum’. No reason, I guess 2*‘\n’ popped into my head instead of ‘\n\n’.

cameron · October 13, 2022, 10:48pm

No, I mean using np.array, but just condensing all the repeat code.
y2_arrayT in the actual code (the one above is structurally identical)
is 11x1037. The original data is 11 csv files with 1037 data points and
I read that it is best to convert the dataframe created on importing
the files into an array for manipulation.

It would be interesting to see that recommendation for its context. AIUI
a DataFrame is a collection of Series (the columns) and a Series usually
contains a numpy array anyway.

I am not a numpy expert, just starting to use it myself, so take all my
comments with caveats.

Yes, the comment is not valid for the above code, but using actual data the bkg1 et al. are a function of ‘spectrum’.

Ah ok. My comments were more general numpy vs pure Python. Let’s revisit
the code a little. I think there’s certainly scope for DRYing it up
(“don’t repeat yourself”) and also for performance improvement around
the vstack. But if you’re really only computing 3 things removing the
repetition may only be worth it if the loop body becomes more complex.

 y2_arrayT = np.array([[1, 2, 3], [4, 5, 6]])
 i = 0
 for spectrum in y2_arrayT:

So each spectrum is a 1-d ndarray, eg: [1, 2, 3].

     #  bkg1-3 change for each spectrum in y2_arrayT
     bkg1 = np.array([[1, 2, 3], [4, 5, 6]])
     bkg2 = np.array([[1, 1, 1], [1, 1, 1]])
     bkg3 = np.array([[0, 0, 0], [0, 0, 0]])

And a bkg (background?) is a 2x3 ndarray (or maybe n x 3 in
reality?)

     diff1 = spectrum - bkg1
     diff2 = spectrum - bkg2
     diff3 = spectrum - bkg3

Subtracting one from the other gets you another 2x3 ndarray in the
diff variable.

I could imagine writing a loop to iterate over the 3 flavours of bkg:

 for bkg_i in range(3): # counts 0, 1, 2
     # compute one of the bkgs using bkg_i
     bkg = func(spectrum, bkg_i)
     diff = spectrum - bkg

and then accumulate the diff in a list of diffs for that bkg_i
value.

This avoids repeating the bkg setup and difference. If you’re doing more
stuff in there, this avoids tedious (and error prone) repetition.

The accumulation is to aid replacing the vstack which follows.
Repeatedly vstacking has a cumulative cost because it copies things,
potentially many times Instead we can collect the diffs and
concatenate
them once at the end.

So something shaped like this (untested):

 # in case you want something mroe general
 n_bkgs = 3

 # make a list of list-of-diffs, on per bkg
 # note: _not_ [[]] * n_bkgs, for reasons I can explain if needed
 diffs = [ [] for _ in range(n_bkgs) ]

 for spectrum in y2_arrayT:
     # counts 0, 1, 2
     for bkg_i in range(n_bkg):
         # compute one of the bkgs using bkg_i
         bkg = func(spectrum, bkg_i)
         diff = spectrum - bkg
         diffs[bkg_i].append(diff)

Then at the end, outside the loops, you can go:

 stacked_diffs = [
     np.concatenate(diffs[bkg_i])
     for bkg_i in range(n_bkg)
 ]

and have a list stacked_diffs with the stacked diff arrays for each
bkg_i.

I suspect you do not need to unpack the DataFrame you had, but we’d need
to know its shape and indices in order to see what should be done to use
it directly. Basicly I expect you could just pull things straight out of
the DataFrame to get the equivalent of y2_arrayT, then process the
same way. And maybe put the diffs back into the DataFrame as new
columns, depending on their shape and whether that made sense or was
even useful.

Regarding the vstack replacement with concatenate, this:

 diffs[bkg_i].append(diff)

is practically free, because it is just appending a reference to the
list in diff to the list (diffs[bkg_i]). Costs nothing, and makes no
copies. Then concatenate copies the diffs just once at the end. Doing
repeated vstacks copies the diff and the
accumulated-copy-of-earlier-diffs on every loop iteration. That is much
more expensive.

Finally, a bit of Python generic criticism. Your loop went:

 i = 0
 for spectrum in y2_arrayT:
     ... stuff ...
     if i == 0:
         d1 = diff1
         d2 = diff2
         d3 = diff3
         i =+ 1
         continue

     d1 = np.vstack((d1, diff1))
     d2 = np.vstack((d2, diff2))
     d3 = np.vstack((d3, diff3))

This is to split the initial/first diff (eg d1) from the accumulation
with vstack for the second and following diffs. So this is a 2-value
state: first run and later runs. I usually write that like this:

 first = True
 for spectrum in y2_arrayT:
     ... stuff ...
     if first:
         d1 = diff1
         ... etc ...
         first = False
     else:
         d1 = np.vstack(.....)
         ... etc ...

Not that you should need this now. And alternative construction, which
is effectively what the diffs=[.....] setup above is looks like:

 # empty lists
 diffs = [......]
 for spectrum in y2_arrayT:
     ... stuff ...
     ... append to diffs in some way ...

which is always the same code, with no if first: test because the
first append is appending to an empty list, just like the later ones.

Finally, if you actually cared about the row/index of the spectrum
you’re doing, you can do:

 for spectrum_i, spectrum in enumerate(y2_arrayT):

This iterates spectrum through y2_arrayT exactly as before, but also
provides spectrum_i being the index of the spectrum counting 0, 1,
2, etc.
enumerate
counts from 0 by default but you can have it start at one or some other
number if that is useful.

Cheers,
Cameron Simpson cs@cskk.id.au

ncnels · October 14, 2022, 9:47pm

It would be interesting to see that recommendation for its context. AIUI
a DataFrame is a collection of Series (the columns) and a Series usually
contains a numpy array anyway.

I saw the recommendation on numpy.org somewhere (couldn’t find it just now). But there is a very good chance that I misinterpreted what I read. There are a lot of new terms that I am not familiar with. So even bigger caveat on my end.

Yes, it is ‘background’ and 1x1037.

Thanks for the information on the costs of vstack. I typically just google how to accomplish things and I am not at the level where I learn (or even concerned) about the tradeoffs with speed/efficiency, other than lines of code I guess.

Yes, I probably could pull the data from the DataFrame. The only reason I didn’t is b/c I saw that comment about converting something into an array for better reproducibility(?)

I will try my hand at the last suggestion you made without the if statement. That seems to be the most straightforward and I think I understand it the most. Thanks again for your help. I will post back with an update. Cheers!

ncnels · October 26, 2022, 10:00pm

Here is what I came up with. Seems to work just fine with a few less lines of code, kind of…

import numpy as np

def process(arr, bkg):
diffs =
for y in arr:
for list in bkg:
diffs.append(np.array(y) - np.array(list))
return diffs

y2_arrayT = [[1, 2, 3], [4, 5, 6]]

bkg1 = [[1, 2, 3], [4, 5, 6]]
bkg2 = [[1, 1, 1], [1, 1, 1]]
bkg3 = [[0, 0, 0], [0, 0, 0]]
bkgs = [bkg1, bkg2, bkg3]

for bkg in bkgs:
print(process(y2_arrayT, bkg))

cameron · October 27, 2022, 8:57pm

Nice! - Cameron Simpson cs@cskk.id.au