# Loops, arrays, dictionaries -- oh my

Is there a way to carry out this set of instructions using a list or dictionary? Or what would your approach be to make the code more efficient? Thx!

Please don’t post screenshots of code. We don’t use Photoshop to edit code

Please copy and paste the code, as text, and place it between “code fences”:

`````````
code goes here
```
``````

You can also use three tildes `~~~` instead of backticks.

``````import numpy as np

y2_arrayT = np.array([[1, 2, 3], [4, 5, 6]])
i = 0
for spectrum in y2_arrayT:
#  bkg1-3 change for each spectrum in y2_arrayT
bkg1 = np.array([[1, 2, 3], [4, 5, 6]])
bkg2 = np.array([[1, 1, 1], [1, 1, 1]])
bkg3 = np.array([[0, 0, 0], [0, 0, 0]])

diff1 = spectrum - bkg1
diff2 = spectrum - bkg2
diff3 = spectrum - bkg3

if i == 0:
d1 = diff1
d2 = diff2
d3 = diff3
i =+ 1
continue

d1 = np.vstack((d1, diff1))
d2 = np.vstack((d2, diff2))
d3 = np.vstack((d3, diff3))

print(d1, 2*'\n', d2, 2*'\n', d3)
``````

Is there a way to carry out this set of instructions using a list or
dictionary? Or what would your approach be to make the code more
efficient? Thx!

Do you mean: instead of using `np.array`?

You can certainly do all of this with `list`s - semanticly your 2x3
`np.array` can be done with nested `list`s.

But I wouldn’t. `numpy` (a) has a huge suite of nice math functions for
bulk computation, which come into play when you use things like
`np.vstack` or `spectrum - bkg1` and (b) those operations are far
faster than what you’d get replicating them in pure Python, because they
will be vectorised and run at essentailly machine speed - there’s a
little cost to orchestrating the setup, but once kicked off the gains
far outweigh the setup because of their bulk nature.

`numpy` will only be less efficient if you’re mangling some natural and
efficient method to use `numpy` methods where the numpy methods are a
.

If `numpy` expresses what you’re doing clearly and directly, just use
it.

``````for spectrum in y2_arrayT:
#  bkg1-3 change for each spectrum in y2_arrayT
bkg1 = np.array([[1, 2, 3], [4, 5, 6]])
bkg2 = np.array([[1, 1, 1], [1, 1, 1]])
bkg3 = np.array([[0, 0, 0], [0, 0, 0]])
``````

It seems to me that the comment is incorrect; `bkg1` et al are set up
the same way each time and not modified, and so you could define them
just once before the loop. Unless these are special cased for the
example, adapted from a large more general piece of code.

print(d1, 2*‘\n’, d2, 2*‘\n’, d3)

Any reason for `2*'\n'` instead of just `'\n\n'` here?

Cheers,
Cameron Simpson cs@cskk.id.au

No, I mean using np.array, but just condensing all the repeat code. y2_arrayT in the actual code (the one above is structurally identical) is 11x1037. The original data is 11 csv files with 1037 data points and I read that it is best to convert the dataframe created on importing the files into an array for manipulation.

Yes, the comment is not valid for the above code, but using actual data the bkg1 et al. are a function of ‘spectrum’. No reason, I guess 2*‘\n’ popped into my head instead of ‘\n\n’.

No, I mean using np.array, but just condensing all the repeat code.
y2_arrayT in the actual code (the one above is structurally identical)
is 11x1037. The original data is 11 csv files with 1037 data points and
I read that it is best to convert the dataframe created on importing
the files into an array for manipulation.

It would be interesting to see that recommendation for its context. AIUI
a DataFrame is a collection of Series (the columns) and a Series usually
contains a numpy array anyway.

I am not a numpy expert, just starting to use it myself, so take all my

Yes, the comment is not valid for the above code, but using actual data the bkg1 et al. are a function of ‘spectrum’.

Ah ok. My comments were more general numpy vs pure Python. Let’s revisit
the code a little. I think there’s certainly scope for DRYing it up
(“don’t repeat yourself”) and also for performance improvement around
the `vstack`. But if you’re really only computing 3 things removing the
repetition may only be worth it if the loop body becomes more complex.

`````` y2_arrayT = np.array([[1, 2, 3], [4, 5, 6]])
i = 0
for spectrum in y2_arrayT:
``````

So each `spectrum` is a 1-d `ndarray`, eg: `[1, 2, 3]`.

``````     #  bkg1-3 change for each spectrum in y2_arrayT
bkg1 = np.array([[1, 2, 3], [4, 5, 6]])
bkg2 = np.array([[1, 1, 1], [1, 1, 1]])
bkg3 = np.array([[0, 0, 0], [0, 0, 0]])
``````

And a `bkg` (background?) is a 2x3 `ndarray` (or maybe n x 3 in
reality?)

``````     diff1 = spectrum - bkg1
diff2 = spectrum - bkg2
diff3 = spectrum - bkg3
``````

Subtracting one from the other gets you another 2x3 `ndarray` in the
`diff` variable.

I could imagine writing a loop to iterate over the 3 flavours of `bkg`:

`````` for bkg_i in range(3): # counts 0, 1, 2
# compute one of the bkgs using bkg_i
bkg = func(spectrum, bkg_i)
diff = spectrum - bkg
``````

and then accumulate the `diff` in a list of diffs for that `bkg_i`
value.

This avoids repeating the bkg setup and difference. If you’re doing more
stuff in there, this avoids tedious (and error prone) repetition.

The accumulation is to aid replacing the `vstack` which follows.
Repeatedly `vstack`ing has a cumulative cost because it copies things,
potentially many times Instead we can collect the diffs and
`concatenate`
them once at the end.

So something shaped like this (untested):

`````` # in case you want something mroe general
n_bkgs = 3

# make a list of list-of-diffs, on per bkg
# note: _not_ [[]] * n_bkgs, for reasons I can explain if needed
diffs = [ [] for _ in range(n_bkgs) ]

for spectrum in y2_arrayT:
# counts 0, 1, 2
for bkg_i in range(n_bkg):
# compute one of the bkgs using bkg_i
bkg = func(spectrum, bkg_i)
diff = spectrum - bkg
diffs[bkg_i].append(diff)
``````

Then at the end, outside the loops, you can go:

`````` stacked_diffs = [
np.concatenate(diffs[bkg_i])
for bkg_i in range(n_bkg)
]
``````

and have a list `stacked_diffs` with the stacked diff arrays for each
`bkg_i`.

I suspect you do not need to unpack the DataFrame you had, but we’d need
to know its shape and indices in order to see what should be done to use
it directly. Basicly I expect you could just pull things straight out of
the DataFrame to get the equivalent of `y2_arrayT`, then process the
same way. And maybe put the diffs back into the DataFrame as new
columns, depending on their shape and whether that made sense or was
even useful.

Regarding the `vstack` replacement with `concatenate`, this:

`````` diffs[bkg_i].append(diff)
``````

is practically free, because it is just appending a reference to the
list in `diff` to the list (`diffs[bkg_i]`). Costs nothing, and makes no
copies. Then `concatenate` copies the diffs just once at the end. Doing
repeated `vstack`s copies the diff and the
accumulated-copy-of-earlier-diffs on every loop iteration. That is much
more expensive.

Finally, a bit of Python generic criticism. Your loop went:

`````` i = 0
for spectrum in y2_arrayT:
... stuff ...
if i == 0:
d1 = diff1
d2 = diff2
d3 = diff3
i =+ 1
continue

d1 = np.vstack((d1, diff1))
d2 = np.vstack((d2, diff2))
d3 = np.vstack((d3, diff3))
``````

This is to split the initial/first diff (eg `d1`) from the accumulation
with `vstack` for the second and following diffs. So this is a 2-value
state: first run and later runs. I usually write that like this:

`````` first = True
for spectrum in y2_arrayT:
... stuff ...
if first:
d1 = diff1
... etc ...
first = False
else:
d1 = np.vstack(.....)
... etc ...
``````

Not that you should need this now. And alternative construction, which
is effectively what the `diffs=[.....]` setup above is looks like:

`````` # empty lists
diffs = [......]
for spectrum in y2_arrayT:
... stuff ...
... append to diffs in some way ...
``````

which is always the same code, with no `if first:` test because the
first append is appending to an empty list, just like the later ones.

Finally, if you actually cared about the row/index of the spectrum
you’re doing, you can do:

`````` for spectrum_i, spectrum in enumerate(y2_arrayT):
``````

This iterates `spectrum` through `y2_arrayT` exactly as before, but also
provides `spectrum_i` being the index of the `spectrum` counting 0, 1,
2, etc.
`enumerate`
counts from 0 by default but you can have it start at one or some other
number if that is useful.

Cheers,
Cameron Simpson cs@cskk.id.au

It would be interesting to see that recommendation for its context. AIUI
a DataFrame is a collection of Series (the columns) and a Series usually
contains a numpy array anyway.

I saw the recommendation on numpy.org somewhere (couldn’t find it just now). But there is a very good chance that I misinterpreted what I read. There are a lot of new terms that I am not familiar with. So even bigger caveat on my end.

Yes, it is ‘background’ and 1x1037.

Thanks for the information on the costs of vstack. I typically just google how to accomplish things and I am not at the level where I learn (or even concerned) about the tradeoffs with speed/efficiency, other than lines of code I guess.

Yes, I probably could pull the data from the DataFrame. The only reason I didn’t is b/c I saw that comment about converting something into an array for better reproducibility(?)

I will try my hand at the last suggestion you made without the if statement. That seems to be the most straightforward and I think I understand it the most. Thanks again for your help. I will post back with an update. Cheers!

Here is what I came up with. Seems to work just fine with a few less lines of code, kind of…

import numpy as np

def process(arr, bkg):
diffs =
for y in arr:
for list in bkg:
diffs.append(np.array(y) - np.array(list))
return diffs

y2_arrayT = [[1, 2, 3], [4, 5, 6]]

bkg1 = [[1, 2, 3], [4, 5, 6]]
bkg2 = [[1, 1, 1], [1, 1, 1]]
bkg3 = [[0, 0, 0], [0, 0, 0]]
bkgs = [bkg1, bkg2, bkg3]

for bkg in bkgs:
print(process(y2_arrayT, bkg))

Nice! - Cameron Simpson cs@cskk.id.au