Is there a way to carry out this set of instructions using a list or dictionary? Or what would your approach be to make the code more efficient? Thx!
Please don’t post screenshots of code. We don’t use Photoshop to edit code
Please copy and paste the code, as text, and place it between “code fences”:
```
code goes here
```
You can also use three tildes ~~~
instead of backticks.
Sorry about that.
import numpy as np
y2_arrayT = np.array([[1, 2, 3], [4, 5, 6]])
i = 0
for spectrum in y2_arrayT:
# bkg1-3 change for each spectrum in y2_arrayT
bkg1 = np.array([[1, 2, 3], [4, 5, 6]])
bkg2 = np.array([[1, 1, 1], [1, 1, 1]])
bkg3 = np.array([[0, 0, 0], [0, 0, 0]])
diff1 = spectrum - bkg1
diff2 = spectrum - bkg2
diff3 = spectrum - bkg3
if i == 0:
d1 = diff1
d2 = diff2
d3 = diff3
i =+ 1
continue
d1 = np.vstack((d1, diff1))
d2 = np.vstack((d2, diff2))
d3 = np.vstack((d3, diff3))
print(d1, 2*'\n', d2, 2*'\n', d3)
Is there a way to carry out this set of instructions using a list or
dictionary? Or what would your approach be to make the code more
efficient? Thx!
Do you mean: instead of using np.array
?
You can certainly do all of this with list
s - semanticly your 2x3
np.array
can be done with nested list
s.
But I wouldn’t. numpy
(a) has a huge suite of nice math functions for
bulk computation, which come into play when you use things like
np.vstack
or spectrum - bkg1
and (b) those operations are far
faster than what you’d get replicating them in pure Python, because they
will be vectorised and run at essentailly machine speed - there’s a
little cost to orchestrating the setup, but once kicked off the gains
far outweigh the setup because of their bulk nature.
numpy
will only be less efficient if you’re mangling some natural and
efficient method to use numpy
methods where the numpy methods are a
bad fit to the task.
If numpy
expresses what you’re doing clearly and directly, just use
it.
I’ve got a couple of very minor comments/queries about the code:
for spectrum in y2_arrayT: # bkg1-3 change for each spectrum in y2_arrayT bkg1 = np.array([[1, 2, 3], [4, 5, 6]]) bkg2 = np.array([[1, 1, 1], [1, 1, 1]]) bkg3 = np.array([[0, 0, 0], [0, 0, 0]])
It seems to me that the comment is incorrect; bkg1
et al are set up
the same way each time and not modified, and so you could define them
just once before the loop. Unless these are special cased for the
example, adapted from a large more general piece of code.
print(d1, 2*‘\n’, d2, 2*‘\n’, d3)
Any reason for 2*'\n'
instead of just '\n\n'
here?
Cheers,
Cameron Simpson cs@cskk.id.au
Thanks for your comments Cameron
No, I mean using np.array, but just condensing all the repeat code. y2_arrayT in the actual code (the one above is structurally identical) is 11x1037. The original data is 11 csv files with 1037 data points and I read that it is best to convert the dataframe created on importing the files into an array for manipulation.
Yes, the comment is not valid for the above code, but using actual data the bkg1 et al. are a function of ‘spectrum’. No reason, I guess 2*‘\n’ popped into my head instead of ‘\n\n’.
No, I mean using np.array, but just condensing all the repeat code.
y2_arrayT in the actual code (the one above is structurally identical)
is 11x1037. The original data is 11 csv files with 1037 data points and
I read that it is best to convert the dataframe created on importing
the files into an array for manipulation.
It would be interesting to see that recommendation for its context. AIUI
a DataFrame is a collection of Series (the columns) and a Series usually
contains a numpy array anyway.
I am not a numpy expert, just starting to use it myself, so take all my
comments with caveats.
Yes, the comment is not valid for the above code, but using actual data the bkg1 et al. are a function of ‘spectrum’.
Ah ok. My comments were more general numpy vs pure Python. Let’s revisit
the code a little. I think there’s certainly scope for DRYing it up
(“don’t repeat yourself”) and also for performance improvement around
the vstack
. But if you’re really only computing 3 things removing the
repetition may only be worth it if the loop body becomes more complex.
y2_arrayT = np.array([[1, 2, 3], [4, 5, 6]])
i = 0
for spectrum in y2_arrayT:
So each spectrum
is a 1-d ndarray
, eg: [1, 2, 3]
.
# bkg1-3 change for each spectrum in y2_arrayT
bkg1 = np.array([[1, 2, 3], [4, 5, 6]])
bkg2 = np.array([[1, 1, 1], [1, 1, 1]])
bkg3 = np.array([[0, 0, 0], [0, 0, 0]])
And a bkg
(background?) is a 2x3 ndarray
(or maybe n x 3 in
reality?)
diff1 = spectrum - bkg1
diff2 = spectrum - bkg2
diff3 = spectrum - bkg3
Subtracting one from the other gets you another 2x3 ndarray
in the
diff
variable.
I could imagine writing a loop to iterate over the 3 flavours of bkg
:
for bkg_i in range(3): # counts 0, 1, 2
# compute one of the bkgs using bkg_i
bkg = func(spectrum, bkg_i)
diff = spectrum - bkg
and then accumulate the diff
in a list of diffs for that bkg_i
value.
This avoids repeating the bkg setup and difference. If you’re doing more
stuff in there, this avoids tedious (and error prone) repetition.
The accumulation is to aid replacing the vstack
which follows.
Repeatedly vstack
ing has a cumulative cost because it copies things,
potentially many times Instead we can collect the diffs and
concatenate
them once at the end.
So something shaped like this (untested):
# in case you want something mroe general
n_bkgs = 3
# make a list of list-of-diffs, on per bkg
# note: _not_ [[]] * n_bkgs, for reasons I can explain if needed
diffs = [ [] for _ in range(n_bkgs) ]
for spectrum in y2_arrayT:
# counts 0, 1, 2
for bkg_i in range(n_bkg):
# compute one of the bkgs using bkg_i
bkg = func(spectrum, bkg_i)
diff = spectrum - bkg
diffs[bkg_i].append(diff)
Then at the end, outside the loops, you can go:
stacked_diffs = [
np.concatenate(diffs[bkg_i])
for bkg_i in range(n_bkg)
]
and have a list stacked_diffs
with the stacked diff arrays for each
bkg_i
.
I suspect you do not need to unpack the DataFrame you had, but we’d need
to know its shape and indices in order to see what should be done to use
it directly. Basicly I expect you could just pull things straight out of
the DataFrame to get the equivalent of y2_arrayT
, then process the
same way. And maybe put the diffs back into the DataFrame as new
columns, depending on their shape and whether that made sense or was
even useful.
Regarding the vstack
replacement with concatenate
, this:
diffs[bkg_i].append(diff)
is practically free, because it is just appending a reference to the
list in diff
to the list (diffs[bkg_i]
). Costs nothing, and makes no
copies. Then concatenate
copies the diffs just once at the end. Doing
repeated vstack
s copies the diff and the
accumulated-copy-of-earlier-diffs on every loop iteration. That is much
more expensive.
Finally, a bit of Python generic criticism. Your loop went:
i = 0
for spectrum in y2_arrayT:
... stuff ...
if i == 0:
d1 = diff1
d2 = diff2
d3 = diff3
i =+ 1
continue
d1 = np.vstack((d1, diff1))
d2 = np.vstack((d2, diff2))
d3 = np.vstack((d3, diff3))
This is to split the initial/first diff (eg d1
) from the accumulation
with vstack
for the second and following diffs. So this is a 2-value
state: first run and later runs. I usually write that like this:
first = True
for spectrum in y2_arrayT:
... stuff ...
if first:
d1 = diff1
... etc ...
first = False
else:
d1 = np.vstack(.....)
... etc ...
Not that you should need this now. And alternative construction, which
is effectively what the diffs=[.....]
setup above is looks like:
# empty lists
diffs = [......]
for spectrum in y2_arrayT:
... stuff ...
... append to diffs in some way ...
which is always the same code, with no if first:
test because the
first append is appending to an empty list, just like the later ones.
Finally, if you actually cared about the row/index of the spectrum
you’re doing, you can do:
for spectrum_i, spectrum in enumerate(y2_arrayT):
This iterates spectrum
through y2_arrayT
exactly as before, but also
provides spectrum_i
being the index of the spectrum
counting 0, 1,
2, etc.
enumerate
counts from 0 by default but you can have it start at one or some other
number if that is useful.
Cheers,
Cameron Simpson cs@cskk.id.au
It would be interesting to see that recommendation for its context. AIUI
a DataFrame is a collection of Series (the columns) and a Series usually
contains a numpy array anyway.
I saw the recommendation on numpy.org somewhere (couldn’t find it just now). But there is a very good chance that I misinterpreted what I read. There are a lot of new terms that I am not familiar with. So even bigger caveat on my end.
Yes, it is ‘background’ and 1x1037.
Thanks for the information on the costs of vstack. I typically just google how to accomplish things and I am not at the level where I learn (or even concerned) about the tradeoffs with speed/efficiency, other than lines of code I guess.
Yes, I probably could pull the data from the DataFrame. The only reason I didn’t is b/c I saw that comment about converting something into an array for better reproducibility(?)
I will try my hand at the last suggestion you made without the if statement. That seems to be the most straightforward and I think I understand it the most. Thanks again for your help. I will post back with an update. Cheers!
Here is what I came up with. Seems to work just fine with a few less lines of code, kind of…
import numpy as np
def process(arr, bkg):
diffs =
for y in arr:
for list in bkg:
diffs.append(np.array(y) - np.array(list))
return diffsy2_arrayT = [[1, 2, 3], [4, 5, 6]]
bkg1 = [[1, 2, 3], [4, 5, 6]]
bkg2 = [[1, 1, 1], [1, 1, 1]]
bkg3 = [[0, 0, 0], [0, 0, 0]]
bkgs = [bkg1, bkg2, bkg3]for bkg in bkgs:
print(process(y2_arrayT, bkg))
Nice! - Cameron Simpson cs@cskk.id.au