My data comes originally from a tuple of twenty one variables. There are several tuples to be added one at a time, each with twenty one variables. Each tuple is changed to a list, and then appended to a file. My question is with multiple tuples appended to a file (now containing one long list), in order, of the group (tuple to list) being appended. Indexing is the problem. Do I index from 0 to the end of the list? Or should I use a sub set of twenty one variables and then address the variable I need?
I believe both will work as well, as the math is the same in the end. You are addressing the same data point. This is about what is the best or perfered way to handle this. To keep it clear to those who have to work with the code later.
Thanks
I donāt understand your question.
I guess that when you say you have āa tuple of twenty one variablesā you mean that you have a tuple with 21 items in it, which you then turn into a list:
t = ('a', 'b', 'c', 'd', 'e', 'f', 'g',
'h', 'i', 'j', 'k', 'l', 'm', 'n',
'o', 'p', 'q', 'r', 's', 't', 'u')
assert len(t) == 21
mylist = list(t)
That much is reasonably clear. But then you say the list is appended to a file. You canāt write lists to a file, you have to convert to a string first. So what are you doing? My guess is:
with open(somefile, "w") as f:
for t in many_tuples:
mylist = list(t)
f.write(str(mylist))
but thatās probably not right, because you say that your file contains āone long listā.
You then ask:
which I donāt understand at all. You canāt index into a file.
Maybe instead of giving us a vague description like this, you can show us the code you are using, what you are trying to do, and the result you expect to get. Please simplify the code to a minimal working example, e.g. instead of using 21 items in the tuple, cut it down to just three.
This is the ālistā that I want to append to a file. When you have several different 'listās, but same exact variables each appended to the file (different data is all). I want to then read the file (main_list) and index to the data variable I want. (main_list[15]) Sixteenth data item. But I want that data point in the third set. Each list have different values.
[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 'homework', 'Default Alarm', 0, 2, 30, 14, 13, 16, 43]
main_list:
[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 'homework', 'Default Alarm', 0, 1, 30, 14, 13, 16, 43][0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 'Work', 'Alarm 4', 0, 4, 32, 14, 13, 16, 43][0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 'Meds', 'Fog Horn', 0, 2, 30, 14, 13, 16, 43]
Therefore I wanted the ā2ā for main_list[15] of the third set of data
Now do I index the list as one unit or do I sub set it by number of data points (21 in this case). first set would be 0 ofset, second set would be offset 21, third set would be offset 42 plus the the variable I want 15 so (42 + 15 = 57)ā¦main_list[57] Is the final index address. Understand that there can be many more datasets at one time as the program runs.
I hope I have made this clear. Thanks again.
Iām afraid not; sorry. You still havenāt shown any actual code demonstrating the issue, and your description appears to have the same points of confusion/inaccuracies as @steven.daprano points out above, and hasnāt answered any of the questions he poses. If you could just post some minimal working code that creates a file as you describe, reads from it and attempts to do the āindexingā operating youāre describing, that would allow us to give you far more useful feedback without having to guess.
If you have some list of tuples:
some_data = [(0, 42, "spam"), (1, 5, "eggs")]
you could save it to a file as a string (repr) and eval
it, at which point you have (roughly) the same object you started with, assuming youāre using primitive data types:
from pathlib import Path
data_file = Path(...)
data_file.write_text(repr(some_data), encoding="UTF-8")
some_data_read = eval(data_file.read_text(encoding="UTF-8"))
But thatās a bad idea for many reasons and will only work in the simplest of cases; much better to write it as JSON (or a pickle, if you need arbitrary Python objects rather than basic data types):
import json
data_file.write_text(json.dumps(some_data), encoding="UTF-8")
some_data_read = json.loads(data_file.read_text(encoding="UTF-8"))
So long as you save it in a format that will roundtrip to the same Python object, you can just index into the list of tuples just as you could originally, e.g. to get 16th item of the 3rd set of data (as above):
assert some_data == some_data_read
assert some_data_read[2][15] == 2
Alternatively, if you want to be able to arbitrarily append tuples to a file row by row, you might be better off writing to a CSV instead, which you can write row by row at different times (once youāve written the headers). You can use Pythonās CSV module and manually convert the data types yourself when you read it back in, but in that case youāre way better off using the popular Pandas library, which is designed for exactly this kind of data manipulation and make not only data I/O, but also data processing much easier.
If you have separate sets of data, why do you want to combine them into one? Iām not really sure what youāre trying to achieve by doing that.
Also, in your first post above you mentioned about wanting to make things clear to the people who have to work with the code later. Good - thatās really important. A list of values of different types suggests that it isnāt the position of the item thatās important, but conceptually what the value represents. Thatās not to say that values of the same type represent the same concept - in your example, I imagine "homework"
and "Default Alarm"
are values for two entirely separate concepts, even though theyāre both strings.
To make things clearer then, why not represent the data using something that ascribes names to the various parts? You could use a dict for this. Instead of
a_book = ['Refactoring', 'Martin Fowler', True]
# Usage
print(a_book[2]) # What does 2 mean?
I might have
a_book = {"title": "Refactoring", "author": "Martin Fowler", "in_print": True}
print(a_book["in_print"])
Personally, I prefer to name the concept as a whole, because it gives more meaning when reading code. Therefore, Iād use a data class or named tuple:
from dataclasses import dataclass
@dataclass
class Book:
title: str
author: str
in_print: bool
a_book = Book("Refactoring", "Martin Fowler", True)
print(a_book.in_print)
from collections import namedtuple
Book = namedtuple("Book", ["title", "author", "in_print"])
a_book = Book(title="Refactoring", author="Martin Fowler", in_print=True)
print(a_book.in_print)
There are reasons youād want to choose one over the other, but you can learn about the differences between them.
If heās got a tool/function which processes a single data set and heās
got a bunch of datasets, combining them in order to send them to that
tool is entirely reasonable.
Real world example: Iāve got a solar inverter. Thereās a Windows app for
monitoring it, and the data export function is a button which writes
historic data as CSV files. Those CSV files cover only the recent past
because of storage constraints inside the inverter itself. I push that
button every few days.
If I want to examine data over a wider time span than the memory of the
inverter, I need to combine the data from a few different dumps.
A year ago I was doing that by loading up the CSV files without the
header lines piping them through sort -u
to get a single bigger CSV
file. Not Iām parsing new CSV files and stroing their data in an
accumulated data store.
But either way Iām effectively combining several datasets into a single
data set.
Cheers,
Cameron Simpson cs@cskk.id.au
Yes! This is what Iām doing. Now for clarity, what I wanted to know was how to show the unpacking to individual data points. Both ways end at the same data point. One, to count to the exact datapoint (forget about groups). Two, is use a offset of (in this case) 21. So you are able to count by groups of 21. Then the addition of the data points count of the 21 datset group ( 0 through 20).
I think the second makes it clearer. Less likely to make mistakes as you are counting off by groups. One could look at a print of the data and count the groups to the data group needed and then the count to the data point.
It comes down to this: Example 6th data group, data point 12. Or data point = 117
As these are just small examples, consider 1500 data groups.
Thanks so very much.
Maybe others are able to figure it out, but I still donāt fully understand what you are attempting to describe, sorry, and I canāt find where youāve answered any of the specific questions weāve posed above. Hopefully, our replies have been useful, but if youād like further useful feedback, Iād suggest that you answer the specific questions posed to you above (with quotes, so its clear what youāre responding toāwhen you say
It isnāt obvious which of the four different posts by three different people youāre responding to, each with distinct ideas of what youāre doing and what to suggest in reply.)
Best of luck
The last entry was to Cameron Simpsonās explaination. He explained it very clearly.
Thanks
Iām going to try guessing what you want, based on the posts you have made so far.
You have a bunch of records with 21 items each. For simplicity and brevity, Iām going to cut that down to just four items. You want to combine them into big one list. Iām not sure how you are doing it.
Option 1:
# Some data blobs.
a = (100, 101, 102, 103)
b = (2000, 2001, 2002, 2003)
c = (30, 31, 32, 33)
combined = []
# Make one long list
for blob in (a, b, c):
combined.extend(blob)
print(combined)
# Gives -> [100, 101, 102, 103, 2000, 2001, 2002, 2003, 30, 31, 32, 33]
This is the natural way to combine arrays of data in low-level languages like assembly.
If you want the third item of the second blob, you can get it from the combined list like this:
# Remember that Python counts from 0 instead of 1
index = (2-1)*4 + (3-1) # Subtract 1 to allow for zero-based indexing.
print(combined[index])
# gives 2002
In your case, you would use 21 instead of 4.
Here is the second option:
# The data blobs again.
a = (100, 101, 102, 103)
b = (2000, 2001, 2002, 2003)
c = (30, 31, 32, 33)
combined = []
# Make a list of sublists
for blob in (a, b, c):
combined.append(list(blob))
print(combined)
# Gives -> [[100, 101, 102, 103], [2000, 2001, 2002, 2003], [30, 31, 32, 33]]
This is probably more natural for Python. Here is how you would retrieve the third item from the second blob:
# We subtract 1 to allow for zero-based indexing.
print(combined[2-1][3-1])
# gives 2002
None of this is related to writing the data to a file, I canāt guess what you mean there.
Actually, itās not done much anymore, but you can use fixed-width records and seek.
Or a database. Thatās done a lot today.
Well, you can. Thatās what the file .seek()
method does.
With text files, seek values need to be treated as opaque values
(generally, assuming you donāt know the text encoding, or with a
variable width encoding where computing it isā¦ fiddly - and UTF8, the
modern common text encoding is variable width).
But nothing prevents you writing a text file and recording the file
offsets of the lines as you go.
With binary data the situation is better defined, particularly if the
records are fixed width.
If the OPās data are arbitrary things (like strings and so forth) this
gets messy. But if, say, theyāre all float
s then they can be written
directly to files and directly seek()
ed to (apologies for the
grammar).
Iām not sure Leonard was clear about the type(s) in his data. But if
theyāre all, say, float
s and the tuples are fixed length then he can
definitely save them in binary form, and either retrieve them with a
seek
and a read
or equally, probably better, mmap
the file and
just index straight into it. The array
module lets you load a chunk of
binary data (implicitly: from a file) and access it directly as an array
of float
s or otherbasic C-like types. It works really well. I believe
numpy
ās ndarray
stuff does something similar.
If you really care, hereās an overengineered example of exactly this:
which is what Iām using to store my accumulated solar inverter data.
Itās a file-of-float
s, and I do random access into it.
That is designed for time sampled data, but Leonard could do the same
kind of thing for whatever indexing scheme he has for for his own data.
Cheers,
Cameron Simpson cs@cskk.id.au
Just looking at Leonardās example, he has numbers and strings. So a binary file full of floats wonāt do the trick unless he separates the strings out into a separate file (even more work, and getting into the realm of choosing a database).
Cheers,
Cameron
You can save strings in a file with fixed-length records. You just have to pick a maximum length for each one.
I want to explain what I am doing. I wondered which would be the best way to describe for future programmers understanding what was being done in my program.
First we are working with tuples. In my case Iām using both strings and integers in the tuple. Example: (my_tuple(1, 67, āDonutsā, āCoffeeā, 98)
We know we canāt add to a tuple directly, however we can convert the tuple to a list, do the add or remove as needed, turn the list back into a tuple.
The next thing is to see that it does not matter if we add one data points or one hundred data points. We just have a larger tuple. In my case Iām adding a 21 variable sub tuple every time. All the data is in 21 variables tuple sets. All the tuples must be the same length of variables. This is how to get a tuple of tuples or tuple of sub tuples.
Now that we have sets of 21 variables we can use it as an offset, counting by 21 to get to the beginning of the next sub tuple. Now we can address any of the 21 variables of that sub tuple. This also means we can loop through the WHOLE tuple of sub tuples. This is good for sorting and finding the highest and lowest data point of the address we are addressing.
In my case I donāt know if Iāll have one sub tuple or one hundred sub tuples at any given time. The benefit of doing data this way is we can now address all the data in a sub tuple or the entire tuple of sub tuples regardless of the size at any time.
This is the question. How do I document this process so other programmers understand what is going on?
Sorry I was not able to express it correctly.
Thanks again for your help and patients.
Make your code communicate well. There are several things you can do:
- Model the domain in which your program lives - use classes to represent the concepts in that domain, rather than just using sequences to structure things, as I mentioned above.
- Name variables and functions in a meaningful way so as to reveal their intent. Break long functions down into small ones. The book Clean Code by Bob Martin gives a lot of good guidance on these sorts of things.
- Write tests, with e.g.
unittest
or the the third party pytest to describe the intended behaviour of components of the program and the program as a whole. Not only do they make sure code continues to work as you and others change it, but they also let people know that the behaviour there was intentional.
But why?
Modifying tuples is going to tend to give O(n^2) algorithms.
Also, a list of lists or list of tuples would be much more self-documenting.
If youāre truly committed to this relative strangeness, you could say something like āitem(n) is tuple_[n * 21:(n + 1)*21] (zero based)ā. Or better, wrap it up in a class that does the indexing for you.
Thanks for your feedback. First, I wonāt know how many tuples there are at any given time. Second, easy to address any data point by direct or indirect (offset) address. Third, dense data storage, even though today that is not so important. Fourth, easy to detect corrupted data by size. No data in a data point may mean that a sensor went down. Fifth, ease to implement. Sixth, the most important yet a side note, Iāve had a major stroke. Somewhere in my brain I made a known connection between my pre stroke brain and the rebuilt brain function. I had lost ALL of my computer skills. This is the first I have been able to program since the stroke. This was normal when memory was at a premium.
How would you handle the top five differently? There are so many way to see and do thing in computing. There is no one way is always best in programing as the project changes. So I want to hear your thoughts, Iām just learning Python for six months now.
Thanks for your time and thoughts.