List.append questions for indexing?

My data comes originally from a tuple of twenty one variables. There are several tuples to be added one at a time, each with twenty one variables. Each tuple is changed to a list, and then appended to a file. My question is with multiple tuples appended to a file (now containing one long list), in order, of the group (tuple to list) being appended. Indexing is the problem. Do I index from 0 to the end of the list? Or should I use a sub set of twenty one variables and then address the variable I need?
I believe both will work as well, as the math is the same in the end. You are addressing the same data point. This is about what is the best or perfered way to handle this. To keep it clear to those who have to work with the code later.
Thanks

I donā€™t understand your question.

I guess that when you say you have ā€œa tuple of twenty one variablesā€ you mean that you have a tuple with 21 items in it, which you then turn into a list:

t = ('a', 'b', 'c', 'd', 'e', 'f', 'g',
     'h', 'i', 'j', 'k', 'l', 'm', 'n',
     'o', 'p', 'q', 'r', 's', 't', 'u')
assert len(t) == 21

mylist = list(t)

That much is reasonably clear. But then you say the list is appended to a file. You canā€™t write lists to a file, you have to convert to a string first. So what are you doing? My guess is:

with open(somefile, "w") as f:
    for t in many_tuples:
        mylist = list(t)
        f.write(str(mylist))

but thatā€™s probably not right, because you say that your file contains ā€œone long listā€.

You then ask:

which I donā€™t understand at all. You canā€™t index into a file.

Maybe instead of giving us a vague description like this, you can show us the code you are using, what you are trying to do, and the result you expect to get. Please simplify the code to a minimal working example, e.g. instead of using 21 items in the tuple, cut it down to just three.

1 Like

This is the ā€˜listā€™ that I want to append to a file. When you have several different 'listā€™s, but same exact variables each appended to the file (different data is all). I want to then read the file (main_list) and index to the data variable I want. (main_list[15]) Sixteenth data item. But I want that data point in the third set. Each list have different values.

[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 'homework', 'Default Alarm', 0, 2, 30, 14, 13, 16, 43]

main_list:
[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 'homework', 'Default Alarm', 0, 1, 30, 14, 13, 16, 43][0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 'Work', 'Alarm 4', 0, 4, 32, 14, 13, 16, 43][0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 'Meds', 'Fog Horn', 0, 2, 30, 14, 13, 16, 43]

Therefore I wanted the ā€˜2ā€™ for main_list[15] of the third set of data

Now do I index the list as one unit or do I sub set it by number of data points (21 in this case). first set would be 0 ofset, second set would be offset 21, third set would be offset 42 plus the the variable I want 15 so (42 + 15 = 57)ā€¦main_list[57] Is the final index address. Understand that there can be many more datasets at one time as the program runs.
I hope I have made this clear. Thanks again.

Iā€™m afraid not; sorry. You still havenā€™t shown any actual code demonstrating the issue, and your description appears to have the same points of confusion/inaccuracies as @steven.daprano points out above, and hasnā€™t answered any of the questions he poses. If you could just post some minimal working code that creates a file as you describe, reads from it and attempts to do the ā€œindexingā€ operating youā€™re describing, that would allow us to give you far more useful feedback without having to guess.

If you have some list of tuples:

some_data = [(0, 42, "spam"), (1, 5, "eggs")]

you could save it to a file as a string (repr) and eval it, at which point you have (roughly) the same object you started with, assuming youā€™re using primitive data types:

from pathlib import Path
data_file = Path(...)
data_file.write_text(repr(some_data), encoding="UTF-8")
some_data_read = eval(data_file.read_text(encoding="UTF-8"))

But thatā€™s a bad idea for many reasons and will only work in the simplest of cases; much better to write it as JSON (or a pickle, if you need arbitrary Python objects rather than basic data types):

import json
data_file.write_text(json.dumps(some_data), encoding="UTF-8")
some_data_read = json.loads(data_file.read_text(encoding="UTF-8"))

So long as you save it in a format that will roundtrip to the same Python object, you can just index into the list of tuples just as you could originally, e.g. to get 16th item of the 3rd set of data (as above):

assert some_data == some_data_read
assert some_data_read[2][15] == 2

Alternatively, if you want to be able to arbitrarily append tuples to a file row by row, you might be better off writing to a CSV instead, which you can write row by row at different times (once youā€™ve written the headers). You can use Pythonā€™s CSV module and manually convert the data types yourself when you read it back in, but in that case youā€™re way better off using the popular Pandas library, which is designed for exactly this kind of data manipulation and make not only data I/O, but also data processing much easier.

1 Like

If you have separate sets of data, why do you want to combine them into one? Iā€™m not really sure what youā€™re trying to achieve by doing that.

Also, in your first post above you mentioned about wanting to make things clear to the people who have to work with the code later. Good - thatā€™s really important. A list of values of different types suggests that it isnā€™t the position of the item thatā€™s important, but conceptually what the value represents. Thatā€™s not to say that values of the same type represent the same concept - in your example, I imagine "homework" and "Default Alarm" are values for two entirely separate concepts, even though theyā€™re both strings.

To make things clearer then, why not represent the data using something that ascribes names to the various parts? You could use a dict for this. Instead of

a_book = ['Refactoring', 'Martin Fowler', True]

# Usage
print(a_book[2]) # What does 2 mean?

I might have

a_book = {"title": "Refactoring", "author": "Martin Fowler", "in_print": True}

print(a_book["in_print"])

Personally, I prefer to name the concept as a whole, because it gives more meaning when reading code. Therefore, Iā€™d use a data class or named tuple:

from dataclasses import dataclass

@dataclass
class Book:
    title: str
    author: str
    in_print: bool

a_book = Book("Refactoring", "Martin Fowler", True)

print(a_book.in_print)
from collections import namedtuple

Book = namedtuple("Book", ["title", "author", "in_print"])

a_book = Book(title="Refactoring", author="Martin Fowler", in_print=True)

print(a_book.in_print)

There are reasons youā€™d want to choose one over the other, but you can learn about the differences between them.

1 Like

If heā€™s got a tool/function which processes a single data set and heā€™s
got a bunch of datasets, combining them in order to send them to that
tool is entirely reasonable.

Real world example: Iā€™ve got a solar inverter. Thereā€™s a Windows app for
monitoring it, and the data export function is a button which writes
historic data as CSV files. Those CSV files cover only the recent past
because of storage constraints inside the inverter itself. I push that
button every few days.

If I want to examine data over a wider time span than the memory of the
inverter, I need to combine the data from a few different dumps.

A year ago I was doing that by loading up the CSV files without the
header lines piping them through sort -u to get a single bigger CSV
file. Not Iā€™m parsing new CSV files and stroing their data in an
accumulated data store.

But either way Iā€™m effectively combining several datasets into a single
data set.

Cheers,
Cameron Simpson cs@cskk.id.au

Yes! This is what Iā€™m doing. Now for clarity, what I wanted to know was how to show the unpacking to individual data points. Both ways end at the same data point. One, to count to the exact datapoint (forget about groups). Two, is use a offset of (in this case) 21. So you are able to count by groups of 21. Then the addition of the data points count of the 21 datset group ( 0 through 20).
I think the second makes it clearer. Less likely to make mistakes as you are counting off by groups. One could look at a print of the data and count the groups to the data group needed and then the count to the data point.
It comes down to this: Example 6th data group, data point 12. Or data point = 117
As these are just small examples, consider 1500 data groups.
Thanks so very much.

Maybe others are able to figure it out, but I still donā€™t fully understand what you are attempting to describe, sorry, and I canā€™t find where youā€™ve answered any of the specific questions weā€™ve posed above. Hopefully, our replies have been useful, but if youā€™d like further useful feedback, Iā€™d suggest that you answer the specific questions posed to you above (with quotes, so its clear what youā€™re responding toā€”when you say

It isnā€™t obvious which of the four different posts by three different people youā€™re responding to, each with distinct ideas of what youā€™re doing and what to suggest in reply.)

Best of luck

The last entry was to Cameron Simpsonā€™s explaination. He explained it very clearly.

Thanks

Iā€™m going to try guessing what you want, based on the posts you have made so far.

You have a bunch of records with 21 items each. For simplicity and brevity, Iā€™m going to cut that down to just four items. You want to combine them into big one list. Iā€™m not sure how you are doing it.

Option 1:

# Some data blobs.
a = (100, 101, 102, 103)
b = (2000, 2001, 2002, 2003)
c = (30, 31, 32, 33)

combined = []

# Make one long list
for blob in (a, b, c):
    combined.extend(blob)

print(combined)
# Gives -> [100, 101, 102, 103, 2000, 2001, 2002, 2003, 30, 31, 32, 33]

This is the natural way to combine arrays of data in low-level languages like assembly.

If you want the third item of the second blob, you can get it from the combined list like this:

# Remember that Python counts from 0 instead of 1
index = (2-1)*4 + (3-1)  # Subtract 1 to allow for zero-based indexing.
print(combined[index])
# gives 2002

In your case, you would use 21 instead of 4.

Here is the second option:

# The data blobs again.
a = (100, 101, 102, 103)
b = (2000, 2001, 2002, 2003)
c = (30, 31, 32, 33)

combined = []

# Make a list of sublists
for blob in (a, b, c):
    combined.append(list(blob))

print(combined)
# Gives -> [[100, 101, 102, 103], [2000, 2001, 2002, 2003], [30, 31, 32, 33]]

This is probably more natural for Python. Here is how you would retrieve the third item from the second blob:

# We subtract 1 to allow for zero-based indexing.
print(combined[2-1][3-1])
# gives 2002

None of this is related to writing the data to a file, I canā€™t guess what you mean there.

Actually, itā€™s not done much anymore, but you can use fixed-width records and seek.

Or a database. Thatā€™s done a lot today.

Well, you can. Thatā€™s what the file .seek() method does.

With text files, seek values need to be treated as opaque values
(generally, assuming you donā€™t know the text encoding, or with a
variable width encoding where computing it isā€¦ fiddly - and UTF8, the
modern common text encoding is variable width).

But nothing prevents you writing a text file and recording the file
offsets of the lines as you go.

With binary data the situation is better defined, particularly if the
records are fixed width.

If the OPā€™s data are arbitrary things (like strings and so forth) this
gets messy. But if, say, theyā€™re all floats then they can be written
directly to files and directly seek()ed to (apologies for the
grammar).

Iā€™m not sure Leonard was clear about the type(s) in his data. But if
theyā€™re all, say, floats and the tuples are fixed length then he can
definitely save them in binary form, and either retrieve them with a
seek and a read or equally, probably better, mmap the file and
just index straight into it. The array module lets you load a chunk of
binary data (implicitly: from a file) and access it directly as an array
of floats or otherbasic C-like types. It works really well. I believe
numpyā€™s ndarray stuff does something similar.

If you really care, hereā€™s an overengineered example of exactly this:

which is what Iā€™m using to store my accumulated solar inverter data.
Itā€™s a file-of-floats, and I do random access into it.

That is designed for time sampled data, but Leonard could do the same
kind of thing for whatever indexing scheme he has for for his own data.

Cheers,
Cameron Simpson cs@cskk.id.au

Just looking at Leonardā€™s example, he has numbers and strings. So a binary file full of floats wonā€™t do the trick unless he separates the strings out into a separate file (even more work, and getting into the realm of choosing a database).

Cheers,
Cameron

You can save strings in a file with fixed-length records. You just have to pick a maximum length for each one.

I want to explain what I am doing. I wondered which would be the best way to describe for future programmers understanding what was being done in my program.

First we are working with tuples. In my case Iā€™m using both strings and integers in the tuple. Example: (my_tuple(1, 67, ā€˜Donutsā€™, ā€˜Coffeeā€™, 98)

We know we canā€™t add to a tuple directly, however we can convert the tuple to a list, do the add or remove as needed, turn the list back into a tuple.

The next thing is to see that it does not matter if we add one data points or one hundred data points. We just have a larger tuple. In my case Iā€™m adding a 21 variable sub tuple every time. All the data is in 21 variables tuple sets. All the tuples must be the same length of variables. This is how to get a tuple of tuples or tuple of sub tuples.

Now that we have sets of 21 variables we can use it as an offset, counting by 21 to get to the beginning of the next sub tuple. Now we can address any of the 21 variables of that sub tuple. This also means we can loop through the WHOLE tuple of sub tuples. This is good for sorting and finding the highest and lowest data point of the address we are addressing.

In my case I donā€™t know if Iā€™ll have one sub tuple or one hundred sub tuples at any given time. The benefit of doing data this way is we can now address all the data in a sub tuple or the entire tuple of sub tuples regardless of the size at any time.

This is the question. How do I document this process so other programmers understand what is going on?

Sorry I was not able to express it correctly.
Thanks again for your help and patients.

Make your code communicate well. There are several things you can do:

  • Model the domain in which your program lives - use classes to represent the concepts in that domain, rather than just using sequences to structure things, as I mentioned above.
  • Name variables and functions in a meaningful way so as to reveal their intent. Break long functions down into small ones. The book Clean Code by Bob Martin gives a lot of good guidance on these sorts of things.
  • Write tests, with e.g. unittest or the the third party pytest to describe the intended behaviour of components of the program and the program as a whole. Not only do they make sure code continues to work as you and others change it, but they also let people know that the behaviour there was intentional.

But why?

Modifying tuples is going to tend to give O(n^2) algorithms.

Also, a list of lists or list of tuples would be much more self-documenting.

If youā€™re truly committed to this relative strangeness, you could say something like ā€œitem(n) is tuple_[n * 21:(n + 1)*21] (zero based)ā€. Or better, wrap it up in a class that does the indexing for you.

Thanks for your feedback. First, I wonā€™t know how many tuples there are at any given time. Second, easy to address any data point by direct or indirect (offset) address. Third, dense data storage, even though today that is not so important. Fourth, easy to detect corrupted data by size. No data in a data point may mean that a sensor went down. Fifth, ease to implement. Sixth, the most important yet a side note, Iā€™ve had a major stroke. Somewhere in my brain I made a known connection between my pre stroke brain and the rebuilt brain function. I had lost ALL of my computer skills. This is the first I have been able to program since the stroke. This was normal when memory was at a premium.

How would you handle the top five differently? There are so many way to see and do thing in computing. There is no one way is always best in programing as the project changes. So I want to hear your thoughts, Iā€™m just learning Python for six months now.

Thanks for your time and thoughts.