Add search methods from bytes/bytearray objects to io.BytesIO

lingeringwill · November 5, 2022, 1:22pm

I worked heavily with BytesIO objects in one of my applications, and recently I thought that it would be useful to be able to look through the object for specific values.

For this I did something like: file.read().find(text), which I think is creating a copy of the entire bytes object.

So my suggestion is to add some methods from the bytes/bytearray object to the BytesIO class. I’m mainly thinking of: count, find, index, replace, rfind, rindex.

I think t’s pretty easy to implement and useful enough to be worth it.

It’s also possible to add the same methods to other in-memory file-like objects in the standard library.

barry-scott · November 5, 2022, 2:53pm

It is reading from the disk and does not have all the contents in memory as you seem to suggest.

lingeringwill · November 5, 2022, 4:08pm

file is a BytesIO instance in the example that I’ve shown.

Rosuav · November 5, 2022, 4:34pm

Are you able to use BytesIO.getbuffer() and do your searching on that?

tjreedy · November 5, 2022, 8:33pm

-1
A BytesIO is a seekable, buffered, non-tty, no fileno read-write IO file object implemented in memory. It is intended to be interchangeable with other file objects with the same properties, or a subset thereof. Its main use is to substitute for non-memory files during development and testing. A StringIO might be used as a writable text equivalent of bytearray. (Indeed, one might wonder whether you should use bytearrays instead of ByteIOs.)

lingeringwill · November 6, 2022, 2:05am

Possible, but as far as I know the memoryview that you get from this doesn’t have a native find method or anything similar, so you would have to write your own function, and I don’t think that my own implementation would be as fast as a native C implementation from the standard library.

lingeringwill · November 6, 2022, 2:18am

Do you have a resource that states that BytesIO objects should only be used for testing? I’m currently using them to load my files into memory and they work perfectly fine. Loading the whole file into a BytesIO object and working with it seems to be even faster than reading a hard drive file in small chunks. The only problem that I could think of is running out of memory if you load big files, but that’s not a concern for my application.

A bytearray is useable but not being able to seek to a certain position would be a headache. I thought one of the advantages of BytesIO was that it kept track of the position that you’re in while reading.

I didn’t use StringIO because I’m working with data containing multiple data types.

Rosuav · November 6, 2022, 2:29am

Hang on. If you’re reading the whole file in, that sounds like a job for memory mapping. A mmap object has a find method built in. Although I can’t go much further into that, as I haven’t used mmap very much myself.

barry-scott · November 6, 2022, 3:29pm

Its an arrary so you do not need to seek just use a offset?

Since you are reading all the data into memory why not read into an bytes object?

with open(file, 'rb') as f:
    data = f.read()

marcelm · November 9, 2022, 8:46pm

If you are at position 0 in file when you run this, then an optimization in the CPython implementation of BytesIO.read() should apply: If the position is 0 and no explicit size was requested, then read() just returns a reference to the internal bytes buffer.

However, if I understand the code correctly, a copy is still triggered when getbuffer() is involved: If you call it before read(), then read makes a copy. If you call it after read(), then getbuffer() makes a copy. This makes sense because the returned memoryview object allows modifications, while bytes() is immutable. So they should not be backed by the same memory.

Edit: getvalue() also doesn’t copy as long as it’s not combined with getbuffer().

lingeringwill · November 12, 2022, 2:38pm

Thanks! I already noticed some performance improvements with getvalue().find() so I guess that’s what I’m gonna use for now.