Check for validity of raw file access (i.e. no compression)?

seberg · May 9, 2023, 9:07am

I am wondering if there is well defined way to check for a file object which:

Has fileno()
Is not compressed: Duping/using fileno() directly gives the same thing (maybe modulo buffering I guess).

The first is actually easy, you can check for some of the IO base classes and try: f.fileno(); except OSError. The second is the tricky one.

NumPy has a few places that do direct file access (or do so conditionally). Whether always useful or not, it would be nice to check precisely for whether it is valid, if just to reject compressed files in some places that always do raw access.
(Maybe, Python’s mmap works directly with the fileno for this type of reason probably.)

rob42 · May 9, 2023, 9:40am

Maybe there is a ‘well defined way’ that another FM knows of.

I would use the likes of filetype which (by the looks of it, but untested by me) should be able to detect compressed files. From that you should be able list out any files that are compressed, thus leaving you with a list of files that are not compressed.

cameron · May 9, 2023, 9:56am

Just to this: mmap maps an OS file descriptor, so it has to use what
comes from fileno() (unless you do eg os.open and only jave a file
descriptor).

I’m curious about your use case for what you’re trying to infer from a
file though. It might be enough to test isinstance(f,io.FileIO).

Cheers,
Cameron Simpson cs@cskk.id.au

seberg · May 9, 2023, 10:03am

Yes, maybe. I am looking for some confirmation that this is expected to work (that is preferably not a long and incomplete list of files or an odd dependency). BZ2File and GzipFile don’t inherit from FileIO, so maybe it is an approximation at least.

NumPy has functions that, unlike mmap, do accept file object and will use raw access through fileno() either always or conditionally when it seems right (e.g. use mmap internally).

abessman · May 9, 2023, 10:11am

There is, to my knowledge, no standard way to signal “this file contains compressed data”; thus, there cannot exist a well defined way to check whether a file is compressed.

Why not just try to read the file, and see what happens^[1]? If UnicodeDecodeError is raised, the file is not valid for whatever it is you are trying to do, possibly because it is compressed.

it’s easier to ask forgiveness than permission. ↩︎

barry-scott · May 9, 2023, 5:25pm

There is no enforced requirement to use a filetype that indicates compress.
Also the compress can be built into the file system itself.
For example btrfs on linux using compress for all files by default I think.
There are ways to turn off the compress in the file system when this is a problem.

cameron · May 9, 2023, 11:49pm

NumPy has functions that, unlike mmap, do accept file object and
will use raw access through fileno() either always or conditionally
when it seems right (e.g. use mmap internally).

Sebastian should probably clarify, but I have the impression that the
objective is to figure out if (a) it is possibly to make another view of
the file contents by dup()ing fileno() and (b) if what is thus
obtained is the same content as what reding the file is.

His objective is still unclear to me, but I think he’s trying to know
whether a more “direct” access to the file content will be sane. His
citation of Numpy mmapping the underlying file suggests this to me. I
do not believe he cares if the underlying filesystem (which stores the
file itself, eg btrfs) compresses, just whether the raw uncompressed
content can be obtained via f.fileno() in a naive manner.

Personally I’d be pretty uncertain trying to infer this. This bodes ill:

 >>> f=open('../../.profile','rb')
 >>> type(f).__mro__
 (<class '_io.BufferedReader'>, <class '_io._BufferedIOBase'>, <class '_io._IOBase'>, <class 'object'>)
 >>> import io
 >>> isinstance(f,io.FileIO)
 False

That’s as direct is one gets I suspect, and no joy.

What’s the end objective here where just reading from the file won’t do?

Cheers,
Cameron Simpson cs@cskk.id.au

Rosuav · May 10, 2023, 1:56am

Sounds that way to me, too, and that suggests that any caching within the file object itself spells probable doom too.

seberg · May 10, 2023, 6:37am

Yes, duping for direct access like a mmap and yes, that presumably means that caching can spell doom also (I dunno what happens when you dupe right now).
The point was whether there is a simple way to remove the worse trap in 20+ year old code not making it fully safe. Few users will pass used files, but they may open a gzip file and pass that.

Is that good API, maybe not from that point of view. Although, there are some convenient things like supporting to memory-map a files data based on a kwarg, things that will get cluttered (up to hard) if you force a fileno/path. (The point of which is: I am not willing to deprecate that, at least not at this point.)

cameron · May 10, 2023, 9:19am

Yes, duping for direct access like a mmap and yes, that presumably
means that caching can spell doom also (I dunno what happens when you
dupe right now).

Does this mean you do not know what os.dup() does? It obtains another
file descriptor for the original descriptor (the int frm
f.fileno()).

They are not independent.

In POSIX, a file descriptor is just an int we use to talk about an
open file via the OS interfaces. A process has a mapping from the file
descriptors (these ints) to the file handle inside the kernel, which
represents the open file. In particular, it has a seek position for the
file. If you read from one file descriptor, that advances the seek
position, and that change will be visible via the other file descriptor
(because they both point at the same file handle).

Note there’s an os.pread call which doesn’t move the file pointer.

But if you’re eg mmapping the file, you can not worry about that
because you’re not doing anything which moves the file point that way
either.

The purpose of the dup() is to obtain a secondary file descriptor so
that if the first is closed (eg by closing the file) you’ve still got
valid access to the file (and, eg, its mmap) because there’s still a
file descriptor sitting around (you’ll need to close it yourself when
you’re does with it).

Note that if you’re not using the file outside of the original
open/close sequence, and not moving its read pointer, you don’t need to
use os.dup() at all!

The point was whether there is a simple way to remove the worse trap in 20+ year old code not making it fully safe. Few users will pass used files, but they may open a gzip file and pass that.

I would test stat.S_ISREG(os.fstat(f.fileno()).st_mode). If that’s
True you’ve got a regular (data) file an you’re probably just fine.
And I’d expect it to raise some exception for some non-regular files or
“psuedofiles” of whatever kind.

If you have a file-like object giving you uncompressed gzip data, I
don’t expect there to be a working f.fileno(). (I could be wrong, some
gunzipping wrapper might keep the fileno hanging around.)

Is that good API, maybe not from that point of view. Although, there
are some convenient things like supporting to memory-map a files data
based on a kwarg, things that will get cluttered (up to hard) if you
force a fileno/path. (The point of which is: I am not willing to
deprecate that, at least not at this point.)

I think I need to see the source code, or more explaination; i still
don’t understand in enugh detail I think.

Cheers,
Cameron Simpson cs@cskk.id.au

seberg · May 10, 2023, 9:45am

The answer seems clearly “no”. Beyond that we are trying to solve the XY problem, but since this is legacy code I doubt you can figure out a simple thing there, happy to be surprised though.

I don’t know what to say, there are two things:

np.fromfile(file)

which does raw access in C, but does accept open files. And yes, that code could be changed to just use read/write in chunks. Some code uses that indirectly:

if fileobj(f):  #has fileno() and is io.FileIO
     # np.fromfile(f)
else:
     # f.read()

But, more interesting:

 if mmap:  # user request
     # should check if that makes sense.
     mmap(f.fileno())
 else:
      if fileobj(f):  # same as above

Which has no clear “maybe its better to just use .read() and .write(), anyway”.

This is always about direct/raw file access making sense. Maybe it helps to say that e.g. fromfile accesses the file in C, although I think the mmap path is more interesting.

I would test stat.S_ISREG(os.fstat(f.fileno()).st_mode) .

That doesn’t say anything about whether directly interpreting the raw bytes you get via using fileno() directly (e.g. using a mmap) makes sense. Since you lost the info that it is a bz2 file:

In [11]: import bz2, stat, os
In [12]: f = bz2.BZ2File("/tmp/asdf.bz2", "w")
In [13]: stat.S_ISREG(os.fstat(f.fileno()).st_mode)
Out[13]: True

cameron · May 10, 2023, 11:24am

The answer seems clearly “no”. Beyond that we are trying to solve the
XY problem, but since this is legacy code I doubt you can figure out a
simple thing there, happy to be surprised though.

I don’t know what to say, there are two things:

np.fromfile(file)

which does raw access in C, but does accept open files.

Maybe it is quite picky. Eg checking type(f) in (TextIOReader, BufferedReader, FileIO) or maybe just FileIO, and silently falling
back to plain old f.read() if it isn’t totally sure. That list of
classes is totally ad hoc from the below experiment:

 Python 3.10.6 (main, Aug 11 2022, 13:47:18) [Clang 12.0.0 
 (clang-1200.0.32.29)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> f=open('../../.profile','r')
 >>> type(f)
 <class '_io.TextIOWrapper'>
 >>> f=open('../../.profile','rb')
 >>> type(f)
 <class '_io.BufferedReader'>
 >>> f=open('../../.profile','rb',buffering=0)
 >>> type(f)
 <class '_io.FileIO'>

Personally, I’d probably only feel safe constraining myself to classes
whose behaviou you truly know and silently fall back to read() for
anything else.

And yes, that code could be changed to just use read/write in chunks.
Some code uses that indirectly:

if fileobj(f): #has fileno() and is io.FileIO
# np.fromfile(f)
else:
# f.read()

But, more interesting:
if mmap:  # user request
    # should check if that makes sense.
    mmap(f.fileno())
else:
     if fileobj(f):  # same as above
Which has no clear “maybe its better to just use .read() and .write(), anyway”.

No, but maybe the mmap (flag from the caller) says that the caller
might have a policy opinion there.

I think mmapping a big file and reading its data direct should be
faster that reading it in a stream progressively. If nothing else you
can mmap the whole file and know exactly how big it is and therefore
how many values you need to allocate. If the data are already in machine
format you (well, numpy) can probably copy them directly to an array
or numpy’s equivalent.

This is always about direct/raw file access making sense. Maybe it helps to say that e.g. fromfile accesses the file in C, although I think the mmap path is more interesting.

The C code might also mmap stuff. I played around with mmap and
array myself recently:
https://hg.sr.ht/~cameron-simpson/css/browse/lib/python/cs/timeseries.py#L1877

I would test stat.S_ISREG(os.fstat(f.fileno()).st_mode) .

That doesn’t say anything about whether directly interpreting the raw bytes you get via using fileno() directly (e.g. using a mmap) makes sense. Since you lost the info that it is a bz2 file:
In [11]: import bz2, stat, os
In [12]: f = bz2.BZ2File("/tmp/asdf.bz2", "w")
In [13]: stat.S_ISREG(os.fstat(f.fileno()).st_mode)
Out[13]: True

Ah, exactly the case I feared. So that test is no good.