File Upload in FastAPI (and from there to S3) without intermediate file

gkb · April 17, 2024, 7:36pm

I’m trying to implement a file upload endpoint in FastAPI at work.
The goal is to accept files from the client and pass them on to an S3 bucket.
In order to fulfill security requirements it must run in a read-only container.
That’s why I’m currently struggling to do this using UploadFile (because this uses disk to store the uploaded file).

Does somebody know if it is possible to configure UploadFile such that no intermediate file is written? Ideally, I want to stream the incoming file directly to the S3 bucket (so reading it into memory first is no option either).

Thanks for your time!

sr-murthy · April 17, 2024, 7:53pm

I’m not sure about a streaming option to S3, but you can use the boto3 S3 client put_object method to effectively upload your file from an in-memory buffer (io.StringIO or io.BytesIO) created from the file contents, without the flie being written to disk.

I don’t know what kind of file you want to write. Let’s say, for the sake of illustration, that it’s a CSV created from a dataframe data then this would work:

with io.StringIO() as csv_buffer:
            data.to_csv(csv_buffer, index=False)

            response = s3_client.put_object(Bucket=<bucket name>,
                                            Key=<s3 filename>,
                                            Body=csv_buffer.getvalue())

I don’t quite see the possibility of uploading a file from a source to S3 without at least an object representing either the file on disk, or its in-memory content - for put_object the body must be a “bytes or seekable file-like object” according to the documentation. I think this applies to all of the upload and put methods on the boto3 S3 client.

P. S. You may be able to find other 3rd party libraries that do this, but not boto3 as far as I can see. Not sure about FastAPI UploadFile either, but could it be combined with boto3 S3?

hansgeunsmeyer · April 17, 2024, 9:05pm

What does it mean to “accept files from the client” if you cannot read them into memory? How can you pass on something that you don’t have in memory?

hansgeunsmeyer · April 17, 2024, 9:31pm

Looking in the FastAPI (and underlying starlette) source code for UploadFile, I don’t see anything that says the files must be stored on disk. Generally it seems any in-memory BytesIO buffer (or really any file-like buffer) should be usable as file. By default it seems they assume the files may be tempfile.SpooledTemporaryFile objects, but the source code is totally general and doesn’t require this (and those temp files are only written to disk when a particular max_size is exceeded or until file.rollover() is called). So, in the implementation of the endpoint, you should be able to just use a BytesIO object to handle the content that needs to be passed on?

sr-murthy · April 17, 2024, 9:40pm

A more general point on streaming to keep in mind is that you cannot know the size of the incoming stream/data in advance, whereas an upload is of something that already exists in its entirety and has a known size.

EpicWink · April 17, 2024, 10:45pm

I can think of three more options:

You can configure certain directories in Linux to be mounted in-memory, but you still have to save the entire file before sending to S3.
You can return and have the client upload with a pre-signed URL. You can return an HTTP 307 to allow the client to automatically start uploading to S3 (if redirects are enabled) if the request was a PUT. Otherwise you’ll need control over the client to have them use a returned pre-signed URL.
You can construct an HTML form for use in a browser with pre-authed POST upload. Given that you’re using FastAPI, you’re probably not dealing with browsers.

Most upload requests come with Content-Length header.

sr-murthy · April 17, 2024, 11:14pm

A stream is usually defined as a continuously self-updating data source, where its “size” isn’t known in advance. So, if you have a set of files sent by a client in a HTTP request, with the content length available in the headers, then you are dealing with a non-streaming scenario.

gkb · April 18, 2024, 6:15pm

How can you pass on something that you don’t have in memory?

I’m thinking of a generator-like approach where each chunk of data is immediately processed (i.e. is sent to the S3 storage).
This way I hope there is no need to store the complete file in memory.

gkb · April 18, 2024, 6:43pm

In the end I abandoned UploadFile and used a starlette.Request object.
This object actually has a method Request.stream which lets you iterate asynchronously over the incoming data chunks.
I “hide” this stream in a File-like interface which I pass on to boto3’s s3.upload_fileobj.
Unfortunately, boto3 does not support async, so I have to run the S3 upload within a thread.

The result looks roughly like this:

from queue import Queue
from threading import Thread

import boto3
from fastapi import FastAPI, Request

app = FastAPI()


class Filelike:
    def __init__(self, q):
        self.q = q

    def read(self, size=-1):
        result = b""
        while True:
            chunk = self.q.get()
            if chunk is None:
                break

            # handle chunk size etc
            # add chunks to result until len(result) == size

        return result


def upload_to_s3(file_):
    s3 = boto3.client("s3")
    s3.upload_fileobj(file_, "bucket", "key")


@app.route("/upload")
async def upload(request: Request):
    q = Queue()

    t = Thread(target=upload_to_s3, args=(Filelike(q),))
    t.start()

    async for chunk in request.stream():
        q.put(chunk)
    q.put(None)

    t.join()

    return {"result": "success"}

This approach worked so far, but I’m happy about critique and suggestions.

PyWoody · April 18, 2024, 8:24pm

You might run in to issues with boto3 making concurrent requests on larger files leading to out-of-order reads. To force boto3 to use a single-threaded approach, you can set the config as:

# Disable thread use/transfer concurrency
config = TransferConfig(use_threads=False)

s3 = boto3.client('s3')
s3.upload_fileobj(file_, "bucket", "key", Config=config)

Taken from: File transfer configuration - Boto3 1.34.86 documentation, which uses the download_file as an example but it applies to upload_fileobj.