How to export 800 MB data from request lib and write into file in minute or so?

PriyankaChaudhariNitor · February 9, 2022, 7:49am

@kpfleming I have tried using httpx
“”"
pbix_fileName = Path(f’{save_path}/{filename}.pbix’)

async def main():

with open(pbix_fileName, 'wb') as f:

    headers        = {'Authorization': f'Bearer {access_token}'}

    async with httpx.AsyncClient() as client:

        async with client.stream('GET', url, headers=headers, timeout=None) as r:

            async for chunk in r.aiter_bytes(20480):

                f.write(chunk)

asyncio.run(main())"""

but for exporting 6MB it is taking 3mins.

PriyankaChaudhariNitor · February 9, 2022, 8:21am

@steven.daprano Yes I tried reading 1MB chunks but It does not solves the problem.

kpfleming · February 9, 2022, 11:43am

If you change the code to just read the data but not write it, how long does it take to process 10MB?

PriyankaChaudhariNitor · February 9, 2022, 12:21pm

@kpfleming for reading 6MB data it is taking around 3mins.

kpfleming · February 9, 2022, 12:51pm

OK, then you have a reading issue, not a writing issue, as it took the same amount of time to read the data without writing it as it took to both read and write it. You’ll need to focus on that aspect of the problem.

PriyankaChaudhariNitor · February 9, 2022, 2:29pm

@kpfleming I can download same in 10mins using powershell rest api.

kpfleming · February 9, 2022, 3:04pm

Yes, you’ve made that clear. Unfortunately there’s not much else we can do here, as the Python code you’ve shown us is probably the most effective way to do the task you are trying to do.

The cause of the difference in performance is not your code, it’s something else. You might be using a poorly-optimized Python interpreter, or running it in an environment where it has restricted access to the network, or you might be using a different API endpoint in each of the two situations. There are a lot of potential variables, and you’ll need to investigate them to try to understand the cause of the difference in performance.

steven.daprano · February 10, 2022, 10:29am

Hi Kevin,

I disagree with your statement that Priyanka’s code is as efficient as possible.

The code shown to us is reading from the remote site in 8KB blocks. That is not efficient.

At the very least, that should be fixed. Priyanka should try larger block sizes, or just leave it up to requests to set the block size.

If possible, Priyanka should try to provide a Minimal Working Example that we can actually run ourselves, to confirm that when we run the code and download from the same URL, we get the same result.

Of course that assumes that we can download from the same URL that Priyanka is downloading from.

Another likely possibility is that the remote site is throttling the download. Perhaps it doesn’t like the default user-agent that Python is using, or that cookies are not set, or there is no referer.

Or maybe Priyanka has downloaded this same file so many times, over and over again, and the remote server is throttling it now.

Beyond that, as you say, there are too many potential variables for us to diagnose the problem just by reading the code.

PriyankaChaudhariNitor · February 10, 2022, 11:13am

@steven.daprano I have updated code but still facing same issue.

One thing i did, I read 840MB file and rewrite that file which Im expecting from API. It took just 24 sec.

kpfleming · February 10, 2022, 12:10pm

Indeed, that was the case. Later there was httpx-based code posted, and did not have that problem… but since I didn’t look closely, it’s actually worse

Priyanka, in your httpx code example above, you’ve specified ‘20480’ as the argument to aiter_bytes. That limits the chunks to 20KB each, which is extremely small. There is no need to specify an argument to aiter_bytes at all, just let it return whatever size chunks it wants to return.

PriyankaChaudhariNitor · February 14, 2022, 12:11pm

Hi @kpfleming
I tried your suggestions but no change in execution time.

async with httpx.AsyncClient() as client:

        async with client.stream('GET', url, headers=headers, timeout=None) as r:

            data = b''

            async for chunk in r.aiter_bytes():

                if chunk:  

                    data+=chunk

            print("success!")

            return data

steven.daprano · February 14, 2022, 3:43pm

Mystery solved! (I hope.)

This part of your code is a dangerous anti-pattern that can lead to
extremely slow performance:

# Don't do this, it is an Anti-Pattern
data = b''
for chunk in something:
    data += chunk

The problem is that each time you go through the loop, the interpreter
has to create a new byte-string by copying the old one. If you know Big
Oh notation, this is O(n**2) so it can be very slow.

There is a FAQ about this:

https://docs.python.org/3/faq/programming.html#what-is-the-most-efficient-way-to-concatenate-many-strings-together

So you could use a bytearray object, as the FAQ suggests, or a list:

data = []
for chunk in something:
    data += chunk
data = b''.join(data)  # Convert to bytes object.

This is why you should always show your code.

kpfleming · February 15, 2022, 11:54am

Also, if your goal is to just write the response data into a file, you may consider combining the aiofiles package with httpx to asynchronously write the data as it is received.