How to export 800 MB data from request lib and write into file in minute or so?

I am exporting large dataset from ‘/export’ endpoint of powerpi? For downloading and writing into .pbix file around 800MB it is taking more than 5hrs. How can we reduce it like in minutes?

with requests.get(url, headers=headers, params=params, stream=True) as response:
        response.raise_for_status()
        with open(pbix_fileName, 'wb') as report_file:
            for chunk in response.iter_content(chunk_size=1024 **2):  # 1MB chunks 
                if chunk:
                    report_file.write(chunk)
                    report_file.flush()
                    os.fsync(report_file.fileno())
            print("success!")

How rapidly can you do the same thing with curl or wget on the same machine? That’s the lower bound for how quickly this operation can be done, so you need to measure that first.

@kpfleming actually I have tried same using powershell. For downloading and writing into file it is taking around 10mins. Yes on the same machine.

Why flush+fsync? With an fsync for every 8kiB, terrible performance is to be expected.

If it takes 10 minutes using curl or wget, I don’t think you are going to be able to make it take less time using a Python program.

@AndersMunch I just tried it. For downloading 6MB data it took around 4mins.

with requests.get(url, headers=headers, params=params, stream=True) as response:

        response.raise_for_status()

        data = b''

        with open(pbix_fileName, 'wb') as report_file:

            for chunk in response.iter_content(chunk_size=1024 *8*10):  # 1MB chunks

                if chunk:

                    report_file.write(chunk)

                    # report_file.flush()

                    os.fsync(report_file.fileno())

            return "success"

@kpfleming Im ok for 10mins But while using python it is taking 5hrs to download. but using powershell it is downloading in 10mins.

OK, thanks.

As noted above the use of ‘flush’ and ‘fsync’ will drastically reduce performance, and they should not be necessary. Beyond that, if you still can’t obtain the performance you like, you’ll need to consider one of the ‘async’ HTTP client libraries like requests-threads, httpx, etc. Using those will allow you to overlap reading from the HTTP server and writing to the local file.

@kpfleming Thank you for your quick response. Yes I have tried by removing flush and fsync but still it is taking time. like for downloading 6MB it is taking around 4mins.

It would be great help if you share sample code.

Thanks in advance.

@kpfleming I just tried
pbix_fileName = Path(f’{save_path}/{filename}.pbix’)

async def main():

with open(pbix_fileName, 'wb') as f:

    headers        = {'Authorization': f'Bearer {access_token}'}

    async with httpx.AsyncClient() as client:

        async with client.stream('GET', url, headers=headers, timeout=None) as r:

            async for chunk in r.aiter_bytes():

                f.write(chunk)

asyncio.run(main())

but to downlaod 6MB it is taking 4mins

The code is fine. I can download a random file at 10MB/s using basically the same code. Whatever is slowing you down is specific to your computer or the server you are downloading from.

@AndersMunch which code you have used?

@AndersMunch But if you are saying about server then I should not be download using powershell as well. Because Im using same machine and same server to downlaod.

Thanks

The server could be rate limiting certain clients. Who’s to say? It’s your mystery to solve.

Or you could just use subprocess to get curl or powershell or whatever to download the file for you.

Not necessarily. There server could be saying “I recognise powershell, I
will not limit the rate”, but with your script it goes “I have no idea
what this bot is, so I will throttle it”.

Can we ask what URL you are downloading?

You have this line:

for chunk in response.iter_content(chunk_size=1024 * 8):  # 1MB chunks 

Isn’t 1024*8 an 8KB chunk, not 1MB?

@steven.daprano
I am exporting powerbi .pbix report from rest api

GET https://api.powerbi.com/v1.0/myorg/groups/{groupId}/reports/{reportId}/Export

Yes its 8KB.

@steven.daprano I can download 6MB file using REST API python. But its taking time like for 6MB it is taking 4min and so

Why are you using 8KB chunks? That’s going to be slow.

Try reading it in 1MB chunks (like the comment says) and see if that
improves performance.

It looks like its not the server throttling your code, but your code
throttling the server.