How to export 800 MB data from request lib and write into file in minute or so?

PriyankaChaudhariNitor · February 8, 2022, 1:18pm

I am exporting large dataset from ‘/export’ endpoint of powerpi? For downloading and writing into .pbix file around 800MB it is taking more than 5hrs. How can we reduce it like in minutes?

with requests.get(url, headers=headers, params=params, stream=True) as response:
        response.raise_for_status()
        with open(pbix_fileName, 'wb') as report_file:
            for chunk in response.iter_content(chunk_size=1024 **2):  # 1MB chunks 
                if chunk:
                    report_file.write(chunk)
                    report_file.flush()
                    os.fsync(report_file.fileno())
            print("success!")

kpfleming · February 8, 2022, 1:26pm

How rapidly can you do the same thing with curl or wget on the same machine? That’s the lower bound for how quickly this operation can be done, so you need to measure that first.

PriyankaChaudhariNitor · February 8, 2022, 1:39pm

@kpfleming actually I have tried same using powershell. For downloading and writing into file it is taking around 10mins. Yes on the same machine.

AndersMunch · February 8, 2022, 2:02pm

Why flush+fsync? With an fsync for every 8kiB, terrible performance is to be expected.

kpfleming · February 8, 2022, 2:12pm

If it takes 10 minutes using curl or wget, I don’t think you are going to be able to make it take less time using a Python program.

PriyankaChaudhariNitor · February 8, 2022, 2:13pm

@AndersMunch I just tried it. For downloading 6MB data it took around 4mins.

with requests.get(url, headers=headers, params=params, stream=True) as response:

        response.raise_for_status()

        data = b''

        with open(pbix_fileName, 'wb') as report_file:

            for chunk in response.iter_content(chunk_size=1024 *8*10):  # 1MB chunks

                if chunk:

                    report_file.write(chunk)

                    # report_file.flush()

                    os.fsync(report_file.fileno())

            return "success"

PriyankaChaudhariNitor · February 8, 2022, 2:14pm

@kpfleming Im ok for 10mins But while using python it is taking 5hrs to download. but using powershell it is downloading in 10mins.

kpfleming · February 8, 2022, 2:21pm

OK, thanks.

As noted above the use of ‘flush’ and ‘fsync’ will drastically reduce performance, and they should not be necessary. Beyond that, if you still can’t obtain the performance you like, you’ll need to consider one of the ‘async’ HTTP client libraries like requests-threads, httpx, etc. Using those will allow you to overlap reading from the HTTP server and writing to the local file.

PriyankaChaudhariNitor · February 8, 2022, 2:22pm

@kpfleming Thank you for your quick response. Yes I have tried by removing flush and fsync but still it is taking time. like for downloading 6MB it is taking around 4mins.

It would be great help if you share sample code.

Thanks in advance.

PriyankaChaudhariNitor · February 8, 2022, 2:56pm

@kpfleming I just tried
pbix_fileName = Path(f’{save_path}/{filename}.pbix’)

async def main():

with open(pbix_fileName, 'wb') as f:

    headers        = {'Authorization': f'Bearer {access_token}'}

    async with httpx.AsyncClient() as client:

        async with client.stream('GET', url, headers=headers, timeout=None) as r:

            async for chunk in r.aiter_bytes():

                f.write(chunk)

asyncio.run(main())

but to downlaod 6MB it is taking 4mins

AndersMunch · February 8, 2022, 3:03pm

The code is fine. I can download a random file at 10MB/s using basically the same code. Whatever is slowing you down is specific to your computer or the server you are downloading from.

PriyankaChaudhariNitor · February 8, 2022, 3:05pm

@AndersMunch which code you have used?

PriyankaChaudhariNitor · February 8, 2022, 3:14pm

@AndersMunch But if you are saying about server then I should not be download using powershell as well. Because Im using same machine and same server to downlaod.

Thanks

AndersMunch · February 8, 2022, 3:25pm

The server could be rate limiting certain clients. Who’s to say? It’s your mystery to solve.

Or you could just use subprocess to get curl or powershell or whatever to download the file for you.

steven.daprano · February 8, 2022, 4:01pm

Not necessarily. There server could be saying “I recognise powershell, I
will not limit the rate”, but with your script it goes “I have no idea
what this bot is, so I will throttle it”.

Can we ask what URL you are downloading?

steven.daprano · February 8, 2022, 4:07pm

You have this line:

for chunk in response.iter_content(chunk_size=1024 * 8):  # 1MB chunks

Isn’t 1024*8 an 8KB chunk, not 1MB?

PriyankaChaudhariNitor · February 8, 2022, 4:21pm

@steven.daprano
I am exporting powerbi .pbix report from rest api

GET https://api.powerbi.com/v1.0/myorg/groups/{groupId}/reports/{reportId}/Export

PriyankaChaudhariNitor · February 8, 2022, 4:22pm

Yes its 8KB.

PriyankaChaudhariNitor · February 8, 2022, 4:26pm

@steven.daprano I can download 6MB file using REST API python. But its taking time like for 6MB it is taking 4min and so

steven.daprano · February 8, 2022, 4:35pm

Why are you using 8KB chunks? That’s going to be slow.

Try reading it in 1MB chunks (like the comment says) and see if that
improves performance.

It looks like its not the server throttling your code, but your code
throttling the server.

Topic		Replies	Views
Performance of str.encode vs codecs.getwriter Python Help performance	9	210	February 14, 2024
Show data on server Python Help help	1	602	February 10, 2022
Automatic extract routine from AspenOne Process Explorer Python Help help	0	293	August 22, 2022
After restarting the computer, the transfer speed dropped. Python Help help	0	266	March 18, 2022
Use Selenium to automatically download reports from website Python Help	3	1076	August 19, 2023

How to export 800 MB data from request lib and write into file in minute or so?

Related Topics