Issue with pagination of REST API in Python

akumar1903 · February 3, 2023, 5:16am

Hello,

I have written a code to do pagination of REST API and write it to a folder path. The code is not throwing any error but running for hours and not getting any output written. After running so long the notebook stops with an internal error.

Any help on what i am doing wrong or how the code can be improved is much appreciated.

Below my code

getURL = 'https://api.xxx.com/v3/direct-access/abc'
baseURL = 'https://api.xxx.com/v3/direct-access'
headers = {
"accept" : "application/json",
"Content-Type": "application/json",
"Authorization": "Bearer " + str(token)
 }
results = []

response = requests.get(getURL, headers=headers)
r = response.json()

for i in r:
    results.append(i)

while response.links.get('next'):
    response = requests.get(baseURL + response.links['next']['url'],headers=headers)
    r1 = response.json()
    for i in response:
        results.append(i)

##assert len(results) == requests.get(getURL[:-6]).json()    
return results
rdd = spark.sparkContext.parallelize((results))
print(rdd)
df = spark.read.option('multiline','true').json(rdd)
df.repartition(1).write.json(stagingpath,mode="overwrite")

barry-scott · February 3, 2023, 2:46pm

Add print statements to your code so that you can find out what it doing.

I expect that the reason the program aborts is that you run out of memory.
You are appending to results in a loop that seems to never terminate.

How does the program know that its loaded the last URL?
How many unique URLs are you expecting to get?

For example what is the next_link that you use to do the requests.get() in the loop?

avisser · February 3, 2023, 5:38pm

To add, your code shows an error message at the end. Does that mean that you copied the code after running it in an interactive IPython session and if so, could that be an indication of what goes wrong (no idea what code -9 means)?

akumar1903 · February 3, 2023, 5:45pm

I have fixed the errors and the code is working, modifed the code. But the issue i am facing is in the return results below. This runs for hours and don’t give any results. Any idea how i can print the results to a path?

return results
print(results)        

rdd = spark.sparkContext.parallelize((results))
print(rdd)
df = spark.read.option('multiline','true').json(rdd)
df.coalesce(1).write.json(stagingpath,mode='overwrite')

akumar1903 · February 3, 2023, 5:48pm

this is what it says in the document for terminating the code when it reaches the end of pages. How to incorporate this?

When paginating through your request if an empty set is returned, this indicates the end of your
requested dataset. When you receive the first empty set, your process has successfully completed,
and you may exit your program

barry-scott · February 3, 2023, 5:52pm

To find out what your code is doing you need to debug it.
Again i suggest you put prints into the code to find out what it is doing.

Maybe all you need to do it print(i) instead of appending it to results?