Download several files from github

mfernandes · June 7, 2021, 9:20pm

Dear Python community,
I am brand new to Python.
I would like to download all the files available on this Github link: professions-foi-candidats/documents/LG17/1 at master · regardscitoyens/professions-foi-candidats · GitHub .
Is it possible through python to do the download of all files and save them in my laptop?
From what I understood, if I want to download a file I should code:

import urllib
urllib.urlretrieve ("https://github.com/regardscitoyens/professions-foi-candidats/blob/master/documents/LG17/1/LG17-1-1-BLATRIX-CONTAT-2-tour1-profession_foi.pdf", "mydirectory/myfile.pdf")

the issue in this case is that the files are in different links.

cameron · June 7, 2021, 11:12pm

You could use beautifulsoup to fetch the page with the listing, grab all
the URLs from the contents, then fecth each as above.

The bs4 library:

https://pypi.org/project/beautifulsoup4/

Cheers,
Cameron Simpson cs@cskk.id.au

Mariatta · June 7, 2021, 11:33pm

Is there a reason you’re not doing git checkout to get the files from GitHub to your computer?

https://github.com/regardscitoyens/professions-foi-candidats.git

erlendaasland · June 8, 2021, 10:31am

Sounds like a good case for trying out git sparse checkout.

mfernandes · June 8, 2021, 11:00am

Thank you for your comments.
I tried to understand Cameron Simpson advice on using beautifulsoup. I coded:
html_doc =copy&paste their html which was long
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, ‘html.parser’)

print(soup.prettify())

and I received the error:
“File “”, line 104

^
IndentationError: unindent does not match any outer indentation level”

So I think I will try to understand how git sparse checkout works.
Thank you!

cameron · June 8, 2021, 11:18pm

That looks to me as though you pasted HTML straight into your Python
code. And Python then tries to parse the HTML as Python. Badness.

The cleanest way to try this is to put the HTML into a separate file, eg
named “foo.html”. Then:

from bs4 import BeautifulSoup
with open("foo.html") as f:
    html_doc = f.read()
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

which keeps the HTML code separate from the Python code.

Cheers,
Cameron Simpson cs@cskk.id.au