Saving contents of a web directory to local folder

cheesebird · April 30, 2024, 10:47am

I’m trying to save all files (Html and jpg) from a web directory to a local folder.

import bs4
import requests

username = 'test1'
password = 'test2'


url = "https://test/docs/28/"
z = requests.get(url,auth=(username,password))
data = bs4.BeautifulSoup(z.text, "html.parser")
for l in data.find_all("a"):
    r = requests.get(url + l["href"],auth=(username,password))
    print(r.status_code)
    print(data)
    
    
    
    with open(l, 'wb') as f:
      f.write(data)

Currently this code does not work. I’ve tried with urllib but can’t get authentication to work. Bit stuck any pointers how to achieve this would be very much appreciated

c-rob · April 30, 2024, 1:28pm

When I can’t get something to work I try another method. Take a look at these videos from the past year. https://www.youtube.com/results?search_query=%2BPython+save+web+page+and+images&sp=EgIIBQ%253D%253D

I actually have not done this before, sorry.

What doesn’t work?
Does it run but you don’t get any saved output and no errors?
Did you get an error? What line is the error on and what is the error?

cheesebird · April 30, 2024, 1:36pm

I have tried everything I can think of . Wget.download fails as I can’t use authentication correctly. All I am trying to do is download files from a web directory with authentication.

I can scrape images from the site page but need to the files in the directory.

Have literally tried everything now, hence the post.

This script above does not save the files , it is only producing a list of the files on the server. I need to complete it with a save method.

c-rob · April 30, 2024, 1:53pm

This doesn’t look like a valid URL with a valid domain. Are you trying to get this to work on your local web server on the same machine that is running Python? That’s not a valid domain for a local web server either.

Did you try a valid URL like https://gutenberg.org/?
Also why do you have to use authentication? There must be a specific reason for that.

Be aware that some websites may block you in their robots.txt file. I don’t know if Gutenberg blocks crawlers or not.

Does anyone have a test URL that won’t block crawlers for sure?

I should add that crawlers cannot download pages from dynamic websites like Reddit. They only get static HTML pages.

cheesebird · April 30, 2024, 1:58pm

I cannot put the real name and password in the example. All I’m looking for is a way to save the files in a web directory , i’ve now tried every example I can find on the web

barry-scott · April 30, 2024, 2:10pm

If the page uses form login, which is very likely, you will need to use a library like selenium to drive the page. Start here selenium · PyPI

cheesebird · April 30, 2024, 2:17pm

I can already login i.e .use authentication , what I can’t do is download the files on the server

barry-scott · April 30, 2024, 8:53pm

What error do you get?

c-rob · May 1, 2024, 9:55am

If you want to download files from the server using Python, then Python has to handle logging into the page also. Which is what @barry-scott alluded to.

If you as a person log into the page, then run the Python program to grab the page, the Python program is like a separate user, the Python program has not logged in to the page nor does the Python program know you have logged in as a person.

cheesebird · May 1, 2024, 12:06pm

No error , I can connect and get the 200 response. All I am looking for is the code to download the files in the web directory.

Nothing works, as previously mentioned.

cheesebird · May 1, 2024, 12:07pm

I can connect this is not the issue. I simply need the the code to download the files from a web server

c-rob · May 1, 2024, 1:57pm

Then we’re in luck! This is a specific error that we can search for. Here’s a search for you. Try some of the solutions they offer and let us know what didn’t work and the error you got for each case.

I’m not familiar with Python wget myself. Do you have a webpage for it? I found a bunch of wgets on pypi.org. There’s no way for me to know which one you are using unless you send me the link to the module or docs page.

Also if the third-party module is too old it may not work with newer Pythons. Which is a problem I have with schedule not working with Python 3.12.

EDIT: This wget says nothing about supporting authentication and it hasn’t been updated since 2015. wget · PyPI

EDIT: And here we have a search for +python how to download a web page with authentication.

cheesebird · May 1, 2024, 3:44pm

I can get authentication to work with requests.When I attempt to download all files in the web directory I can’t get this part to work. I’ve searched and tried every example Google came up with.

I can even scrap the web page but I need to download the IMG files in the web directory.

Surely someone could point me in the right direction?

c-rob · May 1, 2024, 3:58pm

I’ve never seen a way to download all files on a web server that are in a given directory, that would seem to be a big security issue if they allowed that. Normally a person gets one HTML page, and downloads all pages listed on that first HTML page in a recursive fashion.

Sorry I can’t help more.

Do you have a link to the wget Python module you are using? Without that I can’t even write a test program to help.

Or are you using the wget application? Those are 2 different things.

barry-scott · May 1, 2024, 5:04pm

There is no such think in HTTP world as a “directory”, its not a file system.
Do you mean all the URLs that a web page refers to?

cheesebird · May 1, 2024, 5:21pm

Yes , I’ve got my terminology wrong it appears.

So as an example this is the link …

https://test/docs/28/

And within this link there are numerous files…

1.htm
2. htm
3.htm
11.jpg
22.jpg
33.jpg

barry-scott · May 1, 2024, 5:28pm

Terminology: they are not files they are URLs refering to resourses.

That’s local to your network, nothing I can access.

If this is a task that is done rarely you could use your browsers ablity to “save page”.

Or you could using a tool like wget from the command line to do the work if its better automated.

Or you can learn about the details of HTML markup and all the ways resources are refered to from web pages.
But beware that this will not work for dynamic web pages that use JavaScript to draw the page and load resources for the page.