Get header (http or https)

I am developing an http asynchronous web server to render html pages and other statistical files, running purely python. I’m using the socketserver module among others to structure the server. At the stage I am in, I need to obtain the scheme (protocol: http or https) that is usually sent in the request header and in the url entries on the client side. I know that there is wsgi that establishes an interface for interaction, but I don’t want to use the module (wsgiref). I want to be able to fully understand and develop a solution, so my question comes down to exploring how to obtain the header schema of the request or how to capture the protocol (http or htps) of the received url ?

I want to be able to fully understand and develop a
solution, so my question comes down to exploring how to obtain the header
schema of the request or how to capture the protocol (http or htps) of the
received url ?

Likely HTTP will be served on a different TCP port than HTTPS. Have you got
your certificate set up already to encrypt the communication? If not, can you
share how far you have got? In terms of the client request content, it will be
the same bar the wrapping transmission security. Have you tried making requests
to an HTTPS server using tools like curl and inspecting the network exchange?

1 Like

Observe the HTTP request content after the HTTPS handshake with google.com
using curl. It’s just a GET / HTTP/2 followed by the Host: header and
User-Agent: identification:

* Server certificate:
*  start date: Mar 16 19:35:00 2021 GMT
*  expire date: Jun  8 19:34:59 2021 GMT
*  subjectAltName: host "www.google.com" matched cert's "www.google.com"
*  issuer: C=US; O=Google Trust Services; CN=GTS CA 1O1
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
} [5 bytes data]
* Using Stream ID: 1 (easy handle 0x5631854cdfb0)
} [5 bytes data]
> GET / HTTP/2
> Host: www.google.com
> User-Agent: curl/7.64.0
> Accept: */*

You might find the TLS handshake and cipher selection leading up to that
interesting too if you logged the traffic, using curl -v or tshark |
Wireshark.

I am grateful for the initiative, but I have not yet reached the issue of dealing with the schemes (http and https). To contextualize, I’m using the BaseHTTPRequestHandler class of the http.server module, but I can’t find a solution to be able to obtain the complete request url (http: //domain.com …), because it would be easy to separate the components from the url using the urllib module to be able to treat each one individually. I tried to share the code here more and a little laborious to identify, then follow the link of the repository where the project is located.

I am grateful for the initiative, but I have not yet reached the issue of
dealing with the schemes (http and https).

The resolution of server name to address and port number to listening program
happens before.

To understand how these parts play together, I recommend experimenting with
this test scenario:

Map a pretend web address to your own machine:

127.0.0.8 myhost.mydomain.com

and either run a local Python HTTP server on port 8000:

python3 -m http.server --bind 127.0.0.8
Serving HTTP on 127.0.0.8 port 8000 (http://127.0.0.8:8000/) ...

or a simple socket data dumper on the same port:

nc -l 8000

then make a traced request to your listening program:

curl -v http://myhost.mydomain.com:8000

and notice that all the server sees is the HTTP request’s content:

GET / HTTP/1.1
Host: myhost.mydomain.com:8000
User-Agent: curl/7.64.0
Accept: */*

There is no notion of “scheme” at that point.

You can look at the actual network traffic so:

sudo tshark -f 'tcp port 8000' -i any

or, for packet-level information, you can use tcpdump listening to the
loopback interface (i.e. -i lo). The manual page tells you how to make it
dump to stdout (i.e. -w -) and how to dump packets as they come in with no
buffering (i.e. -U).

You can display your captures as ASCII text or in hexadecimal form for
convenience.

In summary, some layering (think: layer 3, 4, 7 of the OSI model):

  • DNS name to IP address
  • Port to listening program bound to that port
  • Plain HTTP request/response traffic

I am grateful for the initiative, but I have not yet reached the issue of dealing with the schemes (http and https).

You can also think of it this way: HTTPS security just wraps what is otherwise
normal HTTP protocol -based dialogue.

Your DNS to IP resolution conveniently specifies which target computer to
locate in a routed network, using a memorable name.

The HTTP scheme in your browser both informs which application protocol to use
to converse with the server and identifies a port for the conversation to
happen on.

On the server, a network interface is bound to the IP address you made your
request to and listens on the corresponding port (80 for HTTP) so it will get
the incoming web request (and not some other type of traffic, like an incoming
e-mail, which would arrive on another port).

For more info on schemes, here their official assignment:

https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml

with details about HTTP: