Error to connect pyspark to postgreSQL

I’m doing some tests using a Python Jupyter Notebook on Visual Code to connect pyspark to my localhost PostgreSQL, running as a Docker container.

'''
from pyspark.sql import SparkSession

# create a spark instance
spark = SparkSession.builder \
    .appName("ETL_PostgreSQL") \
    .config("spark.master", "local") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.5.4") \
    .getOrCreate()

# Source PostgreSQL database connection settings
source_url = "jdbc:postgresql://localhost:5430/chinook"
source_properties = {
    "user": "root",
    "password": "root",
    "driver": "org.postgresql.Driver"
}

table_df = spark.read.jdbc(url=source_url, table="genre", properties=source_properties)
table_df.show()

spark.stop()
'''

I get the following error on the spark.read command:
"…
Py4JJavaError: An error occurred while calling o1946.jdbc.
: java.lang.ClassNotFoundException: org.postgresql.Driver

"

Can you help me please?

Does the Docker container have Java installed?

Does it have an appropriate PostgreSQL driver for Java installed?

The PostgreSQL image was downloaded directly using the “docker run” command, so I think Java is not installed.
However, I can access the DB using SQL Clients and other Java applications, such as Pentaho Data Integration (Kettle), where I had to install the Java driver.
I get the same error if I try to access my local machine SQL Server using pyspark.
Thank you for the help.

Is it resolved? If so, can you pls share the solution.

Thank you.

No, it was not resolved.

Thank you!

Indu Girish via Discussions on Python.org <notifications@python1.discoursemail.com> escreveu (quarta, 17/04/2024 à(s) 15:22):