Python – PySpark Connect to SQL

Python - PySpark Connect to SQL

Today in this article, we will see how to use PySpark Connect to SQL Database using Python code examples.

Here’s a basic example demonstrating how to read data from SQL Database, perform a transformation, and then write the results back to SQL Database.

We will make use of SQL Database Spark Connector which helps connect to SQL Database and allows us to read and write data between PySpark and SQL Database.

We have below SQL Database table,

Install the PySpark

Please install PySpark using the below command.

Command

pip install pyspark

To connect Apache Spark with a Microsoft SQL Server database using Python, you can use the JDBC (Java Database Connectivity) driver for SQL Server along with PySpark.

Download the SQL Server JDBC Driver:

Download the SQL Server JDBC driver from the Microsoft website.

You can find it here: Microsoft SQL Server JDBC Driver

Add JDBC Driver JAR to Spark:

Place the downloaded JAR file (e.g., mssql-jdbc-<version>.jar) in a location accessible to your Spark cluster, and make note of the path.

Initialize a Spark session

Initialize a Spark session as below using the downloaded jar file. Update the jar file as needed.

spark = SparkSession.builder 
    .appName("TheCodeBuzz") 
    .config("spark.jars", "/path/sqljdbc<version>.jar")  /// this is a example 
    .getOrCreate()

Read data from SQL Server into a DataFrame

Now, you can use the PySpark library to read data from the SQL Server database as below,

df = spark.read 
    .format("jdbc") 
    .option("url", database_url) 
    .option("dbtable", database_prop["table"])) 
    .option("user", database_prop["user"]) 
    .option("password", database_prop["password"]) 
    .option("driver", database_prop["driver"]) 
    .load()

# Show the DataFrame
df.show()

# do transformations or business logic execution on the DataFrame



# Stop the Spark session
spark.stop()

Write the DataFrame back to the SQL Server

Let’s now try to write the data back to the table.

I have below sample data

data = [("John", 28), ("Bob", 22), ("Victor", 25)]
columns = ["name", "age"]



df = spark.createDataFrame(data, columns)



Let’s use the above df object to write the DataFrame back to SQL Server,

df.write 
    .format("jdbc") 
    .option("url", database_url) 
    .option("dbtable", "<table_name>") 
    .option("user", <user>) 
    .option("password", <password>) 
    .option("driver", <driver>) 
    .mode("overwrite")  
    .save()


You can modify the mode as needed as overwrite or append, etc as needed.

Do you have any comments or ideas or any better suggestions to share?

Please sound off your comments below.

Happy Coding !!



Please bookmark this page and share it with your friends. Please Subscribe to the blog to receive notifications on freshly published(2024) best practices and guidelines for software design and development.



Leave a Reply

Your email address will not be published. Required fields are marked *