How to Install PySpark on Windows Machine

Install PySpark on Windows Machine

Install PySpark on Windows

Are you excited about diving into the world of PySpark? But puzzled by the PySpark install process on your Windows machine? PySpark install windows? Fret not! In this step-by-step guide, we’ll walk you through the process of installing PySpark without breaking a sweat.

What is PySpark?

PySpark, the Python API for Apache Spark, empowers users to conduct real-time, large-scale data processing in distributed settings using Python. It offers a PySpark shell for interactive data analysis, blending Python’s user-friendliness with the robust capabilities of Apache Spark. PySpark encompasses Spark’s full suite of functionalities, including Spark SQL, DataFrames, Structured Streaming, Machine Learning (MLlib), and Spark Core, making it accessible to Python-savvy individuals for data processing and analysis of any scale.

PySpark Installation Prerequisites

Before we jump into the installation of PySpark on Windows, make sure you have the following PySpark Installation prerequisites in place:

  1. Python: PySpark requires Python. Ensure you have Python installed on your Windows machine. If not, download and install Python from the official website.
  2. Java: PySpark relies on Java, so ensure you have Java Development Kit (JDK) installed. You can download JDK from Oracle’s website.

After configuring Python and Java in local system you are ready to do the next step of pyspark install windows.

How to Configure PySpark on Windows

Now, let’s get to the heart of the matter—Configure PySpark (PySpark install windows).

Steps to Install PySpark on Windows Machine:

Step 1: Install Apache Spark in Jupyter Notebook

  1. Visit the official Apache Spark website (https://spark.apache.org/downloads.html).
  2. Choose the latest stable version of Spark.
  3. Select “Pre-built for Apache Hadoop” and download the “Direct Download” link for your chosen version.
  4. Extract the downloaded .tgzfile to your preferred location
  5. Let’s say extracted file is in C:\Spark\spark-3.4.1-bin-hadoop3.2
  6. Above location to be set as SPARK_HOME
Download Spark and Hadoop

Step 2: Download Hadoop for Install PySpark on Windows

  1. Download Winutils.exe file from github
  2. Select the Hadoop version as per the version selected Step 1
  3. Click hadoop.exe and download it under C:\Spark\spark-3.4.1-bin-hadoop3.2\Hadoop\bin
  4. Above location to be set as HADOOP_HOME

Above step is very important step of pyspark install windows. Please make sure to do it as shown in the above screenshot.

3: Use of Environment Variables for Install PySpark

  1. Open the File Explorer
  2. Right Click on “This PC”
  3. Click on “Properties”
  4. Click on “Advanced system settings” on the left.
  5. In the System Properties window, click the “Environment Variables” button.
  6. Click “OK” to save the environment variables.
  7. Under “System variables”, click “New” and add the following variables:
    • Variable name: SPARK_HOME
    • Variable value: The path to the Spark folder you extracted earlier(C:\Spark\spark-3.4.1-bin-hadoop3.2).
    • Variable name: HADOOP_HOME
    • Variable value: The path to the Hadoop folder within the Spark directory (e.g., C:\Spark\spark-3.4.1-bin-hadoop3.2\Hadoop\bin).
    • Variable name: Path
    • Variable value: The path to your Python executable (e.g., C:\Users\userName\anaconda3\python.exe).
Setting Environment Variables for Install PySpark on Windows Machine

Setting of environment variables in local windows system is very important for doing pyspark install windows. Make sure you setup the environment variable correctly to install PySpark in Windows system correctly.

Step 4: Install Findspark for PySpark

Findspark is a Python library that helps locate Spark in your system. Open your command prompt or terminal and run the following command to install Findspark:

Install findspark
pip install findspark
Pip Install FindSpark

Step5: pip install pyspark on Windows

Open your command prompt or terminal and run the following command:

PIP Install PySpark
pip install pyspark
Install PySpark

Step 6: Verify Install PySpark

To verify that PySpark is installed correctly, open a Python environment (e.g., Jupyter Notebook or your favorite Python IDE) and run the following code:

Import PySpark
import findspark
findspark.init()
import pyspark
Install PySpark

If you don’t encounter any errors, congratulations! You’ve successfully installed PySpark on your Windows machine.

Test in Jupyter for PySpark Installation on Windows

Install pyspark on windows
import pyspark

# Create a SparkSession
spark = pyspark.sql.SparkSession.builder.appName("read csv").getOrCreate()
#Read csv file
df = spark.read.csv('employees.csv', inferSchema=True, header=True)

#show the dataframe
df.show()
Install PySpark
Install Pyspark on Windows

Conclusion: PySpark Installation on Windows

In this brief guide, we’ve simplified the process of installing PySpark on your Windows machine(pyspark install windows). Now you’re ready to embark on your data adventures with PySpark, harnessing its immense capabilities effortlessly. Now you can read large CSV file and much more easily with rapid speed using it’s distributed computing Architechure.

Remember, the journey of data exploration and analysis begins with a single installation step. Happy coding!

This Post Has 4 Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.