How to Convert Parquet to CSV file format in Python

convert parquet to csv

Convert Parquet to CSV File in Python

Convert a Parquet File Format in Python

Reason to Convert Parquet to CSV

There are a few reasons why you might want to convert Parquet to CSV file:

  • Compatibility: CSV formats is a more widely supported format than Parquet. You can open and read CSV files in a variety of applications, including spreadsheets, databases, and data visualization tools.
  • Human Readability: CSV files are human-readable, which means that you can inspect the data without having to use any special tools.
  • Ease of sharing: CSV files are easy to share with other users. You can simply send the file via email or upload it to a cloud storage service.
  • Smaller Datasets: For smaller datasets, the performance benefits of Parquet may be negligible, and CSV’s simplicity becomes preferable.

Understanding Parquest and CSV File Formats

Before diving into the conversion process, let’s briefly understand the characteristics of Parquet and CSV:

Parquet:

  • Columnar storage format.
  • Highly efficient compression.
  • Optimized for analytical queries.
  • Ideal for large datasets.
  • Binary format.

CSV (Comma-Separated Values):

  • Row-based storage format.
  • Simple and human-readable.
  • Universally supported by various tools.
  • Less efficient compression compared to Parquet.  
  • Text-based format.

Python Libraries for File Format Conversion

To convert Parquet to CSV in Python, we’ll primarily use three libraries:

  • Pandas: This is a powerful data manipulation and analysis Python library. Pandas provides functionalities for reading and writing data in various formats, including CSV.  
  • PyArrow: This is a cross-language development platform for in-memory data. PyArrow provides efficient support for Parquet files.
  • PySpark: PySpark is a great library for reading and converting Parquet file in Python.

How to convert a file to CSV using Pandas?

Parquet File Format Example

This is a sample Parquet file for employess. This file is not human readable. We need to convert it to CSV formats(or some readable data format) which is human readable. So that we can easily read the file in a text editor

  1. Import the necessary libraries:
Convert CSV to Parquet Pandas
import pandas as pd

#CSV file with full path
employee_file = "employees.csv"

#Read csv file into pandas dataframe
df = pd.read_csv(employee_file)

#Write the DataFrame to a json file
df.to_json("employees.json")

#Write the DataFrame to a parquet file
df.to_parquet("employees.parquet")

#Print the Dataframe
df.show()
Pandas Convert CSV to Parquet
  1. Read the Parquet file into a Pandas DataFrame:
Read Parquet File
df = pd.read_parquet('employees.parquet')
Python Read Parquet File
  1. Write the DataFrame to a CSV file:
Convert to CSV file
df.to_csv('employees.csv')
Python Convert Parquet File to CSV

Converting to CSV file from parquet file using Pandas we need to have PyArrow installed in the system.

PyArrow for Parquet to CSV

You can understand the complete conversion process(Parquet to CSV) using below Python code snippet:

PyArrow Python Script for Parquet to CSV conversion
import pandas as pd
import pyarrow.parquet as pq
import os

def parquet_to_csv(parquet_file, csv_file):
    try:
        # Read the Parquet file using pyarrow
        table = pq.read_table(parquet_file)

        # Convert the PyArrow table to a Pandas DataFrame
        df = table.to_pandas()

        # pandas DataFrame to a CSV file
        df.to_csv(csv_file, index=False)

        print(f"Successfully converted '{parquet_file}' to '{csv_file}'")

    except FileNotFoundError:
        print(f"Error: Parquet file '{parquet_file}' not found.")
    except Exception as e:
        print(f"An error occurred: {e}")

# Example usage:
parquet_to_csv("employees.parquet", "employees.csv")
PyArrow Python Script for Parquet to CSV conversion

PySpark for Parquet to CSV in Python

PySpark Transfer Parquet into CSV
import pyspark
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("parquet2csv").getOrCreate()

#read parquet file
df = spark.read.parquet("employees.parquet")

#write parquet to csv using pyspark
df.write.csv("employees.csv")

#close spark
spark.close()
Python transfer Parquet into CSV

Tips for Changing Parquet File Format into Other Format

  • Choose the right tools: If you are not familiar with programming, there are a number of online tools that can help you converting Parquet files to CSV. However, these tools may not be as flexible or powerful as using a programming language.
  • Handle missing values: Parquet files can contain missing values. When converting to CSV, you need to decide how to handle these values. You can either drop the rows or columns with missing values, or you can replace them with a default value.
  • Specify the encoding: CSV files can be encoded in different ways. When writing the CSV file, you need to specify the encoding that you want to use. The most common encoding is UTF-8.

Conclusion

Converting Parquet files to CSV is a relatively simple task. By following the steps outlined in this guide, you can easily convert your Parquet files to CSV and unlock a world of possibilities.

This Post Has One Comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.