Convert Parquet File to CSV in Python
Did you store data in parquet file format and want to read it? Struggling to convert a Parquet file to CSV? I am going to describe this step-by-step guide to convert a parquet file. I will help you to get the job done easily and quickly.
Table of Contents
1. Convert a Parquet File Format in Python
2. Why convert to CSV?
3. Converting parquet to csv using Pandas
4. Convert parquet file to csv using PySpark
5. Tips for converting Parquet to CSV
Convert a Parquet File Format in Python
Parquet is a columnar storage format that is widely used for storing large datasets efficiently. It is a binary format, which means that it is not human-readable. CSV, on the other hand, is a text-based format that is easy to read and manipulate. Parquet file reduce the storage cost by 70% comapred to a CSV file format. So, we can store data file in parquet formats and whenever require we can read it as csv file after convert it to CSV file format.
Why convert to CSV?
There are a few reasons why you might want to convert Parquet files to CSV:
- Compatibility: CSV formats is a more widely supported format than Parquet. You can open and read CSV files in a variety of applications, including spreadsheets, databases, and data visualization tools.
- Human-readability: CSV files are human-readable, which means that you can inspect the data without having to use any special tools.
- Ease of sharing: CSV files are easy to share with other users. You can simply send the file via email or upload it to a cloud storage service.
Converting Parquet File to CSV using Pandas
This is a sample Parquet file for employess. This file is not human readable. We need to convert it to CSV formats(or some readable data format) which is human readable. So that we can easily read the file in a text editor
The easiest way to convert Parquet files to CSV is to use a programming language. Although there are a number of libraries available can help you with this task, we will see Pandas and PySpark library here. Pandas.read_parquet library to be used to read the parquet file. Now parquet file to be available as pandas dataframe. We can convert the Pandas dataframe into CSV file using pandas.to_csv library. Now we are ready with reading the parquet file(converted CSV) using Excel or text editor. We can get details analysis of Pandas dataframe in another blog post.
We will use below steps for converting Parquet to CSV:
- Import the necessary libraries:
import pandas as pd
#CSV file with full path
employee_file = "employees.csv"
#Read csv file into pandas dataframe
df = pd.read_csv(employee_file)
#Write the DataFrame to a json file
df.to_json("employees.json")
#Write the DataFrame to a parquet file
df.to_parquet("employees.parquet")
#Print the Dataframe
df.show()
Pandas Convert CSV to Parquet- Read the Parquet file into a Pandas DataFrame:
df = pd.read_parquet('employees.parquet')
Python- Write the DataFrame to a CSV file:
df.to_csv('employees.csv')
PythonConverting to CSV file from parquet file using Pandas we need to have PyArrow installed in the system.
Convert Parquet File using PySpark
We can convert very large parquet file to csv very easily and efficinetly using PySpark library. Installing PySpark is not complex if you properly follow the instuctions. As mentioned in my earlier blog post, you can install PySpark. Please refer to my earlier blog post to know how you can install PySpark in windows. If any issues please let me know.
import pyspark
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("parquet2csv").getOrCreate()
#read parquet file
df = spark.read.parquet("employees.parquet")
#write parquet to csv using pyspark
df.write.csv("employees.csv")
#close spark
spark.close()
PythonTips for Converting Parquet File Format into Other Format
- Choose the right tools: If you are not familiar with programming, there are a number of online tools that can help you converting Parquet files to CSV. However, these tools may not be as flexible or powerful as using a programming language.
- Handle missing values: Parquet files can contain missing values. When converting to CSV, you need to decide how to handle these values. You can either drop the rows or columns with missing values, or you can replace them with a default value.
- Specify the encoding: CSV files can be encoded in different ways. When writing the CSV file, you need to specify the encoding that you want to use. The most common encoding is UTF-8.
Conclusion
Converting Parquet files to CSV is a relatively simple task. By following the steps outlined in this guide, you can easily convert your Parquet files to CSV and unlock a world of possibilities.
Similarly you can convert csv file to parquet file format by following the steps given in the article.
Pingback: How to Convert CSV to Parquet and more Format – Enodeas