Import Parquet File from Amazon S3
Parquet file format is a popular column-oriented data format known for its efficient storage of large datasets and its support for fast analytical queries. On the other hand, PostgreSQL is a powerful relational database system capable of handling complex datasets and transactions. In this blog post, we will demonstrate to read s3 parquet file in python and how to import data from S3 into PostgreSQL using Python, with the assistance of the awswrangler library. Please read my earlier blog post for detail of Parquet File Format.
Postgresql to Parquet
We will create a Parquet table view with employee_id, Name, DateofBirth, Salary, and Department, you can use the following steps:
- Create a PyArrow Table with the Desired Columns
- Write the PyArrow table to a Parquet file
- Read the Parquet file using a Parquet reader
- Select the desired columns from the Parquet reader
import pyarrow as pa
import pyarrow.parquet as pq
# Define the arrays
arrays = [
pa.array([1, 2, 3, 4, 5]),
pa.array(['Ashok', 'Gautam', 'Akshay', 'Birbal', 'Eve']),
pa.array(['1980-01-01', '1981-02-02', '1982-03-03', '1983-04-04', '1984-05-05']),
pa.array([100000, 110000, 120000, 130000, 140000]),
pa.array(['Engineering', 'Sales', 'Marketing', 'Product', 'Customer Support'])
]
# Create a pyarrow table
table = pa.Table.from_arrays(arrays, names=['employee_id', 'Name', 'DateofBirth', 'Salary', 'Department'])
# Write the pyarrow table to Parquet file
pq.write_table(table, 'employees.parquet')
# Read the Parquet file format
table = pq.read_table('employees.parquet')
# Select the desired columns
employee_view = table.select(['employee_id', 'Name', 'DateofBirth', 'Salary', 'Department'])
# Print the employee view
print(employee_view)
Write PyArrow Table into Parquet FileWhy Import Data From S3 to PostgreSQL?
There are many reasons why you might want to import Parquet files to PostgreSQL, including:
- Performance and scalability: Parquet files are designed for analytical queries, and they can significantly improve the performance and scalability of your PostgreSQL database.
- Flexibility and functionality: PostgreSQL is a feature-rich relational database management system (RDBMS) that offers a wide range of functionality, including ACID transactions, foreign keys, and complex queries.
- Cloud storage integration: Parquet files are often stored in cloud storage services, such as Amazon S3 or Azure Blob Storage. PostgreSQL supports foreign data wrappers (FDWs) that allow you to query Parquet files directly from your PostgreSQL database, without having to download them to your local machine.
- Big data integration: Parquet files are commonly used in big data frameworks, such as Spark and Hive. PostgreSQL supports a variety of tools and libraries that make it easy to import Parquet files from big data frameworks.
- Long-term retention and compliance: PostgreSQL is a reliable and durable database that is well-suited for storing historical and archival data. Parquet files are also designed for long-term retention, as they support efficient compression and encoding.
How to Import Data from S3 to PostgreSQL with Python?
Read Parquet Files to PostgreSQL with Python and AWS Wrangler
AWS Wrangler is a Python package that simplifies the integration between AWS services and data analysis tools. It also supports reading and writing Parquet files from/to various sources and destinations.
To read Parquet files from Amazon S3 to PostgreSQL with Python and AWS Wrangler, you can follow these steps:
- Install AWS Wrangler using pip or conda.
- Import AWS Wrangler in your Python script.
- Read the S3 parquet format using the wr.s3.read_parquet() function. This function returns a Pandas DataFrame with the data from the Parquet files.
- Write the DataFrame to PostgreSQL using the wr.postgresql.to_sql() function. This function creates a table in PostgreSQL with the same schema as the DataFrame and inserts the data into it.
Here is an example of how to import Parquet files from S3 to PostgreSQL using AWS Wrangler:
Import Data from S3 to PostgreSql
Read Data from Amazon S3
import awswrangler as wr
from sqlalchemy import create_engine
# Connect to the PostgreSQL database
postgres_db = create_engine('postgresql://postgres:password@localhost:5432/my_database')
# s3 read parquet. s3 to postgres
df = wr.s3.read_parquet("s3://my-bucket/my-parquet-files")
# Write the DataFrame to PostgreSQL
wr.postgresql.to_sql(df, "my_table", postgres_db)
AWS Wrangler also supports importing Parquet files from other sources, such as Redshift Spectrum and local disk. It also supports writing Parquet files to other destinations, such as S3 and Redshift Spectrum.
Conclusion
In this blog post, we showed you how to import parquet files to PostgreSQL using Python and awswrangler. We explained what are parquet and PostgreSQL, why you might want to import parquet files to PostgreSQL, and how to do it with a few lines of code.
We hope you found this post useful and informative. If you have any questions or feedback, please leave a comment below.
Pingback: Postgres Parquet FDW: Foreign Data Wrappers in 10 Minutes
Pingback: How to Open Parquet File in Python – Enodeas