Pandas Python Tutorial: Complete Guide
Introduction
Python Tutorial is indispensable knowledge for data manipulation, offering a plethora of functionalities tailored for seamless data handling and analysis. Its versatility and efficiency make it a go-to library across various industries, enabling professionals to extract valuable insights from complex datasets effortlessly. Let’s drive into 10 Minutes to Pandas.
In the context of data analysis, Pandas stands as the cornerstone for Data Scientists and Analysts, often overshadowed by the glitz of machine learning and dazzling visualizations. However, Pandas remains the bedrock of numerous data projects, offering unparalleled functionality.
For those venturing into the realm of data science, mastering this Pandas Tutorial is not just beneficial; it’s a necessity. This post serves as a primer, shedding light on crucial aspects of Pandas—its installation, multifaceted utility, and seamless integration with other prevalent data analysis tools like matplotlib and scikit-learn.
Prerequisites of 10 Minutes to Python Pandas Tutorial
A rudimentary understanding of the Python programming concepts is recommended. Familiarity with the fundamentals, such as variables, data types, loops, and conditional statements, will facilitate a smoother learning curve.
What’s Pandas for?
Pandas, a versatile tool, excels at data manipulation and analysis. It serves as a data haven, enabling data cleansing, transformation, and exploration.
Consider a CSV dataset: Pandas effortlessly converts it into a DataFrame, a structured table. From here, a world of possibilities awaits:
- Compute Statistics: Uncover insights by calculating averages, medians, maximums, minimums, or correlations between columns.
- Data Cleansing: Addressing data inconsistencies by removing missing values and filtering rows or columns based on specified criteria.
- Visualization: Leveraging Matplotlib’s capabilities, collaboratively visualize data, creating bars, lines, histograms, and other graphical representations.
- Persistence: Effortlessly store refined data back into a CSV, another file format, or a database for easy access.
In essence, Pandas serves as a bridge for seamless data navigation, providing a comprehensive suite of functionalities for efficient data management.
10 Minutes to Pandas: The Cornerstone of Data Manipulation and Analysis
In the ever-evolving landscape of data science, Pandas stands as an indispensable cornerstone, providing a robust toolkit for efficient data manipulation, analysis, and exploration. Its multifaceted functionalities streamline the handling of tabular data, making it the quintessential companion for data scientists seeking to extract meaningful insights from vast datasets.
At the heart of Pandas lies the DataFrame, a structured framework that effortlessly organizes and navigates data, transforming raw data into a well-defined format. This structured approach facilitates seamless interoperability with other prevalent libraries like NumPy, Matplotlib, and scikit-learn, empowering data scientists to seamlessly bridge the gap between raw data and actionable insights.
Pandas’ versatility shines through its comprehensive suite of functionalities, encompassing data cleansing, statistical analysis, and data visualization. Whether it’s eliminating inconsistencies, identifying patterns, or calculating statistical measures, Pandas ensures a smooth journey from data ingestion to actionable insights.
Key Highlights: 10 Minutes to Pandas
- Data Cleansing: Consequently, effectively remove missing values, outliers, and inconsistencies, ensuring data integrity.
- Statistical Analysis: Compute descriptive statistics, perform hypothesis testing, and uncover correlations to gain a deeper understanding of the data.
- Data Visualization: Create compelling visualizations, including bar charts, line graphs, histograms, and scatter plots, to communicate insights effectively.
10 Minues to Pandas: The Indispensable Data Science Companion
Pandas’ unwavering commitment to efficiency, versatility, and interoperability has cemented its position as an indispensable tool for data scientists. Its ability to transform raw data into actionable insights makes it an essential component of the data science toolkit, empowering data scientists to tackle complex challenges and derive meaningful conclusions from vast datasets.
10 Minutes to Pandas from the Start: A Data Enthusiast’s Guide
Embark on your data exploration journey with Pandas as your trusty companion. Whether you’re an aspiring data enthusiast or a seasoned professional venturing into the realm of data science, mastering Pandas early on will prove invaluable. Its user-friendly interface and versatile functionalities make it an ideal starting point for beginners seeking to navigate and manipulate data efficiently.
Regardless of your chosen path, whether it’s data analysis, machine learning, or data visualization, early adoption of Pandas equips you with fundamental skills essential for seamless data handling. With Pandas as your foundation, you’ll be empowered to extract actionable insights from diverse datasets swiftly and effectively.
So, embrace Pandas from the outset and unlock your potential to transform raw data into meaningful solutions.
Getting Started with 10 Minutes to Pandas Python Tutorial
1. Installation of Pandas
Let’s start the Pandas Python Tutorial from begining. Ensure the Pandas library is installed within your Python environment. If not, use the following pip command:
pip install pandas
OR
conda install pandas
You can install Pandas in Jupyter Notebook
!pip install pandas
2. Creating a DataFrame
The cornerstone of Pandas, a DataFrame is a tabular data structure consisting of rows and columns. We can create a DataFrame from various sources like lists, dictionaries, CSV files, or NumPy arrays. For instance:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [30, 25, 22],
'Occupation': ['Data Scientist', 'Software Engineer', 'Data Analyst'] }
df = pd.DataFrame(data)
print (df)
Pythonimport pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [30, 25, 22],
'Occupation': ['Data Scientist', 'Software Engineer', 'Data Analyst'] }
df = pd.DataFrame(data)
df
Print Pandas Dataframe as Table
We can use the to_string() method to print pandas dataframe as table. As this method returns a string representation of the DataFrame in a tabular format. We can then print the string using the print() function.
print(df.to_string())
#Output of print pandas dataframe as table:
Name Age Occupation
0 Alice 30 Data Scientist
1 Bob 25 Software Engineer
2 Charlie 22 Data Analyst
PythonAdd a Row to a Dataframe Pandas
Using Pandas Dataframe.loc
We can add a new row in the existing Pandas Dataframe using dataframe.loc attribute. Since Dataframe.loc is used to get the nth row’s data as pandas series type. We need to create the new row as pandas series and add the row as the last row of the dataframe like below:
data = {'Name': 'David', 'Age': 27, 'Occupation': 'Data Engineer'}
index = ['Name', 'Age', 'Occupation']
new_row = pd.Series(data=data, index=index)
last_position = len(df)
df.loc[last_position]=new_row
df
PythonUsing pandas.concat
We can create a new pandas dataframe of one row. Now we can use the pandas.concat method to add the new one row dataframe to the existing dataframe.
new_row=pd.Series(data={'Name': 'David', 'Age': 27,
'Occupation': 'Data Engineer'},
index=['Name', 'Age','Occupation']
)
df2=pd.DataFrame([new_row])
pd.concat([df, df2], ignore_index=True)
PythonAdd Row to Empty Dataframe Pandas
To add row to empty dataframe pandas we can use the pandas concat method as describe below
import pandas as pd
df = pd.DataFrame()
new_row = pd.DataFrame({'Name': ['David'], 'Age': [28], 'Occupation': ['Data Engineer']})
df = pd.concat([df, new_row], ignore_index=True)
print(df)
PythonAttributeerror: ‘dataframe’ object has no attribute ‘append’
The AttributeError: ‘DataFrame’ object has no attribute ‘append’ error arises when you attempt to utilize the append method on a Pandas DataFrame object. This error stems from the absence of a built-in append method in Pandas DataFrames for directly appending rows. Please use any of the above mentioned method to resolve this. This is how you can add a row to pandas dataframe easily and effectivly.
Pandas Subtract two Dataframes
To subtract two Pandas DataFrames, you can use the DataFrame.subtract() method. This method performs an element-wise subtraction of the two DataFrames. We can subtract two dataframes in Pandas element by element using this method.
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df2 = pd.DataFrame({'a': [7, 8, 9], 'b': [10, 11, 12]})
# Subtract the two DataFrames using the DataFrame.subtract() method
df_sub = df1.subtract(df2)
# Print the result
print(df_sub)
# Output of pandas subtract two dataframes
a b
0 -6 -6
1 -6 -6
2 -6 -6
PythonSlice Pandas Dataframe
We can use the DataFrame.iloc() method to slice pandas dataframe. This method allows us to slice the DataFrame using integer indices. The indices can be specified individually or as a range.
import pandas as pd
#Create a DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, 7, 8, 9, 10]})
#pandas dataframe slicing using the DataFrame.iloc() method
df_sliced = df.iloc[:3, :]
# Print the sliced DataFrame
print(df_sliced)
OUTPUT:
a b
0 1 6
1 2 7
2 3 8
PythonConvert Pandas to Spark Dataframe: 10 Minutes to Pandas
We can convert pandas dataframe to pyspark dataframe, we can use the following steps:
- Create a SparkSession object.
- Import the Pandas DataFrame.
- Use the createDataFrame() method to convert the Pandas DataFrame to a Spark DataFrame.
- Define the schema for the Spark DataFrame (optional).
- Save the Spark DataFrame to a file or cache it in memory.
Here is an example of how to convert a Pandas DataFrame to a Spark DataFrame using the Python API:
from pyspark.sql import SparkSession
#Create PySpark Season
spark = SparkSession.builder.appName("PandasDf2SparkDf").getOrCreate()
#Enable Apache Arrow to convert Pandas to PySpark DataFrame
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark_df = spark.createDataFrame(df)
spark_df.show(5)
10 Minutes to PandasConvert List to Pandas Dataframe
We can convert a list to a Pandas DataFrame using the pd.DataFrame() constructor in Pandas. Here’s an example:
Let’s assume you have a list of lists containing data:
import pandas as pd
# Sample list of lists
data = [
['Alice', 30, 'Data Scientist'],
['Bob', 25, 'Software Engineer'],
['Charlie', 22, 'Data Analyst']
]
# Define column names
columns = ['Name', 'Age', 'Occupation']
# Convert list to Pandas DataFrame
df = pd.DataFrame(data, columns=columns)
print(df)
Name Age Occupation
0 Alice 30 Data Scientist
1 Bob 25 Software Engineer
2 Charlie 22 Data Analyst
How to convert list to Pandas DataFramePandas Dataframe from Numpy Array
We can create a Pandas DataFrame from a NumPy array using the DataFrame() constructor. The constructor takes a NumPy array as an argument, and it creates a DataFrame with the same data as the array.
Here is an example of how to create a Pandas DataFrame from a NumPy array:
import pandas as pd
import numpy as np
# Create a NumPy array
array = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
# Create a Pandas DataFrame from the NumPy array
df = pd.DataFrame(array)
# Print the DataFrame
print(df)
OUTPUT
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
Pandas DataFrame from NumPy ArrayData Manipulation: Pandas Python Tutorial
Pandas offers a multitude of functions and attributes for data manipulation within DataFrames:
- Get first 3 rows of dataframe pandas
We can get Pandas first n rows of dataframe using below approach. In the following example n is 3.
#pandas head method can be used to get first n rows of dataframe
pd.head(3)
Python- Selecting Columns:
age_column = df['Age']
Python- Filtering Rows:
filtered_df = df[df['Age'] > 25]
Python- Adding or Removing Columns:
# Adding a Salary column
df['Salary'] = [60000, 50000, 45000]
# Removing the 'Name' column
df = df.drop('Name', axis=1)
Python- Sorting Data:
sorted_df = df.sort_values('Age', ascending=False)
#reset_index is to use change the index values after sorting
#parameter drop is used to drop the old index values
sorted_df.reset_index(drop=True)
PythonPandas Dataframe Drop Rows with Condition
We can drop rows from pandas dataframe with condition. To drop rows in a Pandas DataFrame, we can use the DataFrame.drop() method. This method takes an index or a list of indices as an argument, and it drops the specified rows from the DataFrame.
#Drop the rows where the value in the column 'Name' is 'Bob'
df.drop(df[df.Name == 'Bob'].index, inplace=True)
OUTPUT
Name Age Occupation
0 Alice 30 Data Scientist
2 Charlie 22 Data Analyst
PythonData Wrangling: Pandas Python Tutorial
Data wrangling is a pivotal aspect of data preparation, and Pandas offers an array of tools facilitating this process within Python. This vital stage involves cleaning, transforming, and refining raw data into a structured format suitable for analysis. Pandas empowers users to tackle common data wrangling challenges seamlessly.
- Handling Missing Values:
Handling missing values is an essential part of data preprocessing. Pandas offers various methods to detect, handle, and manage missing data within DataFrames efficiently. Here are some common techniques to handle missing values:
Detecting Missing Values:
isnull() and notnull(): These methods return boolean masks indicating missing (True) or non-missing (False) values in the DataFrame or Series.
info(): Provides a summary of the DataFrame, showing the count of non-null values per column, which can help identify missing values.
Handling Missing Values:
fillna(): Replaces missing values with specified values like a constant, mean, median, or forward/backward fills. Let’s assume we have a pandas dataframe with few rows with no salary. We can replace it with any value using fillna() method.
df['Salary'].fillna(500, inplace=True)
Pythondropna(): It removes rows or columns containing missing values based on specified thresholds (e.g., drop rows with any null value or only those with all null values).
df.dropna()
Python- Dealing with Duplicates:
df.drop_duplicates()
Python- Converting Data Types:
df['Salary'] = df['Salary'].astype(int)
Data Visualization: 10 Minutes to Pandas
Basic plotting capabilities exist within Pandas for simple visualizations:
import pandas as pd
import matplotlib.pyplot as plt
# Create a DataFrame
df = pd.DataFrame({'Name': ['Ram', 'Sham', 'Jadu', 'Madhu', 'Tarit'], 'Age': [30, 25, 22, 32, 28]})
# Create a bar chart of the number of employees for each age
plt.bar(df['Age'], df['Name'])
# Add a title and labels
plt.title('Number of Employees by Age')
plt.xlabel('Age')
plt.ylabel('Number of Employees')
# Show the chart
plt.show()
Pythondf['Age'].plot(kind='bar')
Advanced visualization can be achieved by integrating Pandas with libraries like Matplotlib and Seaborn.
Time Series Analysis: 10 Minutes to Pandas
Pandas excels in handling time series data:
dates = pd.to_datetime(['2020-01-01', '2020-02-01', '2020-03-01'])
data = [10, 20, 30]
time_series_df = pd.DataFrame({'Date': dates, 'Value': data})
PythonWith various time series operations like rolling averages, resampling, and trend analysis can be performed using Pandas’ functionalities we can easily use them for time series analysis.
Conclusion: 10 Minutes to Pandas
In conclusion, this comprehensive guide covers essential 10 Minutes to Pandas Python Tutorial this includes functions, attributes, and operations, laying the groundwork for utilizing Pandas proficiently in data manipulation and analysis. You can also use PySpark to read very large CSV file more efficiently and quickly using Spark Distributed Architechure. You can review my earlier blog post to install PySpark in windows and how to read large CSV file using Python. I hope this article is useful. Please write in comment session if you have any questions, suggestions and ideas. Happy Coding!
Pingback: What is Parquet File Format – Enodeas
Pingback: Best Way to Read Large CSV File in Python – Enodeas
Pingback: How to Open Parquet File in Python – Enodeas
Pingback: Python Script to Connect to Oracle Database, Run Query – Enodeas
Pingback: How to Convert Parquet File to CSV in Python – Enodeas