10 Minutes to Pandas (Python Tutorial) – A Complete Guide

Pandas Python Tutorial includes functions, attributes, and operations, laying the groundwork for utilizing Pandas proficiently

Pandas Python Tutorial: Complete Guide

Introduction

Python Tutorial is indispensable knowledge for data manipulation, offering a plethora of functionalities tailored for seamless data handling and analysis. Its versatility and efficiency make it a go-to library across various industries, enabling professionals to extract valuable insights from complex datasets effortlessly. Let’s drive into 10 Minutes to Pandas.

In the context of data analysis, Pandas stands as the cornerstone for Data Scientists and Analysts, often overshadowed by the glitz of machine learning and dazzling visualizations. However, Pandas remains the bedrock of numerous data projects, offering unparalleled functionality.

For those venturing into the realm of data science, mastering this Pandas Tutorial is not just beneficial; it’s a necessity. This post serves as a primer, shedding light on crucial aspects of Pandas—its installation, multifaceted utility, and seamless integration with other prevalent data analysis tools like matplotlib and scikit-learn.

Prerequisites of 10 Minutes to Python Pandas Tutorial

A rudimentary understanding of the Python programming concepts is recommended. Familiarity with the fundamentals, such as variables, data types, loops, and conditional statements, will facilitate a smoother learning curve.

What’s Pandas for?

Pandas, a versatile tool, excels at data manipulation and analysis. It serves as a data haven, enabling data cleansing, transformation, and exploration.

Consider a CSV dataset: Pandas effortlessly converts it into a DataFrame, a structured table. From here, a world of possibilities awaits:

  • Compute Statistics: Uncover insights by calculating averages, medians, maximums, minimums, or correlations between columns.
  • Data Cleansing: Addressing data inconsistencies by removing missing values and filtering rows or columns based on specified criteria.
  • Visualization: Leveraging Matplotlib’s capabilities, collaboratively visualize data, creating bars, lines, histograms, and other graphical representations.
  • Persistence: Effortlessly store refined data back into a CSV, another file format, or a database for easy access.

In essence, Pandas serves as a bridge for seamless data navigation, providing a comprehensive suite of functionalities for efficient data management.

10 Minutes to Pandas: The Cornerstone of Data Manipulation and Analysis

In the ever-evolving landscape of data science, Pandas stands as an indispensable cornerstone, providing a robust toolkit for efficient data manipulation, analysis, and exploration. Its multifaceted functionalities streamline the handling of tabular data, making it the quintessential companion for data scientists seeking to extract meaningful insights from vast datasets.

At the heart of Pandas lies the DataFrame, a structured framework that effortlessly organizes and navigates data, transforming raw data into a well-defined format. This structured approach facilitates seamless interoperability with other prevalent libraries like NumPy, Matplotlib, and scikit-learn, empowering data scientists to seamlessly bridge the gap between raw data and actionable insights.

Pandas’ versatility shines through its comprehensive suite of functionalities, encompassing data cleansing, statistical analysis, and data visualization. Whether it’s eliminating inconsistencies, identifying patterns, or calculating statistical measures, Pandas ensures a smooth journey from data ingestion to actionable insights.

Key Highlights: 10 Minutes to Pandas

  • Data Cleansing: Consequently, effectively remove missing values, outliers, and inconsistencies, ensuring data integrity.
  • Statistical Analysis: Compute descriptive statistics, perform hypothesis testing, and uncover correlations to gain a deeper understanding of the data.
  • Data Visualization: Create compelling visualizations, including bar charts, line graphs, histograms, and scatter plots, to communicate insights effectively.

10 Minues to Pandas: The Indispensable Data Science Companion

Pandas’ unwavering commitment to efficiency, versatility, and interoperability has cemented its position as an indispensable tool for data scientists. Its ability to transform raw data into actionable insights makes it an essential component of the data science toolkit, empowering data scientists to tackle complex challenges and derive meaningful conclusions from vast datasets.

10 Minutes to Pandas from the Start: A Data Enthusiast’s Guide

Embark on your data exploration journey with Pandas as your trusty companion. Whether you’re an aspiring data enthusiast or a seasoned professional venturing into the realm of data science, mastering Pandas early on will prove invaluable. Its user-friendly interface and versatile functionalities make it an ideal starting point for beginners seeking to navigate and manipulate data efficiently.

Regardless of your chosen path, whether it’s data analysis, machine learning, or data visualization, early adoption of Pandas equips you with fundamental skills essential for seamless data handling. With Pandas as your foundation, you’ll be empowered to extract actionable insights from diverse datasets swiftly and effectively.

So, embrace Pandas from the outset and unlock your potential to transform raw data into meaningful solutions.

Getting Started with 10 Minutes to Pandas Python Tutorial

1. Installation of Pandas

Let’s start the Pandas Python Tutorial from begining. Ensure the Pandas library is installed within your Python environment. If not, use the following pip command:

pip install pandas

OR

conda install pandas

You can install Pandas in Jupyter Notebook


!pip install pandas

2. Creating a DataFrame

The cornerstone of Pandas, a DataFrame is a tabular data structure consisting of rows and columns. We can create a DataFrame from various sources like lists, dictionaries, CSV files, or NumPy arrays. For instance:

Pandas
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age':  [30, 25, 22],
        'Occupation': ['Data Scientist', 'Software Engineer', 'Data Analyst'] }

df = pd.DataFrame(data)

print (df)
Python
In [1]:
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age':  [30, 25, 22],
        'Occupation': ['Data Scientist', 'Software Engineer', 'Data Analyst'] }

df = pd.DataFrame(data)
In [2]:
df
Out[2]:
NameAgeOccupation
0Alice30Data Scientist
1Bob25Software Engineer
2Charlie22Data Analyst

We can use the to_string() method to print pandas dataframe as table. As this method returns a string representation of the DataFrame in a tabular format. We can then print the string using the print() function.

to_string
print(df.to_string())

#Output of print pandas dataframe as table:

      Name  Age         Occupation
0    Alice   30     Data Scientist
1      Bob   25  Software Engineer
2  Charlie   22       Data Analyst
Python

Add a Row to a Dataframe Pandas

Using Pandas Dataframe.loc

We can add a new row in the existing Pandas Dataframe using dataframe.loc attribute. Since Dataframe.loc is used to get the nth row’s data as pandas series type. We need to create the new row as pandas series and add the row as the last row of the dataframe like below:

df.loc
data  = {'Name': 'David', 'Age': 27, 'Occupation': 'Data Engineer'} 
index = ['Name', 'Age', 'Occupation']
new_row = pd.Series(data=data, index=index)

last_position = len(df)
df.loc[last_position]=new_row

df
Python
Add a Row to a Dataframe Pandas

Using pandas.concat

We can create a new pandas dataframe of one row. Now we can use the pandas.concat method to add the new one row dataframe to the existing dataframe.

pandas.concat
new_row=pd.Series(data={'Name': 'David', 'Age': 27, 
                   'Occupation': 'Data Engineer'}, 
                   index=['Name', 'Age','Occupation']
                   )
df2=pd.DataFrame([new_row])
pd.concat([df, df2], ignore_index=True)                   
Python

Add Row to Empty Dataframe Pandas

To add row to empty dataframe pandas we can use the pandas concat method as describe below

Python
import pandas as pd

df = pd.DataFrame()

new_row = pd.DataFrame({'Name': ['David'], 'Age': [28], 'Occupation': ['Data Engineer']})

df = pd.concat([df, new_row], ignore_index=True)

print(df)
Python

Attributeerror: ‘dataframe’ object has no attribute ‘append’

The AttributeError: ‘DataFrame’ object has no attribute ‘append’ error arises when you attempt to utilize the append method on a Pandas DataFrame object. This error stems from the absence of a built-in append method in Pandas DataFrames for directly appending rows. Please use any of the above mentioned method to resolve this. This is how you can add a row to pandas dataframe easily and effectivly.

Pandas Subtract two Dataframes

To subtract two Pandas DataFrames, you can use the DataFrame.subtract() method. This method performs an element-wise subtraction of the two DataFrames. We can subtract two dataframes in Pandas element by element using this method.

dataframe.subtract
import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df2 = pd.DataFrame({'a': [7, 8, 9], 'b': [10, 11, 12]})

# Subtract the two DataFrames using the DataFrame.subtract() method
df_sub = df1.subtract(df2)

# Print the result
print(df_sub) 

# Output of pandas subtract two dataframes

   a  b
0 -6 -6
1 -6 -6
2 -6 -6
Python

Slice Pandas Dataframe

We can use the DataFrame.iloc() method to slice pandas dataframe. This method allows us to slice the DataFrame using integer indices. The indices can be specified individually or as a range.

dataframe.iloc
import pandas as pd

#Create a DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, 7, 8, 9, 10]})

#pandas dataframe slicing using the DataFrame.iloc() method
df_sliced = df.iloc[:3, :]

# Print the sliced DataFrame
print(df_sliced)


OUTPUT:

   a  b
0  1  6
1  2  7
2  3  8
Python

Convert Pandas to Spark Dataframe: 10 Minutes to Pandas

We can convert pandas dataframe to pyspark dataframe, we can use the following steps:

  • Create a SparkSession object.
  • Import the Pandas DataFrame.
  • Use the createDataFrame() method to convert the Pandas DataFrame to a Spark DataFrame.
  • Define the schema for the Spark DataFrame (optional).
  • Save the Spark DataFrame to a file or cache it in memory.
    Here is an example of how to convert a Pandas DataFrame to a Spark DataFrame using the Python API:
Convert Pandas Dataframe to pySpark df
from pyspark.sql import SparkSession

#Create PySpark Season
spark = SparkSession.builder.appName("PandasDf2SparkDf").getOrCreate()

#Enable Apache Arrow to convert Pandas to PySpark DataFrame
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark_df = spark.createDataFrame(df)

spark_df.show(5)
10 Minutes to Pandas

Convert List to Pandas Dataframe

We can convert a list to a Pandas DataFrame using the pd.DataFrame() constructor in Pandas. Here’s an example:

Let’s assume you have a list of lists containing data:

Convert list to Pandas DataFrame
import pandas as pd

# Sample list of lists
data = [
    ['Alice', 30, 'Data Scientist'],
    ['Bob', 25, 'Software Engineer'],
    ['Charlie', 22, 'Data Analyst']
]

# Define column names
columns = ['Name', 'Age', 'Occupation']

# Convert list to Pandas DataFrame
df = pd.DataFrame(data, columns=columns)

print(df)

      Name  Age         Occupation
0    Alice   30     Data Scientist
1      Bob   25  Software Engineer
2  Charlie   22       Data Analyst
How to convert list to Pandas DataFrame

Pandas Dataframe from Numpy Array

We can create a Pandas DataFrame from a NumPy array using the DataFrame() constructor. The constructor takes a NumPy array as an argument, and it creates a DataFrame with the same data as the array.

Here is an example of how to create a Pandas DataFrame from a NumPy array:

Pandas DataFrame from NumPy Array
import pandas as pd
import numpy as np

# Create a NumPy array
array = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

# Create a Pandas DataFrame from the NumPy array
df = pd.DataFrame(array)

# Print the DataFrame
print(df)

OUTPUT

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9
Pandas DataFrame from NumPy Array

Data Manipulation: Pandas Python Tutorial

Pandas offers a multitude of functions and attributes for data manipulation within DataFrames:

  • Get first 3 rows of dataframe pandas
    We can get Pandas first n rows of dataframe using below approach. In the following example n is 3.
pandas.head
#pandas head method can be used to get first n rows of dataframe
pd.head(3)
Python
  • Selecting Columns:
Select Column
age_column = df['Age']
Python
  • Filtering Rows:
Filter
filtered_df = df[df['Age'] > 25]
Python
  • Adding or Removing Columns:
dataframe.drop
# Adding a Salary column 
df['Salary'] = [60000, 50000, 45000]

# Removing the 'Name' column
df = df.drop('Name', axis=1) 
Python
  • Sorting Data:
sorted_values
sorted_df = df.sort_values('Age', ascending=False)

#reset_index is to use change the index values after sorting
#parameter drop is used to drop the old index values

sorted_df.reset_index(drop=True)
Python

Pandas Dataframe Drop Rows with Condition

We can drop rows from pandas dataframe with condition. To drop rows in a Pandas DataFrame, we can use the DataFrame.drop() method. This method takes an index or a list of indices as an argument, and it drops the specified rows from the DataFrame.

drop method
#Drop the rows where the value in the column 'Name' is 'Bob'
df.drop(df[df.Name == 'Bob'].index, inplace=True)

OUTPUT
      Name  Age      Occupation
0    Alice   30  Data Scientist
2  Charlie   22    Data Analyst
Python

Data Wrangling: Pandas Python Tutorial

Data wrangling is a pivotal aspect of data preparation, and Pandas offers an array of tools facilitating this process within Python. This vital stage involves cleaning, transforming, and refining raw data into a structured format suitable for analysis. Pandas empowers users to tackle common data wrangling challenges seamlessly.

  • Handling Missing Values:

Handling missing values is an essential part of data preprocessing. Pandas offers various methods to detect, handle, and manage missing data within DataFrames efficiently. Here are some common techniques to handle missing values:

Detecting Missing Values:

isnull() and notnull(): These methods return boolean masks indicating missing (True) or non-missing (False) values in the DataFrame or Series.

info(): Provides a summary of the DataFrame, showing the count of non-null values per column, which can help identify missing values.

Handling Missing Values:

fillna(): Replaces missing values with specified values like a constant, mean, median, or forward/backward fills. Let’s assume we have a pandas dataframe with few rows with no salary. We can replace it with any value using fillna() method.

fillna
df['Salary'].fillna(500, inplace=True)
Python

dropna(): It removes rows or columns containing missing values based on specified thresholds (e.g., drop rows with any null value or only those with all null values).

Python
df.dropna()
Python
  • Dealing with Duplicates:
drop_duplicates
df.drop_duplicates()
Python
  • Converting Data Types:

df['Salary'] = df['Salary'].astype(int)

Data Visualization: 10 Minutes to Pandas

Basic plotting capabilities exist within Pandas for simple visualizations:

matplotlib
import pandas as pd
import matplotlib.pyplot as plt

# Create a DataFrame
df = pd.DataFrame({'Name': ['Ram', 'Sham', 'Jadu', 'Madhu', 'Tarit'], 'Age': [30, 25, 22, 32, 28]})

# Create a bar chart of the number of employees for each age
plt.bar(df['Age'], df['Name'])

# Add a title and labels
plt.title('Number of Employees by Age')
plt.xlabel('Age')
plt.ylabel('Number of Employees')

# Show the chart
plt.show()
Python
Data Visualization with Pandas

df['Age'].plot(kind='bar')

Advanced visualization can be achieved by integrating Pandas with libraries like Matplotlib and Seaborn.

Time Series Analysis: 10 Minutes to Pandas

Pandas excels in handling time series data:

Python
dates = pd.to_datetime(['2020-01-01', '2020-02-01', '2020-03-01']) 
data = [10, 20, 30] 
time_series_df = pd.DataFrame({'Date': dates, 'Value': data})
Python

With various time series operations like rolling averages, resampling, and trend analysis can be performed using Pandas’ functionalities we can easily use them for time series analysis.

Conclusion: 10 Minutes to Pandas

In conclusion, this comprehensive guide covers essential 10 Minutes to Pandas Python Tutorial this includes functions, attributes, and operations, laying the groundwork for utilizing Pandas proficiently in data manipulation and analysis. You can also use PySpark to read very large CSV file more efficiently and quickly using Spark Distributed Architechure. You can review my earlier blog post to install PySpark in windows and how to read large CSV file using Python. I hope this article is useful. Please write in comment session if you have any questions, suggestions and ideas. Happy Coding!

This Post Has 5 Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.