This chapter introduces the two most essential external libraries for Python in the data science ecosystem. NumPy provides efficient numerical arrays, and Pandas offers powerful data structures for data manipulation and analysis. This is a high-level overview to prepare you for specialized data roles.
The strength of Python in data science comes from its rich third-party libraries. NumPy (Numerical Python) and Pandas are foundational for handling large datasets efficiently.
1. NumPy (Numerical Python)
NumPy provides the central data structure in scientific computing: the ndarray (N-dimensional array). It is a fast, memory-efficient container for numerical data.
A. The ndarray
Unlike Python lists, NumPy arrays can hold data of a single, uniform type (e.g., all integers or all floats). This uniformity allows Python to perform operations on the entire array much faster than iterating over a list.
- Installation:
pip install numpy - Import Convention:Python
import numpy as np # Standard alias
B. Vectorization (Optimized Operations)
NumPy allows you to perform operations on entire arrays without writing explicit loops, a concept called vectorization. This delegates the heavy lifting to highly optimized C code under the hood.
import numpy as np
# Creating an array
arr = np.array([1, 2, 3, 4])
# Vectorized operation: apply addition to every element simultaneously
result = arr + 10
print(result)
# Output: [11 12 13 14]
# Operations between two arrays (element-wise)
arr2 = np.array([5, 5, 5, 5])
multiplied = arr * arr2
print(multiplied)
# Output: [ 5 10 15 20]
C. Multidimensional Arrays
NumPy easily handles multi-dimensional arrays, which are crucial for linear algebra and machine learning.
# Create a 2x3 array (2 rows, 3 columns)
matrix = np.array([
[1, 2, 3],
[4, 5, 6]
])
print(matrix.shape) # Output: (2, 3)
2. Pandas (Data Analysis Library)
Pandas builds on NumPy and provides highly intuitive, labeled data structures designed for manipulating, cleaning, and analyzing tabular data (like spreadsheets or SQL tables).
A. Core Data Structures
Pandas introduces two primary structures:
- Series: A one-dimensional array with explicit labels (an index). Think of it as a single column in a spreadsheet.
- DataFrame (Crucial): A two-dimensional table with both row and column labels (indices). This is the most common object for data analysis.
- Installation:
pip install pandas - Import Convention:Python
import pandas as pd # Standard alias
B. Creating and Inspecting a DataFrame
DataFrames are often created by loading files (CSV, Excel, SQL), but they can also be created from dictionaries or NumPy arrays.
import pandas as pd
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'City': ['NY', 'LA', 'SF']
}
df = pd.DataFrame(data)
# Accessing a column (Series)
ages = df['Age']
# Viewing the first few rows (essential for data inspection)
print(df.head())
C. Data Manipulation (Cleaning and Filtering)
Pandas excels at letting you filter, aggregate, and reshape data using simple, expressive syntax.
# Filtering the DataFrame
# Selects all rows where the 'Age' column value is greater than 25
older_than_25 = df[df['Age'] > 25]
print("\nOlder than 25:")
print(older_than_25)
# Output:
# Name Age City
# 1 Bob 30 LA
