Data Science involves:
- Collecting data
- Cleaning data
- Analyzing data
- Visualizing insights
Python is widely used in Data Science because of its powerful libraries.
Popular libraries:
- NumPy
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
NumPy Basics
NumPy stands for:
Numerical Python
Used for:
- Arrays
- Mathematical operations
- Scientific computing
Installing NumPy
pip install numpy
Import NumPy
import numpy as np
Creating Arrays
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr)
Output:
[1 2 3 4]
Array Operations
arr = np.array([1, 2, 3])
print(arr + 2)
print(arr * 2)
Output:
[3 4 5]
[2 4 6]
Multi-Dimensional Arrays
matrix = np.array([
[1, 2],
[3, 4]
])
print(matrix)
Useful NumPy Functions
print(np.zeros((2, 2)))
print(np.ones((3, 3)))
print(np.arange(1, 10))
Pandas DataFrames
Pandas is used for:
- Data analysis
- Data manipulation
- Working with tables
Installing Pandas
pip install pandas
Import Pandas
import pandas as pd
Creating DataFrame
import pandas as pd
data = {
"Name": ["Aditya", "Rahul"],
"Marks": [90, 85]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Marks
0 Aditya 90
1 Rahul 85
Reading CSV File
df = pd.read_csv("students.csv")
print(df.head())
Viewing Data
print(df.head())
print(df.tail())
print(df.info())
Selecting Columns
print(df["Name"])
Filtering Data
high_marks = df[df["Marks"] > 80]
print(high_marks)
Adding New Column
df["Result"] = "Pass"
print(df)
Data Visualization
Data visualization helps understand patterns and trends.
Popular library:
Matplotlib
Installing Matplotlib
pip install matplotlib
Import Matplotlib
import matplotlib.pyplot as plt
Line Chart
x = [1, 2, 3, 4]
y = [10, 20, 30, 40]
plt.plot(x, y)
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Line Chart")
plt.show()
Bar Chart
subjects = ["Python", "Java", "C++"]
marks = [90, 80, 70]
plt.bar(subjects, marks)
plt.show()
Pie Chart
labels = ["Python", "Java", "C++"]
sizes = [50, 30, 20]
plt.pie(sizes, labels=labels)
plt.show()
Histogram
data = [10, 20, 20, 30, 40, 40, 40]
plt.hist(data)
plt.show()
Data Cleaning
Data cleaning means fixing:
- Missing values
- Duplicate data
- Incorrect formats
Detect Missing Values
print(df.isnull())
print(df.isnull().sum())
Remove Missing Values
df = df.dropna()
Fill Missing Values
df["Marks"] = df["Marks"].fillna(0)
Remove Duplicate Rows
df = df.drop_duplicates()
Rename Columns
df.rename(columns={
"Marks": "Score"
}, inplace=True)
Change Data Types
df["Marks"] = df["Marks"].astype(int)
Exploratory Data Analysis (EDA)
EDA is the process of analyzing datasets to discover:
- Patterns
- Trends
- Relationships
Basic Statistics
print(df.describe())
Value Counts
print(df["Result"].value_counts())
Correlation
print(df.corr(numeric_only=True))
Grouping Data
grouped = df.groupby("Result")["Marks"].mean()
print(grouped)
Sorting Data
sorted_df = df.sort_values(
by="Marks",
ascending=False
)
print(sorted_df)
Export Data
Save CSV
df.to_csv("output.csv", index=False)
Save Excel File
df.to_excel("output.xlsx", index=False)
Practical Example
Student Data Analysis
import pandas as pd
data = {
"Name": ["Aditya", "Rahul", "Aman"],
"Marks": [90, 75, 85]
}
df = pd.DataFrame(data)
print("Average Marks:")
print(df["Marks"].mean())
print("Highest Marks:")
print(df["Marks"].max())
print("Students with Marks > 80:")
print(df[df["Marks"] > 80])
Advantages of Data Science Libraries
✅ Fast data processing
✅ Powerful analysis tools
✅ Easy visualization
✅ Handles large datasets
✅ Supports machine learning
Summary
In this chapter you learned:
✅ NumPy basics
✅ Pandas DataFrames
✅ Data visualization
✅ Data cleaning
✅ Exploratory Data Analysis (EDA)






