15. Data Science with Python

Data Science involves:

  • Collecting data
  • Cleaning data
  • Analyzing data
  • Visualizing insights

Python is widely used in Data Science because of its powerful libraries.

Popular libraries:

  • NumPy
  • Pandas
  • Matplotlib
  • Seaborn
  • Scikit-learn

NumPy Basics

NumPy stands for:

Numerical Python

Used for:

  • Arrays
  • Mathematical operations
  • Scientific computing

Installing NumPy

pip install numpy

Import NumPy

import numpy as np

Creating Arrays

import numpy as np

arr = np.array([1, 2, 3, 4])

print(arr)

Output:

[1 2 3 4]

Array Operations

arr = np.array([1, 2, 3])

print(arr + 2)
print(arr * 2)

Output:

[3 4 5]
[2 4 6]

Multi-Dimensional Arrays

matrix = np.array([
[1, 2],
[3, 4]
])

print(matrix)

Useful NumPy Functions

print(np.zeros((2, 2)))
print(np.ones((3, 3)))
print(np.arange(1, 10))

Pandas DataFrames

Pandas is used for:

  • Data analysis
  • Data manipulation
  • Working with tables

Installing Pandas

pip install pandas

Import Pandas

import pandas as pd

Creating DataFrame

import pandas as pd

data = {
"Name": ["Aditya", "Rahul"],
"Marks": [90, 85]
}

df = pd.DataFrame(data)

print(df)

Output:

      Name  Marks
0 Aditya 90
1 Rahul 85

Reading CSV File

df = pd.read_csv("students.csv")

print(df.head())

Viewing Data

print(df.head())
print(df.tail())
print(df.info())

Selecting Columns

print(df["Name"])

Filtering Data

high_marks = df[df["Marks"] > 80]

print(high_marks)

Adding New Column

df["Result"] = "Pass"

print(df)

Data Visualization

Data visualization helps understand patterns and trends.

Popular library:

Matplotlib

Installing Matplotlib

pip install matplotlib

Import Matplotlib

import matplotlib.pyplot as plt

Line Chart

x = [1, 2, 3, 4]
y = [10, 20, 30, 40]

plt.plot(x, y)

plt.xlabel("X-axis")
plt.ylabel("Y-axis")

plt.title("Line Chart")

plt.show()

Bar Chart

subjects = ["Python", "Java", "C++"]
marks = [90, 80, 70]

plt.bar(subjects, marks)

plt.show()

Pie Chart

labels = ["Python", "Java", "C++"]
sizes = [50, 30, 20]

plt.pie(sizes, labels=labels)

plt.show()

Histogram

data = [10, 20, 20, 30, 40, 40, 40]

plt.hist(data)

plt.show()

Data Cleaning

Data cleaning means fixing:

  • Missing values
  • Duplicate data
  • Incorrect formats

Detect Missing Values

print(df.isnull())
print(df.isnull().sum())

Remove Missing Values

df = df.dropna()

Fill Missing Values

df["Marks"] = df["Marks"].fillna(0)

Remove Duplicate Rows

df = df.drop_duplicates()

Rename Columns

df.rename(columns={
"Marks": "Score"
}, inplace=True)

Change Data Types

df["Marks"] = df["Marks"].astype(int)

Exploratory Data Analysis (EDA)

EDA is the process of analyzing datasets to discover:

  • Patterns
  • Trends
  • Relationships

Basic Statistics

print(df.describe())

Value Counts

print(df["Result"].value_counts())

Correlation

print(df.corr(numeric_only=True))

Grouping Data

grouped = df.groupby("Result")["Marks"].mean()

print(grouped)

Sorting Data

sorted_df = df.sort_values(
by="Marks",
ascending=False
)

print(sorted_df)

Export Data

Save CSV

df.to_csv("output.csv", index=False)

Save Excel File

df.to_excel("output.xlsx", index=False)

Practical Example

Student Data Analysis

import pandas as pd

data = {
"Name": ["Aditya", "Rahul", "Aman"],
"Marks": [90, 75, 85]
}

df = pd.DataFrame(data)

print("Average Marks:")
print(df["Marks"].mean())

print("Highest Marks:")
print(df["Marks"].max())

print("Students with Marks > 80:")
print(df[df["Marks"] > 80])

Advantages of Data Science Libraries

✅ Fast data processing
✅ Powerful analysis tools
✅ Easy visualization
✅ Handles large datasets
✅ Supports machine learning


Summary

In this chapter you learned:

✅ NumPy basics
✅ Pandas DataFrames
✅ Data visualization
✅ Data cleaning
✅ Exploratory Data Analysis (EDA)