GitHub - Axoudouxou/Shark_bait: Project for week two

Welcome to the Shark Bait Data Cleaning Project!

This project analyzes the Global Shark Attack File dataset, containing over 7,000 shark attack records from 1845 to 2022. The dataset required extensive cleaning due to inconsistent formatting, missing values, and unstructured text data across multiple columns. Using Python, Pandas, and regular expressions, this project demonstrates a systematic approach to data cleaning, transforming a messy dataset with 10,674 missing values into a high-quality, analysis-ready dataset of 6,280 records. The project answers the research question: "Is French Polynesia safe for water activities compared to other locations?"

Original Dataset: 7,050 rows × 24 columns 10,674 missing values Time period: 1845-2022 Geographic coverage: 194 countries

Cleaned Dataset: 6,280 rows × 16 columns 0 missing values in critical columns Standardized categorical variables Ready for analysis

Methodology

Step 1: Exploratory Data Analysis

Used df.info(), df.head(), df.describe() to understand structure Identified 10,674 missing values Found 3 duplicate records Discovered inconsistent formatting in text columns

Step 2: Missing Values Strategy

Decision Rules: Less than 5% missing → Drop rows 5-20% missing → Fill with 'Unknown' More than 20% missing → Create clean version using regex

Implementation: Dropped 3 rows (Year, Case Number missing) Filled 8 columns with 'Unknown' Created clean versions for Age, Species, Activity

Step 3: Column-by-Column Cleaning

Age Column: Problem: Text values ("30s", "teen", "Ca. 33") Solution: Regex extraction + categorization Result: 6 age groups (Child, Teen, Young Adult, Adult, Middle Age, Senior)

Country Column: Problem: Typos, continents, oceans Solution: Standardization + removal of invalid entries Result: 194 valid countries (removed ~100 invalid rows)

Species Column: Problem: 3,000+ unique values Solution: Regex patterns to group variations Result: 20 standardized species categories

Activity Column: Problem: 3,755 unique activities Solution: Regex with negative lookahead Result: 15 meaningful categories

Injury Column: Problem: 3,755 unique descriptions Solution: Keyword matching + cross-validation Result: 7 severity levels + fixed 67 inconsistencies

Step 4: Data Quality Verification

Removed duplicates Cross-validated Fatal Y/N vs Injury Verified 0 missing in critical columns Created sequential Case_ID

Key Findings

Geographic Distribution:

USA: 2,394 attacks (38.1%)
Australia: 1,360 attacks (21.7%)
South Africa: 562 attacks (8.9%)

Most Dangerous Species:

White Shark
Tiger Shark
Bull Shark

Riskiest Activities:

Surfing
Swimming
Spearfishing
French Polynesia Analysis: Total attacks: 37 (0.6% of world total)

Conclusion: Relatively safe compared to USA, Australia, South Africa Fatality Rate: Fatal: 1,305 (20.8%) Non-fatal: 4,975 (79.2%)

Main Challenges

Challenge 1: Messy Age Data Problem: "30s", "teen", "Ca. 33" Solution: Regex extraction + text mapping
Challenge 2: Species Variations Problem: 3,000+ values for same species Solution: Comprehensive regex patterns
Challenge 3: Activity Disambiguation Problem: "Surf fishing" vs "Surfing" Solution: Negative lookahead patterns
Challenge 4: Data Inconsistencies Problem: Fatal Y/N contradicted Injury Solution: Cross-tabulation + correction
Challenge 5: Missing Data Decisions Problem: 0.01% to 51% missing Solution: Threshold-based rules

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
SHARK BAIT PROJECT.ipynb		SHARK BAIT PROJECT.ipynb
SHARK BAITS PRESENTATION.pdf		SHARK BAITS PRESENTATION.pdf
Shark_Project.ipynb		Shark_Project.ipynb
Welcome to the Shark Bait Data Cleaning Project, Readme.docx		Welcome to the Shark Bait Data Cleaning Project, Readme.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Axoudouxou/Shark_bait

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages