Skip to content

Axoudouxou/Shark_bait

 
 

Repository files navigation

Welcome to the Shark Bait Data Cleaning Project!

This project analyzes the Global Shark Attack File dataset, containing over 7,000 shark attack records from 1845 to 2022. The dataset required extensive cleaning due to inconsistent formatting, missing values, and unstructured text data across multiple columns. Using Python, Pandas, and regular expressions, this project demonstrates a systematic approach to data cleaning, transforming a messy dataset with 10,674 missing values into a high-quality, analysis-ready dataset of 6,280 records. The project answers the research question: "Is French Polynesia safe for water activities compared to other locations?"

Original Dataset: 7,050 rows × 24 columns 10,674 missing values Time period: 1845-2022 Geographic coverage: 194 countries

Cleaned Dataset: 6,280 rows × 16 columns 0 missing values in critical columns Standardized categorical variables Ready for analysis

Methodology

Step 1: Exploratory Data Analysis

Used df.info(), df.head(), df.describe() to understand structure Identified 10,674 missing values Found 3 duplicate records Discovered inconsistent formatting in text columns

Step 2: Missing Values Strategy

Decision Rules: Less than 5% missing → Drop rows 5-20% missing → Fill with 'Unknown' More than 20% missing → Create clean version using regex

Implementation: Dropped 3 rows (Year, Case Number missing) Filled 8 columns with 'Unknown' Created clean versions for Age, Species, Activity

Step 3: Column-by-Column Cleaning

Age Column: Problem: Text values ("30s", "teen", "Ca. 33") Solution: Regex extraction + categorization Result: 6 age groups (Child, Teen, Young Adult, Adult, Middle Age, Senior)

Country Column: Problem: Typos, continents, oceans Solution: Standardization + removal of invalid entries Result: 194 valid countries (removed ~100 invalid rows)

Species Column: Problem: 3,000+ unique values Solution: Regex patterns to group variations Result: 20 standardized species categories

Activity Column: Problem: 3,755 unique activities Solution: Regex with negative lookahead Result: 15 meaningful categories

Injury Column: Problem: 3,755 unique descriptions Solution: Keyword matching + cross-validation Result: 7 severity levels + fixed 67 inconsistencies

Step 4: Data Quality Verification

Removed duplicates Cross-validated Fatal Y/N vs Injury Verified 0 missing in critical columns Created sequential Case_ID

Key Findings

Geographic Distribution:

  • USA: 2,394 attacks (38.1%)
  • Australia: 1,360 attacks (21.7%)
  • South Africa: 562 attacks (8.9%)

Most Dangerous Species:

  • White Shark
  • Tiger Shark
  • Bull Shark

Riskiest Activities:

  • Surfing
  • Swimming
  • Spearfishing
  • French Polynesia Analysis: Total attacks: 37 (0.6% of world total)

Conclusion: Relatively safe compared to USA, Australia, South Africa Fatality Rate: Fatal: 1,305 (20.8%) Non-fatal: 4,975 (79.2%)

Main Challenges

  • Challenge 1: Messy Age Data Problem: "30s", "teen", "Ca. 33" Solution: Regex extraction + text mapping
  • Challenge 2: Species Variations Problem: 3,000+ values for same species Solution: Comprehensive regex patterns
  • Challenge 3: Activity Disambiguation Problem: "Surf fishing" vs "Surfing" Solution: Negative lookahead patterns
  • Challenge 4: Data Inconsistencies Problem: Fatal Y/N contradicted Injury Solution: Cross-tabulation + correction
  • Challenge 5: Missing Data Decisions Problem: 0.01% to 51% missing Solution: Threshold-based rules

About

Project for week two

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%