Analyzing America's Pastime Through 10 Pivotal Eras (1927-2023)
A comprehensive web scraping and data analysis project that explores how baseball has evolved through its most transformative periods, combining advanced data science techniques with historical storytelling.
This project investigates the fundamental transformation of baseball through three core research questions:
- How have offensive capabilities evolved? From the Dead Ball Era to the Analytics Era
- What factors drive team success? The relationship between individual excellence and team victories
- How do historical events reshape the game? The impact of integration, rule changes, and technological advancement
Rather than random sampling, I strategically selected 10 watershed moments that fundamentally changed baseball:
| Year | Era | Significance |
|---|---|---|
| 1927 | Murderers' Row | Babe Ruth's 60 HR season, Yankees dominance |
| 1947 | Integration Era | Jackie Robinson breaks color barrier |
| 1961 | Expansion Era | Maris breaks Ruth's record, AL expands to 10 teams |
| 1969 | End of Pitcher Era | Mound lowered, strike zone reduced |
| 1994 | Strike Season | Season ended by labor dispute, offensive explosion begins |
| 1998 | Home Run Chase | McGwire vs Sosa, steroid era peak |
| 2001 | Bonds' Peak | 73 HRs, post-9/11 season |
| 2016 | Analytics Era | Cubs break 108-year drought, advanced metrics reshape game |
| 2020 | COVID Season | 60-game season, rule experiments |
| 2023 | Modern Rules | Pitch clock, shift restrictions implemented |
Source: Baseball-Almanac.com historical records
Key Challenges Solved:
- Anti-bot protection: User-agent rotation and intelligent fallback strategies
- Dynamic content: Selenium fallback for JavaScript-rendered pages
- Cross-era consistency: Handling different page structures across decades
- Data validation: Real-time filtering of navigation artifacts and non-baseball content
def scrape_page(self, url: str) -> BeautifulSoup:
"""Robust scraping with intelligent fallback"""
# Try requests first (faster)
soup = self.scrape_with_requests(url)
# Fallback to Selenium if blocked
if soup is None:
soup = self.scrape_with_selenium(url)
return soupMajor Challenge: Team name inconsistency across datasets
- Problem: Statistical data contained city names ("New York"), standings had full names ("New York Yankees")
- Solution: Historical mapping system accounting for franchise moves and era changes
Before/After Results:
- Raw data: 120 hitting, 80 pitching, 111 standings, 127 events
- Cleaned data: Standardized team names, validated statistical ranges, 18 specific event categories
Four integrated analysis views revealing baseball's evolution:
- Offensive Evolution Timeline - Power vs Contact vs Patience trends
- Team Dominance Analysis - Individual performance correlation with team success
- Historical Events Timeline - Context-rich event categorization
- Interactive Data Tables - Filterable, explorable datasets
The most significant finding: Baseball fundamentally shifted from a contact-based game to a power-focused strategy.
Evidence:
- Batting Average: Steady decline from Ted Williams' .398 (1947) to .330 (2023)
- Home Runs: Cyclical peaks but consistently higher than early eras
- On Base Percentage: Moneyball era briefly emphasized patience, modern game returned to aggression
- RBI: Team-focused approach replaced individual production emphasis
Correlation analysis reveals changing success drivers:
- Early Era (1927-1961): Pitching dominance primary factor
- Modern Era (1998-2023): Balanced excellence across categories
- Overall: 0.67 correlation between statistical leaders and team wins
Major events profoundly shaped statistical trends:
- 1947 Integration: Expanded talent pool, increased competition
- 1998 Peak: Highest event activity (16 major occurrences)
- 2020 COVID: Rule experiments that became permanent fixtures
pip install -r requirements.txtDependencies:
streamlit>=1.28.0- Interactive dashboard frameworkselenium>=4.11.0- Web scraping automationpandas>=2.0.0- Data manipulation and analysisplotly>=5.15.0- Interactive visualizationsbeautifulsoup4>=4.12.0- HTML parsingrequests>=2.31.0- HTTP requests
- Clone the repository
git clone https://github.com/your-username/mlb-historical-analysis.git
cd mlb-historical-analysis- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt- Run data collection (Optional - cleaned data included)
python src/scraper.py- Process and clean data
python simple_data_cleaner.py
python quick_db_import.py- Launch interactive dashboard
streamlit run src/dashboard.pymlb-historical-analysis/
โโโ src/
โ โโโ scraper.py # Enhanced Selenium web scraper
โ โโโ dashboard.py # Interactive Streamlit dashboard
โ โโโ db_import.py # Database creation and management
โ โโโ db_query.py # Interactive querying tool
โโโ data/
โ โโโ raw/ # Original scraped data
โ โโโ cleaned/ # Processed and standardized data
โ โโโ mlb_database.db # SQLite database
โโโ screenshots/ # Dashboard and analysis screenshots
โโโ simple_data_cleaner.py # Streamlined data cleaning script
โโโ quick_db_import.py # Fast database setup
โโโ requirements.txt # Python dependencies
โโโ README.md # Project documentation
- Era Selection: Filter analysis by historical periods
- Team Comparison: Multi-team performance tracking
- Statistical Categories: Hitting, pitching, and event type filtering
- Year-over-Year Analysis: Temporal trend exploration
- Home Run Evolution - Power hitting trends across eras with historical context
- Four-Panel Offensive Analysis - Contact vs Power vs Patience vs Production
- Team Success Correlation - Bubble chart linking individual and team performance
- Historical Events Classification - 18 event categories with contextual annotations
- Success Rate: 95% successful page retrievals across 10 target years
- Fallback Usage: 23% of requests required Selenium fallback
- Data Points Collected: 400+ statistical records, 127 historical events
- Team Name Standardization: 89% automatic matching success
- Statistical Range Validation: Removed outliers and artifacts
- Event Classification: 18 specific categories vs original generic types
- Cross-Reference Verification: Major records validated against known historical facts
- Team name matching across eras requires some manual verification
- Event classification based on text analysis may miss nuanced categories
- Statistical categories vary slightly between different historical periods
- Limited to American League data for consistency
- Robust error handling with intelligent fallback mechanisms
- Anti-detection techniques including user-agent rotation and delay randomization
- Cross-decade consistency handling varied page structures from 1920s to 2020s
- Real-time validation filtering non-baseball content during collection
- Complex entity matching across disparate historical datasets
- Historical context integration with franchise moves and rule changes
- Automated quality assurance with statistical range validation
- Scalable architecture supporting easy expansion to additional years
- Multi-dimensional analysis revealing patterns across multiple statistical categories
- Interactive exploration enabling user-driven discovery
- Historical contextualization connecting statistical trends to cultural events
- Professional presentation suitable for both technical and general audiences
- Comprehensive Coverage: Extend to 50+ years of baseball history
- Advanced Metrics: Include modern sabermetrics (WAR, wOBA, FIP)
- National League: Add NL data for complete league analysis
- Minor Leagues: Incorporate developmental system data
- Predictive Modeling: Use historical trends to forecast future performance
- Geographic Analysis: Map talent distribution and regional trends
- Economic Impact: Correlation with ticket prices, attendance, and revenue
- Social Media Integration: Modern era fan sentiment analysis
- Real-time Updates: Automated scraping for current season data
- Machine Learning: Automated event classification and trend detection
- API Development: Enable third-party access to cleaned datasets
- Mobile Optimization: Responsive dashboard design for all devices
- Complete Pipeline Example: Web scraping โ Cleaning โ Analysis โ Visualization
- Real-world Challenges: Anti-bot protection, entity matching, temporal data consistency
- Industry Best Practices: Error handling, data validation, reproducible research
- Historical Context: Understanding how modern analytics relate to traditional methods
- Strategic Evolution: Evidence-based analysis of changing game philosophy
- Performance Evaluation: Correlation between individual and team success metrics
- Social Change Reflection: How sports mirror broader American cultural shifts
- Technological Impact: Evolution of data collection and analysis capabilities
- Historical Narrative: Connecting statistical trends to significant events
This project demonstrates the power of combining technical data science skills with domain expertise to uncover meaningful insights about cultural and historical phenomena.



