Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,12 @@ on:

jobs:
check:
runs-on: ubuntu-20.04
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: "3.9"
- name: ruff
run: |
pip install ruff
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -162,3 +162,7 @@ cython_debug/

# Vercel
.vercel/

# OpenMemory - IDE/Assistant specific rules
.cursor/rules/openmemory.mdc
.DS_Store
36 changes: 36 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Changelog

All notable changes to this project will be documented in this file.

## [Unreleased]

### Added
- **National Teams Support**: Full support for national teams across all club endpoints
- Added `isNationalTeam` boolean field to Club Profile response schema
- Club Profile endpoint now automatically detects and indicates if a club is a national team
- Club Players endpoint now supports fetching players from national teams
- Club Competitions endpoint supports national teams

- **New Endpoint**: `/clubs/{club_id}/competitions`
- Retrieves all competitions a club participates in for a given season
- Supports both regular clubs and national teams
- Returns competition ID, name, and URL for each competition
- Defaults to current season if `season_id` is not provided

### Changed
- **Club Profile Schema**: Made several fields optional to accommodate national teams
- `stadium_name`, `stadium_seats`, `current_transfer_record` are now optional
- `squad.national_team_players` is now optional
- `league` fields can now be `None` for national teams

- **Club Players Endpoint**: Enhanced to handle different HTML structures
- Automatically detects national teams and uses appropriate parsing logic
- Handles different table structures between clubs and national teams
- Ensures all player data lists are properly aligned for correct parsing

### Technical Details
- Updated XPath expressions to support both club and national team HTML structures
- Added intelligent detection logic for national teams based on HTML structure
- Improved error handling and data validation for edge cases
- All changes maintain backward compatibility with existing club endpoints

5 changes: 4 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
FROM python:3.9-slim-bullseye

ENV PYTHONUNBUFFERED=1
ENV PYTHONPATH "${PYTHONPATH}:/app"
ENV PYTHONPATH=/app

WORKDIR /app
COPY requirements.txt ./

RUN pip install --no-cache-dir -r requirements.txt

# Install playwright browsers if playwright is in requirements
RUN python -c "import playwright" 2>/dev/null && playwright install chromium || true

COPY . ./

CMD ["python", "app/main.py"]
126 changes: 126 additions & 0 deletions RAILWAY_ENV_CONFIG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Railway Environment Variables Configuration

Dette dokument beskriver de environment variables der skal sættes i Railway for optimal anti-scraping beskyttelse.

## Anti-Scraping Konfiguration

### Session Management
```bash
SESSION_TIMEOUT=3600 # Session timeout i sekunder (1 time)
MAX_SESSIONS=50 # Maximum antal samtidige sessions
MAX_CONCURRENT_REQUESTS=10 # Maximum samtidige requests per session
```

### Request Delays (Anti-Detection)
```bash
REQUEST_DELAY_MIN=1.0 # Minimum delay mellem requests (sekunder)
REQUEST_DELAY_MAX=3.0 # Maximum delay mellem requests (sekunder)
ENABLE_BEHAVIORAL_SIMULATION=false # Behavioral simulation (kommende feature)
```

### Rate Limiting (Eksisterende)
```bash
RATE_LIMITING_ENABLE=false
RATE_LIMITING_FREQUENCY=2/3seconds
```

## Proxy Konfiguration

### Bright Data / Oxylabs Residential Proxies
```bash
PROXY_HOST=your-proxy-host.brightdata.com
PROXY_PORT=22225
PROXY_USERNAME=your-brightdata-username
PROXY_PASSWORD=your-brightdata-password
```

### Alternative: Flere Proxy URLs
```bash
PROXY_URL_1=http://username:password@proxy1.example.com:8080
PROXY_URL_2=http://username:password@proxy2.example.com:8080
# ... op til PROXY_URL_10
```

## Browser Scraping Konfiguration

### Playwright Browser Settings
```bash
ENABLE_BROWSER_SCRAPING=true # Aktiver browser fallback
BROWSER_TIMEOUT=30000 # Timeout i millisekunder
BROWSER_HEADLESS=true # Kør browser i headless mode
```

## Eksisterende Konfiguration (Tournament Sizes)

```bash
TOURNAMENT_SIZE_FIWC=32
TOURNAMENT_SIZE_EURO=24
TOURNAMENT_SIZE_COPA=12
TOURNAMENT_SIZE_AFAC=24
TOURNAMENT_SIZE_GOCU=16
TOURNAMENT_SIZE_AFCN=24
```

## Implementerings Guide

1. **Start med basis konfiguration** (uden proxies)
2. **Monitor performance** med de nye endpoints
3. **Juster delays** baseret på success rates
4. **Track blokeringer** via monitoring data

## Monitoring Endpoints

API'en inkluderer nu omfattende monitoring:

### `/health`
- **Formål:** Basic health check
- **Returnerer:** Service status

### `/monitoring/anti-scraping`
- **Formål:** Komplet anti-scraping statistik
- **Målinger:**
- Success rate (%)
- Blokeringer detekteret
- Gennemsnitlig response time
- Retries performed
- Sessions created
- Uptime

### `/monitoring/session`
- **Formål:** Session manager statistik
- **Målinger:** Aktive/expired sessions, proxies, user agents

### `/monitoring/retry`
- **Formål:** Retry konfiguration
- **Målinger:** Retry settings og performance

### Eksempel på monitoring data:
```json
{
"uptime_seconds": 3600.5,
"requests_total": 150,
"requests_successful": 142,
"success_rate_percent": 94.67,
"blocks_detected": 3,
"block_rate_percent": 2.0,
"avg_response_time_seconds": 1.234,
"session_manager_stats": {
"active_sessions": 5,
"user_agents_available": 12
}
}
```

## Monitoring

API'en inkluderer nu session statistik der kan tilgås via:
- `TransfermarktBase.get_session_stats()` - Python API
- Kan integreres som endpoint for monitoring

## Proxy Services Anbefalinger

1. **Bright Data (Oxylabs)** - Bedste residential proxies (~$500/måned)
2. **Smart Proxy** - Billigere alternativ (~$300/måned)
3. **ProxyMesh** - God til testing (~$100/måned)

Start med en måned og test før permanent opsætning.
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,11 @@ $ open http://localhost:8000/
|---------------------------|-----------------------------------------------------------|--------------|
| `RATE_LIMITING_ENABLE` | Enable rate limiting feature for API calls | `false` |
| `RATE_LIMITING_FREQUENCY` | Delay allowed between each API call. See [slowapi](https://slowapi.readthedocs.io/en/latest/) for more | `2/3seconds` |
| `TOURNAMENT_SIZE_FIWC` | Expected number of participants for World Cup | `32` |
| `TOURNAMENT_SIZE_EURO` | Expected number of participants for UEFA Euro | `24` |
| `TOURNAMENT_SIZE_COPA` | Expected number of participants for Copa America | `12` |
| `TOURNAMENT_SIZE_AFAC` | Expected number of participants for AFC Asian Cup | `24` |
| `TOURNAMENT_SIZE_GOCU` | Expected number of participants for Gold Cup | `16` |
| `TOURNAMENT_SIZE_AFCN` | Expected number of participants for Africa Cup of Nations | `24` |

**Note:** Tournament size variables are used to limit participant lists for national team competitions, excluding non-qualified teams. If a tournament size is not configured or becomes outdated, a warning will be logged and all participants will be included without truncation.
25 changes: 21 additions & 4 deletions app/api/endpoints/clubs.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from fastapi import APIRouter

from app.schemas import clubs as schemas
from app.services.clubs.competitions import TransfermarktClubCompetitions
from app.services.clubs.players import TransfermarktClubPlayers
from app.services.clubs.profile import TransfermarktClubProfile
from app.services.clubs.search import TransfermarktClubSearch
Expand All @@ -12,9 +13,16 @@

@router.get("/search/{club_name}", response_model=schemas.ClubSearch, response_model_exclude_none=True)
def search_clubs(club_name: str, page_number: Optional[int] = 1) -> dict:
tfmkt = TransfermarktClubSearch(query=club_name, page_number=page_number)
found_clubs = tfmkt.search_clubs()
return found_clubs
try:
tfmkt = TransfermarktClubSearch(query=club_name, page_number=page_number)
found_clubs = tfmkt.search_clubs()
# Validate we got actual results
if not found_clubs.get("results"):
print(f"Warning: No results found for club search: {club_name} (page {page_number})")
return found_clubs
except Exception as e:
print(f"Error in search_clubs for {club_name}: {e}")
raise


@router.get("/{club_id}/profile", response_model=schemas.ClubProfile, response_model_exclude_defaults=True)
Expand All @@ -26,6 +34,15 @@ def get_club_profile(club_id: str) -> dict:

@router.get("/{club_id}/players", response_model=schemas.ClubPlayers, response_model_exclude_defaults=True)
def get_club_players(club_id: str, season_id: Optional[str] = None) -> dict:
tfmkt = TransfermarktClubPlayers(club_id=club_id, season_id=season_id)
# Let TransfermarktClubPlayers use its DOM heuristics to detect national teams
# This avoids an extra HTTP request and relies on the players page structure
tfmkt = TransfermarktClubPlayers(club_id=club_id, season_id=season_id, is_national_team=None)
club_players = tfmkt.get_club_players()
return club_players


@router.get("/{club_id}/competitions", response_model=schemas.ClubCompetitions, response_model_exclude_defaults=True)
def get_club_competitions(club_id: str, season_id: Optional[str] = None) -> dict:
tfmkt = TransfermarktClubCompetitions(club_id=club_id, season_id=season_id)
club_competitions = tfmkt.get_club_competitions()
return club_competitions
23 changes: 19 additions & 4 deletions app/api/endpoints/competitions.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,34 @@
from app.schemas import competitions as schemas
from app.services.competitions.clubs import TransfermarktCompetitionClubs
from app.services.competitions.search import TransfermarktCompetitionSearch
from app.services.competitions.seasons import TransfermarktCompetitionSeasons

router = APIRouter()


@router.get("/search/{competition_name}", response_model=schemas.CompetitionSearch)
def search_competitions(competition_name: str, page_number: Optional[int] = 1):
tfmkt = TransfermarktCompetitionSearch(query=competition_name, page_number=page_number)
competitions = tfmkt.search_competitions()
return competitions
try:
tfmkt = TransfermarktCompetitionSearch(query=competition_name, page_number=page_number)
competitions = tfmkt.search_competitions()
# Validate we got actual results
if not competitions.get("results"):
print(f"Warning: No results found for competition search: {competition_name} (page {page_number})")
return competitions
except Exception as e:
print(f"Error in search_competitions for {competition_name}: {e}")
raise


@router.get("/{competition_id}/clubs", response_model=schemas.CompetitionClubs)
@router.get("/{competition_id}/clubs", response_model=schemas.CompetitionClubs, response_model_exclude_none=True)
def get_competition_clubs(competition_id: str, season_id: Optional[str] = None):
tfmkt = TransfermarktCompetitionClubs(competition_id=competition_id, season_id=season_id)
competition_clubs = tfmkt.get_competition_clubs()
return competition_clubs


@router.get("/{competition_id}/seasons", response_model=schemas.CompetitionSeasons, response_model_exclude_none=True)
def get_competition_seasons(competition_id: str):
tfmkt = TransfermarktCompetitionSeasons(competition_id=competition_id)
competition_seasons = tfmkt.get_competition_seasons()
return competition_seasons
Loading