Fix DOB/age parsing for player profiles#113
Fix DOB/age parsing for player profiles#113KernowMaiden wants to merge 3 commits intofelipeall:mainfrom
Conversation
WalkthroughReplaced strict DOB regex with a flexible dd[./-]dd[./-]yyyy pattern (optional parenthesized 1–2 digit age) and added DOB parsing utilities plus two optional DOB fields to player profile schemas; profile parsing now stores raw DOB and ISO-normalized DOB using the new utilities. Changes
Sequence Diagram(s)sequenceDiagram
participant S as Scraper / Source
participant P as profile parser
participant U as clean_dob util
participant D as PlayerProfile
S->>P: provide raw DOB string (e.g. "12/05/1990 (34)")
P->>P: trim & remove trailing age marker (if present)
P->>U: clean_dob(raw_trimmed)
U-->>P: ISO date string or null
P->>D: set dateOfBirthRaw = original trimmed (or None)
P->>D: set dateOfBirth = ISO (or None)
Note over P,D: TransfermarktPlayerProfile used in overall mapping
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
app/schemas/players/profile.py (1)
63-64: Type normalized DOB as a date; keep raw as string.For consistency with other fields (e.g., joined/contract_expires), prefer the normalized value to be a date. Keep Raw as string.
If API clients expect a string, keep as-is; otherwise:
- dateOfBirth: Optional[str] = None + dateOfBirth: Optional[date] = None dateOfBirthRaw: Optional[str] = NoneAdditionally, consider deprecating
date_of_birthor wiring an alias to avoid duplicate fields diverging. If aliases are used in TransfermarktBaseModel, ensure both serialize consistently.app/utils/regex.py (1)
6-6: Tighten DOB regex with boundaries and valid day/month ranges.Reduce false positives and keep permissive matching without anchors.
-REGEX_DOB_AGE: str = r"(?P<dob>\d{1,2}[./-]\d{1,2}[./-]\d{4})(?:\s*\((?P<age>\d{1,2})\))?" +REGEX_DOB_AGE: str = ( + r"\b(?P<dob>(0?[1-9]|[12][0-9]|3[01])[./-](0?[1-9]|1[0-2])[./-](19|20)\d{2})\b" + r"(?:\s*\((?P<age>\d{1,2})\))?" +)app/services/players/profile.py (3)
5-7: Avoid regex drift: compile fallback from the shared constant.Using a local hardcoded pattern can diverge from app.utils.regex.REGEX_DOB_AGE over time.
If import order allows, prefer:
# after importing REGEX_DOB_AGE _DOB_FALLBACK = re.compile(REGEX_DOB_AGE)If needed, move the REGEX_DOB_AGE import above and initialize _DOB_FALLBACK there.
8-22: Type hint portability + narrower exception.
- Use Optional[str] for broader Python compatibility if <3.10 is supported.
- Catch ValueError instead of bare Exception.
-def clean_dob(raw_text: str) -> str | None: +from typing import Optional +def clean_dob(raw_text: Optional[str]) -> Optional[str]: @@ - except Exception: + except ValueError: return NoneIf you later switch schema dateOfBirth to a date type, have clean_dob return a date object and let Pydantic serialize it.
90-101: Reuse cached text; avoid second XPath; handle age in one place.You already have raw_dob_text. Use it for age extraction and drop the extra DOM read later.
-raw_dob_text = self.get_text_by_xpath(Players.Profile.DATE_OF_BIRTH_AGE) +raw_dob_text = self.get_text_by_xpath(Players.Profile.DATE_OF_BIRTH_AGE) @@ -# Store both versions -self.response["dateOfBirthRaw"] = raw_dob_only # "24/06/1987" -self.response["dateOfBirth"] = clean_dob(raw_dob_only) # "1987-06-24" +# Store both versions +self.response["dateOfBirthRaw"] = raw_dob_only # "24/06/1987" +self.response["dateOfBirth"] = clean_dob(raw_dob_only) # "1987-06-24" +# Extract age from the same text to avoid a second XPath query +self.response["age"] = safe_regex(raw_dob_text, REGEX_DOB_AGE, "age")Then remove the age block at Lines 106–110.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
app/schemas/players/profile.py(1 hunks)app/services/players/profile.py(2 hunks)app/utils/regex.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
app/services/players/profile.py (2)
app/services/base.py (1)
get_text_by_xpath(152-206)app/utils/xpath.py (5)
Players(1-102)Players(148-177)Profile(18-57)Profile(106-138)Profile(181-183)
🔇 Additional comments (1)
app/utils/regex.py (1)
6-6: safe_regex correctly uses re.search
The implementation in app/utils/utils.py (line 81) calls re.search, so the unanchored DOB/age pattern will match anywhere in the string as intended.
| def clean_dob(raw_text: str) -> str | None: | ||
| """ | ||
| Find a DD[./-]MM[./-]YYYY in the input and return ISO YYYY-MM-DD. | ||
| Returns None if not found or parsing fails. | ||
| """ | ||
| if not raw_text: | ||
| return None | ||
| m = _DOB_FALLBACK.search(raw_text) | ||
| if not m: | ||
| return None | ||
| dob = m.group("dob").replace("-", ".").replace("/", ".") # unify to dots | ||
| try: | ||
| return datetime.strptime(dob, "%d.%m.%Y").strftime("%Y-%m-%d") | ||
| except Exception: | ||
| return None |
There was a problem hiding this comment.
🧩 Analysis chain
Quick sanity checks for DOB parsing.
Add a tiny table‑driven test to guard formats you intend to support.
Also confirm the repo’s Python requirement to validate str | None usage vs Optional:
Also applies to: 90-101
🏁 Script executed:
#!/bin/bash
python - << 'PY'
from app.services.players.profile import clean_dob
samples = [
"24/06/1987 (38)",
"1-2-2000 (5)",
"03.11.1999",
None,
"",
]
for s in samples:
print(repr(s), "->", clean_dob(s))
PYLength of output: 353
🏁 Script executed:
#!/bin/bash
rg -n "requires-python|python_requires" -C2 pyproject.toml setup.cfg setup.py || trueLength of output: 98
🏁 Script executed:
#!/bin/bash
python3 - << 'PY'
import sys, types
# Stub bs4 to avoid ModuleNotFoundError
sys.modules['bs4'] = types.ModuleType('bs4')
sys.modules['bs4'].BeautifulSoup = lambda *args, **kwargs: None
# Stub app.services.base before importing profile to bypass its imports
base_mod = types.ModuleType('app.services.base')
base_mod.TransfermarktBase = type('TransfermarktBase', (), {})
sys.modules['app.services.base'] = base_mod
from app.services.players.profile import clean_dob
samples = [
"24/06/1987 (38)",
"1-2-2000 (5)",
"03.11.1999",
None,
"",
]
for s in samples:
print(f"{s!r} -> {clean_dob(s)}")
PYLength of output: 168
🏁 Script executed:
#!/bin/bash
# Test clean_dob outputs with proper f-string and stubbed imports
python3 - << 'PY'
import sys, types
# Stub missing modules
sys.modules['bs4'] = types.ModuleType('bs4')
sys.modules['bs4'].BeautifulSoup = lambda *args, **kwargs: None
base_mod = types.ModuleType('app.services.base')
base_mod.TransfermarktBase = type('TransfermarktBase', (), {})
sys.modules['app.services.base'] = base_mod
from app.services.players.profile import clean_dob
samples = ["24/06/1987 (38)", "1-2-2000 (5)", "03.11.1999", None, ""]
for s in samples:
print(f"{s!r} -> {clean_dob(s)}")
PY
# Search for Python version requirements
rg -n "requires-python|python_requires|requires_python|requires-python" pyproject.toml || echo "No python requirement found"Length of output: 196
🏁 Script executed:
#!/bin/bash
# 1. Test clean_dob outputs with correct f-string
python3 - << 'PY'
import sys, types
# Stub missing modules
sys.modules['bs4'] = types.ModuleType('bs4')
sys.modules['bs4'].BeautifulSoup = lambda *args, **kwargs: None
base_mod = types.ModuleType('app.services.base')
base_mod.TransfermarktBase = type('TransfermarktBase', (), {})
sys.modules['app.services.base'] = base_mod
from app.services.players.profile import clean_dob
samples = ["24/06/1987 (38)", "1-2-2000 (5)", "03.11.1999", None, ""]
for s in samples:
print(f"{s!r} -> {clean_dob(s)}")
PY
# 2. Inspect Python version requirement in pyproject.toml
sed -n '1,200p' pyproject.tomlLength of output: 1508
Align type annotations with project’s Python requirement
- The pyproject.toml pins
python = "^3.9", but PEP 604 union syntax (str | None) isn’t valid until Python 3.10+. Either raise the requirement to^3.10in pyproject.toml or revert toOptional[str](withfrom typing import Optional) inapp/services/players/profile.py. - Add a small, table-driven pytest for
clean_dobcovering “24/06/1987 (38)” →"1987-06-24", “1-2-2000 (5)” →"2000-02-01", “03.11.1999” →"1999-11-03", plusNoneand empty string cases.
🤖 Prompt for AI Agents
In app/services/players/profile.py around lines 8 to 22, the function uses PEP
604 union syntax `str | None` which is invalid under the project's Python 3.9
constraint; replace the return annotation with Optional[str] and add `from
typing import Optional` at the top of the file, keeping behavior unchanged. Also
add a small table-driven pytest (e.g., in tests/unit/test_profile.py) that
parametrizes inputs and expected outputs for cases: "24/06/1987 (38)" ->
"1987-06-24", "1-2-2000 (5)" -> "2000-02-01", "03.11.1999" -> "1999-11-03", None
-> None, and "" -> None to ensure parsing and edge cases are covered.
Summary by CodeRabbit
New Features
Bug Fixes