I'd split all these little checks and bits for cleaning the text into separate functions for readability + testability. For example, this one could be something like this:
def get_year(soup: BeautifulSoup) -> str:
# Check if the 'accepted' date is found within 'date', and if it contains a 'year' tag
### no check for unicode or hexacode or XML tags
if date := soup.find("date", {"date-type": "accepted"}):
if year := date.find("year"):
# Extract the text content of the 'year' tag if found
return year.text
# If 'accepted' date or 'year' is missing, return empty string
return ""
(I'm not saying you have to do that on this PR -- just food for thought)
Originally posted by @alexdewar in #263 (comment)
Originally posted by @alexdewar in #263 (comment)