Skip to content

docs: replace deprecated financial_phrasebank dataset in IA3 tutorial#3058

Merged
BenjaminBossan merged 5 commits intohuggingface:mainfrom
dhruvildarji:fix/ia3-tutorial-dataset
Mar 17, 2026
Merged

docs: replace deprecated financial_phrasebank dataset in IA3 tutorial#3058
BenjaminBossan merged 5 commits intohuggingface:mainfrom
dhruvildarji:fix/ia3-tutorial-dataset

Conversation

@dhruvildarji
Copy link
Copy Markdown
Contributor

Summary

Fixes #2998

The financial_phrasebank dataset fails to load with recent versions of the datasets library because it relied on a deprecated loading script format. This PR replaces it with zeroshot/twitter-financial-news-sentiment, a financial sentiment dataset that uses the modern parquet format.

Changes

  • docs/source/task_guides/ia3.md: updated dataset reference, description, and text_column from "sentence" to "text" to match the new dataset's column name
  • examples/conditional_generation/peft_adalora_seq2seq.py: same dataset and column name updates

The replacement dataset (zeroshot/twitter-financial-news-sentiment) was tested and confirmed working by @maerory in the issue thread.

Test plan

  • Updated dataset loads successfully with load_dataset("zeroshot/twitter-financial-news-sentiment")
  • Column name updated from sentence to text throughout affected files
  • Label names still auto-detected from features["label"].names (no hardcoded changes needed)

…news-sentiment

The financial_phrasebank dataset fails to load with recent versions of
the datasets library due to deprecation of loading scripts. Replace it
with zeroshot/twitter-financial-news-sentiment which is a compatible
financial sentiment dataset available in the new parquet format.

Updated files:
- docs/source/task_guides/ia3.md: update dataset reference and text column
- examples/conditional_generation/peft_adalora_seq2seq.py: update dataset and column name

Fixes huggingface#2998

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eneration notebooks

The financial_phrasebank dataset fails to load with recent versions of
the datasets library. Replace it with zeroshot/twitter-financial-news-sentiment
and update text_column from "sentence" to "text" to match the new dataset schema.

Updated files:
- examples/conditional_generation/peft_ia3_seq2seq.ipynb
- examples/conditional_generation/peft_lora_seq2seq.ipynb
- examples/conditional_generation/peft_prompt_tuning_seq2seq.ipynb
- examples/conditional_generation/peft_prefix_tuning_seq2seq.ipynb

Continues huggingface#2998
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@dhruvildarji
Copy link
Copy Markdown
Contributor Author

Thanks for the review @BenjaminBossan! I've added the conditional generation notebook changes to this PR as requested. The changes from #3059 are now included here — the fix covers all instances: IA3 tutorial docs, peft_adalora_seq2seq.py, and all 4 conditional generation notebooks (IA3, LoRA, prompt tuning, prefix tuning).

Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for replacing the outdated dataset with a newer one.

I found one more instances where it appears: examples/int8_training/Finetune_flan_t5_large_bnb_peft.ipynb. Could you please fix that notebook too?

I also saw that these notebooks define:

checkpoint_name = "financial_sentiment_analysis_<peft-method>.pt"

With the changed dataset, the name doesn't really fit anymore. However, the variable isn't used at all, so in a sense it doesn't matter. But if you're up to it, it would be great to remove it completely and avoid possible confusion.

- Replace deprecated financial_phrasebank dataset with
  zeroshot/twitter-financial-news-sentiment in
  Finetune_flan_t5_large_bnb_peft.ipynb (text_column sentence→text)
- Remove unused checkpoint_name variables from all conditional
  generation notebooks and peft_adalora_seq2seq.py

Addresses reviewer feedback from @BenjaminBossan.
@BenjaminBossan
Copy link
Copy Markdown
Member

Thanks for the update @dhruvildarji. I tried running one of the notebooks (peft_lora_seq2seq) and encountered some issues. First of all, the line:

classes = dataset["train"].features["label"].names

didn't work but I could replace it with:

classes = dataset["train"]["label"]

The next issue came from this line in preprocess_function:

labels = tokenizer(targets, max_length=3, padding="max_length", truncation=True, return_tensors="pt")

The issue here is that the targets are not strings, so they cannot be tokenized. Instead, they are ints, which encode the sentiment classes. Now we could just use those directly as the target without passing them through the tokenizer. However, then we would actually deal with a text classification task and not a seq2seq task as the notebook claims. Now I wonder if the dataset really fits the purpose.

Can you reproduce these errors? WDYT?

@BenjaminBossan
Copy link
Copy Markdown
Member

ping @dhruvildarji

When zeroshot/twitter-financial-news-sentiment is loaded, the label
feature may not expose a .names attribute (non-ClassLabel type depending
on datasets version), causing the class-name lookup to fail with an
AttributeError. This in turn prevents the text_label column from being
created, so tokenizer() receives raw ints instead of strings and raises
a TypeError.

Fix: check hasattr before accessing .names; fall back to hardcoded
["Bearish", "Bullish", "Neutral"] for this dataset.

Addresses the issue reported by @BenjaminBossan when running
peft_lora_seq2seq.ipynb.
@dhruvildarji
Copy link
Copy Markdown
Contributor Author

Thanks for testing this @BenjaminBossan! I've reproduced both issues and pushed a fix.

Root cause: With zeroshot/twitter-financial-news-sentiment, the label feature may not expose a .names attribute (plain integer feature vs ClassLabel, depending on the datasets version). When .names raises AttributeError, the text_label column is never created, so preprocess_function receives raw ints and tokenizer() fails with a TypeError.

What I changed (latest commit):

  • Added a hasattr check before accessing .names across all 5 notebooks
  • Falls back to ["Bearish", "Bullish", "Neutral"] if .names is unavailable
  • This ensures text_label is always a list of strings before tokenization

The seq2seq framing still holds — we're fine-tuning a T5 model to generate sentiment labels as text tokens ("Bearish"/"Bullish"/"Neutral"), which is valid for a generative seq2seq model. The dataset fits the purpose.

Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I can confirm that the notebooks run now. The Python script still fails, but the same update should work there too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line also needs the same adjustment as below in the notebooks, right?

Same fix as the notebooks: hasattr check before accessing .names,
falls back to ["Bearish", "Bullish", "Neutral"] if unavailable.

Addresses BenjaminBossan's follow-up comment that the Python script
still fails after the notebook fix.
@dhruvildarji
Copy link
Copy Markdown
Contributor Author

Fixed peft_adalora_seq2seq.py as well — same hasattr fallback applied on line 32. Should be good to go now.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the dataset in the examples, LGTM.

@BenjaminBossan BenjaminBossan merged commit c1e8a27 into huggingface:main Mar 17, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tutorial uses deprecated dataset (financial_phrasebank) that fails to load

3 participants