docs: replace deprecated financial_phrasebank dataset in IA3 tutorial#3058
Conversation
…news-sentiment The financial_phrasebank dataset fails to load with recent versions of the datasets library due to deprecation of loading scripts. Replace it with zeroshot/twitter-financial-news-sentiment which is a compatible financial sentiment dataset available in the new parquet format. Updated files: - docs/source/task_guides/ia3.md: update dataset reference and text column - examples/conditional_generation/peft_adalora_seq2seq.py: update dataset and column name Fixes huggingface#2998 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eneration notebooks The financial_phrasebank dataset fails to load with recent versions of the datasets library. Replace it with zeroshot/twitter-financial-news-sentiment and update text_column from "sentence" to "text" to match the new dataset schema. Updated files: - examples/conditional_generation/peft_ia3_seq2seq.ipynb - examples/conditional_generation/peft_lora_seq2seq.ipynb - examples/conditional_generation/peft_prompt_tuning_seq2seq.ipynb - examples/conditional_generation/peft_prefix_tuning_seq2seq.ipynb Continues huggingface#2998
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Thanks for the review @BenjaminBossan! I've added the conditional generation notebook changes to this PR as requested. The changes from #3059 are now included here — the fix covers all instances: IA3 tutorial docs, peft_adalora_seq2seq.py, and all 4 conditional generation notebooks (IA3, LoRA, prompt tuning, prefix tuning). |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for replacing the outdated dataset with a newer one.
I found one more instances where it appears: examples/int8_training/Finetune_flan_t5_large_bnb_peft.ipynb. Could you please fix that notebook too?
I also saw that these notebooks define:
checkpoint_name = "financial_sentiment_analysis_<peft-method>.pt"
With the changed dataset, the name doesn't really fit anymore. However, the variable isn't used at all, so in a sense it doesn't matter. But if you're up to it, it would be great to remove it completely and avoid possible confusion.
- Replace deprecated financial_phrasebank dataset with zeroshot/twitter-financial-news-sentiment in Finetune_flan_t5_large_bnb_peft.ipynb (text_column sentence→text) - Remove unused checkpoint_name variables from all conditional generation notebooks and peft_adalora_seq2seq.py Addresses reviewer feedback from @BenjaminBossan.
|
Thanks for the update @dhruvildarji. I tried running one of the notebooks (peft_lora_seq2seq) and encountered some issues. First of all, the line:
didn't work but I could replace it with:
The next issue came from this line in
The issue here is that the targets are not strings, so they cannot be tokenized. Instead, they are ints, which encode the sentiment classes. Now we could just use those directly as the target without passing them through the tokenizer. However, then we would actually deal with a text classification task and not a seq2seq task as the notebook claims. Now I wonder if the dataset really fits the purpose. Can you reproduce these errors? WDYT? |
|
ping @dhruvildarji |
When zeroshot/twitter-financial-news-sentiment is loaded, the label feature may not expose a .names attribute (non-ClassLabel type depending on datasets version), causing the class-name lookup to fail with an AttributeError. This in turn prevents the text_label column from being created, so tokenizer() receives raw ints instead of strings and raises a TypeError. Fix: check hasattr before accessing .names; fall back to hardcoded ["Bearish", "Bullish", "Neutral"] for this dataset. Addresses the issue reported by @BenjaminBossan when running peft_lora_seq2seq.ipynb.
|
Thanks for testing this @BenjaminBossan! I've reproduced both issues and pushed a fix. Root cause: With What I changed (latest commit):
The seq2seq framing still holds — we're fine-tuning a T5 model to generate sentiment labels as text tokens ("Bearish"/"Bullish"/"Neutral"), which is valid for a generative seq2seq model. The dataset fits the purpose. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for the updates. I can confirm that the notebooks run now. The Python script still fails, but the same update should work there too.
There was a problem hiding this comment.
This line also needs the same adjustment as below in the notebooks, right?
Same fix as the notebooks: hasattr check before accessing .names, falls back to ["Bearish", "Bullish", "Neutral"] if unavailable. Addresses BenjaminBossan's follow-up comment that the Python script still fails after the notebook fix.
|
Fixed |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for fixing the dataset in the examples, LGTM.
Summary
Fixes #2998
The
financial_phrasebankdataset fails to load with recent versions of thedatasetslibrary because it relied on a deprecated loading script format. This PR replaces it withzeroshot/twitter-financial-news-sentiment, a financial sentiment dataset that uses the modern parquet format.Changes
docs/source/task_guides/ia3.md: updated dataset reference, description, andtext_columnfrom"sentence"to"text"to match the new dataset's column nameexamples/conditional_generation/peft_adalora_seq2seq.py: same dataset and column name updatesThe replacement dataset (
zeroshot/twitter-financial-news-sentiment) was tested and confirmed working by @maerory in the issue thread.Test plan
load_dataset("zeroshot/twitter-financial-news-sentiment")sentencetotextthroughout affected filesfeatures["label"].names(no hardcoded changes needed)