-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Modern Bert Support #15641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Modern Bert Support #15641
Conversation
…orted yet but working on getting conversion to work for encoder only
…ated gate split with views, GEGLU is now used which does exactly this
…when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more
|
@gabe-l-hart thanks in advance :) |
also realizing this a little late haha, but should I be changing all of the modern bert stuff to a granite embedding macro like LLM_ARCH_GRANITE_EMBD or keep it as is |
|
You may want to check out an earlier attempt at ModernBert in #14014 |
|
Thanks for getting this together @ryan-mangeno and thanks for pointing out the previous work @CISC. Ryan, let me know if/when you've looked over that PR and found anything to fix and I'll take a pass at review. |
In general, we want to keep things as generic as possible, so since this uses the |
will do |
|
@gabe-l-hart im looking into modern berts research paper, I cant find a mention of symmetric sliding window attention but rather local sliding window attention so I am going to opt to use LLAMA_SWA_TYPE_LOCAL versus LLAMA_SWA_TYPE_SYMMETRIC used in the previous attempt. It also uses global attention every third layer so I am going to implement this stuff and then it should be ready for a review :) |
|
@ryan-mangeno That sounds good! I haven't unpacked any of those mechanics myself, but can try to get into it if you get stuck. |
… per previous attempt, added local sliding window attention that alternates every third layer
ok 👍 , made some changes but not sure if its fully ready yet, I will ping you when I think its ready if thats ok |
|
status update - I found out that modern bert uses an alternating rope method , per https://arxiv.org/pdf/2412.13663 I am currently figuring out how to implement this |
|
@gabe-l-hart I believe this should be ready for review again, I added a new hparam -> n_swa_pattern which now works in the conversion script and can be pulled during model loading rather than it being hardcoded. Let me know of any changes, finals are coming up so I might be a bit slow for the next week just fyi |
|
@ryan-mangeno it looks like there are some conflicts that need resolving. Can you merge in |
|
It looks like these failing I'm not clear if these are real issues, but I haven't seen these tests fail randomly on other PRs. |
They are not real, it's a Edit: Keep in mind that a PR branch job will pick the default branch cache if no other is available. I do routinely make sure to delete corrupt caches on master though. |
|
There are only two hard things in Computer Science... |
I just deleted cache in actions and rerunning now, is that all that needs to be done as of now regarding that job? |
The cache is long gone by now, it gets purged on a daily basis (hourly when exceeding 10GB). |
hmm, it seems to be failing again (ubuntu-cpu-cmake (x64, ubuntu-22.04) here is a excert from the logs is this related to my implementation and if so any idea where to start debugging? thanks! |
Nope, those are expected failures, what is actually failing are these: ..but it has nothing to do with your PR, just ignore it. |
ok thanks! will be on the lookout for review fixes |
|
One note as I'm testing this: When I run |
|
After some dependency flailing, I was able to get my comparison script running again, and the side-by-side results with granite_embed.pyfrom sentence_transformers import SentenceTransformer
import numpy as np
import subprocess
import shlex
model_path = "/Users/ghart/models/ibm-granite/granite-embedding-small-english-r2/"
model = SentenceTransformer(model_path)
input_queries = [
"hello world",
"tell me a story about a developer and their dog",
"123sfg this is a r@nd0m t35t",
]
def cosine_similarity(vector_a: np.ndarray, vector_b: np.ndarray) -> float:
vector_a = np.asarray(vector_a)
vector_b = np.asarray(vector_b)
numerator = np.dot(vector_a, vector_b)
denominator_a = np.linalg.norm(vector_a)
denominator_b = np.linalg.norm(vector_b)
if denominator_a == 0 or denominator_b == 0: return 0.0
cosine_sim = numerator / (denominator_a * denominator_b)
return cosine_sim
for query in input_queries:
print("### BASELINE ###")
embedding = model.encode([query])
print("Embedding shape:", embedding.shape)
print("Embedding vector:", embedding[:, :8])
print("### llama.cpp ###")
lcpp_exe = "/Users/ghart/Projects/github/ggml-org/llama.cpp/build/bin/llama-embedding"
lcpp_model = f"{model_path}/granite-embedding-small-english-r2-BF16.gguf"
cmd = f"{lcpp_exe} -m {lcpp_model} -p \"{query}\" --temp 0 --embd-normalize -1"
print(f"llama.cpp command: {cmd}")
proc = subprocess.Popen(
shlex.split(cmd),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
out, _ = proc.communicate()
vals = out.decode("utf-8").split(":")[-1]
vals = [
float(v) for v in vals.split()
if v.strip()
]
lcpp_emb = np.array(vals)
print("llama.cpp Embedding shape:", lcpp_emb.shape)
print("llama.cpp Embedding vector:", lcpp_emb[:8])
print()
cos_sim = cosine_similarity(embedding, lcpp_emb)
print(f"COSINE SIMILARITY: {cos_sim}")
print("--------------------------------")
print() |
This is actually handled now, but requires a little more care for |
gabe-l-hart
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small naming NIT, but that aside, I think this is ready to go! Thank you @ryan-mangeno
| SHARED_KV_LAYERS = "{arch}.attention.shared_kv_layers" | ||
| SLIDING_WINDOW_PATTERN = "{arch}.attention.sliding_window_pattern" | ||
| TEMPERATURE_SCALE = "{arch}.attention.temperature_scale" | ||
| DENSE_EVERY_N_LAYERS = "{arch}.attention.dense_every_n_layers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we now have several different ways to indicate the layer type for different hybrid patterns. This DENSE_EVERY_N_LAYERS implies a strictly periodic pattern. The above SLIDING_WINDOW_PATTERN appears to be for interleaved sliding window hybrid architectures, but strangely I don't see the corresponding key anywhere on the c++ side (here). We also use the fact that head_count_kv can be either a scalar or a list of scalars to differentiate recurrent vs attention layers for GraniteHybrid (here).
Since this is already fairly confusing and potentially redundent, I don't think we need to hold up this PR, but I'm curious if others can think of a clean way to accomplish the goal of layer type designation without a net-new hparam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sliding_window_pattern and dense_every_n_layers are basically the same thing, but is indeed missing support (probably forgotten at some stage), all models seem to just hardcode the hparams.set_swa_pattern.
I wonder if we can just reuse the sliding_window_pattern key instead and differentiate between loading a single integer and an array of bools instead of introducing a new key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think what we can do is make an overload of llama_model_loader::get_key_or_arr that has uin32_t as result (and no n) which simply fails if the metadata is an array.
That way we can ml.get_key_or_arr the sliding_window_pattern into an uint32_t (that has the previously hardcoded value as default) we feed to hparams.set_swa_pattern.
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
adding support to run granite embedding small, and it primarily pulls the modern bert architecture - https://huggingface.co/ibm-granite/granite-embedding-small-english-r2, currently working on it still, havent figured out the pre-tokenizer type or if I need to impliment it, also for the ubatch size the assert fails in llama-graph.cpp, hacked it to accept ubatch size of 1 for testing, but it seems to keep failing there and not sure why,
if I comment out of the line in llama-graph.cpp
then it works