-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Modern Bert Support #15641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Modern Bert Support #15641
Changes from all commits
6151592
6643c5a
ac67fc6
cc40378
41b6864
cc3d7ab
4ceb828
18c0c23
bffe3c9
8f32843
9805635
40249dd
853f344
2a1c750
c73eb68
ca353d3
6d86944
39c0291
e101005
044bc7d
e296a0b
2bacfb0
4e7c879
20d448a
db4f565
da0604a
43a2980
e368442
7036cc8
2522ce8
e043815
35667f2
3cdd650
86adde6
46f2182
33eed31
61a0b03
3bbf671
f362878
3976d77
ff9f8c2
97e1de4
4187cf5
e3ac2ae
72f1f51
952c302
2ea2862
da3a1c9
89431b6
43332bf
b442b43
94e7ece
30fe2a7
c386eb0
727008f
93c1744
7b956a3
9b0f38b
c9fa285
e1abf73
edbe4d2
1f54cf4
0082680
9715c2a
1d01245
a6306ce
3581b68
7c15ba5
070b30b
7a1f06a
04167d9
7e8e1a0
b66c2fd
a9441fb
9e078b8
e0ca150
8599e26
bc719e8
5efe017
a1523db
d7356bf
876f204
e224eb5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -176,11 +176,13 @@ class Attention: | |
| SHARED_KV_LAYERS = "{arch}.attention.shared_kv_layers" | ||
| SLIDING_WINDOW_PATTERN = "{arch}.attention.sliding_window_pattern" | ||
| TEMPERATURE_SCALE = "{arch}.attention.temperature_scale" | ||
| DENSE_EVERY_N_LAYERS = "{arch}.attention.dense_every_n_layers" | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like we now have several different ways to indicate the layer type for different hybrid patterns. This Since this is already fairly confusing and potentially redundent, I don't think we need to hold up this PR, but I'm curious if others can think of a clean way to accomplish the goal of layer type designation without a net-new
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I wonder if we can just reuse the
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alright, I think what we can do is make an overload of That way we can |
||
|
|
||
| class Rope: | ||
| DIMENSION_COUNT = "{arch}.rope.dimension_count" | ||
| DIMENSION_SECTIONS = "{arch}.rope.dimension_sections" | ||
| FREQ_BASE = "{arch}.rope.freq_base" | ||
| FREQ_BASE_SWA = "{arch}.rope.freq_base_swa" | ||
| SCALING_TYPE = "{arch}.rope.scaling.type" | ||
| SCALING_FACTOR = "{arch}.rope.scaling.factor" | ||
| SCALING_ATTN_FACTOR = "{arch}.rope.scaling.attn_factor" | ||
|
|
@@ -354,6 +356,7 @@ class MODEL_ARCH(IntEnum): | |
| STARCODER = auto() | ||
| REFACT = auto() | ||
| BERT = auto() | ||
| MODERN_BERT = auto() | ||
| NOMIC_BERT = auto() | ||
| NOMIC_BERT_MOE = auto() | ||
| NEO_BERT = auto() | ||
|
|
@@ -747,6 +750,7 @@ class MODEL_TENSOR(IntEnum): | |
| MODEL_ARCH.STARCODER: "starcoder", | ||
| MODEL_ARCH.REFACT: "refact", | ||
| MODEL_ARCH.BERT: "bert", | ||
| MODEL_ARCH.MODERN_BERT: "modern-bert", | ||
| MODEL_ARCH.NOMIC_BERT: "nomic-bert", | ||
| MODEL_ARCH.NOMIC_BERT_MOE: "nomic-bert-moe", | ||
| MODEL_ARCH.NEO_BERT: "neo-bert", | ||
|
|
@@ -1367,6 +1371,20 @@ class MODEL_TENSOR(IntEnum): | |
| MODEL_TENSOR.CLS, | ||
| MODEL_TENSOR.CLS_OUT, | ||
| ], | ||
| MODEL_ARCH.MODERN_BERT: [ | ||
| MODEL_TENSOR.TOKEN_EMBD, | ||
| MODEL_TENSOR.TOKEN_EMBD_NORM, | ||
| MODEL_TENSOR.OUTPUT_NORM, | ||
| MODEL_TENSOR.ATTN_NORM, | ||
| MODEL_TENSOR.ATTN_OUT, | ||
| MODEL_TENSOR.ATTN_QKV, | ||
| MODEL_TENSOR.POS_EMBD, | ||
| MODEL_TENSOR.FFN_UP, | ||
| MODEL_TENSOR.FFN_DOWN, | ||
| MODEL_TENSOR.FFN_NORM, | ||
| MODEL_TENSOR.CLS, | ||
| MODEL_TENSOR.CLS_OUT, | ||
| ], | ||
| MODEL_ARCH.NOMIC_BERT: [ | ||
| MODEL_TENSOR.TOKEN_EMBD, | ||
| MODEL_TENSOR.TOKEN_EMBD_NORM, | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,6 +24,7 @@ enum llm_arch { | |
| LLM_ARCH_STARCODER, | ||
| LLM_ARCH_REFACT, | ||
| LLM_ARCH_BERT, | ||
| LLM_ARCH_MODERN_BERT, | ||
| LLM_ARCH_NOMIC_BERT, | ||
| LLM_ARCH_NOMIC_BERT_MOE, | ||
| LLM_ARCH_NEO_BERT, | ||
|
|
@@ -188,6 +189,7 @@ enum llm_kv { | |
| LLM_KV_EMBEDDING_SCALE, | ||
| LLM_KV_TOKEN_SHIFT_COUNT, | ||
| LLM_KV_INTERLEAVE_MOE_LAYER_STEP, | ||
| LLM_KV_DENSE_EVERY_N_LAYERS, | ||
|
|
||
| LLM_KV_ATTENTION_HEAD_COUNT, | ||
| LLM_KV_ATTENTION_HEAD_COUNT_KV, | ||
|
|
@@ -208,6 +210,7 @@ enum llm_kv { | |
| LLM_KV_ATTENTION_GATE_LORA_RANK, | ||
| LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT, | ||
| LLM_KV_ATTENTION_SLIDING_WINDOW, | ||
| LLM_KV_ATTENTION_DENSE_EVERY_N_LAYERS, | ||
| LLM_KV_ATTENTION_SCALE, | ||
| LLM_KV_ATTENTION_OUTPUT_SCALE, | ||
| LLM_KV_ATTENTION_TEMPERATURE_LENGTH, | ||
|
|
@@ -218,6 +221,7 @@ enum llm_kv { | |
| LLM_KV_ROPE_DIMENSION_COUNT, | ||
| LLM_KV_ROPE_DIMENSION_SECTIONS, | ||
| LLM_KV_ROPE_FREQ_BASE, | ||
| LLM_KV_ROPE_FREQ_BASE_SWA, | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NIT: Seems like this should be one line up so it's next to |
||
| LLM_KV_ROPE_SCALE_LINEAR, | ||
| LLM_KV_ROPE_SCALING_TYPE, | ||
| LLM_KV_ROPE_SCALING_FACTOR, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot to commit the mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I had made support for granite small embedding and it was using the modern arch under the hood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which model uses this naming convention? I'm not seeing any of these naming conventions in the
granite-embedding-small-english-r2model. Either way, I think the point still stands that this is not the right place to do name mappings. Unless I'm misunderstanding, these would easily fit intotensor_mapping.py.