MDEV-37009 bulk insert into vector index#5150
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces bulk insert support for hierarchical navigable small world (mhnsw) indexes during table copy operations. The reviewer identified several critical issues, including compilation errors from accessing bulk_insert_active through the wrong pointer, an uninitialized member variable, and an ignored return value during bulk insert initialization. Suggestions were also made to reset the active flag on failure and correct formatting to match the repository's style.
| DBUG_ASSERT(s->hlindexes() == (hlindex != NULL)); | ||
| if (hlindex && hlindex->in_use) | ||
| { | ||
| if (hlindex->bulk_insert_active) | ||
| return mhnsw_bulk_insert_row(this, key_info + s->keys); | ||
| else | ||
| return mhnsw_insert(this, key_info + s->keys); | ||
| } | ||
| return 0; |
There was a problem hiding this comment.
The member bulk_insert_active was added to the TABLE struct in sql/table.h, but here it is accessed via hlindex->bulk_insert_active. Since hlindex is not of type TABLE*, this will cause a compilation error. It should be accessed directly as bulk_insert_active (or this->bulk_insert_active).
Additionally, please update the indentation to 2 spaces to match the repository's implicit style guide.
DBUG_ASSERT(s->hlindexes() == (hlindex != NULL));
if (hlindex && hlindex->in_use)
{
if (bulk_insert_active)
return mhnsw_bulk_insert_row(this, key_info + s->keys);
else
return mhnsw_insert(this, key_info + s->keys);
}
return 0;| int TABLE::hlindexes_bulk_insert_begin(ha_rows rows) | ||
| { | ||
| if (hlindex && hlindex->in_use) | ||
| { | ||
| hlindex->bulk_insert_active= true; | ||
| return mhnsw_bulk_insert_begin(this, key_info + s->keys, rows); | ||
| } | ||
| return 0; | ||
| } | ||
|
|
||
| int TABLE::hlindexes_bulk_insert_end() | ||
| { | ||
| if (hlindex && hlindex->in_use) | ||
| { | ||
| hlindex->bulk_insert_active= false; | ||
| return mhnsw_bulk_insert_end(this, key_info + s->keys); | ||
| } | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
There are three issues here:
bulk_insert_activeis a member ofTABLE, nothlindex, so accessing it viahlindex->bulk_insert_activewill cause a compilation error.- If
mhnsw_bulk_insert_beginfails (returns a non-zero error code),bulk_insert_activeshould be reset tofalseso that subsequent inserts do not incorrectly attempt bulk insertion. - The indentation should be updated to 2 spaces to match the repository's implicit style guide.
int TABLE::hlindexes_bulk_insert_begin(ha_rows rows)
{
if (hlindex && hlindex->in_use)
{
bulk_insert_active= true;
int err= mhnsw_bulk_insert_begin(this, key_info + s->keys, rows);
if (err)
bulk_insert_active= false;
return err;
}
return 0;
}
int TABLE::hlindexes_bulk_insert_end()
{
if (hlindex && hlindex->in_use)
{
bulk_insert_active= false;
return mhnsw_bulk_insert_end(this, key_info + s->keys);
}
return 0;
}| if (!to->s->long_unique_table) | ||
| { | ||
| to->file->ha_start_bulk_insert(from->file->stats.records, | ||
| ignore ? 0 : HA_CREATE_UNIQUE_INDEX_BY_SORT); | ||
| bulk_insert_started= 1; | ||
| to->file->ha_start_bulk_insert(from->file->stats.records, | ||
| ignore ? 0 : HA_CREATE_UNIQUE_INDEX_BY_SORT); | ||
| bulk_insert_started= 1; | ||
|
|
||
| if (to->s->hlindexes()) | ||
| { | ||
| to->hlindexes_bulk_insert_begin(from->file->stats.records); | ||
| hlindex_bulk_started= 1; | ||
| } | ||
| } |
There was a problem hiding this comment.
The return value of to->hlindexes_bulk_insert_begin is currently ignored. If the bulk insert initialization fails, we should handle the error, print it, and avoid setting hlindex_bulk_started to 1 so that we do not proceed with the bulk insert or attempt to end it later.
Additionally, please update the indentation to 2 spaces to match the repository's implicit style guide.
if (!to->s->long_unique_table)
{
to->file->ha_start_bulk_insert(from->file->stats.records,
ignore ? 0 : HA_CREATE_UNIQUE_INDEX_BY_SORT);
bulk_insert_started= 1;
if (to->s->hlindexes())
{
if ((error= to->hlindexes_bulk_insert_begin(from->file->stats.records)))
{
to->file->print_error(error, MYF(0));
}
else
{
hlindex_bulk_started= 1;
}
}
}| if (hlindex_bulk_started && to->hlindexes_bulk_insert_end() && error <= 0) | ||
| { | ||
| if (!thd->is_error()) | ||
| to->file->print_error(my_errno, MYF(0)); | ||
| error= 1; | ||
| } | ||
| hlindex_bulk_started=0; |
There was a problem hiding this comment.
Please update the indentation to 2 spaces and add proper spacing around operators (e.g., hlindex_bulk_started = 0;) to match the repository's implicit style guide.
if (hlindex_bulk_started && to->hlindexes_bulk_insert_end() && error <= 0)
{
if (!thd->is_error())
to->file->print_error(my_errno, MYF(0));
error= 1;
}
hlindex_bulk_started= 0;| */ | ||
| bool alias_name_used; /* true if table_name is alias */ | ||
| bool get_fields_in_item_tree; /* Signal to fix_field */ | ||
| bool bulk_insert_active; /* mhnsw bulk_insert_started flag */ |
There was a problem hiding this comment.
…orrectly set the grefs already faster by ~15%
2c44b07 to
ffaa6a1
Compare
iMineLink
left a comment
There was a problem hiding this comment.
Thank you for sharing the first draft @shabbann, I did a preliminary review and found minor things and a missing initialization of MHNSW_Share::bulk_active worth checking.
Possibly out of scope, but it may be interesting to enable falling back to normal insert if some memory budget is exceeded for the bulk insert, to avoid unexpected OOM issues.
| @@ -510,7 +510,7 @@ class MHNSW_Share : public Sql_alloc | |||
| const uint M; | |||
| metric_type metric; | |||
| bool use_subdist; | |||
|
|
|||
| bool bulk_active; | |||
There was a problem hiding this comment.
Can you explicitely initialize this to false?
|
|
||
| if (to->s->hlindexes()) | ||
| { | ||
| to->hlindexes_bulk_insert_begin(from->file->stats.records); |
There was a problem hiding this comment.
I think error handling is needed here.
| return err; | ||
| } | ||
|
|
||
| if (int err= graph->file->ha_end_bulk_insert()) |
There was a problem hiding this comment.
I think this needs to be placed inside a SCOPE_EXIT or similar, to avoid premature exit on error if node->save() fails in the loop above.
| if (my_init_dynamic_array(PSI_INSTRUMENT_MEM, &bulk->nodes, sizeof(FVectorNode*), | ||
| rows + rows * 0.1, rows, MYF(0))) | ||
| { | ||
| delete bulk; |
There was a problem hiding this comment.
I think this delete bulk statements and the one few lines below can be removed.
| @@ -1012,6 +1012,8 @@ int FVectorNode::load_from_record(TABLE *graph) | |||
| FVector *vec_ptr= FVector::align_ptr(tref() + tref_len()); | |||
| memcpy(vec_ptr->data(), v->ptr(), v->length()); | |||
| vec_ptr->postprocess(ctx->use_subdist, ctx->vec_len); | |||
| if (ctx->metric == COSINE) | |||
| vec_ptr->abs2= 0.5f; | |||
There was a problem hiding this comment.
I see you already prepared #5184 to fix this separately, good.
| delete bulk; | ||
| return err; | ||
| } | ||
|
|
There was a problem hiding this comment.
Here you may add DBUG_ASSERT(!bulk->ctx->start); to document that the bulk build assumes an empty target graph
Add support for bulk insertion into MHNSW vector indexes during ALTER TABLE