Skip to content

Commit 1943ae2

Browse files
committed
split and embed documentation fixed
1 parent a47373a commit 1943ae2

File tree

1 file changed

+31
-8
lines changed

1 file changed

+31
-8
lines changed

docs/content/sandbox/tools/split_embed.md

Lines changed: 31 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,22 +8,45 @@ Copyright (c) 2023, 2024, Oracle and/or its affiliates.
88
Licensed under the Universal Permissive License v1.0 as shown at http://oss.oracle.com/licenses/upl.
99
-->
1010

11-
The first phase of building a RAG Chatbot start with the chunking of document base and generation of vector embeddings will be stored into a vector store to be retrieved by vectors distance search and added to the context to answer the question grounded to the information provided.
11+
The first phase building a RAG Chatbot starts with the document chunking based on vector embeddings generation, that will be stored into a vector store to be retrieved by vectors distance search and added to the context in order to answer the question grounded to the information provided.
1212

13-
We choose the freedom to exploit LLMs for embeddings provided by public services like Cohere, OpenAI, and Perplexity, or running on top a GPU compute node managed by the customer and exposed through the open source platform OLLAMA, to avoid sharing data with external services that are beyond full customer control.
13+
We choose the freedom to exploit LLMs for vector embeddings provided by public services like Cohere, OpenAI, and Perplexity, or running on top a GPU compute node managed by the user and exposed through open source platforms like OLLAMA or HuggingFace, to avoid sharing data with external services that are beyond full customer control.
1414

15-
From the Split/Embed voice of menu, you’ll access to the ingestion page:
15+
From the **Split/Embed** voice of the left side menu, you’ll access to the ingestion page:
1616

1717
![Split](images/split.png)
1818

19+
The Load and Split Documents, parts of Split/Embed form, will allow to choose documents (txt,pdf,html,etc.) stored on the Object Storage service available on the Oracle Cloud Infrastructure, on the client’s desktop or getting from URLs, like shown in following snapshot:
1920

20-
The Embedding models available list will depend from the Configuration/Models page.
21+
![Embed](images/embed.png)
2122

22-
You’ll define quickly the embedding size, that depends on model type, the overlap size, the distance metrics adopted and availble in the Oracle DB 23ai that will play the role as vector store.
23+
It will be created a “speaking” table, like the TEXT_EMBEDDING_3_SMALL_8191_1639_COSINE in the example. You can create, on the same set of documents, several options of vectorstore table, since nobody normally knows which is the best chunking size, and then test them indipendently.
2324

24-
The Load and Split Documents part of Split/Embed form will allow to choose documents (txt,pdf,html,etc.) stored on the Object Storage service available on the Oracle Cloud Infrastructure, on the client’s desktop or getting from URLs, like shown in following snapshot:
25+
## Embedding Configuration
26+
27+
Choose one of the **Embedding models available** from the listbox that will depend by the **Configuration/Models** page.
28+
The **Embedding Server** URL associated to the model chosen will be shown. The **Chunk Size (tokens)** will change according the kind of embeddings model selected, as well as the **Chunk Overlap (% of Chunk Size)**.
29+
Then you have to choose one of the **Distance Metric** available in the Oracle DB23ai:
30+
- COSINE
31+
- EUCLIDEAN_DISTANCE
32+
- DOT_PRODUCT
33+
- MAX_INNER_PRODUCT
34+
To understand the meaning of these metrics, please refer to the doc [Vector Distance Metrics](https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/vector-distance-metrics.html) in the Oracle DB23ai "*AI Vector Search User's Guide*".
35+
36+
The **Embedding Alias** field let you to add a more meaningful info to the vectorstore table that allows you to have more than one vector table with the same: *model + chunksize + chunk_overlap + distance_strategy* combination.
37+
38+
39+
## Load and Split Documents
40+
41+
The process that starts clicking the **Populate Vector Store** button needs:
42+
- **File Source**: you can include txt,pdf,html documents from one of these sources:
43+
- **OCI**: you can browse and add more than one document into the same vectostore table at a time;
44+
- **Local**: uploading more than one document into the same vectostore table at a time;
45+
- **Web**: upload one txt,pdf,html from the URL provided.
46+
47+
- **Rate Limit (RPM)**: to avoid that a public LLM embedding service bans you for too much requests per second, out of your subscription limits.
48+
49+
The **Vector Store** will show the name of the table will be populated into the DB, according the naming convention that reflects the parameters used.
2550

26-
![Embed](images/embed.png)
2751

28-
It will be created a “speaking” table, like the TEXT_EMBEDDING_3_SMALL_8191_1639_COSINE in the example. You could create on the same documents set several options of vector store, since nobody normally knows which is the best chunking size, and test them indipendently.
2952

0 commit comments

Comments
 (0)