Skip to content

Commit 6872c23

Browse files
author
Ziqun Ye
committed
change the example and remove keyword search example
1 parent 60a77d0 commit 6872c23

File tree

1 file changed

+57
-72
lines changed

1 file changed

+57
-72
lines changed

llm_application/Use_of_Cohere_embed_models_for_Semantic_Search_in_OCI_OpenSearch.ipynb

Lines changed: 57 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@
4242
"from langchain.text_splitter import MarkdownHeaderTextSplitter\n",
4343
"\n",
4444
"with fsspec.open(\n",
45-
" \"https://raw.githubusercontent.com/oracle-samples/oci-data-science-ai-samples/155f76109d24860ceeb72ed6b742ced33a46ce22/README.md\",\n",
45+
" \"https://raw.githubusercontent.com/oracle-samples/oci-data-science-ai-samples/main/distributed_training/Tensorboard.md\",\n",
4646
" \"r\"\n",
4747
") as f:\n",
4848
" report = f.read()\n",
@@ -68,10 +68,14 @@
6868
"name": "stdout",
6969
"output_type": "stream",
7070
"text": [
71-
"Number of documents: 15\n",
71+
"Number of documents: 4\n",
7272
"First document:\n",
73-
"The Oracle Cloud Infrastructure (OCI) Data Science service has created this repo to make demos, tutorials, and code examples that highlight various features of the [OCI Data Science service](https://www.oracle.com/data-science/cloud-infrastructure-data-science.html) and AI services. We welcome your feedback and would like to know what content is useful and what content is missing. Open an [issue](https://github.com/oracle/oci-data-science-ai-samples/issues) to do this. We know that a lot of you are creating great content and we would like to help you share it. See the [contributions](CONTRIBUTING.md) document. \n",
74-
"Oracle Cloud Infrastructure (OCI) Data Science Services provide a powerful suite of tools for data scientists, enabling faster machine learning model development and deployment. With features like the Accelerated Data Science (ADS) SDK, distributed training, batch processing and machine learning pipelines, OCI Data Science Services offer the scalability and flexibility needed to tackle complex data science and machine learning challenges. Whether you're a beginner or an experienced machine learning practitioner or data scientist, OCI Data Science Services provide the resources you need to build, train, and deploy your models with ease.\n"
73+
"TensorBoard helps visualizing your experiments. You bring up a ``TensorBoard`` session on your workstation and point to\n",
74+
"the directory that contains the TensorBoard logs. \n",
75+
"`OCI` = Oracle Cloud Infrastructure\n",
76+
"`DT` = Distributed Training\n",
77+
"`ADS` = Oracle Accelerated Data Science Library\n",
78+
"`OCIR` = Oracle Cloud Infrastructure Registry\n"
7579
]
7680
}
7781
],
@@ -91,20 +95,10 @@
9195
},
9296
{
9397
"cell_type": "code",
94-
"execution_count": 3,
98+
"execution_count": null,
9599
"metadata": {},
96-
"outputs": [
97-
{
98-
"name": "stderr",
99-
"output_type": "stream",
100-
"text": [
101-
"/Users/z7ye/miniconda3/envs/ads_testing/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
102-
" from .autonotebook import tqdm as notebook_tqdm\n"
103-
]
104-
}
105-
],
100+
"outputs": [],
106101
"source": [
107-
"import ads\n",
108102
"from ads.llm import GenerativeAIEmbeddings\n",
109103
" \n",
110104
"oci_embedings = GenerativeAIEmbeddings(\n",
@@ -123,7 +117,7 @@
123117
"name": "stdout",
124118
"output_type": "stream",
125119
"text": [
126-
"Number of embeddings: 15\n",
120+
"Number of embeddings: 4\n",
127121
"Embedding dimensions: 1024\n"
128122
]
129123
}
@@ -185,9 +179,20 @@
185179
"cell_type": "code",
186180
"execution_count": 7,
187181
"metadata": {},
188-
"outputs": [],
182+
"outputs": [
183+
{
184+
"data": {
185+
"text/plain": [
186+
"{'acknowledged': True, 'shards_acknowledged': True, 'index': 'tensorboard'}"
187+
]
188+
},
189+
"execution_count": 7,
190+
"metadata": {},
191+
"output_type": "execute_result"
192+
}
193+
],
189194
"source": [
190-
"INDEX_NAME = \"cosine-similarity\"\n",
195+
"INDEX_NAME = \"tensorboard\"\n",
191196
"VECTOR_1_NAME = \"embedding_vector\"\n",
192197
"VECTOR_2_NAME = \"text\"\n",
193198
" \n",
@@ -254,7 +259,7 @@
254259
"cell_type": "markdown",
255260
"metadata": {},
256261
"source": [
257-
"### Step 5: Semantic Search\n",
262+
"\n",
258263
"A new query coming in, first calcualte the embedding vector and then conduct a semantic search.\n",
259264
"\n",
260265
"- `k`: the number of neighbors the search will return\n",
@@ -263,70 +268,50 @@
263268
},
264269
{
265270
"cell_type": "code",
266-
"execution_count": 9,
271+
"execution_count": 15,
267272
"metadata": {},
268273
"outputs": [
269274
{
270275
"name": "stdout",
271276
"output_type": "stream",
272277
"text": [
273-
"Oracle Cloud Infrastructure (OCI) [Data Science Jobs](https://docs.oracle.com/en-us/iaas/data-science/using/jobs-about.htm) is a powerful tool that allows you to define and run repeatable machine learning tasks on a fully managed infrastructure. With Jobs, you have the flexibility to apply custom tasks to meet your specific use cases, such as data preparation, model training, hyperparameter optimization, batch inference, large model training and more. \n",
274-
"On-demand jobs and batch processing are especially important for businesses that need to process large volumes of data on a regular basis, as they enable companies to automate data processing workflows, reduce the need for manual intervention, and save costs associated with running compute resources for extended periods of time. With the ability to define and schedule jobs to run at specific times, businesses can automate their data processing workflows and reduce the need for manual intervention. This helps to improve efficiency, reduce errors, and save valuable time and resources. Additionally, by using a fully managed infrastructure, businesses can ensure that their data processing workflows are secure and compliant with industry regulations. Overall, OCI Data Science Jobs is a powerful tool that can help businesses to scale their machine learning workflows and improve their data processing capabilities.\n"
278+
"It is required that ``tensorboard`` is installed in a dedicated conda environment or virtual environment. Prepare an\n",
279+
"environment yaml file for creating conda environment with following command - \n",
280+
"**tensorboard-dep.yaml**: \n",
281+
"```yaml\n",
282+
"dependencies:\n",
283+
"- python=3.8\n",
284+
"- pip\n",
285+
"- pip:\n",
286+
"- ocifs\n",
287+
"- tensorboard\n",
288+
"name: tensorboard\n",
289+
"``` \n",
290+
"Create the conda environment from the yaml file generated in the preceeding step \n",
291+
"```bash\n",
292+
"conda env create -f tensorboard-dep.yaml\n",
293+
"``` \n",
294+
"This will create a conda environment called tensorboard. Activate the conda environment by running - \n",
295+
"```bash\n",
296+
"conda activate tensorboard\n",
297+
"``` \n",
298+
"**Using TensorBoard Logs:** \n",
299+
"To launch a TensorBoard session on your local workstation, run - \n",
300+
"```bash\n",
301+
"export OCIFS_IAM_KEY=api_key\n",
302+
"tensorboard --logdir oci://my-bucket@my-namespace/path/to/logs\n",
303+
"``` \n",
304+
"`OCIFS_IAM_KEY=api_key` - If you are using resource principal, set `resource_principal` \n",
305+
"This will bring up TensorBoard app on your workstation. Access TensorBoard at ``http://localhost:6006/`` \n",
306+
"**Note**: The logs take some initial time (few minutes) to reflect on the tensorboard dashboard.\n"
275307
]
276308
}
277309
],
278310
"source": [
279-
"query_vector = oci_embedings.embed_query(text=\"oci job\")\n",
311+
"query_vector = oci_embedings.embed_query(text=\"how to set up tensorboard in oci?\")\n",
280312
"query = {\n",
281313
" \"size\": 2,\n",
282-
" \"query\": {\"knn\": {VECTOR_1_NAME: {\"vector\": query_vector, \"k\": 3}}},\n",
283-
"}\n",
284-
" \n",
285-
"response = es.search(body=query, index=INDEX_NAME) # the same as before\n",
286-
"print(response[\"hits\"][\"hits\"][0]['_source']['text'])"
287-
]
288-
},
289-
{
290-
"cell_type": "markdown",
291-
"metadata": {},
292-
"source": [
293-
"#### Comparing Semantic Search with Keywords search\n",
294-
"\n",
295-
"here is the result of keyword search. you can see that since it does not understand the meaning of oci job, it gave irrelevant results."
296-
]
297-
},
298-
{
299-
"cell_type": "code",
300-
"execution_count": 10,
301-
"metadata": {},
302-
"outputs": [
303-
{
304-
"name": "stdout",
305-
"output_type": "stream",
306-
"text": [
307-
"Check out the following resources for more information about the OCI Data Science and AI services: \n",
308-
"* [ADS class documentation](https://accelerated-data-science.readthedocs.io/en/latest/modules.html)\n",
309-
"* [ADS user guide](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n",
310-
"* [AI & Data Science blog](https://blogs.oracle.com/ai-and-datascience/)\n",
311-
"* [OCI Data Science service guide](https://docs.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n",
312-
"* [OCI Data Science service release notes](https://docs.cloud.oracle.com/en-us/iaas/releasenotes/services/data-science/)\n",
313-
"* [YouTube playlist](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n",
314-
"* [OCI Data Labeling Service guide](https://docs.oracle.com/en-us/iaas/data-labeling/data-labeling/using/home.htm)\n",
315-
"* [OCI DLS DP API](https://docs.oracle.com/en-us/iaas/api/#/en/datalabeling-dp/20211001/)\n",
316-
"* [OCI DLS CP API](https://docs.oracle.com/en-us/iaas/api/#/en/datalabeling/20211001/)\n"
317-
]
318-
}
319-
],
320-
"source": [
321-
"query = {\n",
322-
" \"query\": {\n",
323-
" \"match\": {\n",
324-
" \"text\": {\n",
325-
" \"query\": \"oci job\",\n",
326-
" \"analyzer\": \"standard\"\n",
327-
" }\n",
328-
" }\n",
329-
" }\n",
314+
" \"query\": {\"knn\": {VECTOR_1_NAME: {\"vector\": query_vector, \"k\": 2}}},\n",
330315
"}\n",
331316
" \n",
332317
"response = es.search(body=query, index=INDEX_NAME) # the same as before\n",

0 commit comments

Comments
 (0)