技術專欄

Oracle Database 23ai 新功能技術介紹- 系列(三) – AI Vector Search

Oracle 甲骨文 企業軟體供應商
2024/09/20

Oracle Database 23ai 引入了 Oracle AI Vector Search,這是一項令人興奮的新功能,您可以透過使用 embedding model 對資料產生向量 (vectors) 並在資料庫中儲存並管理它們,然後提供語意相似性搜尋 (semantic search),以根據含義和上下文查找相關資訊,其流程圖如下:

由於非結構化數據很難直接搜尋,比如:文本檔案、圖像檔、聲音檔與影像檔,因此根據語意查找相關數據的相似性搜尋超越了簡單的關鍵字搜索。 許多行業都需要在大型數據集中進行相似數據的匹配,其應用包括以下:

本篇文章將以非結構化文本檔案的語意搜尋為例,將 Oracle 所提供的 all-MiniLM-L12-v2 模型載入資料庫中 (https://blogs.oracle.com/machinelearning/post/use-our-prebuilt-onnx-model-now-available-for-embedding-generation-in-oracle-database-23ai),透過該模型對文檔中的文字數據向量化,最後將用戶搜尋文字的向量數據與存入表格中的向量化數據進行相似性搜尋:

  1. 載入 Embedding model

這裡使用了 Oracle Database Oracle 23ai free 資料庫版本進行演示,在作業系統目錄 /opt/oracle 下創建一個目錄來存放下載的 model:

mkdir -p /opt/oracle/models

 

cd /opt/oracle/models/

 

wget https://adwc4pm.objectstorage.us-ashburn-1.oci.customer-oci.com/p/VBRD9P8ZFWkKvnfhrWxkpPe8K03-JIoM5h_8EJyJcpE80c108fuUjg7R5L5O7mMZ/n/adwc4pm/b/OML-Resources/o/all_MiniLM_L12_v2_augmented.zip

 

unzip -oq all_MiniLM_L12_v2_augmented.zip

接著登入該資料庫,創建一個資料庫用戶 testuser1 與建立一個資料庫目錄物件 model_dir 指向實體目錄路徑 /opt/oracle/models,然後授予所需權限給該用戶:

sqlplus sys/"1qaz2wsx#EDC"@localhost:1521/freepdb1 as sysdba

 

create user if not exists testuser1 identified by testuser1 quota unlimited on users;

 

grant create session, db_developer_role, create mining model to testuser1;

 

create or replace directory model_dir as '/opt/oracle/models';

 

grant read, write on directory model_dir to testuser1;

將 Embedding model 透過 DBMS_VECTOR 套件載入資料中:

begin

  dbms_vector.drop_onnx_model (

    model_name => 'ALL_MINILM_L12_V2',

    force => true);

 

  dbms_vector.load_onnx_model (

    directory  => 'model_dir',

    file_name  => 'all_MiniLM_L12_v2.onnx',

    model_name => 'ALL_MINILM_L12_V2');

end;

/

我們可以透過 USER_MINING_MODELS 視圖來查看該 model 的資訊:

column model_name format a30

column algorithm format a10

column mining_function format a15

 

select model_name, algorithm, mining_function

from   user_mining_models

where  model_name = 'ALL_MINILM_L12_V2';

 

MODEL_NAME        ALGORITHM  MINING_FUNCTION

------------------------------ ---------------  ---------------

ALL_MINILM_L12_V2   ONNX       EMBEDDING

 

  1. 產生向量資料

在 model 載入後,我們可以測試使用 VECTOR_EMBEDDING 函數將文字數據產生向量數據,以下以文字 “Quick test” 為例:

set long 1000000

 

select vector_embedding(all_minilm_l12_v2 using 'Quick test' as data) AS my_vector;

 

MY_VECTOR

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[-3.86444256E-002,7.27762803E-002,-6.99377712E-003,-7.29618035E-003,8.81512091E-003,-6.36086613E-002,4.39666817E-003,-4.20215651E-002,-1.32307202E-001,-5.83761605E-003,-1.3236966E-002,-1.62914731E-002

,6.54898351E-003,-4.983522E-002,-1.98450536E-002,-4.69920225E-002,1.03937663E-001,-8.96753445E-002,-2.77861813E-003,4.13947664E-002,-6.51626661E-002,-1.09901905E-001,-8.73053819E-003,2.533352E-002,-1.

42030632E-002,-2.42071245E-002,1.91591978E-002,4.93748812E-003,6.30869251E-003,-1.24127813E-001,-7.17297941E-003,3.7317384E-002,4.97635901E-002,4.52162437E-002,1.49683114E-002,-2.21795831E-002,-3.6793

6082E-002,-6.20233943E-004,7.16803819E-002,5.33913262E-003,1.92087106E-002,-9.91346017E-002,3.90679464E-002,2.22725421E-002,5.04363105E-002,1.81943253E-002,5.34031466E-002,1.44161871E-002,-1.99907795E

-002,-1.20323608E-002,-2.63888389E-002,-4.14667316E-002,6.2473774E-002,-4.68838662E-002,1.16748568E-002,-2.43180972E-002,-3.11982706E-002,-7.5750039E-003,2.25466359E-002,-4.17359956E-002,1.23237111E-0

02,4.31706831E-002,-7.83750787E-002,1.24918511E-002,5.42060807E-002,4.33742851E-002,2.52278382E-003,-1.15482137E-002,-9.98658361E-004,-2.12613102E-002,1.00960173E-002,3.17986645E-002,-1.13146752E-002,

-1.26893371E-002,2.66182758E-002,-7.50683714E-003,-3.70341949E-002,1.94851588E-002,-2.9213747E-002,-2.61210538E-002,2.86212545E-002,-9.15900841E-002,1.50552345E-002,-4.98168021E-002,2.29324233E-002,7.

82517716E-003,4.22972552E-002,3.37974802E-002,-4.2345725E-002,-6.32970557E-002,3.84949856E-002,-1.93851739E-002,1.96233811E-003,-3.91593436E-004,7.80334743E-003,5.63595518E-002,4.45814878E-002,-4.9701

6348E-002,1.36384079E-002,2.76547611E-001,6.3580215E-002,-1.69337653E-002,-3.25948671E-002,2.74621621E-002,-1.84809547E-002,-3.58916223E-002,3.18280957E-003,-3.92074026E-002,-5.03925188E-003,-3.951984

27E-002,2.64224112E-002,5.44404946E-002,-2.97634304E-003,1.06564369E-002,4.55005467E-002,-9.6166715E-002,4.53019142E-002,3.02239861E-002,-1.11025631E-001,6.18582554E-002,8.55141804E-002,-1.51456865E-0

02,-5.64082488E-002,-5.93042118E-004,1.07500188E-001,-6.81523383E-002,1.85917299E-002,3.75312977E-002,-3.27163041E-002,-4.72422279E-002,5.59753366E-002,2.20437758E-002,2.74991542E-002,2.6306238E-002,-

 

MY_VECTOR

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

5.36229946E-002,6.86868653E-003,5.06033236E-003,8.8686645E-002,3.97679023E-002,-1.49526203E-003,-1.11991554E-001,-1.48920557E-002,-1.42183565E-002,5.44682518E-002,-5.46902828E-002,-3.37714851E-002,-3.

93099487E-002,-8.88748933E-003,-2.50034276E-002,-3.86718176E-002,7.15422928E-002,-1.72947664E-002,5.7217218E-002,1.23445597E-002,-6.2534079E-002,-1.97963249E-002,4.08163741E-002,9.22357664E-003,2.3645

6636E-002,-4.27602604E-003,-1.24366455E-001,8.28649253E-002,-5.27119003E-002,-1.11121042E-002,4.35752422E-002,1.35777146E-002,-2.25060452E-002,4.55260389E-002,3.8973894E-002,-8.93306658E-002,1.1729340

3E-001,5.51190414E-002,-2.56631672E-002,-5.30632474E-002,-8.3953537E-002,4.83216904E-003,6.57674596E-002,8.87271464E-002,-1.52742835E-002,1.0525425E-002,-1.5814418E-002,-3.10783181E-002,-2.90690996E-0

02,7.04224128E-003,-3.09849493E-002,-4.46302071E-003,-7.20088482E-002,-7.05658421E-002,4.65546139E-002,1.10276654E-001,3.60872261E-002,1.86069943E-002,-6.10642694E-002,3.21829356E-002,-1.43657662E-002

,-6.75653145E-002,8.0748558E-002,1.68782603E-002,-1.0059043E-001,-7.55800977E-002,-1.69591643E-002,-4.4571083E-002,-8.6054299E-003,4.3378789E-002,4.29520719E-002,3.94066535E-002,8.67496245E-003,-8.521

2335E-002,1.20206423E-001,-1.14268221E-001,-1.7028559E-002,8.87670461E-003,-4.69081141E-002,-3.02138384E-002,4.61057387E-002,-4.92519997E-002,1.5618098E-002,-9.27053913E-002,-6.08293712E-002,1.4645159

2E-002,-1.84691269E-002,-1.40407518E-001,5.35490997E-002,5.85880894E-033,7.62652457E-002,-3.07706036E-002,-6.74770074E-003,1.03074081E-001,7.20860362E-002,-9.75818709E-002,1.51840553E-001,7.43321329E-

002,-2.99238227E-002,9.39517915E-002,1.50299724E-002,4.35530096E-002,-7.5808214E-003,-7.49262646E-002,-5.07647395E-002,4.01099063E-002,-7.43360221E-002,4.62087579E-002,9.61421523E-003,3.15185694E-004,

6.20259941E-002,1.57011077E-002,3.29307318E-002,5.69748059E-002,-7.89974183E-002,9.78372246E-003,1.16776042E-002,-3.65987606E-002,-5.30387387E-002,-1.22491773E-002,5.65313101E-002,3.41438241E-002,-4.2

6849052E-002,9.84478816E-002,1.52461289E-003,-6.92429617E-002,9.64930356E-002,-1.85021386E-002,4.28027324E-002,-4.41830754E-002,-2.54553296E-002,5.20384498E-002,-1.38082858E-002,-1.59469545E-002,2.100

 

MY_VECTOR

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

03126E-002,-1.85695048E-002,2.29395889E-002,1.91418324E-002,4.09490801E-002,2.35137548E-002,-3.91654707E-002,3.57466713E-002,4.80409227E-002,-1.02699148E-002,1.45040629E-002,-4.27465364E-002,-5.092588

44E-002,-7.12790638E-002,-9.19121876E-002,1.79740377E-002,-3.53490934E-002,-2.26370133E-002,1.6497435E-002,1.05952598E-001,-3.52565758E-002,-3.41151431E-002,-5.7282865E-002,-3.10265832E-002,6.97579682

E-002,-2.50360705E-002,-3.91423441E-002,1.37532474E-002,6.76135859E-003,-5.15896529E-002,-3.55789214E-002,6.91288933E-002,-3.4734223E-002,-1.098355E-002,-2.45214198E-002,-3.24611887E-002,1.08489497E-0

02,8.21795501E-003,-2.97738295E-002,4.81432006E-002,-5.78631125E-002,2.8562462E-002,4.02920581E-002,2.82907318E-002,-3.79493348E-002,8.59354064E-003,1.02058174E-002,2.48055868E-002,7.16195162E-003,-6.

24535196E-002,-3.25725861E-002,4.26037687E-033,-7.57265091E-003,-4.15650047E-002,-4.9813509E-002,1.02479653E-002,3.28872614E-002,1.50397036E-003,-6.39198944E-002,-7.53579438E-002,-2.46183965E-002,-3.0

6450091E-002,4.16100211E-002,7.04020783E-002,-8.15085769E-002,2.55300757E-002,1.89818796E-002,4.26408388E-002,-2.19986811E-002,7.13623175E-003,-3.42554823E-002,3.70062445E-003,-3.15255509E-003,1.41580

394E-002,5.00133969E-002,7.54985511E-002,6.42605647E-002,7.55612105E-002,1.52721936E-002,1.15661152E-001,-2.45987345E-002,1.08358106E-002,5.02406769E-002,6.28810748E-002,-5.52952401E-002,-5.5196926E-0

02,-4.60026506E-003,-1.46539938E-002,6.40283972E-002,5.18338159E-002,2.51765456E-002,6.45218045E-002,-8.35603252E-002,3.57579924E-002,6.28178008E-003,3.15947039E-003,2.320843E-002,4.76812162E-002,-5.4

7788292E-003,-1.06323622E-001,-1.45862857E-002,-5.92180677E-002,-1.59236323E-002,-1.90922078E-002,4.61261906E-002,2.41158754E-002,-7.90221989E-003,1.11448206E-001,1.11205513E-002,-2.0573806E-002,-4.08

658385E-002,5.5462148E-002,5.37177995E-002,4.76263314E-002,-3.29908058E-002,4.37314026E-002]

接著,我們需要一些資料來測試相似性搜尋,我們建立一個表格 movie_quotes 並將目錄中的文字檔案 movie_quotes.csv 載入至該表格中,該文字檔中包含電影台詞,也就是預計進行相似性搜尋的數據集:

cd /opt/oracle/models/

 

wget https://huggingface.co/datasets/ygorgeurts/movie-quotes/resolve/main/movie_quotes.csv?download=true -O movie_quotes.csv

 

sqlplus testuser1/testuser1@localhost:1521/freepdb1

 

drop table if exists movie_quotes purge;

 

create table movie_quotes as

select movie_quote, movie, movie_type, movie_year

from   external (

         (

           movie_quote  varchar2(400),

           movie        varchar2(200),

           movie_type   varchar2(50),

           movie_year   number(4)

         )

         type oracle_loader

         default directory model_dir

         access parameters (

           records delimited by newline

           skip 1

           badfile model_dir

           logfile model_dir:'moview_quotes_ext_tab_%a_%p.log'

           discardfile model_dir

           fields csv with embedded terminated by ',' optionally enclosed by '"'

           missing field values are null

           (

             movie_quote char(400),

             movie,

             movie_type,

             movie_year

           )

        )

        location ('movie_quotes.csv')

        reject limit unlimited

      );

 

該表格目前有四個欄位,其中欄位movie_quote 儲存電影台詞,我們將對該欄位產生向量數據,並儲存至新增的欄位 movie_quote_vector,該新增欄位的數據類型為 vector:

alter table movie_quotes add (

  movie_quote_vector vector

);

 

SQL> desc movie_quotes

 Name                       Null?  Type

 ----------------------------------------- -------- ----------------------------

 MOVIE_QUOTE                     VARCHAR2(400)

 MOVIE                            VARCHAR2(200)

 MOVIE_TYPE                       VARCHAR2(50)

 MOVIE_YEAR                       NUMBER(4)

 MOVIE_QUOTE_VECTOR             VECTOR(*, *)

 

update movie_quotes

set    movie_quote_vector = vector_embedding(all_minilm_l12_v2 using movie_quote as data);

 

commit;

 

  1. 使用 VECTOR_DISTANCE 函數進行相似性搜尋

該函數接受兩個向量為參數,並返回它們之間的距離,由於我們以文字數據產生向量,因此我們預期兩個向量的距離會更小。 在以下範例中,我們將提問文字 (user question) 建立向量數據,並將提問文字的向量數據與 movie_quote_vector 欄位中的向量數據進行相似性搜尋,並對查詢輸出進行排序。

首先,我們提問問題 "Films with motivational speaking in them" 以查找有勵志演講的電影:

variable search_text varchar2(100);

exec :search_text := 'Films with motivational speaking in them';

 

set linesize 200

column movie format a50

column movie_quote format a100

 

SELECT vector_distance(movie_quote_vector, (vector_embedding(all_minilm_l12_v2 using :search_text as data))) as distance,

       movie,

       movie_quote

FROM   movie_quotes

order by 1

fetch approximate first 5 rows only;

 

  DISTANCE MOVIE                        MOVIE_QUOTE

---------- -------------------------------------------------- ----------------------------------------------------------------------------------------------------

6.786E-001 Once Upon a Time in Hollywood   That was the best acting i've ever seen in my whole life.

6.979E-001 Dead Poets Society              You must strive to find your own voice because the longer you wait to begin, the less likely you are

                                         going to find it at all.

 

7.169E-001 The Pursuit of Happyness        Walk that walk and go forward all the time. Don't just talk that talk, walk it and go forward. Also,

                                        the walk didn't have to be long strides; baby steps counted too. Go forward.

 

7.186E-001 Joker                         My mother always tells me to smile and put on a happy face. She told me I had a purpose to bring lau

                                        ghter and joy to the world.

 

7.234E-001 Blazing Saddles               Men, you are about to embark on a great crusade to stamp out runaway decency in the west. Now you me

 

  DISTANCE MOVIE                      MOVIE_QUOTE

---------- -------------------------------------------------- ----------------------------------------------------------------------------------------------------

                                       n will only be risking your lives, whilst I will be risking an almost certain Academy Award nominati

                                       on for Best Supporting Actor.

接著提問 "Films about war" 以查找有關戰爭的電影:

variable search_text varchar2(100);

exec :search_text := 'Films about war';

 

set linesize 200

column movie format a50

column movie_quote format a100

 

SELECT vector_distance(movie_quote_vector, (vector_embedding(all_minilm_l12_v2 using :search_text as data))) as distance,

       movie,

       movie_quote

FROM   movie_quotes

order by 1

fetch approximate first 5 rows only;

 

  DISTANCE MOVIE                   MOVIE_QUOTE

---------- -------------------------------------------------- ----------------------------------------------------------------------------------------------------

5.682E-001 Dr. Strangelove             Gentlemen, you can't fight in here! This is the War Room!

6.346E-001 Blazing Saddles            Men, you are about to embark on a great crusade to stamp out runaway decency in the west. Now you me

                                    n will only be risking your lives, whilst I will be risking an almost certain Academy Award nominati

                                    on for Best Supporting Actor.

 

6.587E-001 Fury                      Ideals are peaceful; history is violent.

7.243E-001 The Kill Team              You give me your loyalty, and I?ll guarantee that each and every one of you will have a chance to be

                                    a warrior, to actually be a part of history.

 

7.253E-001 Dr. No                    Bond. James Bond

以上步驟簡單演示了 Oracle AI Vector Search 的過程,其相似性搜尋的結果還可以進一步被 RAG 使用,來提高大型語言模型對企業資料問題的回應準確性,更多的資訊請參考以下連結或是洽詢邁達特,我們將有專員提供咨詢

[1] Oracle AI Vector Search – Introduction - https://www.oracle.com/tw/database/ai-vector-search/

[2] Oracle AI Vector Search – Basics - https://apexapps.oracle.com/pls/apex/f?p=133:180:10942923485662::::wid:1070

[3] AI Vector Search - Complete RAG Application using PL/SQL in Oracle Database 23ai -https://apexapps.oracle.com/pls/apex/f?p=133:180:10942923485662::::wid:3934

聯絡 我們