Oracle Database 23ai 引入了 Oracle AI Vector Search,這是一項令人興奮的新功能,您可以透過使用 embedding model 對資料產生向量 (vectors) 並在資料庫中儲存並管理它們,然後提供語意相似性搜尋 (semantic search),以根據含義和上下文查找相關資訊,其流程圖如下:
由於非結構化數據很難直接搜尋,比如:文本檔案、圖像檔、聲音檔與影像檔,因此根據語意查找相關數據的相似性搜尋超越了簡單的關鍵字搜索。 許多行業都需要在大型數據集中進行相似數據的匹配,其應用包括以下:
本篇文章將以非結構化文本檔案的語意搜尋為例,將 Oracle 所提供的 all-MiniLM-L12-v2 模型載入資料庫中 (https://blogs.oracle.com/machinelearning/post/use-our-prebuilt-onnx-model-now-available-for-embedding-generation-in-oracle-database-23ai),透過該模型對文檔中的文字數據向量化,最後將用戶搜尋文字的向量數據與存入表格中的向量化數據進行相似性搜尋:
這裡使用了 Oracle Database Oracle 23ai free 資料庫版本進行演示,在作業系統目錄 /opt/oracle 下創建一個目錄來存放下載的 model:
mkdir -p /opt/oracle/models
cd /opt/oracle/models/
wget https://adwc4pm.objectstorage.us-ashburn-1.oci.customer-oci.com/p/VBRD9P8ZFWkKvnfhrWxkpPe8K03-JIoM5h_8EJyJcpE80c108fuUjg7R5L5O7mMZ/n/adwc4pm/b/OML-Resources/o/all_MiniLM_L12_v2_augmented.zip
unzip -oq all_MiniLM_L12_v2_augmented.zip |
接著登入該資料庫,創建一個資料庫用戶 testuser1 與建立一個資料庫目錄物件 model_dir 指向實體目錄路徑 /opt/oracle/models,然後授予所需權限給該用戶:
sqlplus sys/"1qaz2wsx#EDC"@localhost:1521/freepdb1 as sysdba
create user if not exists testuser1 identified by testuser1 quota unlimited on users;
grant create session, db_developer_role, create mining model to testuser1;
create or replace directory model_dir as '/opt/oracle/models';
grant read, write on directory model_dir to testuser1; |
將 Embedding model 透過 DBMS_VECTOR 套件載入資料中:
begin dbms_vector.drop_onnx_model ( model_name => 'ALL_MINILM_L12_V2', force => true);
dbms_vector.load_onnx_model ( directory => 'model_dir', file_name => 'all_MiniLM_L12_v2.onnx', model_name => 'ALL_MINILM_L12_V2'); end; / |
我們可以透過 USER_MINING_MODELS 視圖來查看該 model 的資訊:
column model_name format a30 column algorithm format a10 column mining_function format a15
select model_name, algorithm, mining_function from user_mining_models where model_name = 'ALL_MINILM_L12_V2';
MODEL_NAME ALGORITHM MINING_FUNCTION ------------------------------ --------------- --------------- ALL_MINILM_L12_V2 ONNX EMBEDDING |
在 model 載入後,我們可以測試使用 VECTOR_EMBEDDING 函數將文字數據產生向量數據,以下以文字 “Quick test” 為例:
set long 1000000
select vector_embedding(all_minilm_l12_v2 using 'Quick test' as data) AS my_vector;
MY_VECTOR -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [-3.86444256E-002,7.27762803E-002,-6.99377712E-003,-7.29618035E-003,8.81512091E-003,-6.36086613E-002,4.39666817E-003,-4.20215651E-002,-1.32307202E-001,-5.83761605E-003,-1.3236966E-002,-1.62914731E-002 ,6.54898351E-003,-4.983522E-002,-1.98450536E-002,-4.69920225E-002,1.03937663E-001,-8.96753445E-002,-2.77861813E-003,4.13947664E-002,-6.51626661E-002,-1.09901905E-001,-8.73053819E-003,2.533352E-002,-1. 42030632E-002,-2.42071245E-002,1.91591978E-002,4.93748812E-003,6.30869251E-003,-1.24127813E-001,-7.17297941E-003,3.7317384E-002,4.97635901E-002,4.52162437E-002,1.49683114E-002,-2.21795831E-002,-3.6793 6082E-002,-6.20233943E-004,7.16803819E-002,5.33913262E-003,1.92087106E-002,-9.91346017E-002,3.90679464E-002,2.22725421E-002,5.04363105E-002,1.81943253E-002,5.34031466E-002,1.44161871E-002,-1.99907795E -002,-1.20323608E-002,-2.63888389E-002,-4.14667316E-002,6.2473774E-002,-4.68838662E-002,1.16748568E-002,-2.43180972E-002,-3.11982706E-002,-7.5750039E-003,2.25466359E-002,-4.17359956E-002,1.23237111E-0 02,4.31706831E-002,-7.83750787E-002,1.24918511E-002,5.42060807E-002,4.33742851E-002,2.52278382E-003,-1.15482137E-002,-9.98658361E-004,-2.12613102E-002,1.00960173E-002,3.17986645E-002,-1.13146752E-002, -1.26893371E-002,2.66182758E-002,-7.50683714E-003,-3.70341949E-002,1.94851588E-002,-2.9213747E-002,-2.61210538E-002,2.86212545E-002,-9.15900841E-002,1.50552345E-002,-4.98168021E-002,2.29324233E-002,7. 82517716E-003,4.22972552E-002,3.37974802E-002,-4.2345725E-002,-6.32970557E-002,3.84949856E-002,-1.93851739E-002,1.96233811E-003,-3.91593436E-004,7.80334743E-003,5.63595518E-002,4.45814878E-002,-4.9701 6348E-002,1.36384079E-002,2.76547611E-001,6.3580215E-002,-1.69337653E-002,-3.25948671E-002,2.74621621E-002,-1.84809547E-002,-3.58916223E-002,3.18280957E-003,-3.92074026E-002,-5.03925188E-003,-3.951984 27E-002,2.64224112E-002,5.44404946E-002,-2.97634304E-003,1.06564369E-002,4.55005467E-002,-9.6166715E-002,4.53019142E-002,3.02239861E-002,-1.11025631E-001,6.18582554E-002,8.55141804E-002,-1.51456865E-0 02,-5.64082488E-002,-5.93042118E-004,1.07500188E-001,-6.81523383E-002,1.85917299E-002,3.75312977E-002,-3.27163041E-002,-4.72422279E-002,5.59753366E-002,2.20437758E-002,2.74991542E-002,2.6306238E-002,-
MY_VECTOR -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 5.36229946E-002,6.86868653E-003,5.06033236E-003,8.8686645E-002,3.97679023E-002,-1.49526203E-003,-1.11991554E-001,-1.48920557E-002,-1.42183565E-002,5.44682518E-002,-5.46902828E-002,-3.37714851E-002,-3. 93099487E-002,-8.88748933E-003,-2.50034276E-002,-3.86718176E-002,7.15422928E-002,-1.72947664E-002,5.7217218E-002,1.23445597E-002,-6.2534079E-002,-1.97963249E-002,4.08163741E-002,9.22357664E-003,2.3645 6636E-002,-4.27602604E-003,-1.24366455E-001,8.28649253E-002,-5.27119003E-002,-1.11121042E-002,4.35752422E-002,1.35777146E-002,-2.25060452E-002,4.55260389E-002,3.8973894E-002,-8.93306658E-002,1.1729340 3E-001,5.51190414E-002,-2.56631672E-002,-5.30632474E-002,-8.3953537E-002,4.83216904E-003,6.57674596E-002,8.87271464E-002,-1.52742835E-002,1.0525425E-002,-1.5814418E-002,-3.10783181E-002,-2.90690996E-0 02,7.04224128E-003,-3.09849493E-002,-4.46302071E-003,-7.20088482E-002,-7.05658421E-002,4.65546139E-002,1.10276654E-001,3.60872261E-002,1.86069943E-002,-6.10642694E-002,3.21829356E-002,-1.43657662E-002 ,-6.75653145E-002,8.0748558E-002,1.68782603E-002,-1.0059043E-001,-7.55800977E-002,-1.69591643E-002,-4.4571083E-002,-8.6054299E-003,4.3378789E-002,4.29520719E-002,3.94066535E-002,8.67496245E-003,-8.521 2335E-002,1.20206423E-001,-1.14268221E-001,-1.7028559E-002,8.87670461E-003,-4.69081141E-002,-3.02138384E-002,4.61057387E-002,-4.92519997E-002,1.5618098E-002,-9.27053913E-002,-6.08293712E-002,1.4645159 2E-002,-1.84691269E-002,-1.40407518E-001,5.35490997E-002,5.85880894E-033,7.62652457E-002,-3.07706036E-002,-6.74770074E-003,1.03074081E-001,7.20860362E-002,-9.75818709E-002,1.51840553E-001,7.43321329E- 002,-2.99238227E-002,9.39517915E-002,1.50299724E-002,4.35530096E-002,-7.5808214E-003,-7.49262646E-002,-5.07647395E-002,4.01099063E-002,-7.43360221E-002,4.62087579E-002,9.61421523E-003,3.15185694E-004, 6.20259941E-002,1.57011077E-002,3.29307318E-002,5.69748059E-002,-7.89974183E-002,9.78372246E-003,1.16776042E-002,-3.65987606E-002,-5.30387387E-002,-1.22491773E-002,5.65313101E-002,3.41438241E-002,-4.2 6849052E-002,9.84478816E-002,1.52461289E-003,-6.92429617E-002,9.64930356E-002,-1.85021386E-002,4.28027324E-002,-4.41830754E-002,-2.54553296E-002,5.20384498E-002,-1.38082858E-002,-1.59469545E-002,2.100
MY_VECTOR -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 03126E-002,-1.85695048E-002,2.29395889E-002,1.91418324E-002,4.09490801E-002,2.35137548E-002,-3.91654707E-002,3.57466713E-002,4.80409227E-002,-1.02699148E-002,1.45040629E-002,-4.27465364E-002,-5.092588 44E-002,-7.12790638E-002,-9.19121876E-002,1.79740377E-002,-3.53490934E-002,-2.26370133E-002,1.6497435E-002,1.05952598E-001,-3.52565758E-002,-3.41151431E-002,-5.7282865E-002,-3.10265832E-002,6.97579682 E-002,-2.50360705E-002,-3.91423441E-002,1.37532474E-002,6.76135859E-003,-5.15896529E-002,-3.55789214E-002,6.91288933E-002,-3.4734223E-002,-1.098355E-002,-2.45214198E-002,-3.24611887E-002,1.08489497E-0 02,8.21795501E-003,-2.97738295E-002,4.81432006E-002,-5.78631125E-002,2.8562462E-002,4.02920581E-002,2.82907318E-002,-3.79493348E-002,8.59354064E-003,1.02058174E-002,2.48055868E-002,7.16195162E-003,-6. 24535196E-002,-3.25725861E-002,4.26037687E-033,-7.57265091E-003,-4.15650047E-002,-4.9813509E-002,1.02479653E-002,3.28872614E-002,1.50397036E-003,-6.39198944E-002,-7.53579438E-002,-2.46183965E-002,-3.0 6450091E-002,4.16100211E-002,7.04020783E-002,-8.15085769E-002,2.55300757E-002,1.89818796E-002,4.26408388E-002,-2.19986811E-002,7.13623175E-003,-3.42554823E-002,3.70062445E-003,-3.15255509E-003,1.41580 394E-002,5.00133969E-002,7.54985511E-002,6.42605647E-002,7.55612105E-002,1.52721936E-002,1.15661152E-001,-2.45987345E-002,1.08358106E-002,5.02406769E-002,6.28810748E-002,-5.52952401E-002,-5.5196926E-0 02,-4.60026506E-003,-1.46539938E-002,6.40283972E-002,5.18338159E-002,2.51765456E-002,6.45218045E-002,-8.35603252E-002,3.57579924E-002,6.28178008E-003,3.15947039E-003,2.320843E-002,4.76812162E-002,-5.4 7788292E-003,-1.06323622E-001,-1.45862857E-002,-5.92180677E-002,-1.59236323E-002,-1.90922078E-002,4.61261906E-002,2.41158754E-002,-7.90221989E-003,1.11448206E-001,1.11205513E-002,-2.0573806E-002,-4.08 658385E-002,5.5462148E-002,5.37177995E-002,4.76263314E-002,-3.29908058E-002,4.37314026E-002] |
接著,我們需要一些資料來測試相似性搜尋,我們建立一個表格 movie_quotes 並將目錄中的文字檔案 movie_quotes.csv 載入至該表格中,該文字檔中包含電影台詞,也就是預計進行相似性搜尋的數據集:
cd /opt/oracle/models/
wget https://huggingface.co/datasets/ygorgeurts/movie-quotes/resolve/main/movie_quotes.csv?download=true -O movie_quotes.csv
sqlplus testuser1/testuser1@localhost:1521/freepdb1
drop table if exists movie_quotes purge;
create table movie_quotes as select movie_quote, movie, movie_type, movie_year from external ( ( movie_quote varchar2(400), movie varchar2(200), movie_type varchar2(50), movie_year number(4) ) type oracle_loader default directory model_dir access parameters ( records delimited by newline skip 1 badfile model_dir logfile model_dir:'moview_quotes_ext_tab_%a_%p.log' discardfile model_dir fields csv with embedded terminated by ',' optionally enclosed by '"' missing field values are null ( movie_quote char(400), movie, movie_type, movie_year ) ) location ('movie_quotes.csv') reject limit unlimited );
|
該表格目前有四個欄位,其中欄位movie_quote 儲存電影台詞,我們將對該欄位產生向量數據,並儲存至新增的欄位 movie_quote_vector,該新增欄位的數據類型為 vector:
alter table movie_quotes add ( movie_quote_vector vector );
SQL> desc movie_quotes Name Null? Type ----------------------------------------- -------- ---------------------------- MOVIE_QUOTE VARCHAR2(400) MOVIE VARCHAR2(200) MOVIE_TYPE VARCHAR2(50) MOVIE_YEAR NUMBER(4) MOVIE_QUOTE_VECTOR VECTOR(*, *)
update movie_quotes set movie_quote_vector = vector_embedding(all_minilm_l12_v2 using movie_quote as data);
commit; |
該函數接受兩個向量為參數,並返回它們之間的距離,由於我們以文字數據產生向量,因此我們預期兩個向量的距離會更小。 在以下範例中,我們將提問文字 (user question) 建立向量數據,並將提問文字的向量數據與 movie_quote_vector 欄位中的向量數據進行相似性搜尋,並對查詢輸出進行排序。
首先,我們提問問題 "Films with motivational speaking in them" 以查找有勵志演講的電影:
variable search_text varchar2(100); exec :search_text := 'Films with motivational speaking in them';
set linesize 200 column movie format a50 column movie_quote format a100
SELECT vector_distance(movie_quote_vector, (vector_embedding(all_minilm_l12_v2 using :search_text as data))) as distance, movie, movie_quote FROM movie_quotes order by 1 fetch approximate first 5 rows only;
DISTANCE MOVIE MOVIE_QUOTE ---------- -------------------------------------------------- ---------------------------------------------------------------------------------------------------- 6.786E-001 Once Upon a Time in Hollywood That was the best acting i've ever seen in my whole life. 6.979E-001 Dead Poets Society You must strive to find your own voice because the longer you wait to begin, the less likely you are going to find it at all.
7.169E-001 The Pursuit of Happyness Walk that walk and go forward all the time. Don't just talk that talk, walk it and go forward. Also, the walk didn't have to be long strides; baby steps counted too. Go forward.
7.186E-001 Joker My mother always tells me to smile and put on a happy face. She told me I had a purpose to bring lau ghter and joy to the world.
7.234E-001 Blazing Saddles Men, you are about to embark on a great crusade to stamp out runaway decency in the west. Now you me
DISTANCE MOVIE MOVIE_QUOTE ---------- -------------------------------------------------- ---------------------------------------------------------------------------------------------------- n will only be risking your lives, whilst I will be risking an almost certain Academy Award nominati on for Best Supporting Actor. |
接著提問 "Films about war" 以查找有關戰爭的電影:
variable search_text varchar2(100); exec :search_text := 'Films about war';
set linesize 200 column movie format a50 column movie_quote format a100
SELECT vector_distance(movie_quote_vector, (vector_embedding(all_minilm_l12_v2 using :search_text as data))) as distance, movie, movie_quote FROM movie_quotes order by 1 fetch approximate first 5 rows only;
DISTANCE MOVIE MOVIE_QUOTE ---------- -------------------------------------------------- ---------------------------------------------------------------------------------------------------- 5.682E-001 Dr. Strangelove Gentlemen, you can't fight in here! This is the War Room! 6.346E-001 Blazing Saddles Men, you are about to embark on a great crusade to stamp out runaway decency in the west. Now you me n will only be risking your lives, whilst I will be risking an almost certain Academy Award nominati on for Best Supporting Actor.
6.587E-001 Fury Ideals are peaceful; history is violent. 7.243E-001 The Kill Team You give me your loyalty, and I?ll guarantee that each and every one of you will have a chance to be a warrior, to actually be a part of history.
7.253E-001 Dr. No Bond. James Bond |
以上步驟簡單演示了 Oracle AI Vector Search 的過程,其相似性搜尋的結果還可以進一步被 RAG 使用,來提高大型語言模型對企業資料問題的回應準確性,更多的資訊請參考以下連結或是洽詢邁達特,我們將有專員提供咨詢。
[1] Oracle AI Vector Search – Introduction - https://www.oracle.com/tw/database/ai-vector-search/
[2] Oracle AI Vector Search – Basics - https://apexapps.oracle.com/pls/apex/f?p=133:180:10942923485662::::wid:1070
[3] AI Vector Search - Complete RAG Application using PL/SQL in Oracle Database 23ai -https://apexapps.oracle.com/pls/apex/f?p=133:180:10942923485662::::wid:3934