課程網頁
Chap 7
文件的前置處理
這個是把 input 的東西,轉成一個一個的 word/token
-
自動 indexing
數字不太好,但是會有像是 B6, B12 這種東西
連字號要不要拆開的問題:拆開會增加 recall 減低 precision
-
大小寫:保留可以增加 precision 減低 recall
一般的商用系統是以 recall 為主,所以說會保留數字當 index,但不區分大小寫
query 中可以包含一些運算符號或者是 grouping 的東西
消除 stopword
避免抓出一堆垃圾
最好的作法是在做語法分析的時候就幹掉
stemming
其實大多數的 search engine 不用
考量是正確性,擷取的好壞跟壓縮率
做的時間點
找 stem 的方式
找出一堆字根變化的規則,然後砍掉(Porter algo),在 match 的時候可以分為 longest match 或者 partial matching(只考慮字根的前幾個字)。
successor variety 是說在一個 word 中,每一個字母後面可以接的字母的數量(只考慮現在你有的文件們),而通常在 stem 的時候,會有一個 peak 跳出來,因此找 stem 的方式有三種:cut-off(先定義好 successor variety 超過多少的時候就算是 stem)/ peak and plateau(找突然跳起來的 peak!)/ complete word
-
-
-
table lookup(查表啦 XD)
n-gram
先算字跟字共用的 n-gram,算法是兩個字共用的 n-gram 數量 * 2 / 兩個字的 n-gram 個數,然後造出一個 matrix,用 single link 的方式 cluster 連起來,於是就可以找到 stem。這個比較像是 term cluster 的東西。
Chap 8
indexing
searching
Project
done
On P4 2.8G, 2G memory
read file, extract <TI>, <TEXT> (extract-text.py)
extract terms per document into DOCID.terms (extract-term.py)
標點符號
stop word
also counts term frequency
stemming
indexing (invert file Q_Q)
10m with pysco
165m36.654s (~ 1G?), 100 dbs (per tf,idf,inv)
not yet
目錄結構
實驗
exp1
exp2
FBIS3
tf: hash, idf:btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
idf is calculated in the final
87m2.240s (1G)
rev 2042
exp3
FBIS3
tf: hash, idf:btree, inv: btree, NUM_DB = 50
tf key: docno
inv_db ync per 100 documents
tf sync in the final
idf is calculated in the final
108m11.242s (942M)
rev 2045
exp4
FBIS3
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
67m (1.1G)
rev 2048
exp5
ALL PREVIOUS EXP IS NOT USABLE check log of rev 2052
FBIS3
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
84m (1.3G)
rev 2053
exp6
FBIS3
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (>
2
26m28.852s (430M)
search: 34.784s (480176 264384)
disk: 250M
rev 2063
exp7
FBIS3+FBIS4
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (>
2
90m37s (822M)
search: 1m7.741s (721724 435184)
490M
rev 2063
exp8
FBIS3+FBIS4-inc
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (>
2
53m12.581s (560M) (incremental only)
search: 1m5.634s (723380 429724)
490M (slight larger than exp7)
rev 2068
exp9
FBIS3
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (>
2
1557s (430M)
search: 34.728s (480192 262708)
disk: 244M
rev 2074
exp10
FBIS3+FBIS4
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (>
2
5277s (822M)
search: 1m8.774s (721740 434736)
478M
rev 2074
exp11
FBIS3+FBIS4-inc
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (>
2
1728s (560M) (incremental only)
search: 1m7.604s (497028 271148)
503M (slight larger than exp7)
rev 2074
exp12
FBIS3
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (>
2
stem cache
1192s (430M)
search: 34.728s (488440 451364)
disk: 242M
rev 2075
exp13
FBIS3+FBIS4
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (>
2
stem cache
4633s (870MB)
search: 34.728s (488440 451364)
disk: 473M
rev 2075
exp14
FBIS3+FBIS4 (inc)
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (>
2
stem cache
2773s (606MB) (inc)
search: 34.728s (488440 451364)
disk: 475M
rev 2075