Table of Contents
Chap 7
文件的前置處理
- 語法分析 (lexical analysis)
這個是把 input 的東西,轉成一個一個的 word/token
-
- 自動 indexing
- 數字不太好,但是會有像是 B6, B12 這種東西
- 連字號要不要拆開的問題:拆開會增加 recall 減低 precision
- 念法的標記,通常沒用,像是 OS/2
- 大小寫:保留可以增加 precision 減低 recall
一般的商用系統是以 recall 為主,所以說會保留數字當 index,但不區分大小寫
-
- 處理 query
query 中可以包含一些運算符號或者是 grouping 的東西
- 消除 stopword
- 避免抓出一堆垃圾
- 最好的作法是在做語法分析的時候就幹掉
- stemming
- 其實大多數的 search engine 不用
- 考量是正確性,擷取的好壞跟壓縮率
- 做的時間點
- index 的時候,好處是比較有效率跟好的壓縮,但是一些資訊就會消失
- search 的時候,把類似的結果都丟出來,給使用者選
- 找 stem 的方式
- 手動
- stemmers (自動)
- affix removel
找出一堆字根變化的規則,然後砍掉(Porter algo),在 match 的時候可以分為 longest match 或者 partial matching(只考慮字根的前幾個字)。
-
-
-
- successor variety
-
-
successor variety 是說在一個 word 中,每一個字母後面可以接的字母的數量(只考慮現在你有的文件們),而通常在 stem 的時候,會有一個 peak 跳出來,因此找 stem 的方式有三種:cut-off(先定義好 successor variety 超過多少的時候就算是 stem)/ peak and plateau(找突然跳起來的 peak!)/ complete word
-
-
-
- table lookup(查表啦 XD)
- n-gram
-
-
先算字跟字共用的 n-gram,算法是兩個字共用的 n-gram 數量 * 2 / 兩個字的 n-gram 個數,然後造出一個 matrix,用 single link 的方式 cluster 連起來,於是就可以找到 stem。這個比較像是 term cluster 的東西。
Chap 8
indexing
- 照字母順序的
- sorted indices
- inverted files
- 每篇文章列出一些 keywords,每個 keyword 都有 relevance weight,那 inverted file 就是以 keyword 當 key,會連到有出現這個 keyword 的文件去
- 缺點是 size 會是 text size 的 10%~100%,而且原本文件改過後,應重新做
- inverted list 是記錄每個字出現在文章的那邊,file 只有記說出現在那個文件,那也可以用 block 為單位來記出現的地方,而一個 block 看需要甚至可以是一份文件
- 實際上可以用 sorted array, b-tree, tries, hasing 或者綜合的方式來實作
- tries: 字彙樹,依照有的字彙,從第一個字母開始造樹,leaf 就是真正的字跟出現的地方
- b-tree: 就是那個 b-tree
- sorted array: 照 term sorted 去查 term 出現在那個文件裡面,但是這樣子很慢,所以我們可以把出現在那份文件跟位置的資訊存在另外一個檔案(posting file)中,查 term 的時候就是查這個檔的 offset,然後跑過去。這樣子有點糟糕,有一種解法是在造那個 dictionary 的時候,用 right-threaded bst(這是啥?),來避免 sorting。
- fast inversion algorithm: 兩個原則,盡量用 memory 來處理東西跟合併,還有直接用原本 input 的次序就好,sort 大東西 cost 太高(even O(nlgn) 的)。作法是先準備一份照文件跟 concept no. (in each doc) 的 array,然後 merge 起來成每個 concept 在那些文件(一樣用 posting file offset)(其實有點像 DBMS 的 sort-merge),利用這個表來決定原本的 doc 要怎麼切割(依照 conecpt # 來分),最後就是把每一個小份做 inverted file
- Patricia (PAT) trees (similar to suffix tree and array)
- cluster file (chap 7)
- 用 hash 的
- signature files
searching
- 在 query 中找出 words 跟 patterns (vocabulary search)
- 抓出相關的文章
- 算出 query 中 word/pattern 真正出現的地方
Project
done
On P4 2.8G, 2G memory
- read file, extract <TI>, <TEXT> (extract-text.py)
- 3m12.629s with pysco
- extract terms per document into DOCID.terms (extract-term.py)
- 標點符號
- stop word
- also counts term frequency
- stemming
- indexing (invert file Q_Q)
- 10m with pysco
- 165m36.654s (~ 1G?), 100 dbs (per tf,idf,inv)
not yet
- ranking… (basic vector model?)
目錄結構
- 照 set 跟 doc no 分
FBIS[34]/[0-9]{2}/doc no
實驗
exp1
- FBIS3
- tf: hash, idf:btree, inv: btree, NUM_DB = 100
- tf key: docno:term
- 165m36.654s (> 1G?)
exp2
- FBIS3
- tf: hash, idf:btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 100 documents
- idf is calculated in the final
- 87m2.240s (1G)
- rev 2042
exp3
- FBIS3
- tf: hash, idf:btree, inv: btree, NUM_DB = 50
- tf key: docno
- inv_db ync per 100 documents
- tf sync in the final
- idf is calculated in the final
- 108m11.242s (942M)
- rev 2045
exp4
- FBIS3
- tf: btree, idf: btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 100 documents
- tf sync in the final
- idf is calculated in the final
- for each directory, open dbs and close after use
- 67m (1.1G)
- rev 2048
exp5
- ALL PREVIOUS EXP IS NOT USABLE check log of rev 2052
- FBIS3
- tf: btree, idf: btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 100 documents
- tf sync in the final
- idf is calculated in the final
- for each directory, open dbs and close after use
- save a term to which db's hash table
- 84m (1.3G)
- rev 2053
exp6
- FBIS3
- tf: btree, idf: btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 100 documents
- tf sync in the final
- idf is calculated in the final
- for each directory, open dbs and close after use
- save a term to which db's hash table
- index threshold (>
2
- 26m28.852s (430M)
- search: 34.784s (480176 264384)
- disk: 250M
- rev 2063
exp7
- FBIS3+FBIS4
- tf: btree, idf: btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 100 documents
- tf sync in the final
- idf is calculated in the final
- for each directory, open dbs and close after use
- save a term to which db's hash table
- index threshold (>
2
- 90m37s (822M)
- search: 1m7.741s (721724 435184)
- 490M
- rev 2063
exp8
- FBIS3+FBIS4-inc
- tf: btree, idf: btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 100 documents
- tf sync in the final
- idf is calculated in the final
- for each directory, open dbs and close after use
- save a term to which db's hash table
- index threshold (>
2
- 53m12.581s (560M) (incremental only)
- search: 1m5.634s (723380 429724)
- 490M (slight larger than exp7)
- rev 2068
exp9
- FBIS3
- tf: btree, idf: btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 500documents
- tf sync in the final
- idf is calculated in the final
- for each directory, open dbs and close after use
- save a term to which db's hash table
- index threshold (>
2
- 1557s (430M)
- search: 34.728s (480192 262708)
- disk: 244M
- rev 2074
exp10
- FBIS3+FBIS4
- tf: btree, idf: btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 500 documents
- tf sync in the final
- idf is calculated in the final
- for each directory, open dbs and close after use
- save a term to which db's hash table
- index threshold (>
2
- 5277s (822M)
- search: 1m8.774s (721740 434736)
- 478M
- rev 2074
exp11
- FBIS3+FBIS4-inc
- tf: btree, idf: btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 500 documents
- tf sync in the final
- idf is calculated in the final
- for each directory, open dbs and close after use
- save a term to which db's hash table
- index threshold (>
2
- 1728s (560M) (incremental only)
- search: 1m7.604s (497028 271148)
- 503M (slight larger than exp7)
- rev 2074
exp12
- FBIS3
- tf: btree, idf: btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 500 documents
- tf sync in the final
- idf is calculated in the final
- for each directory, open dbs and close after use
- save a term to which db's hash table
- index threshold (>
2
- stem cache
- 1192s (430M)
- search: 34.728s (488440 451364)
- disk: 242M
- rev 2075
exp13
- FBIS3+FBIS4
- tf: btree, idf: btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 500 documents
- tf sync in the final
- idf is calculated in the final
- for each directory, open dbs and close after use
- save a term to which db's hash table
- index threshold (>
2
- stem cache
- 4633s (870MB)
- search: 34.728s (488440 451364)
- disk: 473M
- rev 2075
exp14
- FBIS3+FBIS4 (inc)
- tf: btree, idf: btree, inv: btree, NUM_DB = 100
- tf key: docno
- inv_db sync per 500 documents
- tf sync in the final
- idf is calculated in the final
- for each directory, open dbs and close after use
- save a term to which db's hash table
- index threshold (>
2
- stem cache
- 2773s (606MB) (inc)
- search: 34.728s (488440 451364)
- disk: 475M
- rev 2075