Chap 7

文件的前置處理

語法分析 (lexical analysis)

這個是把 input 的東西，轉成一個一個的 word/token

- 自動 indexing
  - 數字不太好，但是會有像是 B6, B12 這種東西
  - 連字號要不要拆開的問題：拆開會增加 recall 減低 precision
  - 念法的標記，通常沒用，像是 OS/2
  - 大小寫：保留可以增加 precision 減低 recall

一般的商用系統是以 recall 為主，所以說會保留數字當 index，但不區分大小寫

- 處理 query

query 中可以包含一些運算符號或者是 grouping 的東西

消除 stopword
- 避免抓出一堆垃圾
- 最好的作法是在做語法分析的時候就幹掉
stemming
- 其實大多數的 search engine 不用
- 考量是正確性，擷取的好壞跟壓縮率
- 做的時間點
  - index 的時候，好處是比較有效率跟好的壓縮，但是一些資訊就會消失
  - search 的時候，把類似的結果都丟出來，給使用者選
- 找 stem 的方式
  - 手動
  - stemmers (自動)
    - affix removel

找出一堆字根變化的規則，然後砍掉（Porter algo），在 match 的時候可以分為 longest match 或者 partial matching（只考慮字根的前幾個字）。

- - - successor variety

successor variety 是說在一個 word 中，每一個字母後面可以接的字母的數量（只考慮現在你有的文件們），而通常在 stem 的時候，會有一個 peak 跳出來，因此找 stem 的方式有三種：cut-off（先定義好 successor variety 超過多少的時候就算是 stem）/ peak and plateau（找突然跳起來的 peak!）/ complete word

- - - table lookup（查表啦 XD）
    - n-gram

先算字跟字共用的 n-gram，算法是兩個字共用的 n-gram 數量 * 2 / 兩個字的 n-gram 個數，然後造出一個 matrix，用 single link 的方式 cluster 連起來，於是就可以找到 stem。這個比較像是 term cluster 的東西。

Chap 8

indexing

照字母順序的
- sorted indices
- inverted files
  - 每篇文章列出一些 keywords，每個 keyword 都有 relevance weight，那 inverted file 就是以 keyword 當 key，會連到有出現這個 keyword 的文件去
  - 缺點是 size 會是 text size 的 10%~100%，而且原本文件改過後，應重新做
  - inverted list 是記錄每個字出現在文章的那邊，file 只有記說出現在那個文件，那也可以用 block 為單位來記出現的地方，而一個 block 看需要甚至可以是一份文件
  - 實際上可以用 sorted array, b-tree, tries, hasing 或者綜合的方式來實作
    - tries: 字彙樹，依照有的字彙，從第一個字母開始造樹，leaf 就是真正的字跟出現的地方
    - b-tree: 就是那個 b-tree
    - sorted array: 照 term sorted 去查 term 出現在那個文件裡面，但是這樣子很慢，所以我們可以把出現在那份文件跟位置的資訊存在另外一個檔案(posting file)中，查 term 的時候就是查這個檔的 offset，然後跑過去。這樣子有點糟糕，有一種解法是在造那個 dictionary 的時候，用 right-threaded bst(這是啥?)，來避免 sorting。
    - fast inversion algorithm: 兩個原則，盡量用 memory 來處理東西跟合併，還有直接用原本 input 的次序就好，sort 大東西 cost 太高（even O(nlgn) 的）。作法是先準備一份照文件跟 concept no. (in each doc) 的 array，然後 merge 起來成每個 concept 在那些文件（一樣用 posting file offset）（其實有點像 DBMS 的 sort-merge），利用這個表來決定原本的 doc 要怎麼切割（依照 conecpt # 來分），最後就是把每一個小份做 inverted file
- Patricia (PAT) trees (similar to suffix tree and array)
- cluster file (chap 7)
- 用 hash 的
  - signature files

searching

在 query 中找出 words 跟 patterns (vocabulary search)
抓出相關的文章
算出 query 中 word/pattern 真正出現的地方

Project

done

On P4 2.8G, 2G memory

read file, extract <TI>, <TEXT> (extract-text.py)
- 3m12.629s with pysco
extract terms per document into DOCID.terms (extract-term.py)
- 標點符號
- stop word
- also counts term frequency
- stemming
- indexing (invert file Q_Q)
- 10m with pysco
- 165m36.654s (~ 1G?), 100 dbs (per tf,idf,inv)

not yet

ranking… (basic vector model?)

目錄結構

照 set 跟 doc no 分

FBIS[34]/[0-9]{2}/doc no

實驗

exp1

FBIS3
tf: hash, idf:btree, inv: btree, NUM_DB = 100
tf key: docno:term
165m36.654s (> 1G?)

exp2

FBIS3
tf: hash, idf:btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
idf is calculated in the final
87m2.240s (1G)
rev 2042

exp3

FBIS3
tf: hash, idf:btree, inv: btree, NUM_DB = 50
tf key: docno
inv_db ync per 100 documents
tf sync in the final
idf is calculated in the final
108m11.242s (942M)
rev 2045

exp4

FBIS3
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
67m (1.1G)
rev 2048

exp5

ALL PREVIOUS EXP IS NOT USABLE check log of rev 2052
FBIS3
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
84m (1.3G)
rev 2053

exp6

FBIS3
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (> 2
26m28.852s (430M)
search: 34.784s (480176 264384)
disk: 250M
rev 2063

exp7

FBIS3+FBIS4
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (> 2
90m37s (822M)
search: 1m7.741s (721724 435184)
490M
rev 2063

exp8

FBIS3+FBIS4-inc
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 100 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (> 2
53m12.581s (560M) (incremental only)
search: 1m5.634s (723380 429724)
490M (slight larger than exp7)
rev 2068

exp9

FBIS3
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (> 2
1557s (430M)
search: 34.728s (480192 262708)
disk: 244M
rev 2074

exp10

FBIS3+FBIS4
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (> 2
5277s (822M)
search: 1m8.774s (721740 434736)
478M
rev 2074

exp11

FBIS3+FBIS4-inc
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (> 2
1728s (560M) (incremental only)
search: 1m7.604s (497028 271148)
503M (slight larger than exp7)
rev 2074

exp12

FBIS3
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (> 2
stem cache
1192s (430M)
search: 34.728s (488440 451364)
disk: 242M
rev 2075

exp13

FBIS3+FBIS4
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (> 2
stem cache
4633s (870MB)
search: 34.728s (488440 451364)
disk: 473M
rev 2075

exp14

FBIS3+FBIS4 (inc)
tf: btree, idf: btree, inv: btree, NUM_DB = 100
tf key: docno
inv_db sync per 500 documents
tf sync in the final
idf is calculated in the final
for each directory, open dbs and close after use
save a term to which db's hash table
index threshold (> 2
stem cache
2773s (606MB) (inc)
search: 34.728s (488440 451364)
disk: 475M
rev 2075

rafan 的筆記本

Table of Contents