[[http://nlg3.csie.ntu.edu.tw/courses/IR/course_IR.html|課程網頁]] ====== Chap 7 ====== ===== 文件的前置處理 ===== * 語法分析 (lexical analysis) 這個是把 input 的東西,轉成一個一個的 word/token * * 自動 indexing * 數字不太好,但是會有像是 B6, B12 這種東西 * 連字號要不要拆開的問題:拆開會增加 recall 減低 precision * 念法的標記,通常沒用,像是 OS/2 * 大小寫:保留可以增加 precision 減低 recall 一般的商用系統是以 recall 為主,所以說會保留數字當 index,但不區分大小寫 * * 處理 query query 中可以包含一些運算符號或者是 grouping 的東西 * 消除 stopword * 避免抓出一堆垃圾 * 最好的作法是在做語法分析的時候就幹掉 * stemming * 其實大多數的 search engine 不用 * 考量是正確性,擷取的好壞跟壓縮率 * 做的時間點 * index 的時候,好處是比較有效率跟好的壓縮,但是一些資訊就會消失 * search 的時候,把類似的結果都丟出來,給使用者選 * 找 stem 的方式 * 手動 * stemmers (自動) * affix removel 找出一堆字根變化的規則,然後砍掉(Porter algo),在 match 的時候可以分為 longest match 或者 partial matching(只考慮字根的前幾個字)。 * * * * successor variety successor variety 是說在一個 word 中,每一個字母後面可以接的字母的數量(只考慮現在你有的文件們),而通常在 stem 的時候,會有一個 peak 跳出來,因此找 stem 的方式有三種:cut-off(先定義好 successor variety 超過多少的時候就算是 stem)/ peak and plateau(找突然跳起來的 peak!)/ complete word * * * * table lookup(查表啦 XD) * n-gram 先算字跟字共用的 n-gram,算法是兩個字共用的 n-gram 數量 * 2 / 兩個字的 n-gram 個數,然後造出一個 matrix,用 single link 的方式 cluster 連起來,於是就可以找到 stem。這個比較像是 term cluster 的東西。 ====== Chap 8 ====== ===== indexing ===== * 照字母順序的 * sorted indices * inverted files * 每篇文章列出一些 keywords,每個 keyword 都有 relevance weight,那 inverted file 就是以 keyword 當 key,會連到有出現這個 keyword 的文件去 * 缺點是 size 會是 text size 的 10%~100%,而且原本文件改過後,應重新做 * inverted list 是記錄每個字出現在文章的那邊,file 只有記說出現在那個文件,那也可以用 block 為單位來記出現的地方,而一個 block 看需要甚至可以是一份文件 * 實際上可以用 sorted array, b-tree, tries, hasing 或者綜合的方式來實作 * tries: 字彙樹,依照有的字彙,從第一個字母開始造樹,leaf 就是真正的字跟出現的地方 * b-tree: 就是那個 b-tree * sorted array: 照 term sorted 去查 term 出現在那個文件裡面,但是這樣子很慢,所以我們可以把出現在那份文件跟位置的資訊存在另外一個檔案(posting file)中,查 term 的時候就是查這個檔的 offset,然後跑過去。這樣子有點糟糕,有一種解法是在造那個 dictionary 的時候,用 right-threaded bst(這是啥?),來避免 sorting。 * fast inversion algorithm: 兩個原則,盡量用 memory 來處理東西跟合併,還有直接用原本 input 的次序就好,sort 大東西 cost 太高(even O(nlgn) 的)。作法是先準備一份照文件跟 concept no. (in each doc) 的 array,然後 merge 起來成每個 concept 在那些文件(一樣用 posting file offset)(其實有點像 DBMS 的 sort-merge),利用這個表來決定原本的 doc 要怎麼切割(依照 conecpt # 來分),最後就是把每一個小份做 inverted file * Patricia (PAT) trees (similar to suffix tree and array) * cluster file (chap 7) * 用 hash 的 * signature files ===== searching ===== * 在 query 中找出 words 跟 patterns (vocabulary search) * 抓出相關的文章 * 算出 query 中 word/pattern 真正出現的地方 ====== Project ====== ===== done ===== On P4 2.8G, 2G memory * read file, extract , (extract-text.py) * 3m12.629s with pysco * extract terms per document into DOCID.terms (extract-term.py) * 標點符號 * stop word * also counts term frequency * stemming * indexing (invert file Q_Q) * 10m with pysco * 165m36.654s (~ 1G?), 100 dbs (per tf,idf,inv) ===== not yet ===== * ranking... (basic vector model?) ===== 目錄結構 ===== * 照 set 跟 doc no 分 ''FBIS[34]/[0-9]{2}/doc no'' ===== 實驗 ===== ==== exp1 ==== * FBIS3 * tf: hash, idf:btree, inv: btree, NUM_DB = 100 * tf key: docno:term * 165m36.654s (> 1G?) ==== exp2 ==== * FBIS3 * tf: hash, idf:btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 100 documents * idf is calculated in the final * 87m2.240s (1G) * rev 2042 ==== exp3 ==== * FBIS3 * tf: hash, idf:btree, inv: btree, NUM_DB = 50 * tf key: docno * inv_db ync per 100 documents * tf sync in the final * idf is calculated in the final * 108m11.242s (942M) * rev 2045 ==== exp4 ==== * FBIS3 * tf: btree, idf: btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 100 documents * tf sync in the final * idf is calculated in the final * for each directory, open dbs and close after use * 67m (1.1G) * rev 2048 ==== exp5 ==== * ALL PREVIOUS EXP IS NOT USABLE check log of rev 2052 * FBIS3 * tf: btree, idf: btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 100 documents * tf sync in the final * idf is calculated in the final * for each directory, open dbs and close after use * save a term to which db's hash table * 84m (1.3G) * rev 2053 ==== exp6 ==== * FBIS3 * tf: btree, idf: btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 100 documents * tf sync in the final * idf is calculated in the final * for each directory, open dbs and close after use * save a term to which db's hash table * index threshold (>=) 2 * 26m28.852s (430M) * search: 34.784s (480176 264384) * disk: 250M * rev 2063 ==== exp7 ==== * FBIS3+FBIS4 * tf: btree, idf: btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 100 documents * tf sync in the final * idf is calculated in the final * for each directory, open dbs and close after use * save a term to which db's hash table * index threshold (>=) 2 * 90m37s (822M) * search: 1m7.741s (721724 435184) * 490M * rev 2063 ==== exp8 ==== * FBIS3+FBIS4-inc * tf: btree, idf: btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 100 documents * tf sync in the final * idf is calculated in the final * for each directory, open dbs and close after use * save a term to which db's hash table * index threshold (>=) 2 * 53m12.581s (560M) (incremental only) * search: 1m5.634s (723380 429724) * 490M (slight larger than exp7) * rev 2068 ==== exp9 ==== * FBIS3 * tf: btree, idf: btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 500documents * tf sync in the final * idf is calculated in the final * for each directory, open dbs and close after use * save a term to which db's hash table * index threshold (>=) 2 * 1557s (430M) * search: 34.728s (480192 262708) * disk: 244M * rev 2074 ==== exp10 ==== * FBIS3+FBIS4 * tf: btree, idf: btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 500 documents * tf sync in the final * idf is calculated in the final * for each directory, open dbs and close after use * save a term to which db's hash table * index threshold (>=) 2 * 5277s (822M) * search: 1m8.774s (721740 434736) * 478M * rev 2074 ==== exp11 ==== * FBIS3+FBIS4-inc * tf: btree, idf: btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 500 documents * tf sync in the final * idf is calculated in the final * for each directory, open dbs and close after use * save a term to which db's hash table * index threshold (>=) 2 * 1728s (560M) (incremental only) * search: 1m7.604s (497028 271148) * 503M (slight larger than exp7) * rev 2074 ==== exp12 ==== * FBIS3 * tf: btree, idf: btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 500 documents * tf sync in the final * idf is calculated in the final * for each directory, open dbs and close after use * save a term to which db's hash table * index threshold (>=) 2 * stem cache * 1192s (430M) * search: 34.728s (488440 451364) * disk: 242M * rev 2075 ==== exp13 ==== * FBIS3+FBIS4 * tf: btree, idf: btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 500 documents * tf sync in the final * idf is calculated in the final * for each directory, open dbs and close after use * save a term to which db's hash table * index threshold (>=) 2 * stem cache * 4633s (870MB) * search: 34.728s (488440 451364) * disk: 473M * rev 2075 ==== exp14 ==== * FBIS3+FBIS4 (inc) * tf: btree, idf: btree, inv: btree, NUM_DB = 100 * tf key: docno * inv_db sync per 500 documents * tf sync in the final * idf is calculated in the final * for each directory, open dbs and close after use * save a term to which db's hash table * index threshold (>=) 2 * stem cache * 2773s (606MB) (inc) * search: 34.728s (488440 451364) * disk: 475M * rev 2075