Entry tags:
CYK parser for PCFG in Clojure
Two weeks ago I wrote about a bug in AIMA. (In my spare time I am implementing random bits from AIMA). After fixing the bug, here is my implementation of the algorithm in Clojure.
It surprised me that it is not as terse as I expected. I think Java version would have only 1.5-2x times more LoC.
The algorithm's core (marked as blue text above) is terse, but the helper functions it needs to run defined above take I think too much space. I only wonder how terse would it have been in Scala?
Now will have to convert it to functional manner, without using Java array aget/aset.
(defn CYK-parser [words lexicon grammar0]
" CYK parser for PCFG. Derived from AIMA 3rd edition, p. 894; Bug on line 12 is fixed.
words - arary of words in input sentence
lexicon - hash-map of {word {:term :prob}}
grammar0 - array of {:term :left :right :prob}"
(let [N (count words)
; Merge grammar and lexic
grammar1 (reduce conj grammar0
(map #(hash-map :term %1 :left nil :right nil)
(distinct (map #(% :term) (vals lexicon)))))
M (count grammar1)
grammar (map #(assoc %1 :num %2) grammar1 (range M)) ; Add indexes to grammar for P array references
P (make-array Float M N N) ; create array
noref (dorun (for [i (range M) j (range N) k (range N)] (aset P i j k (Float. 0.0)))) ; set array to zeroes
set-word (fn [word index] ; FIXME not to crash on unknown lexic, to add ambiguity support
(let [matching-lexic (filter #(= (% :term) ((lexicon word) :term)) grammar)]
(dorun (map #(aset P (% :num) index 0 (Float. ((lexicon word) :prob))) matching-lexic))))
get-index (fn [term]
((first (filter #(= (% :term) term) grammar)) :num))
new-val (fn [old rule1 start1 len1 rule2 start2 len2 p]
(max old
(* (aget P rule1 start1 len1) (aget P rule2 start2 len2) p)))]
(dorun (map #(set-word %1 %2) words (range N))) ; set probabilities for lexic, length=1
(dorun
(for [length (range 2 (inc N))
start (range (- N length -1))
len1 (range 1 length)
X (filter #(not (= nil (% :left))) grammar)]
(let [len2 (- length len1)
current (aget P (X :num) start (dec length))
Y (get-index (X :left))
Z (get-index (X :right))
p (X :prob)
new (new-val current, Y start (dec len1), Z (+ start len1) (dec len2), p)]
(aset P (X :num) start (dec length) (Float. new)))))
P))
It surprised me that it is not as terse as I expected. I think Java version would have only 1.5-2x times more LoC.
The algorithm's core (marked as blue text above) is terse, but the helper functions it needs to run defined above take I think too much space. I only wonder how terse would it have been in Scala?
Now will have to convert it to functional manner, without using Java array aget/aset.