2012-09-02

Two weeks ago I wrote about a bug in AIMA. (In my spare time I am implementing random bits from AIMA). After fixing the bug, here is my implementation of the algorithm in Clojure.

(defn CYK-parser [words lexicon grammar0]
  " CYK parser for PCFG. Derived from AIMA 3rd edition, p. 894; Bug on line 12 is fixed.
    words - arary of words in input sentence
    lexicon - hash-map of {word {:term :prob}}
    grammar0 - array of {:term :left :right :prob}"
  (let [N (count words)
        ; Merge grammar and lexic
        grammar1 (reduce conj grammar0 
                         (map #(hash-map :term %1 :left nil :right nil)
                              (distinct (map #(% :term) (vals lexicon)))))
        M (count grammar1)
        grammar (map #(assoc %1 :num %2) grammar1 (range M))  ; Add indexes to grammar for P array references
        P (make-array Float M N N) ; create array
        noref (dorun (for [i (range M) j (range N) k (range N)] (aset P i j k (Float. 0.0)))) ; set array to zeroes
        set-word (fn [word index] ; FIXME not to crash on unknown lexic, to add ambiguity support
                   (let [matching-lexic (filter #(= (% :term) ((lexicon word) :term)) grammar)]
                     (dorun (map #(aset P (% :num) index 0 (Float. ((lexicon word) :prob))) matching-lexic))))
        get-index (fn [term]
                    ((first (filter #(= (% :term) term) grammar)) :num))
        new-val (fn [old rule1 start1 len1 rule2 start2 len2 p] 
                  (max old 
                       (* (aget P rule1 start1 len1) (aget P rule2 start2 len2) p)))]
    (dorun (map #(set-word %1 %2) words (range N))) ; set probabilities for lexic, length=1
    (dorun
      (for [length (range 2 (inc N))
            start (range (- N length -1))
            len1 (range 1 length)
            X (filter #(not (= nil (% :left))) grammar)]
        (let [len2 (- length len1)
              current (aget P (X :num) start (dec length))
              Y (get-index (X :left))
              Z (get-index (X :right))
              p (X :prob)
              new (new-val current, Y start (dec len1), Z (+ start len1) (dec len2), p)]
          (aset P (X :num) start (dec length) (Float. new)))))
    P))

It surprised me that it is not as terse as I expected. I think Java version would have only 1.5-2x times more LoC.
The algorithm's core (marked as blue text above) is terse, but the helper functions it needs to run defined above take I think too much space. I only wonder how terse would it have been in Scala?

Now will have to convert it to functional manner, without using Java array aget/aset.

2012-09-02

CYK parser for PCFG in Clojure

Very cool flame war.