More on PCFG parse in clojure (last post)
Oct. 25th, 2012 08:42 amDeveloping this topic further, when I change it from handling a toy grammar parsing example to more robust code it grows way too much:
diff:
And that is only part of the code, with grammar augmentation with semantic rules still missing (but planned :)
diff:
+ ; Added storage for valid parsing trees + (let [N (count words) + tree (ref (vec (take N (cycle [[]])))) + update-tree (fn [i toadd] + (dosync (ref-set tree (vec + (map #(if (= % i) + (conj (nth @tree i) toadd) + (nth @tree %)) + (range N)))))) +; changed set-word + set-word (fn [word index] + (let [matching-words (lexicon word) + filter-lexic (fn [matching-word] + (first (filter #(and (= (% :term) (matching-word :term)) + (= nil (% :left)) + ) grammar))) + matching-lexic (map filter-lexic matching-words) + get-prob (fn [term] + (Float. ((first (filter #(= nil (% :left)) matching-words)) :prob)))] + (do + (dorun (map #(aset P (% :num) index 0 (get-prob %)) matching-lexic)) + (dosync + (ref-set tree (vec + (map + (fn [i] (if (= i 0) + (reduce conj (nth @tree i) + (vec (map #(hash-map :term (% :term) :start index :len 0 + :len1 1 :len2 1) matching-lexic))) + (nth @tree i))) + (range N)))))))) + ; Add to tree + get-nodes (fn [term] + (filter #(= (% :term) term) grammar)) + new-val (fn [old rules1 start1 len1 rules2 start2 len2 p] + (let [getp #(aget P %1 %2 %3) + get-maxp-index (fn [rules start len] + (apply max (map #(getp (% :num) start len) rules))) + leftp (get-maxp-index rules1 start1 len1) + rightp (get-maxp-index rules2 start2 len2)] + (max old + (* leftp rightp p))))] + X (filter + #(and (not (= nil (% :left))) + (xor (= (% :term) :start) + (< length N))) + grammar)] ; X = all non - terminals in grammar, start nodes are used only on full sentence + (update-tree (dec length) {:term (X :term) :start start :len (dec length) :prob new + :left (X :left) :right (X :right) :len1 len1 :len2 len2})); Add current term to tree + (aset P (X :num) start (dec length) (Float. new)))))) + @tree))
And that is only part of the code, with grammar augmentation with semantic rules still missing (but planned :)