Recursively parse org-mode hierarchy - parsing

I'm trying to parse org-mode text in this way:
* head
** sub-head
- word :: description
** sub-head
- word :: description
- some notes
* head2
** sub-head2
- some more notes
I am trying to capture the data (such as "word :: description" and "some notes") in such a way that each piece of data preserves what its parent headers are and what the parent's parents are, etc. I envision the data coming out in such a form in elisp:
(
("head"
("sub-head" ("word :: definition"))
("sub-head" ("word :: description" "some notes"))
)
("head2"
("sub-head2" ("some more notes"))
)
)
I am guessing there is an elegant solution using recursion. I'm open to structuring the data in elisp a different way, if there's a better way to do it.

The function org-element-parse-buffer should help. It parses the whole org-mode buffer into a lisp list. You will get more properties than you need.
http://orgmode.org/worg/exporters/org-element-docstrings.html#sec-10

Here's a recursive solution:
(defun org-splitter (str lvl)
(let* ((lst (split-string
str
(concat lvl " ")))
(out (unless (= (length (car lst))
(length str))
(mapcar
(lambda (s)
(and
(string-match "\\([^\n]+\\)\n\\(.*\\)" s)
(list (match-string 1 s)
(org-splitter
(substring-no-properties
s (match-beginning 2))
(concat lvl "\\*")))))
(cdr lst)))))
(if (string= (car lst) "")
out
(cons (car lst) out))))
(defun org-recurse-all ()
(let ((str (buffer-substring-no-properties
(point-min) (point-max))))
(org-splitter str "^\\*")))

Related

Counting level of recursion in Racket

I am working on a document structure, which includes chapters, subchapters etc. Each "section" I would like to mark with different HTML header tag, so title would be in <h1>, chapters in <h2>, subchapters in <h3> and so on. Structrures that I am working on are:
(struct section (title text next) #:transparent) ; title is a string, text is list of paragraphs, next is following list of sections
(struct paragraph (text) #:transparent) ; text is a string
(struct image (file alt) #:transparent)
I would like to transform a document declared with such structures to be made of headers I mentioned earlier and I wrote such function:
(define level 0)
(define (inc-nest-level)
(set! level (+ level 1)))
(define (doc->txt t)
(match t
[(image img txt) (tag-img img txt)]
[(paragraph x) (tag-p x)]
;; p1 = title, p2 = text (list of paragraphs), p3 = next (list of subsections)
[(section p1 p2 p3) (inc-nest-level)
(string-append (tag-header p1 level) "\n"
(strings-list->string (map doc->txt p2)) "\n"
(strings-list->string (map doc->txt p3)) "\n")]))
The problem is, that when I parse a document to include HTML tags, each header <h*> is wrong, as it does not count levels of recursion, but number of occurences of any header, which means that parsing this document:
(define test-doc
(section "test-doc-title"
(list (paragraph "title-paragraph-1")
(paragraph "title-paragraph-2")
(paragraph "title-paragraph-3")
(image "image3.jpg" "alternate text")
(paragraph "title-paragraph-4"))
(list (section "test-doc-chapter-1"
(list (paragraph "ch-1-paragraph-1")
(paragraph "ch-1-paragraph-2")
(paragraph "ch-1-paragraph-3"))
null)
(section "test-doc-chapter-2"
(list (paragraph "ch-2-paragraph-1"))
(list (section "test-doc-chapter-2-sub-1"
(list (paragraph "ch-2-sub-1-paragraph-1")
(paragraph "ch-2-sub-1-paragraph-2")
(paragraph "ch-2-sub-1-paragraph-3")
(paragraph "ch-2-sub-1-paragraph-4")
(paragraph "ch-2-sub-1-paragraph-5")
(paragraph "ch-2-sub-1-paragraph-6"))
null)
(section "test-doc-chapter-2-sub-2"
(list (paragraph "ch-2-sub-2-paragraph-1")
(paragraph "ch-2-sub-2-paragraph-2")
(paragraph "ch-2-sub-2-paragraph-3"))
null)))
(section "test-doc-chapter-3"
(list (paragraph "ch-3-paragraph-1")
(paragraph "ch-3-paragraph-2"))
null))))
returns following answer (with (display (doc->txt test-doc)) function:
<h1>test-doc-title</h1>
<p>title-paragraph-1</p>
<p>title-paragraph-2</p>
<p>title-paragraph-3</p>
<img src="image3.jpg" alt="alternate text"><br><p>title-paragraph-4</p>
<h2>test-doc-chapter-1</h2>
<p>ch-1-paragraph-1</p>
<p>ch-1-paragraph-2</p>
<p>ch-1-paragraph-3</p>
<h3>test-doc-chapter-2</h3>
<p>ch-2-paragraph-1</p>
<h4>test-doc-chapter-2-sub-1</h4>
<p>ch-2-sub-1-paragraph-1</p>
<p>ch-2-sub-1-paragraph-2</p>
<p>ch-2-sub-1-paragraph-3</p>
<p>ch-2-sub-1-paragraph-4</p>
<p>ch-2-sub-1-paragraph-5</p>
<p>ch-2-sub-1-paragraph-6</p>
<h5>test-doc-chapter-2-sub-2</h5>
<p>ch-2-sub-2-paragraph-1</p>
<p>ch-2-sub-2-paragraph-2</p>
<p>ch-2-sub-2-paragraph-3</p>
<h6>test-doc-chapter-3</h6>
<p>ch-3-paragraph-1</p>
<p>ch-3-paragraph-2</p>
The problem is, that I would like to count each level of reccursion independently, so I would get something like:
<h1>test-doc-title</h1>
<p>title-paragraph-1</p>
<p>title-paragraph-2</p>
<p>title-paragraph-3</p>
<img src="image3.jpg" alt="alternate text"><br><p>title-paragraph-4</p>
<h2>test-doc-chapter-1</h2>
<p>ch-1-paragraph-1</p>
<p>ch-1-paragraph-2</p>
<p>ch-1-paragraph-3</p>
<h2>test-doc-chapter-2</h2>
<p>ch-2-paragraph-1</p>
<h3>test-doc-chapter-2-sub-1</h3>
<p>ch-2-sub-1-paragraph-1</p>
<p>ch-2-sub-1-paragraph-2</p>
<p>ch-2-sub-1-paragraph-3</p>
<p>ch-2-sub-1-paragraph-4</p>
<p>ch-2-sub-1-paragraph-5</p>
<p>ch-2-sub-1-paragraph-6</p>
<h3>test-doc-chapter-2-sub-2</h3>
<p>ch-2-sub-2-paragraph-1</p>
<p>ch-2-sub-2-paragraph-2</p>
<p>ch-2-sub-2-paragraph-3</p>
<h2>test-doc-chapter-3</h2>
<p>ch-3-paragraph-1</p>
<p>ch-3-paragraph-2</p>
I need help with remaking doc->txt function, so it would parse headers the way I mentioned.
Edit: tag-header function is defined as:
(define (tag-header title level)
(if (> level 6)
(set! level 6)
void)
(let [(lvl (number->string level))]
(string-append "<h" lvl ">" title "</h" lvl ">")))
After few hours of struggles, I finally got that to work. What I needed to do, was to define a helper function map-2, which takes as an input function f, list xs and a "nest-level counter" lvl, so in every deeper section it applies f to every element of the list and saves the nest-level counter. Final function:
(define (doc->txt t)
(define (map-2 f xs lvl)
(if (null? xs)
null
(cons (f (car xs) lvl)
(map-2 f (cdr xs) lvl))))
(define (aux t l) ;; l stands for level of nests, so headers are smaller with each level
(match t
[(image img txt) (tag-img img txt)]
[(paragraph x) (tag-p x)]
;; p1 = title, p2 = text (list of paragraphs), p3 = next (list of subsections)
[(section p1 p2 p3) (string-append (tag-header p1 l) "\n"
(strings-list->string (map-2 aux p2 (add1 l))) "\n"
(strings-list->string (map-2 aux p3 (add1 l))) "\n")]))
(aux t 1))

Parsing concrete syntax in Scheme

I wrote a procedure that gets a valid prefix list for subtraction (e.g, "(- 6 5)" for what we know as "6-5"). Here is my code:
(define parse-diff-list
(lambda (datum)
(cond
((number? datum) (const-exp datum)) ;; if datum is a number, return const-exp
((pair? datum) ;; if datum is a pair:
(let ((sym (car datum))) ;; let sym be the first of the pair
(cond
((eqv? sym '-) ;; if sym is minus:
(let ((lst1 (parse-diff-list (cdr datum)))) ;; parse second element of subtraction
(let ((lst2 (parse-diff-list (cdr lst1)))) ;; parse first element of subtraction
(cons (diff-exp (car lst1) (car lst2)) (cdr lst2))))) ;; "perform" the subtraction
((number? sym) ;; if sym is number:
(cons (const-exp sym) (cdr datum))) ;; return const-exp with the remainder of the list, yet to be processed
(else (eopl:error 'parse-diff-list "bad prefix-expression, expected - ~s" sym)))))
(eopl:error 'parse-diff-list "bad prefix-expression ~s" datum))))
(define parse-prefix
(lambda (lst)
(car (parse-diff-list lst))))
It works fine logically, but I don't understand the logic of the indentation in printing. For the input:
(parse-prefix '(- - 1 2 - 3 - 4 5))
It prints:
#(struct:diff-exp
#(struct:diff-exp #(struct:const-exp 1) #(struct:const-exp 2))
#(struct:diff-exp #(struct:const-exp 3) #(struct:diff-exp #(struct:const-exp 4) #(struct:const-exp 5)))
While I would want the following print style:
#(struct:diff-exp
#(struct:diff-exp
#(struct:const-exp 1)
#(struct:const-exp 2))
#(struct:diff-exp
#(struct:const-exp 3)
#(struct:diff-exp
#(struct:const-exp 4)
#(struct:const-exp 5)))
It's more than a petty question for me, as it does create indentations but I don't know how it does it.
Thanks a lot!
Take a look at racket/pretty the pretty printing library.
In particular note the parameter (pretty-print-columns) which
you can set like this:
`(pretty-print-columns 40)`
in order to avoid long lines.
http://docs.racket-lang.org/reference/pretty-print.html
(I am guessing you are using DrRacket based on the way the structures are printing)

An iterative program for appending lists in scheme

I am reading Section 2.2 in SICP where the book introduced the procedure for appending two lists. I am trying to implement the append using iteration.
This is my code:
(define (append list1 list2)
(define (append-iter item1 reversed-item1 result)
(if (null? item1)
(if (null? reversed-item1)
result
(append-iter item1
(cdr reversed-item1)
(cons (car reverse) result)))
(append-iter (cdr item1)
(cons (car item1) reversed-item1)
result)))
(append-iter list1 '() list2))
Though it works, but noting the number of the iterations is double the length of list1. Is there a solution whose number of the iterations equals to the length of list1. (without using any fold function)?
Basically how your procedure works is like this:
(define (append l1 l2)
(define (reverse-append rev app)
(if (null? rev)
app
(reverse-append (cdr rev)
(cons (car rev) app))))
(reverse-append (reverse l1) l2))
It's O(N) but it wastes some memory since (reverse l1) space is just used for iteration. If you really need to fix that you need to use mutation:
(define (append-iter . rest)
(let ((result (list 1)))
(let loop ((p result) (lst '()) (rest rest))
(cond
((not (null? lst))
(set-cdr! p (list (car lst)))
(loop (cdr p) (cdr lst) rest))
((null? rest) (cdr result))
((null? (cdr rest))
(set-cdr! p (car rest))
(cdr result))
(else (loop p (car rest) (cdr rest)))))))

Thinking in Clojure: Avoid OOP for simple string parser

I'm currently implementing a small parser in Clojure that takes an input string like:
aaa (bbb(ccc)ddd(eee)) fff (ggg) hhh
and returns the string without characters that are not in brackets, i.e.
(bbb(ccc)ddd(eee))(ggg)
I've written the following function:
(defn- parse-str [input]
(let [bracket (atom 0)
output (atom [])]
(doseq [ch (seq input)]
(case ch
\( (swap! bracket inc)
\) (swap! bracket dec)
nil)
(if (or (> #bracket 0) (= ch \)))
(swap! output conj ch)))
(apply str #output)))
which works for me:
(parse-str "aaa (bbb(ccc)ddd(eee)) fff (ggg) hhh")
"(bbb(ccc)ddd(eee))(ggg)"
I am however concerned that my approach is a too object oriented since it uses atoms as some kind of local variables to keep the current state of the parser.
Is it possible to write the same function from a more functional programming perspective? (avoiding the atoms?)
Any comments to improve my code are appreciated as well.
Two ways: You can use explicit recursion or reduce.
(defn parse-str [input]
(letfn [(parse [input bracket result]
(if (seq input)
(let [[ch & rest] input]
(case ch
\( (recur rest (inc bracket) (conj result ch))
\) (recur rest (dec bracket) (conj result ch))
(recur rest bracket (if (> bracket 0)
(conj result ch)
result))))
result))]
(clojure.string/join (parse input 0 []))))
(defn parse-str [input]
(clojure.string/join
(second (reduce (fn [acc ch]
(let [[bracket result] acc]
(case ch
\( [(inc bracket) (conj result ch)]
\) [(dec bracket) (conj result ch)]
[bracket (if (> bracket 0)
(conj result ch)
result)])))
[0 []]
input))))
In a lot of cases where you would use local variables, you just put any variable that changes as a parameter to loop, thereby using recursion instead of mutation.
(defn- parse-str [input]
;; Instead of using atoms to hold the state, use parameters in loop
(loop [output []
bracket 0
;; The [ch & tail] syntax is called destructuring,
;; it means let ch be the first element of (seq input),
;; and tail the rest of the elements
[ch & tail] (seq input)]
;; If there's no elements left, ch will be nil, which is logical false
(if ch
(let [bracket* (case ch
\( (inc bracket)
\) (dec bracket)
bracket)
output* (if (or (> bracket* 0) (= ch \)))
(conj output ch)
output)]
;; Recurse with the updated values
(recur output* bracket* tail))
;; If there's no characters left, apply str to the output
(apply str output))))
This is an iterative version of your function; but it's still functionally pure. I find having the code laid out like this makes it easy to read. Remember, when using recursion, always check your termination condition first.
(defn parse-str [s]
(loop [[x & xs] (seq s), acc [], depth 0]
(cond
(not x) (clojure.string/join acc)
(= x \() (recur xs (conj acc x) (inc depth))
(= x \)) (recur xs (conj acc x) (dec depth))
(<= depth 0) (recur xs acc depth)
:else (recur xs (conj acc x) depth))))

Parsing strings with Scheme

I am trying to write a simple parser which creates a sxml-expression from a string, e. g.
"This is a [Test]" ===> (item "This is a" (subitem "Test"))
Anybody who is wondering about the square brackets within the given example may have a look at the so called Leiden conventions.
This is the code I have written so far:
(define my-sequence '("this" "[" "is" "a" "]" "test"))
(define (left-square-bracket? item)
(or (equal? item "[")
(eq? item #\x005b)))
(define (right-square-bracket? item)
(or (equal? item "]")
(eq? item #\x005d)))
(define (parse-sequence sequence)
(cond ((null? sequence) '())
((left-square-bracket? (car sequence))
(let ((subsequence (get-subsequence (cdr sequence))))
(list subsequence)))
(else
(cons (car sequence)
(parse-sequence (cdr sequence))))))
(define (get-subsequence sequence)
(if (right-square-bracket? (car sequence))
'()
(cons (car sequence)
(get-subsequence (cdr sequence)))))
Evaluating (parse-sequence my-sequence) yields ("this" ("is" "a")). A nested expression has been created, but the program finished without having evaluated the last item "test". The question is, how do I return from get-subsequence to parse-sequence?
Any help is appreciated, many thanks in advance! :)
To address your initial questions, how to return multiple values: use the "values" form. Here is an example implementation where the inner procedure returns both the remaining list to be processed and the result so far. It recurses on opening brackets.
(define (parse-sequence lst)
(define (parse-seq lst)
(let loop ((lst lst) (res null))
(cond
((null? lst) (values null res))
((string=? (car lst) "[")
(let-values ([(lst2 res2) (parse-seq (cdr lst))])
(loop lst2 (append res (list res2)))))
((string=? (car lst) "]")
(values (cdr lst) res))
(else
(loop (cdr lst) (append res (list (car lst))))))))
(let-values ([(lst res) (parse-seq lst)])
res))
then
(parse-sequence '("this" "is" "a" "test"))
(parse-sequence '("this" "[" "is" "a" "]" "test"))
(parse-sequence '("this" "[" "is" "[" "a" "]" "]" "test"))
will yield
'("this" "is" "a" "test")
'("this" ("is" "a") "test")
'("this" ("is" ("a")) "test")
I made some progress by using open-input-string in combination with read-char:
(define my-sequence (open-input-string "this [is a] test"))
(define (parse-sequence sequence)
`(item
,#(let loop ((next-char (read-char sequence)))
(cond ((eof-object? next-char) '())
((left-square-bracket? next-char)
(let ((subsequence (get-subsequence sequence)))
(cons subsequence
(loop (read-char sequence)))))
(else
(cons next-char
(loop (read-char sequence))))))))
(define (get-subsequence sequence)
`(subitem
,#(let loop ((next-char (read-char sequence)))
(if (right-square-bracket? next-char)
'()
(cons next-char
(loop (read-char sequence)))))))
(parse-sequence my-sequence)
===> (item #\t #\h #\i #\s #\space (subitem #\i #\s #\space #\a) #\space #\t #\e #\s #\t)
Now work goes on, step by step. :)
Any comments and suggestions are still appreciated. :)

Resources