Lemma extraction for Commentary (T.E.I. -book) - anchor

I try to extract some text for creating a lemma in the back of a book and don't know how to grab the nodes:
Consider the following:
<anchor type='commentary' xml:id='A1'/>Romeo and Juliet<note type='commentary' xml:id='N1'>Some blabla</note> is a tragedy written by William Shakespeare
So with my XSLT i want to be able to access the text between anchor and note to write a lemma in the back of the book:
p. 1 Romeo and Juliet) Some blabla
The problem is that basically all other possible elements can occur between anchor and note so I can't just use text(). E.g. sometimes I have the element in between anchor and note:
<anchor type='commentary' xml:id='A1'/>Romeo <c rendition="ampersand"/> Juliet<note type='commentary' xml:id='N1'>Some blabla</note>
with the undesired result of 'Romeo Juliet' instead of 'Romeo & Juliet'
How can i copy the nodes between anchor and note to access it a second time?

Related

Lua find and extract tags within a string

I feel questions similar to this have been asked previously but not related to html like tags or in Lua 5.4.
I have a string <NS>my_file_path.py</NS> <NS>count</NS> <NS>type: :model</NS> <TS>do some counting</TS> and ideally I'll be able to pick specific tags (and everything between it) such as <NS>type: :model</NS>, and remove it from the string before doing any further formatting.
I'm guessing some matching with <NS>type: would be a start but how I stop at </NS> is the confusing part!
First of all: Do not attempt to parse HTML (or XML) with RegEx (or Lua patterns). Use libraries instead.
However, if you're only interested in removing innermost tags (i.e. "leaf" tags; tags without children), your tags are strictly formatted in this simple fashing as in your example (no <tag spacing or attributes inside="tag" > allowed) and the scope of your project is very limited, you could use string.gsub and a pattern to remove these tags:
str = str:gsub("<NS>type:.-</NS>", "")
Pattern explanation:
find substrings starting with "<NS>type:"
allow for arbitrary content - zero or more arbitrary characters (.); note that this has to be lazy (-) instead of greedy (*) to work
stop matching the substring at the first occurrence of </NS>, closing the tag; if you used a greedy quantifier before, this would have stopped at the last occurrence of </NS>, exceeding the tag

Extracting text from APA citation

I have a spreadsheet containing APA citation style text and I want to split them into author(s), date, and title.
An example of a citation would be:
Parikka, J. (2010). Insect Media: An Archaeology of Animals and Technology. Minneapolis: Univ Of Minnesota Press.
Given this string is in field I2 I managed to do the following:
Name: =LEFT(I2, FIND("(", I2)-1) yields Parikka, J.
Date: =MID(I2,FIND("(",I2)+1,FIND(")",I2)-FIND("(",I2)-1) yields 2010
However, I am stuck at extracting the name of the title Insect Media: An Archaeology of Animals and Technology.
My current formula =MID(I2,FIND(").",I2)+2,FIND(").",I2)-FIND(".",I2)) only returns the title partially - the output should show every character between ).and the following ..
I tried =REGEXEXTRACT(I2, "\)\.\s(.*[^\.])\.\s" ) and this generally works but does not stop at the first ". " - Like with this example:
Sanders, E. B.-N., Brandt, E., & Binder, T. (2010). A framework for organizing the tools and techniques of participatory design. In Proceedings of the 11th biennial participatory design conference (pp. 195–198). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1900476
Where is the mistake?
The title can be found (in the two examples you've given, at least) with this:
=MID(I2,find("). ",I2)+3,find(". ",I2,find("). ",I2)+3)-(find("). ",I2)+3)+1)
In English: Get the substring starting after the first occurrence of )., up to and including the first occurrence of . following.
If you wish to use REGEXEXTRACT, then this works (on your two examples). (You can also see a Regex101 demo.):
=REGEXEXTRACT(I3,"(?:.*\(\d{4}\)\.\s)([^.]*\.)(?: .*)")
Where is the mistake?
In your expression, you were capturing (.*[^\.]), which greedily includes any number of characters followed by a character in the character class not (backslash or dot), which means that multiple sentences can be captured. The expression finished with \.\s, which wasn't captured, so the capture group would end before a period-then-space, rather than including it.
Try:
=split(SUBSTITUTE(SUBSTITUTE(I2, "(",""), ")", ""),".")
If you don't replace the parentheses around 2010, it thinks it is a negative number -2010.
For your Title try adding index split to your existing formula:
=index(split(REGEXEXTRACT(A5, "\)\.\s(.*[^\.])\.\s" ),"."),0,1)&"."

Remove \text generated by TeXForm

I need to remove all \text generated by TeXForm in Mathematica.
What I am doing now is this:
MyTeXForm[a_]:=StringReplace[ToString[TeXForm[a]], "\\text" -> ""]
But the result keeps the braces, for example:
for a=fx,
the result of TeXForm[a] is \text{fx}
the result of MyTeXForm[a] is {fx}
But what I would like is it to be just fx
You should be able to use string patterns. Based on http://reference.wolfram.com/mathematica/tutorial/StringPatterns.html, something like the following should work:
MyTeXForm[a_]:=StringReplace[ToString[TeXForm[a]], "\\text{"~~s___~~"}"->s]
I don't have Mathematica handy right now, but this should say 'Match "\text{" followed by zero or more characters that are stored in the variable s, followed by "}", then replace all of that with whatever is stored in s.'
UPDATE:
The above works in the simplest case of there being a single "\text{...}" element, but the pattern s___ is greedy, so on input a+bb+xx+y, which Mathematica's TeXForm renders as "a+\text{bb}+\text{xx}+y", it matches everything between the first "\text{" and last "}" --- so, "bb}+\text{xx" --- leading to the output
In[1]:= MyTeXForm[a+bb+xx+y]
Out[1]= a+bb}+\text{xx+y
A fix for this is to wrap the pattern with Shortest[], leading to a second definition
In[2]:= MyTeXForm2[a_] := StringReplace[
ToString[TeXForm[a]],
Shortest["\\text{" ~~ s___ ~~ "}"] -> s
]
which yields the output
In[3]:= MyTeXForm2[a+bb+xx+y]
Out[3]= a+bb+xx+y
as desired.
Unfortunately this still won't work when the text itself contains a closing brace. For example, the input f["a}b","c}d"] (for some reason...) would give
In[4]:= MyTeXForm2[f["a}b","c}d"]]
Out[4]= f(a$\$b},c$\$d})
instead of "f(a$\}$b,c$\}$d)", which would be the proper processing of the TeXForm output "f(\text{a$\}$b},\text{c$\}$d})".
This is what I did (works fine for me):
MyTeXForm[a_] := ToString[ToExpression[StringReplace[ToString[TeXForm[a]], "\\text" -> ""]][[1]]]
This is a really late reply, but I just came up against the same issue and discovered a simple solution. Put a space between the variables in the Mathematica expression that you wish to convert using TexForm.
For the original poster's example, the following code works great:
a=f x
TeXForm[a]
The output is as desired: f x
Since LaTeX will ignore that space in math mode, things will format correctly.
(As an aside, I was having the same issue with subscripted expressions that have two side-by-side variables in the subscript. Inserting a space between them solved the issue.)

Haskell/Parsec: How do you use the functions in Text.Parsec.Indent?

I'm having trouble working out how to use any of the functions in the Text.Parsec.Indent module provided by the indents package for Haskell, which is a sort of add-on for Parsec.
What do all these functions do? How are they to be used?
I can understand the brief Haddock description of withBlock, and I've found examples of how to use withBlock, runIndent and the IndentParser type here, here and here. I can also understand the documentation for the four parsers indentBrackets and friends. But many things are still confusing me.
In particular:
What is the difference between withBlock f a p and
do aa <- a
pp <- block p
return f aa pp
Likewise, what's the difference between withBlock' a p and do {a; block p}
In the family of functions indented and friends, what is ‘the level of the reference’? That is, what is ‘the reference’?
Again, with the functions indented and friends, how are they to be used? With the exception of withPos, it looks like they take no arguments and are all of type IParser () (IParser defined like this or this) so I'm guessing that all they can do is to produce an error or not and that they should appear in a do block, but I can't figure out the details.
I did at least find some examples on the usage of withPos in the source code, so I can probably figure that out if I stare at it for long enough.
<+/> comes with the helpful description “<+/> is to indentation sensitive parsers what ap is to monads” which is great if you want to spend several sessions trying to wrap your head around ap and then work out how that's analogous to a parser. The other three combinators are then defined with reference to <+/>, making the whole group unapproachable to a newcomer.
Do I need to use these? Can I just ignore them and use do instead?
The ordinary lexeme combinator and whiteSpace parser from Parsec will happily consume newlines in the middle of a multi-token construct without complaining. But in an indentation-style language, sometimes you want to stop parsing a lexical construct or throw an error if a line is broken and the next line is indented less than it should be. How do I go about doing this in Parsec?
In the language I am trying to parse, ideally the rules for when a lexical structure is allowed to continue on to the next line should depend on what tokens appear at the end of the first line or the beginning of the subsequent line. Is there an easy way to achieve this in Parsec? (If it is difficult then it is not something which I need to concern myself with at this time.)
So, the first hint is to take a look at IndentParser
type IndentParser s u a = ParsecT s u (State SourcePos) a
I.e. it's a ParsecT keeping an extra close watch on SourcePos, an abstract container which can be used to access, among other things, the current column number. So, it's probably storing the current "level of indentation" in SourcePos. That'd be my initial guess as to what "level of reference" means.
In short, indents gives you a new kind of Parsec which is context sensitive—in particular, sensitive to the current indentation. I'll answer your questions out of order.
(2) The "level of reference" is the "belief" referred in the current parser context state of where this indentation level starts. To be more clear, let me give some test cases on (3).
(3) In order to start experimenting with these functions, we'll build a little test runner. It'll run the parser with a string that we give it and then unwrap the inner State part using an initialPos which we get to modify. In code
import Text.Parsec
import Text.Parsec.Pos
import Text.Parsec.Indent
import Control.Monad.State
testParse :: (SourcePos -> SourcePos)
-> IndentParser String () a
-> String -> Either ParseError a
testParse f p src = fst $ flip runState (f $ initialPos "") $ runParserT p () "" src
(Note that this is almost runIndent, except I gave a backdoor to modify the initialPos.)
Now we can take a look at indented. By examining the source, I can tell it does two things. First, it'll fail if the current SourcePos column number is less-than-or-equal-to the "level of reference" stored in the SourcePos stored in the State. Second, it somewhat mysteriously updates the State SourcePos's line counter (not column counter) to be current.
Only the first behavior is important, to my understanding. We can see the difference here.
>>> testParse id indented ""
Left (line 1, column 1): not indented
>>> testParse id (spaces >> indented) " "
Right ()
>>> testParse id (many (char 'x') >> indented) "xxxx"
Right ()
So, in order to have indented succeed, we need to have consumed enough whitespace (or anything else!) to push our column position out past the "reference" column position. Otherwise, it'll fail saying "not indented". Similar behavior exists for the next three functions: same fails unless the current position and reference position are on the same line, sameOrIndented fails if the current column is strictly less than the reference column, unless they are on the same line, and checkIndent fails unless the current and reference columns match.
withPos is slightly different. It's not just a IndentParser, it's an IndentParser-combinator—it transforms the input IndentParser into one that thinks the "reference column" (the SourcePos in the State) is exactly where it was when we called withPos.
This gives us another hint, btw. It lets us know we have the power to change the reference column.
(1) So now let's take a look at how block and withBlock work using our new, lower level reference column operators. withBlock is implemented in terms of block, so we'll start with block.
-- simplified from the actual source
block p = withPos $ many1 (checkIndent >> p)
So, block resets the "reference column" to be whatever the current column is and then consumes at least 1 parses from p so long as each one is indented identically as this newly set "reference column". Now we can take a look at withBlock
withBlock f a p = withPos $ do
r1 <- a
r2 <- option [] (indented >> block p)
return (f r1 r2)
So, it resets the "reference column" to the current column, parses a single a parse, tries to parse an indented block of ps, then combines the results using f. Your implementation is almost correct, except that you need to use withPos to choose the correct "reference column".
Then, once you have withBlock, withBlock' = withBlock (\_ bs -> bs).
(5) So, indented and friends are exactly the tools to doing this: they'll cause a parse to immediately fail if it's indented incorrectly with respect to the "reference position" chosen by withPos.
(4) Yes, don't worry about these guys until you learn how to use Applicative style parsing in base Parsec. It's often a much cleaner, faster, simpler way of specifying parses. Sometimes they're even more powerful, but if you understand Monads then they're almost always completely equivalent.
(6) And this is the crux. The tools mentioned so far can only do indentation failure if you can describe your intended indentation using withPos. Quickly, I don't think it's possible to specify withPos based on the success or failure of other parses... so you'll have to go another level deeper. Fortunately, the mechanism that makes IndentParsers work is obvious—it's just an inner State monad containing SourcePos. You can use lift :: MonadTrans t => m a -> t m a to manipulate this inner state and set the "reference column" however you like.
Cheers!

How to write an array into a text file in maxima?

I am relatively new to maxima. I want to know how to write an array into a text file using maxima.
I know it's late in the game for the original post, but I'll leave this here in case someone finds it in a search.
Let A be a Lisp array, Maxima array, matrix, list, or nested list. Then:
write_data (A, "some_file.data");
Let S be an ouput stream (created by openw or opena). Then:
write_data (A, S);
Entering ?? numericalio at the input prompt, or ?? write_ or ?? read_, will show some info about this function and related ones.
I've never used maxima (or even heard of it), but a little Google searching out of curiousity turned up this: http://arachnoid.com/maxima/files_functions.html
From what I can gather, you should be able to do something like this:
stringout("my_new_file.txt",values);
It says the second parameter to the stringout function can be one or more of these:
input: all user entries since the beginning of the session.
values: all user variable and array assignments.
functions: all user-defined functions (including functions defined within any loaded packages).
all: all of the above. Such a list is normally useful only for editing and extraction of useful sections.
So by passing values it should save your array assignments to file.
A bit more necroposting, as google leads here, but I haven't found it useful enough. I've needed to export it as following:
-0.8000,-0.8000,-0.2422,-0.242
-0.7942,-0.7942,-0.2387,-0.239
-0.7776,-0.7776,-0.2285,-0.228
-0.7514,-0.7514,-0.2124,-0.212
-0.7168,-0.7168,-0.1912,-0.191
-0.6750,-0.6750,-0.1655,-0.166
-0.6272,-0.6272,-0.1362,-0.136
-0.5746,-0.5746,-0.1039,-0.104
So I've found how to do this with printf:
with_stdout(filename, for i:1 thru length(z_points) do
printf (true,"~,4f,~,4f,~,4f,~,3f~%",bot_points[i],bot_points[i],top_points[i],top_points[i]));
A bit cleaner variation on the #ProdoElmit's answer:
list : [1,2,3,4,5]$
with_stdout("file.txt", apply(print, list))$
/* 1 2 3 4 5 is then what appears in file.txt */
Here the trick with apply is needed as you probably don't want to have square brackets in your output, as is produced by print(list).
For a matrix to be printed out, I would have done the following:
m : matrix([1,2],[3,4])$
with_stdout("file.txt", for row in args(m) do apply(print, row))$
/* 1 2
3 4
is what you then have in file.txt */
Note that in my solution the values are separated with spaces and the format of your values is fixed to that provided by print. Another caveat is that there is a limit on the number of function parameters: for example, for me (GCL 2.6.12) my method does not work if length(list) > 64.

Resources