Incremental processing and enrichment with storm - stream

We have a system which is designed for processing social media content. In our storm topology we have some bolts to process, such as sentiment analysis, language detection, spam detection and so on. All tutorials and examples prepared on storm, we have seen that a bolt can emit the tuple fields which has declared in declareOutputFields() methods. Is there any option to emit the current bolt's field with input tuple?
For example i have an input tuple which contains the fields below:
<
text : bla bla
username: paul
date: 01.01.2013
source:twitter
>
I want to define the output tuple as:
<
text : bla bla
username: paul
date: 01.01.2013
source:twitter
lang:tr
>
Note that i want my bolts don't need to know anything about before bolt's output tuple schema.
Thank you.

You could achieve something like this by writing a function that returns a bolt given some input rather than writing the bolt directly. You parameterize the creation of the bolt by writing a function that will return a bolt object with the output fields you want.
Obviously this will have to be done at the time you deploy your topology, so it can't be dynamic on the stream at run time, but it can be dynamic at startup time. Something like
(defn make-bolt [bolt-name input-fields]
(defbolt bolt-name input-fields
...))
....
(topology
{} ;; spouts
{"a-bolt" (bolt-spec {"a-spout":shuffle}
(make-bolt bolt-name ["input" "tuple" "lang"]))))

Related

How to output text from lists of variable sizes using printf?

I'm trying to automate some output using printf but I'm struggling to find a way to pass to it the list of arguments expr_1, ..., expr_n in
printf (dest, string, expr_1, ..., expr_n)
I thought of using something like Javascript's spread operator but I'm not even sure I should need it.
For instace, say I have a list of strings to be output
a:["foo","bar","foobar"];
a string of appropriate format descriptors, say
s: "~a ~a ~a ~%";
and an output stream, say os. How can I invoke printf using these things in such a way that the result will be the same as writing
printf(os,s,a[1],a[2],a[3]);
Then I could generalize it to output lists of variable size.
Any suggestions?
Thanks.
EDIT:
I just learned about apply and, using the conditions I posed in my OP, the following seems to work wonderfully:
apply(printf,append([os,s],a));
Maxima printf implements most or maybe all of the formatting operators from Common Lisp FORMAT, which are quite extensive; see: http://www.lispworks.com/documentation/HyperSpec/Body/22_c.htm See also ? printf in Maxima to get an abbreviated list of formatting operators.
In particular for a list you can do something like:
printf (os, "my list: ~{~a~^, ~}~%", a);
to get the elements of a separated by ,. Here "~{...~}" tells printf to expect a list, and ~a is how to format each element, ~^ means omit the inter-element stuff after the last element, and , means put that between elements. Of course , could be anything.
There are many variations on that; if that's not what you're looking for, maybe I can help you find it.

How to parse and translate DSL using Red or Rebol

I'm trying to see if I can use Red (or Rebol) to implement a simple DSL. I want to compile my DSL to source code for another language, perhaps Red or C# or both - rather than directly interpreting and executing it.
The DSL has only a couple of simple statements, plus an if/else statement.
Statements can be grouped into rules. A rule would get translated into a function definition, with each statement the equivalent statement in the target language.
The parse capability in Red/Rebol is great and lets me implement a parser very easily - in effect it's basically just the definition of the grammar itself.
However I haven't been able to find any examples of how to take the next steps, specifically handling an if statement and translating it to other source code.
Translating an if statement seems a good example of something minimal but still slightly tricky - because in Red having an else means you need to change the if to an either, rather than just an extra optional else.
Traditionally, during parsing I would build an abstract syntax tree, and then have functions to operate on the AST and generate the new source code. Should I be following this same approach or is there some other more idiomatic way in Red ?
I've experimented with using collect/keep in my parse rules to return a block of nested blocks, which in effect forms the AST. Another approach would be to save data into specific objects representing the different statements etc.
I'm still getting to grips with collect/keep, as to when a new block will be created and what will be kept. I'd also like to keep my parser rules as "clean" as possible, with as little other code intertwined in it. So I'm still not sure how best to add in Red code in round brackets in the parse rules. Adding code too early can cause the Red code to get executed, even if the rule eventually fails. Adding code too late means the code may not be executed in the order you expect, especially when dealing with multi-level statements like if, which can contain other statements.
So, specifically, any help on how to translate my example DSL to Red source code would be appreciated. Also any links to implementing DSLs like this in Red or Rebol would be great ! :)
Here are my parse rules :-
Red [
Purpose: example rules for parsing a simple language
]
SimpleLanguageParser: make object! [
Expr: [string! | integer! | block!]
Data: ['Person.AGE | 'Person.INCOME]
WriteMessageToLog: ['write 'message 'to 'log Expr]
SetData: ['set 'data Data '= Expr]
IfStatement: ['if Expr [any Statement] opt ['else [any Statement]] 'endif]
Statement: [WriteMessageToLog | SetData | IfStatement]
Rule: [
'rule word!
[any Statement]
'endrule
]
AnySimpLeLanguage: [Rule | [any Statement]]
]
SL: function [slInput] [
parse slInput SimpleLanguageParser/AnySimpleLanguage
]
An example of some source in the DSL :-
RULE TooYoung
IF [Person.Age < 15]
WRITE MESSAGE TO LOG "too young to earn an income"
SET DATA Person.Income = 0
ELSE
WRITE MESSAGE TO LOG "old enough"
ENDIF
ENDRULE
Translated to Red source code :-
TooYoung: function [] [
either Person.Age < 15 [
WriteMessageToLog "too young to earn an income"
Person.Income: 0
] [
WriteMessageToLog "old enough"
]
]
The data, ie Person.Age, Person.Income, and the function WriteMessageToLog are all things which would have been previously defined.
Note, for simplicity I've left Expr as block! etc, rather than defining Expr in any more detail in the DSL itself. Also, setting Person.Income in the function doesn't work as coded as it sets a local - but that's ok for now :)
Always nice to see someone digging language-oriented programming, keep it up and welcome to Red! ;)
Specifying correct grammar rules is the trickiest part of the job, and you've already nailed that. What's left is to intersperse your PEG (parsing expression grammar) with set, copy, collect/keep combo and paren! expressions in the right places, and then either create an AST from that or, in simplier cases, emit code directly.
Example
Here's a quickly baked (and definitely buggy!) example of how I'd tackled your task. Basically, it's slightly reworked code of yours, where matched patterns are setted, copyed or collected, and then bounded to specific words, which then just pasted into "template" (compose function inside emit-rule) to produce a Red code.
It's not the only way, I believe. #rebolek might come up with more industrial-strength solution, as he has experience with sophisticated parsers, which I'm lacking :P
Followup
As for if/else dilemma, I followed the approach proposed above -- instead of using opt I wrapped rule for else-branch into block and added an alternative match, which just sets false-block to none.
What to use for AST -- anything that allow to express a hierarchical structure, which is either a block! (though for performance gain you might want to use hash! or map!) or an object!. The advantage of object! is that it provides a context to be bound to, but here we're approaching a realm of so-called Bindology ("scoping" rules of Red language), which is another beast :)
Emitting C# code would be harder, but doable -- you'll need to assemble a string instead of Red code. I think, however, that, as a newcomer, you should stick with parsing directly at block-level (the way you done in your example), because it a lot easier and much expressive.
Another interesting (but much hairy) approach would be to re-define all words used in your DSL-block to work as you want.
Resources
We have a wiki entry about Red/Rebol dialects on github, which you might find if not useful, but interesting to read.
Also, two articles (this and this) in Red blog, I think you skimmed over first one already (if not, you should!).
Last, but not least, an exhaustive review of Parse principles and keywords (which has a couple of wrong parts in it though, so, caveat emptor). It's written for Rebol, but you should adapt examples to Red rather easily.
As a relative newcomer to the language, I do agree that there's a lack of examples and tutorials about DSL development, but we're working on that (at least in our heads) :)
Taking 9214's answer as a starting point, I've coded one possible solution. My approach has been :-
try to keep the parse rules as "clean" as possible
use collect and keep to return a block as the result, rather than trying to build a more complex AST
do some minimal translation in the keeps
the resulting block should be valid Red code
which uses predefined functions, where any more complex processing needs to happen
Most simple statements are easily translated to functions eg WRITE MESSAGE TO LOG becomes SL_WriteMessageToLog which can then do whatever it needs to do.
More complicated statements with structure, eg If/Else become functions which take block parameters which can then process the blocks as required.
For the If/Else complication, I've made this into two separate functions, SL_If and SL_Else. SL_If stores the result of the condition in a sequence, and SL_Else checks the latest result and removes it. This allows for nested If/Elses.
The presence of the final endrule can be checked for to ensure the input was correctly parsed. Once this is removed, we should have a valid function definition.
Here's the code :-
Red [
Purpose: example rules for parsing and translating a simple language
]
; some data
Person.AGE: 0
Person.INCOME: 0
; functions to implement some simple SL statements
SL_WriteMessageToLog: function [value] [
print value
]
SL_SetData: function [parmblock] [
field: parmblock/1
value: parmblock/2
if type? value = word! [
value: do value
]
print ["old value" field "=" do field]
set field value
print ["new value" field "=" do field]
]
; hold the If condition results, to be used to determine whether or not to do Else
IfConditionResults: []
SL_If: function [cond stats] [
cond_result: do cond
head insert IfConditionResults cond_result
if cond_result stats
]
SL_Else: function [stats] [
cond_result: first IfConditionResults
remove IfConditionResults
if not cond_result stats
]
; parsing rules
SimpleLanguageParser: make object! [
Expr: [logic! | string! | integer! | block!]
Data: ['Person.AGE | 'Person.INCOME]
WriteMessageToLog: ['write 'message 'to 'log set x Expr keep ('SL_WriteMessageToLog) keep (x)]
SetData: ['set 'data set d Data '= set x Expr keep ('SL_SetData) keep (reduce [d x])]
IfStatement: ['if keep ('SL_If) keep Expr collect [any Statement] opt ['else keep ('SL_Else) collect [any Statement]] 'endif]
Statement: [WriteMessageToLog | SetData | IfStatement]
Rule: [collect [
'rule set fname word! keep (to set-word! fname) keep ('does)
collect [any Statement]
keep 'endrule
]
]
AnySimpLeLanguage: [Rule | [any Statement]]
]
SL: function [slInput] [
parse slInput SimpleLanguageParser/Rule
]
For the example in the original post, the output is :-
TooYoung: does [
SL_If [Person.Age < 15] [
SL_WriteMessageToLog "too young to earn an income"
SL_SetData [Person.Income 0]
]
SL_Else [
SL_WriteMessageToLog "old enough"
]
]
ENDRULE
Thanks for your help to get this far.
Feedback on this approach and solution would be appreciated :)

Erlang regexp matching on Chinese characters

TL;DR:
25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
** exception error: bad argument
in function re:run/3
called as re:run([1081,1094,1091,46,97,115,100],
"^(.*\\..*)$",
[{capture,none}])
How to make this work? 'йцу' are characters that don't belong in a latin charset, obviously; is there a way to tell the re module or entire system to run with a different charset for "strings"?
ORIGINAL QUESTION (for the record):
Another "Programming Erlang" question )
in Chapter 16 there's an example about reading tags from the mp3 files. It works, great. But, there seems to be some bug in a provided module, lib_find, which has a function for searching a path for matching files. This is the call that works:
61> lib_find:files("../..", "*.mp3", true).
["../../early/files/Veronique.mp3"]
and this call fails:
62> lib_find:files("../../..", "*.mp3", true).
** exception error: bad argument
in function re:run/3
called as re:run([46,46,47,46,46,47,46,46,47,46,107,101,114,108,47,98,117,
105,108,100,115,47,50,48,46,49,47,111|...],
"^(.*\\.mp3)$",
[{capture,none}])
in call from lib_find:find_files/6 (lib_find.erl, line 29)
in call from lib_find:find_files/6 (lib_find.erl, line 39)
in call from lib_find:files/3 (lib_find.erl, line 17)
Ironically, the investigation led to finding the culprit in Erlang's own installation:
.kerl/builds/20.1/otp_src_20.1/lib/ssh/test/ssh_sftp_SUITE_data/sftp_tar_test_data_高兴
OK, this seems to mean Erlang is using a more restrictive default charset, which doesn't include hànzì. What are the options? Obviously, I can just ignore this and move on with my study, but I feel I can learn more from this one =) Such as - where/how can I fix the default charset? I'm a little surprised it's something other than UTF8 by default - so maybe I'm on a wrong track?
Thanks!
TL;DR:
UTF-8 regexs are accessible by putting the regex pattern into unicode mode with the option unicode. (Note below that the string "^(.*\\..*)$" is the result of your call to xmerl_regexp:sh_to_awk/1.)
1> re:run("なにこれ.txt", "^(.*\\..*)$").
** exception error: bad argument
in function re:run/2
called as re:run([12394,12395,12371,12428,46,116,120,116],"^(.*\\..*)$")
2> re:run("なにこれ.txt", "^(.*\\..*)$", [unicode]).
{match,[{0,16},{0,16}]}
And from your exact example:
11> re:run("йцу.asd", "^(.*\\..*)$", [unicode, {capture, none}]).
match
Or
12> {ok, Pattern} = re:compile("^(.*\\..*)$", [unicode]).
{ok,{re_pattern,1,1,0,
<<69,82,67,80,87,0,0,0,16,8,0,0,65,0,0,0,255,255,255,
255,255,255,...>>}}
13> re:run("йцу.asd", Pattern, [{capture, none}]).
match
The docs for re are pretty long and extensive, but that's because regexs are an inherently complex subject. You can find options for compiled regexs in the docs for re:compile/2 and the options for run in the docs for re:run/3.
Discussion
Erlang has settled on the idea that strings, though still a list of codepoints, are all UTF-8 everywhere. As I work in Japan and deal with this all the time, this has come as a big relief to me because I can stop using about half of the conversion libraries I had needed in the past (yay!), but has complicated matters a bit for users of the string module because many operations there now perform under slightly different assumptions (a string is still considered "flat" even if it is a deep list of grapheme clusters, so long as those clusters exist on the first level of the list).
Unfortunately, encodings are just not very easy things to deal with and UTF-8 is anything but simple once you step out of the most common representations -- so much of this is a work in progress. I can tell you with confidence, though, that dealing with UTF-8 data in binary, string, deep list, and io_data() forms, whether file names, file data, network data, or user input from WX or web forms works as expected once you read the unicode, regex and string docs.
But that is, of course, a lot of stuff to get familiar with. 99% of the time things will work as expected if you decode everything incoming from outside as UTF-8 using unicode:characters_to_list/1 and unicode:characters_to_binary/1, and specify binary strings as utf8 binary types everywhere:
3> UnicodeBin = <<"この文書はUTF-8です。"/utf8>>.
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
4> UnicodeString = unicode:characters_to_list(UnicodeBin).
[12371,12398,25991,26360,12399,85,84,70,45,56,12391,12377,
12290]
5> io:format("~ts~n", [UnicodeString]).
この文書はUTF-8です。
ok
6> re:run(UnicodeString, "UTF-8", [unicode]).
{match,[{15,5}]}
7> re:run(UnicodeBin, "UTF-8", [unicode]).
{match,[{15,5}]}
8> unicode:characters_to_binary(UnicodeString).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
9> unicode:characters_to_binary(UnicodeBin).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>

Writing a text file containing LaTeX code from maxima expressions

Suppose in a (wx)Maxima session I have the following
f:sin(x);
df:diff(f,x);
Now I want to have it output a text file containing something like, for example
If $f(x)=\sin(x)$, then $f^\prime(x)=\cos(x)$.
I found the tex and tex1 functions but I think I need some additional string processing to be able to do what I want.
Any help appreciated.
EDIT: Further clarifications.
Auto Multiple Choice is a software that helps you create and manage questionaires. To declare questions one may use LaTeX syntax. From AMC's documentation, a question looks like this:
\element{geographie}{
\begin{question}{Cameroon}
Which is the capital city of Cameroon?
\begin{choices}
\correctchoice{Yaoundé}
\wrongchoice{Douala}
\wrongchoice{Abou-Dabi}
\end{choices}
\end{question}
}
As can be seen, it is just LaTeX. Now, with a little modification, I can turn this example into a math question
\element{derivatives}{
\begin{question}{trig_fun_diff_1}
If $f(x)=\sin(x)$ then $f^\prime(0)$ is
\begin{choices}
\correctchoice{$1$}
\wrongchoice{$-1$}
\wrongchoice{$0$}
\end{choices}
\end{question}
}
This is the sort of output I want. I'll have, say, a list of functions then execute a loop calculating their derivatives and so on.
OK, in response to your updated question. My advice is to work with questions and answers as expressions -- build up your list of questions first, and then when you have the list in the structure that you want, then output the TeX file as the last step. It is generally much clearer and simpler to work with expressions than with strings.
E.g. Here is a simplistic approach. I'll use defstruct to define a structure so that I can refer to its parts by name.
defstruct (question (name, datum, item, correct, incorrect));
myq1 : new (question);
myq1#name : "trig_fun_diff_1";
myq1#datum : f(x) = sin(x);
myq1#item : 'at ('diff (f(x), x), x = 0);
myq1#correct : 1;
myq1#incorrect : [0, -1];
You can also write
myq1 : question ("trig_fun_diff_1", f(x) = sin(x),
'at ('diff (f(x), x), x = 0), 1, [0, -1]);
I don't know which form is more convenient for you.
Then you can make an output function similar to this:
tex_question (q, output_stream) :=
(printf (output_stream, "\\begin{question}{~a}~%", q#name),
printf (output_stream, "If $~a$, then $~a$ is:~%", tex1 (q#datum), tex1 (q#item)),
printf (output_stream, "\\begin{choices}~%"),
/* make a list comprising correct and incorrect here */
/* shuffle the list (see random_permutation) */
/* output each correct or incorrect here */
printf (output_stream, "\\end{choices}~%"),
printf (output_stream, "\\end{question}~%));
where output_stream is an output stream as returned by openw (which see).
It may take a little bit of trying different stuff to get derivatives to be output in just the format you want. My advice is to put the logic for that into the output function.
A side effect of working with expressions is that it is straightforward to output some representations other than TeX (e.g. plain text, XML, HTML). That might or might not become important for your project.
Well, tex is the TeX output function. It can be customized to some extent via texput (which see).
As to post-processing via string manipulation, I don't recommend it. However, if you want to go down that road, there are regex functions which you can access via load(sregex). Unfortunately it's not yet documented; see the comment header of sregex.lisp (somewhere in your Maxima installation) for examples.

How to write an array into a text file in maxima?

I am relatively new to maxima. I want to know how to write an array into a text file using maxima.
I know it's late in the game for the original post, but I'll leave this here in case someone finds it in a search.
Let A be a Lisp array, Maxima array, matrix, list, or nested list. Then:
write_data (A, "some_file.data");
Let S be an ouput stream (created by openw or opena). Then:
write_data (A, S);
Entering ?? numericalio at the input prompt, or ?? write_ or ?? read_, will show some info about this function and related ones.
I've never used maxima (or even heard of it), but a little Google searching out of curiousity turned up this: http://arachnoid.com/maxima/files_functions.html
From what I can gather, you should be able to do something like this:
stringout("my_new_file.txt",values);
It says the second parameter to the stringout function can be one or more of these:
input: all user entries since the beginning of the session.
values: all user variable and array assignments.
functions: all user-defined functions (including functions defined within any loaded packages).
all: all of the above. Such a list is normally useful only for editing and extraction of useful sections.
So by passing values it should save your array assignments to file.
A bit more necroposting, as google leads here, but I haven't found it useful enough. I've needed to export it as following:
-0.8000,-0.8000,-0.2422,-0.242
-0.7942,-0.7942,-0.2387,-0.239
-0.7776,-0.7776,-0.2285,-0.228
-0.7514,-0.7514,-0.2124,-0.212
-0.7168,-0.7168,-0.1912,-0.191
-0.6750,-0.6750,-0.1655,-0.166
-0.6272,-0.6272,-0.1362,-0.136
-0.5746,-0.5746,-0.1039,-0.104
So I've found how to do this with printf:
with_stdout(filename, for i:1 thru length(z_points) do
printf (true,"~,4f,~,4f,~,4f,~,3f~%",bot_points[i],bot_points[i],top_points[i],top_points[i]));
A bit cleaner variation on the #ProdoElmit's answer:
list : [1,2,3,4,5]$
with_stdout("file.txt", apply(print, list))$
/* 1 2 3 4 5 is then what appears in file.txt */
Here the trick with apply is needed as you probably don't want to have square brackets in your output, as is produced by print(list).
For a matrix to be printed out, I would have done the following:
m : matrix([1,2],[3,4])$
with_stdout("file.txt", for row in args(m) do apply(print, row))$
/* 1 2
3 4
is what you then have in file.txt */
Note that in my solution the values are separated with spaces and the format of your values is fixed to that provided by print. Another caveat is that there is a limit on the number of function parameters: for example, for me (GCL 2.6.12) my method does not work if length(list) > 64.

Resources