Parse arbitary user input [closed] - parsing

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have a database full of messages from a bulletin board. The board uses BB codes as formatting style. I.e.:
I'm not formatted
This is [b]bold[/b] text
Tags can also [i][b]be[/b] nested[/i]
And the [b]nesting [i]can be[/b] rather[/i] ugly
My ultimate goal is to convert these messages to some well formed XML (no discussion here ;) ). I don't want to use regular expression, which will fail at some point (in fact: it does).
First step: parse a message into some kind of internal representation (a graph, a tree, etc.). And I'm stuck at this point. The actual extraction is not that big problem, but the storage is.
How do I represent this kind of markup into some meaningful structure. My problem seems to be similar (or almost identical) to a browser building a DOM from a HTML file. So I think there are some strategies to solve it. I know the solution will not be perfect but im willing to invest a vast amount of time to do build the best possible.
Question: Do you have any tips/hint/comments? Any articles or paper you can recommend? Or a book which discusses these topic? I'm grateful for any input.

And the [b]nesting [i]can be[/b] rather[/i] ugly
I've written a parser very similar to what you are looking to do except that it would throw an error on your fourth example. Something to the effect of "Unexpected end tag [/b] within [i]".
I think that what you want to do is very doable but internally you will want to create a tree as if your original text was:
"And the [b]nesting [i]can be[/i][/b][i] rather[/i] ugly". (I don't think this would be necessary if you didn't need to convert it to XML later. If there were no need to convert to XML you could keep a linked list of text sections where each section is marked with its format combination)
Two possible approaches to this problem come to mind (of course there could be better possibilities). 1) Preprocess and insert the missing end and begin tags where necessary. 2) Build your parse tree and where there are overlapping tags imply the missing ones based on the current context. I think approach number (2) would be simpler and cleaner.
You could model your tree based on a composite pattern where you have an AbstractElement class, a TextElement class that extends AbstractElement, and a Tag class that extends AbstractElement and contains a list of sub-elements of type AbstractElement.
You would start by creating a root Tag instance. You would then call rootTag.parse(text). You would need a scanner that could return 3 types of tokens: text, start-tags, and end-tags. The scanner would allow you to push tokens onto it, which it would return before any normal scanned token. This would allow you to push new start tag tokens on after encountering and dealing with the unexpected end tag. You would also have to know when you are done with input. I'll use a 4th token type for that.
/* methods within class Tag */
public void parse(String text) {
MyScanner scanner = new MyScanner(text);
parse(scanner);
}
/* returns next token */
private Token parse(MyScanner scanner) {
Token firstToken = scanner.getNextToken();
return parse(scanner,firstToken);
}
private Token parse(MyScanner scanner) {
Token firstToken = scanner.getNextToken();
return parse(scanner,firstToken);
}
private Token parse(MyScanner scanner, Token token) {
while (!token.isDone() && !token.isEndTag()) {
if (token.isStartTag()) {
Tag subTag = new Tag(token.getValue());
token = scanner.getNextToken();
token = subTag.parse(scanner,token);
addElement(subTag);
}
else {
TextElement text = new TextElement(token.getValue());
addElement(text);
token = scanner.getNextToken();
}
}
if (token.isEndTag()) {
if (!token.getValue().equals(getName()) {
scanner.push(new Token(Token.START_TAG,token.getValue()));
}
else {
token = scanner.getNextToken();
}
}
return token;
}
So if you were to parse "And the [b]nesting [i]can be[/b] rather[/i] ugly", The following should get created.
rootTag.parse should be adding:
TextElement: "And the "
Tag: "b"
TextElement: "nesting "
Tag: "i"
TextElement: "can be"
(... at this point the odd [/b] is encountered ...)
(... push "i" start tag on the scanner ...)
(... here the [/b] is encountered (again) ...)
Tag: "i" (this was scanned because it had been pushed to the scanner)
TextElement: " rather"
TextElement: " ugly"
Note: Coding within a text area does not lend itself well to testing and debugging. Accept this answer as a hint or a possibility, not as your definate answer.

Related

Is it possible to deobfuscate this string of obfuscated Lua Code? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a(nother) string of lua code that's obfuscated. I'm wondering if it's possible to deobfuscate it, or to figure out how it was obfuscated, as I've never encountered anything like it before. This string of code is supposedly the main module for a malicious serverside script executor. Knowing what's inside will help us patch the exploit on our platform. I'm told that it would be easy to decipher by getting the constants, because it's VM based obfuscation, we just need a bit of help getting pointed in the right direction.
The code is rather large, so it's in this pastebin.
pastebin com/dtfzBPZk
Deobfuscating this one looks like it will be a slightly more manual process. As usual, the first things you should do are rename variables to have saner names, and add whitespace and indentation to the code. You can see a start to this at https://pastebin.com/eRTGAbTH. Once you do that, you'll see a pattern of functions like this:
(function(...)
local SynapseXen_116 = "hi xen doesn't work on sk8r please help"
local SynapseXen_092 = SynapseXen_100(38909278, 3932326132)
local SynapseXen_069 = {...}
for SynapseXen_109, SynapseXen_043 in pairs(
SynapseXen_069
) do
local SynapseXen_119
local SynapseXen_097 = type(SynapseXen_043)
if SynapseXen_097 == "number" then
SynapseXen_119 = SynapseXen_043
elseif SynapseXen_097 == "string" then
SynapseXen_119 = SynapseXen_043:len()
elseif SynapseXen_097 == "table" then
SynapseXen_119 = SynapseXen_100(4264903821, 30110892)
end
SynapseXen_092 = SynapseXen_092 + SynapseXen_119
end
SynapseXen_140[1171393165] =
SynapseXen_bit_bxor(
SynapseXen_bit_bxor(2179831066, SynapseXen_092),
SynapseXen_bit_bxor(2132161653, SynapseXen_082)
) -
string.len(SynapseXen_116) -
SynapseXen_139 -
#{
2716917292,
2960928816,
2092744992,
3945961999,
2156388474,
2523828292,
534526172
}
return SynapseXen_140[1171393165]
end)({}, {}, 14275, 107, "iIIllIIlIIilillilI", "i", 5327, 3211, 14382, 14643)
Now you can start to eliminate red herrings. For example, any time you see #{ a bunch of stuff in here }, you can just count the elements in the list, and replace the whole thing with the count. In this case, there's one of those near the end we can replace with the number 7. Next, look at SynapseXen_116. The only place it's used is for its length, so you can substitute that in as well. Now, After that, notice that this is declaring a function and then immediately invoking it, so you can substitute in its arguments. Continue going down that path until you uncover the heart of the Lua-in-Lua VM, and from there, it should be easy to plug in the Base64 at the end, and see what bytecode it decodes to.

How to dynamically set binding type's "formatOptions" and "constraints" in XML with binding?

I have a list of elements (OData set) and use a binding to show this list.
One field is for a quantity value and this value could sometimes need some decimal places.
The requirement is: only show that amount of decimal numbers that is also available in the OData service.
Annotation techniques can't be used.
I 'hacked' something that is misusing a formatter to update the type of a binding. But this is 'a hack' and it is not possible to convert it to XML views. (The reason is a different handling of the scope the formatter will be called).
So I am searching for a working solution for XML views.
The following code would not work but shows the issue:
new sap.m.Input({ // looking for an XML solution with bindings
value: {
path: "Quantity",
type: new sap.ui.model.type.Float({
// formatOptions
maxFractionDigits: "{QuantityDecimals}",
// ...
}, {
// constraints
minimum: 0
}),
// ...
}
});
The maxFractionDigits : "{QuantityDecimals}" should be "dynamic" and not a constant value.
Setting formatOptions and constraints dynamically in XML (via bindings or a declared function) is unfortunately not (yet) supported. But IMHO this is a valid enhancement request that app developers would greatly benefit from, since it encourages declarative programming.
I already asked for the same feature some years ago but in a comment at https://github.com/SAP/openui5/issues/2449#issuecomment-474304965 (See my answer to Frank's question "If XMLViews would allow a way to specify the dynamic constraints as well (e.g. enhanced syntax), would that fix the problem?").
Please create a new issue via https://github.com/SAP/openui5/issues/new with a clear description of what kind of problems the feature would resolve and possibly other use cases (You can add a link to my comment). I'll add my 👍 to your GitHub issue, and hopefully others too.
I'll update this answer as soon as the feature is available.
Get your dynamic number from your model and store it in a JS variable.
var nQuantityDecimals = this.getModel().getProperty("/QuantityDecimals");
new sap.m.Input({
value : {
path : "Quantity",
type : new sap.ui.model.type.Float({
maxFractionDigits : nQuantityDecimals,
source : {
groupingSeparator: ",",
decimalSeparator: ".",
groupingEnabled: false
}
}, {
minimum:0
})
}
}),

Confirming existence of a string in an xml table Lua

Good afternoon everyone,
My problem is that I have 2 XML lists
<List1> <Agency>String</Agency> </List1>
and
<List2><Agency2>String</Agency2><List2>.
In Lua I need to create a program which is parsing this list and when the user inputs a matching string from List 1 or List 2, the program needs to actually confirm to the user if the string belongs to either L1 or L2 or if the string is inexistent. I'm new to Lua and to programming generally speaking and I would be very grateful for you answers. I have LuaExpat as a plugin but I can't seem to be able to actually read from file, I can only do some beginner tricks if the xml list is written in the code. At a later time this small program will be fed by an RSS.
require("lxp")
local stuff = {}
xmldata="<Top><A/> <B a='1'/> <B a='2'/><B a='3'/><C a='3'/></Top>"
function doFunc(parser, name, attr)
if not (name == 'B') then return end
stuff[#stuff+1]= attr
end
local xml = lxp.new{StartElement = doFunc}
xml:parse(xmldata)
xml:close()
print(stuff[3].a)
This code is a tutorial over the web that works, everything is just fine it prints nr. 3. Now I want to know how to do that from an actual file, as if I input io.read:(file, "r" or "rb" ) under xmldata variable and run the same thing it returns either empty space or nil.

Check if entered text is valid in Xtext

lets say we have some grammar like this.
Model:
greeting+=Greeting*;
Greeting:
'Hello' name=ID '!';
I would like to check whether the text written text in name is a valid text.
All the valid words are saved in an array.
Also the array should be filled with words from a given file.
So is it possible to check this at runtime and maybe also use this words as suggestions.
Thanks
For this purpose you can use a validator.
A simple video tutorial about it can be found here
In your case the function in the validator could look like this:
public static val INVALID_NAME = "greeting_InvalidName"
#Check
def nameIsValid(Greeting grt) {
val name = grt.getName() //or just grt.Name
val validNames = NewArrayList
//add all valid names to this list
if (!validNames.contains(name)) {
val errorMsg = "Name is not valid"
error(errorMsg, GreetingsPackage.eINSTANCE.Greeting_name, INVALID_NAME)
}
}
You might have to replace the "GreetingsPackage" if your DSL isn't named Greetings.
The static String passed to the error-method serves for identification of the error. This gets important when you want to implement Quickfixes which is the second thing you have asked for as they provide the possibility to give the programmer a few ideas how to actually fix this particular problem.
Because I don't have any experience with implementing quickfixes myself I can just give you this as a reference.

What do I need to add to use monadUserState with alex when parsing?

I am trying to write a program that will understand a language where embedded comments are allowed. Such as:
/* Here's a comment
/* This comment is further embedded */ second comment is closed
Must close first comment */
This should be recognized as a comment (and as such not stop at the first */ it sees unless it has only seen 1 comment opening prior).
This would be an easy issue to fix in C, I could simply have a counter that incremented when it saw comment opens and decrements when it sees a comment close. If the counter is at 0, we're in "code section".
However, without having state in Haskell, it's a little more challenging.
I've read up on monadUserState which supposedly allows to keep track of a state for this exact type of parsing. However, I can't find very much reading material on it aside from the tutorial page on alex.
When I try to compile it gives the error
templates\wrappers.hs:213:16: Not in scope: `alexEOF`
It should be noted that I directly changed from the "basic" wrapper to the "monadUserState" without changing my code (I don't know what to add in order to use it). It says that this must be initialized in the user code:
data AlexState = AlexState {
alex_pos :: !AlexPosn, -- position at current input location
alex_inp :: String, -- the current input
alex_chr :: !Char, -- the character before the input
alex_bytes :: [Byte], -- rest of the bytes for the current char
alex_scd :: !Int, -- the current startcode
alex_ust :: AlexUserState -- AlexUserState will be defined in the user program
}
I'm a bit of a lexxing noob and I'm not at all sure what I should be adding here to make it at least compile... then I can worry about the logic of the thing.
Update: Working example available here: http://lpaste.net/119212
The file "tiger.x" (link) in the alex github repo contains an example of how to track embedded comments using the monadUserState wrapper.
Well, unfortunately that example doesn't compile but the ideas there should work.
Basically, these lines perform embedded comment processing:
<0> "/*" { enterNewComment `andBegin` state_comment }
<state_comment> "/*" { embedComment }
<state_comment> "*/" { unembedComment }
<state_comment> . ;
<state_comment> \n { skip }
As for alexEOF, the idea is to add an EOF token to your token data type:
data Tokens = ... | EOF
and define alexEOF as:
alexEOF = return EOF
See the file tests/tokens_monadUserState_bytestring.x in the alex repo for an example of this.

Resources