SVG Path Data with multiple dots in one value - parsing

I had to write my own SVG path parser and discovered that I cannot parse some files like Skull_and_crossbones.svg from Wikipedia.
In the second path's data I found -24.57.56 which is just invalid value and I cannot see how to parse it.

If you look at the spec for the grammar of path data, you will find the following explanation below:
The processing of the BNF must consume as much of a given BNF production as possible, stopping at the point when a character is encountered which no longer satisfies the production... for the string M 0.6.5, the first coordinate of the "moveto" consumes the characters 0.6 and stops upon encountering the second decimal point because the production of a "coordinate" only allows one decimal point. The result is that the first coordinate will be 0.6 and the second coordinate will be .5.
For your example, the production -24.57.56 is equivalent to -24.57, 0.56.
You could also say: Leading zeros before a point, commas and whitespace are always optional. Authors writing path data must only use them to avoid ambiguity and make sure that the length of what you can parse as one number fits their intention.

It is not an invalid value. It is two valid values. The first value is -24.57 and the second value is .56.
The path data grammar does not require there to be spaces between coordinate values. Sometimes they are required, though, if the result would be wrong. For example, 1 0.5 cannot be shortened to 10.5

Related

Antlr: lookahead and lookbehind examples

I'm having a hard time figuring out how to recognize some text only if it is preceded and followed by certain things. The task is to recognize AND, OR, and NOT, but not if they're part of a word:
They should be recognized here:
x AND y
(x)AND(y)
NOT x
NOT(x)
but not here:
xANDy
abcNOTdef
AND gets recognized if it is surrounded by spaces or parentheses. NOT gets recognized if it is at the beginning of the input, preceded by a space, and followed by a space or parenthesis.
The trouble is that if I include parentheses as part of the definition of AND or NOT, they get consumed, and I need them to be separate tokens.
Is there some kind of lookahead/lookbehind syntax I can use?
EDIT:
Per the comments, here's some context. The problem is related to this problem: Antlr: how to match everything between the other recognized tokens? My working solution there is just to recognize AND, OR, etc. and skip everything else. Then, in a second pass over the text, I manually grab the characters not otherwise covered, and run a totally different tokenizer on it. The reason is that I need a custom, human-language-specific tokenizer for this content, which means that I can't, in advance, describe what is an ID. Each human language is different. I want to combine, in stages, a single query-language tokenizer, and then apply a human-language tokenizer to what's left.
ANTLR is not the right tool for this task. A normal parser is designed for a specific language, that is, a set of sentences consisting of elements that are known at parser creation time. There are ways to make this more flexible, e.g. by using a runtime function in a predicate to recognize words not defined in the grammar, but this has other (negative) implications.
What you should consider is NLP for a different approach to process natural language. It's more than just skipping things between two known tokens.

GREP: can you have 2 positive lookahead token/argument pairs in a legal GREP query?

I'm having trouble finding a sequence that works it InDesign to move full stops (US English: periods) before footnote(s) when they trail the superscripted endnote references to a word/phrase.
so make thisfoo<sup>23,25<sup/>. into thatfoo.<sup>23,25<sup/> ( tags not literally there just indicating to you the reader these are numbers in superscript but Markdown doesn't do superscript I think)
because my positive look behind is not working I'm look to use a sequence of two or morepositive lookbehind` tokens, but is this inside the rules?
I wrote a GREP token that hits all the endnotes references, whatever combination of spaces, commas and digits. But I can't replace with found in InDesign because it breaks all the hyperlinks to the endnotes. So I need to use positive lookahead and positive lookbehind to move the full stops. First remove the existing then add the new one before the endnotes. But the same token, say this one of many possible to pick up any of
{n, n, n…} —> \d[\d\, \,]+ (and I add '\.' to catch the period) will not get a single hit as an argument for a positive-lookbehind token
i.e. (?<=\d[\d\, \,]+)\. doesn't get a hit. tried various variations too. and lookahead. What about what ID calls "unmarking subexpressions" which Text Wrangler I think refers to as Perl-Style Pattern Expressions?
I can use negative lookbehind to find periods following digits+ i.e. (?<![a-zA-Z])\. but it won't give me the entire endnotes references sequence to mark and put a period preceding it.
This GREP is all executed within Adobe InDesign layout software, so no command line execution. It's okay if I use two operations not all done with one Find/Change operation. First add preceding period. Second remove trailing period.
i want to remove the period char at green arrow and add one at the red arrow for any given series of endnote reference numbers and commas. The central problem is that found hits on endnote strings CANNOT be used in the Change To token as found strings because that will remove their (hidden) indexing as Cross-References linking them to Endnotes which will result in hyperlink connections in exported PDF (amongst other reasons). (ignore the Find token in the screenshot)

When to use V instead of a decimal in Cobol Pic Clauses

Studying for a test right now and can't seem to wrap my head around when to use "V" for a decimal instead of an actual decimal in PIC clauses. I've done some research but can't find anything I understand. Only been learning cobol for about a week, so is there like a rule of thumb here? Thanks for your time.
You use an actual decimal-point when you want to "output" a value which has decimal places, like a report line, a position on a screen, an item in an output file which is going to a "different" system which doesn't understand the format with an implied decimal pace.
That's what the V is, it is an implied decimal place. It tells the compiler where to align results from calculations, MOVEs, whatever. Computer chips, and the machine instructions they support, don't know about actual decimal points for their internal processing.
COBOL is a language with fixed-length fields. The machine instructions don't need to know where the decimal point is (effectively it can deal with everything as integer values) but the compiler does, and the compiler has to do the correct scaling and alignment of results.
Storing on your own files, use V, the implied decimal place.
For data which is to be "human readable" or read by a system which cannot understand your character set, cannot scale what looks like an integer, use an actual decimal-point, . (for computer-readable stuff, you can sometimes use a separate scaling factor, if that is more convenient for the receiving system).
Basically, V for internal, . for external, should be a rule of thumb to get you there.
Which COBOL are you using? I'm surprised it is not covered in your documentation.

Huffman code for a single character?

Lets say I have a massive string of just a single character say x. I need to use huffman encoding.
A huffman encoding is a fully binary tree. So how does one create a huffman code for just a single character when we dont need two leaves at all ?
jbr's answer is fine; this is just a longer version of it.
Huffman is meant to produce a minimal-length sequence of bits that contains all the information in the original sequence of symbols, assuming that the decoder already knows the set of symbols. If there's only one symbol, the input data contains no information except its length.
In Huffman-based data formats, length is usually encoded separately, not as part of the Huffman-encoded bit sequence itself. The decoder of a single-symbol Huffman code therefore has all the information it needs to reconstruct the input without needing to read anything from the Huffman-encoded bit sequence. it is logical, then, that the Huffman encoder's output should be 0 bits long.
If you don't have a length encoded separately, then you must have a symbol to represent End Of Sequence so the decoder knows when to stop reading. Then your Huffman tree will have 2 nodes and you won't run into this special case.
If you only have one symbol, then you only need 1 bit per symbol. So you really don't have to do anything except count the number of bits and translate each into your symbol.
You simply could add an edge case in your code.
For example:
check if there is only one character in your hash table, which returns only the root of the tree without any leafs. In this case, you could add a code for this root node in your encoding function, like 0.
In the encoding function, you should refer to this edge case too.

Checking input grammar and deciding a result

Say I have a string "abacabacabadcdcdcd" and I want to apply a simple set of rules:
abaca->a
dcd->d
From left to right s.t. the string ends up being "abad". This output will be used to make a decision. After the rules are applied, if the output string does not match preset strings such as "abad", the original string would be discarded. ex. Every string should distill down to "abad", kick if it doesn't.
I have this hard-coded right now as regex, but there are many instances of these small rule sets. I am looking for something that will take a set of simple rules and compile (or just a function?) into something I can feed the string to and retrieve a result. The rule sets are independent of each other.
The input is tightly controlled, and the rules in use will be simple. Speed is the most important aspect.
I've looked at Bison and ANTLR, but I don't think I need anything nearly that powerful...
What am I looking for?
Edit: Should mention that the strings are made up of a couple letters. Usually 5, i.e. "abcde". There are no spaces, etc. Just letters.
If it is going to go fast, you can start out with a map, that contains your rules as key value pairs of strings. You can then compile this map to a sort of state machine, a tree with char keys, where the associated value is either a replacement string, or another tree.
You then go char by char through your string. Look up the current char in the tree. If you find another tree, look up the next character in that tree, etc.
At some point, either:
the lookup will fail, and then you know that the string you've seen so far is not the prefix of any rule. You can skip the current character and continue with the next.
or you get a replacement string. In that case, you can replace the characters between the current char and the last one you looked up inclusive by the replacement string.
The only difficulty is if the replacement can itself be part of a pattern to replace. Example:
ab -> e
cd -> b
The input:
acd -> ab (by rule 2)
ab -> e (by rule 1) ????
Now the question is if you want to reconsider ab to give e?
If this is so, you must start over from the beginning after each replacement. In addition, it will be hard to tell whether the replacement ever ends, except if all the rules you have are such that the right hand side is shorter than the left hand side. For, in that case, a finite string will get reduced in a finite amount of time.
But if we don't need to reconsider, the algorithm above will go straight through the string.

Resources