I'm trying to extract the integers after mrp= and talktime=.
var i=0;
var recharge=[];
var recharge_text=[];
var recharge_String="";
var mrp="";
var talktime="";
var validity="";
var mode="";mrp='1100';
talktime='1200.00';
validity='NA';
mode='E-Recharge';
if(typeof String.prototype.trim !== 'function') {
String.prototype.trim = function() {
return this.replace(/^ +| +$/g, '');
}
}
mrp=mrp.trim();
if(isNaN(mrp))
{
recharge_text.push({MRP:mrp, Talktime:talktime, Validity:validity ,Mode:mode});
}
else
{
mrp=parseInt(mrp);
recharge.push({MRP:mrp, Talktime:talktime, Validity:validity ,Mode:mode});
}
mrp='2200';
talktime='2400.00';
I've extracted the above text from a webpage, but I do not know how to extract that particular part alone.
You can use regular expressions to parse strings and extract parts of it :
my_text = "blablabla" #just imagine that this is your text
regex_mrp = /mrp='(.+?)';/ #extracts whatever is between single quotes after mrp
regex_talktime = /talktime='(.+?)';/ #extracts whatever is between single quotes after talktime
mrp = my_text.match(regex_mrp)[1].to_i #gets the match, and converts to integer
talktime = my_text.match(regex_talktime)[1].to_f #gets the match, and converts to float
Here's a quick reference to the regular expressions syntax : https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
I'd do something like this:
string = <<EOT
var i=0;
var recharge=[];
var recharge_text=[];
var recharge_String="";
var mrp="";
var talktime="";
var validity="";
var mode="";mrp='1100';
talktime='1200.00';
validity='NA';
mode='E-Recharge';
if(typeof String.prototype.trim !== 'function') {
String.prototype.trim = function() {
return this.replace(/^ +| +$/g, '');
}
}
mrp=mrp.trim();
if(isNaN(mrp))
{
recharge_text.push({MRP:mrp, Talktime:talktime, Validity:validity ,Mode:mode});
}
else
{
mrp=parseInt(mrp);
recharge.push({MRP:mrp, Talktime:talktime, Validity:validity ,Mode:mode});
}
mrp='2200';
talktime='2400.00';
EOT
hits = string.scan(/(?:mrp|talktime)='[\d.]+'/)
# => ["mrp='1100'", "talktime='1200.00'", "mrp='2200'", "talktime='2400.00'"]
This gives us an array of hits using scan, where the pattern /(?:mrp|talktime)='[\d.]+'/ matched in the string. Figuring out how the pattern works is left as an exercise for the user, but Ruby's Regexp documentation explains it all.
Cleaning that up to be a bit more useful:
hash = hits.map{ |s|
str, val = s.split('=')
[str, val.delete("'")]
}.each_with_object(Hash.new { |h, k| h[k] = [] }){ |(str, val), h| h[str] << val }
You also need to read about each_with_object and what's happening with Hash.new as those are important concepts to learn in Ruby.
At this point, hash is a hash of arrays:
hash # => {"mrp"=>["1100", "2200"], "talktime"=>["1200.00", "2400.00"]}
You can easily extract a particular variable's values, and can correlate them if need be.
what if i get a string instead of integer next to "=" sign?
...
string.scan(/(?:tariff)='[\p{Print}]+'/)
It's important to understand what the pattern is doing. The regular expression engine has some gotchas that can drastically affect the speed of a search, so indiscriminately throwing in things without understanding what they do can be very costly.
When using (?:...), you're creating a non-capturing group. When you only have one item you're matching it's not necessary, nor is it particularly desirable since it's making the engine do more work. The only time I'd do that is when I need to refer back to what the capture was, but since you have only one possible thing it'll match that becomes a moot-point. So, your pattern should be reduced to:
/tariff='[\p{Print}]+'/
Which, when used, results in:
%(tariff='abcdef abc a').scan(/tariff='[\p{Print}]+'/)
# => ["tariff='abcdef abc a'"]
If you want to capture all non-empty occurrences of the string being assigned, it's easier than what you're doing. I'd use something like:
%(tariff='abcdef abc a').scan(/tariff='.+'/)
# => ["tariff='abcdef abc a'"]
%(tariff='abcdef abc a').scan(/tariff='[^']+'/)
# => ["tariff='abcdef abc a'"]
The second is more rigorous, and possible safer as it won't be tricked by an line that has multiple single-quotes:
%(tariff='abcdef abc a', 'foo').scan(/tariff='.+'/)
# => ["tariff='abcdef abc a', 'foo'"]
%(tariff='abcdef abc a', 'foo').scan(/tariff='[^']+'/)
# => ["tariff='abcdef abc a'"]
Why that works is for you to figure out.
Related
I working on a language similar to ruby called gaiman and I'm using PEG.js to generate the parser.
Do you know if there is a way to implement heredocs with proper indentation?
xxx = <<<END
hello
world
END
the output should be:
"hello
world"
I need this because this code doesn't look very nice:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
this is a function where the user wants to return:
"xxx
xxx"
I would prefer the code to look like this:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
If I trim all the lines user will not be able to use a string with leading spaces when he wants. Does anyone know if PEG.js allows this?
I don't have any code yet for heredocs, just want to be sure if something that I want is possible.
EDIT:
So I've tried to implement heredocs and the problem is that PEG doesn't allow back-references.
heredoc = "<<<" marker:[\w]+ "\n" text:[\s\S]+ marker {
return text.join('');
}
It says that the marker is not defined. As for trimming I think I can use location() function
I don't think that's a reasonable expectation for a parser generator; few if any would be equal to the challenge.
For a start, recognising the here-string syntax is inherently context-sensitive, since the end-delimiter must be a precise copy of the delimiter provided after the <<< token. So you would need a custom lexical analyser, and that means that you need a parser generator which allows you to use a custom lexical analyser. (So a parser generator which assumes you want a scannerless parser might not be the optimal choice.)
Recognising the end of the here-string token shouldn't be too difficult, although you can't do it with a single regular expression. My approach would be to use a custom scanning function which breaks the here-string into a series of lines, concatenating them as it goes until it reaches a line containing only the end-delimiter.
Once you've recognised the text of the literal, all you need to normalise the spaces in the way you want is the column number at which the <<< starts. With that, you can trim each line in the string literal. So you only need a lexical scanner which accurately reports token position. Trimming wouldn't normally be done inside the generated lexical scanner; rather, it would be the associated semantic action. (Equally, it could be a semantic action in the grammar. But it's always going to be code that you write.)
When you trim the literal, you'll need to deal with the cases in which it is impossible, because the user has not respected the indentation requirement. And you'll need to do something with tab characters; getting those right probably means that you'll want a lexical scanner which computes visible column positions rather than character offsets.
I don't know if peg.js corresponds with those requirements, since I don't use it. (I did look at the documentation, and failed to see any indication as to how you might incorporate a custom scanner function. But that doesn't mean there isn't a way to do it.) I hope that the discussion above at least lets you check the detailed documentation for the parser generator you want to use, and otherwise find a different parser generator which will work for you in this use case.
Here is the implementation of heredocs in Peggy successor to PEG.js that is not maintained anymore. This code was based on the GitHub issue.
heredoc = "<<<" begin:marker "\n" text:($any_char+ "\n")+ _ end:marker (
&{ return begin === end; }
/ '' { error(`Expected matched marker "${begin}", but marker "${end}" was found`); }
) {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`\\s{${min}}`);
return text.map(line => {
return line[0].replace(re, '');
}).join('\n');
}
any_char = (!"\n" .)
marker_char = (!" " !"\n" .)
marker "Marker" = $marker_char+
_ "whitespace"
= [ \t\n\r]* { return []; }
EDIT: above didn't work with another piece of code after heredoc, here is better grammar:
{ let heredoc_begin = null; }
heredoc = "<<<" beginMarker "\n" text:content endMarker {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`^\\s{${min}}`, 'mg');
return {
type: 'Literal',
value: text.replace(re, '')
};
}
__ = (!"\n" !" " .)
marker 'Marker' = $__+
beginMarker = m:marker { heredoc_begin = m; }
endMarker = "\n" " "* end:marker &{ return heredoc_begin === end; }
content = $(!endMarker .)*
I have stumbled upon this line of code and I am not sure what the [ ? ] part represents (my guess is it's a sort of a wildcard but I searched it for a while and couldn't find anything):
['?'] = function() return is_canadian and "eh" or "" end
I understand that RHS is a functional ternary operator. I am curious about the LHS and what it actually is.
Edit: reference (2nd example):
http://lua-users.org/wiki/SwitchStatement
Actually, it is quite simple.
local t = {
a = "aah",
b = "bee",
c = "see",
It maps each letter to a sound pronunciation. Here, a need to be pronounced aah and b need to be pronounced bee and so on. Some letters have a different pronunciation if in american english or canadian english. So not every letter can be mapped to a single sound.
z = function() return is_canadian and "zed" or "zee" end,
['?'] = function() return is_canadian and "eh" or "" end
In the mapping, the letter z and the letter ? have a different prononciation in american english or canadian english. When the program will try to get the prononciation of '?', it will calls a function to check whether the user want to use canadian english or another english and the function will returns either zed or zee.
Finally, the 2 following notations have the same meaning:
local t1 = {
a = "aah",
b = "bee",
["?"] = "bee"
}
local t2 = {
["a"] = "aah",
["b"] = "bee",
["?"] = "bee"
}
If you look closely at the code linked in the question, you'll see that this line is part of a table constructor (the part inside {}). It is not a full statement on its own. As mentioned in the comments, it would be a syntax error outside of a table constructor. ['?'] is simply a string key.
The other posts alreay explained what that code does, so let me explain why it needs to be written that way.
['?'] = function() return is_canadian and "eh" or "" end is embedded in {}
It is part of a table constructor and assigns a function value to the string key '?'
local tbl = {a = 1} is syntactic sugar for local tbl = {['a'] = 1} or
local tbl = {}
tbl['a'] = 1
String keys that allow that convenient syntax must follow Lua's lexical conventions and hence may only contain letters, digits and underscore. They must not start with a digit.
So local a = {? = 1} is not possible. It will cause a syntax error unexpected symbol near '?' Therefor you have to explicitly provide a string value in square brackets as in local a = {['?'] = 1}
they gave each table element its own line
local a = {
1,
2,
3
}
This greatly improves readability for long table elements or very long tables and allows you maintain a maximum line length.
You'll agree that
local tbl = {
z = function() return is_canadian and "zed" or "zee" end,
['?'] = function() return is_canadian and "eh" or "" end
}
looks a lot cleaner than
local tbl = {z = function() return is_canadian and "zed" or "zee" end,['?'] = function() return is_canadian and "eh" or "" end}
I have a program that grabs a list of peripheral types, matches them to see if they're a valid type, and then executes type-specific code if they are valid.
However, some of the types can share parts of their name, with the only difference being their tier, I want to match these to the base type listed in a table of valid peripherals, but I can't figure out how to use a pattern to match them without the pattern returning nil for everything that doesn't match.
Here is the code to demonstrate my problem:
connectedPeripherals = {
[1] = "tile_thermalexpansion_cell_basic_name",
[2] = "modem",
[3] = "BigReactors-Turbine",
[4] = "tile_thermalexpansion_cell_resonant_name",
[5] = "monitor",
[6] = "tile_thermalexpansion_cell_hardened_name",
[7] = "tile_thermalexpansion_cell_reinforced_name",
[8] = "tile_blockcapacitorbank_name"
}
validPeripherals = {
["tile_thermalexpansion_cell"]=true,
["tile_blockcapacitorbank_name"]=true,
["monitor"]=true,
["BigReactors-Turbine"]=true,
["BigReactors-Reactor"]=true
}
for i = 1, #connectedPeripherals do
local periFunctions = {
["tile_thermalexpansion_cell"] = function()
--content
end,
["tile_blockcapacitorbank_name"] = function()
--content
end,
["monitor"] = function()
--content
end,
["BigReactors-Turbine"] = function()
--content
end,
["BigReactors-Reactor"] = function()
--content
end
}
if validPeripherals[connectedPeripherals[i]] then periFunctions[connectedPeripherals[i]]() end
end
If I try to run it like that, all of the thermalexpansioncells aren't recognized as valid peripherals, and if I add a pattern matching statement, it works for the thermalexpansioncells, but returns nil for everything else and causes an exception.
How do I do a match statement that only returns a shortened string for things that match and returns the original string for things that don't?
Is this possible?
If the short version doesn't contain any of the special characters from Lua patterns you can use the following:
long = "tile_thermalexpansion_cell_basic_name"
result = long:match("tile_thermalexpansion_cell") or long
print(result) -- prints the shorter version
result = long:match("foo") or long
print(result) -- prints the long version
Based on the comment, you can also use string.find to see the types match your peripheral names:
for i,v in ipairs(connectedPeripherals) do
local Valid = CheckValidity(v)
if Valid then Valid() end
end
where, CheckValidity will return the key from validPeripherals:
function CheckValidity( name )
for n, b in pairs(validPeripherals) do
if name:find( n ) then return n end
end
return false
end
Hi I've got this function in JavaScript:
function blur(data) {
var trimdata = trim(data);
var dataSplit = trimdata.split(" ");
var lastWord = dataSplit.pop();
var toBlur = dataSplit.join(" ");
}
What this does is it take's a string such as "Hello my name is bob" and will return
toBlur = "Hello my name is" and lastWord = "bob"
Is there a way i can re-write this in Lua?
You could use Lua's pattern matching facilities:
function blur(data) do
return string.match(data, "^(.*)[ ][^ ]*$")
end
How does the pattern work?
^ # start matching at the beginning of the string
( # open a capturing group ... what is matched inside will be returned
.* # as many arbitrary characters as possible
) # end of capturing group
[ ] # a single literal space (you could omit the square brackets, but I think
# they increase readability
[^ ] # match anything BUT literal spaces... as many as possible
$ # marks the end of the input string
So [ ][^ ]*$ has to match the last word and the preceding space. Therefore, (.*) will return everything in front of it.
For a more direct translation of your JavaScript, first note that there is no split function in Lua. There is table.concat though, which works like join. Since you have to do the splitting manually, you'll probably use a pattern again:
function blur(data) do
local words = {}
for m in string.gmatch("[^ ]+") do
words[#words+1] = m
end
words[#words] = nil -- pops the last word
return table.concat(words, " ")
end
gmatch does not give you a table right away, but an iterator over all matches instead. So you add them to your own temporary table, and call concat on that. words[#words+1] = ... is a Lua idiom to append an element to the end of an array.
I have to parse a document containing groups of variable-value-pairs which is serialized to a string e.g. like this:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Here are the different elements:
Group IDs:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Length of string representation of each group:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
One of the groups:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14 ^VAR1^6^VALUE1^^
Variables:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Length of string representation of the values:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
The values themselves:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Variables consist only of alphanumeric characters.
No assumption is made about the values, i.e. they may contain any character, including ^.
Is there a name for this kind of grammar? Is there a parsing library that can handle this mess?
So far I am using my own parser, but due to the fact that I need to detect and handle corrupt serializations the code looks rather messy, thus my question for a parser library that could lift the burden.
The simplest way to approach it is to note that there are two nested levels that work the same way. The pattern is extremely simple:
id^length^content^
At the outer level, this produces a set of groups. Within each group, the content follows exactly the same pattern, only here the id is the variable name, and the content is the variable value.
So you only need to write that logic once and you can use it to parse both levels. Just write a function that breaks a string up into a list of id/content pairs. Call it once to get the groups, and then loop through them calling it again for each content to get the variables in that group.
Breaking it down into these steps, first we need a way to get "tokens" from the string. This function returns an object with three methods, to find out if we're at "end of file", and to grab the next delimited or counted substring:
var tokens = function(str) {
var pos = 0;
return {
eof: function() {
return pos == str.length;
},
delimited: function(d) {
var end = str.indexOf(d, pos);
if (end == -1) {
throw new Error('Expected delimiter');
}
var result = str.substr(pos, end - pos);
pos = end + d.length;
return result;
},
counted: function(c) {
var result = str.substr(pos, c);
pos += c;
return result;
}
};
};
Now we can conveniently write the reusable parse function:
var parse = function(str) {
var parts = {};
var t = tokens(str);
while (!t.eof()) {
var id = t.delimited('^');
var len = t.delimited('^');
var content = t.counted(parseInt(len, 10));
var end = t.counted(1);
if (end !== '^') {
throw new Error('Expected ^ after counted string, instead found: ' + end);
}
parts[id] = content;
}
return parts;
};
It builds an object where the keys are the IDs (or variable names). I'm asuming as they have names that the order isn't significant.
Then we can use that at both levels to create the function to do the whole job:
var parseGroups = function(str) {
var groups = parse(str);
Object.keys(groups).forEach(function(id) {
groups[id] = parse(groups[id]);
});
return groups;
}
For your example, it produces this object:
{
'1': {
VAR1: 'VALUE1'
},
'4': {
VAR1: 'VALUE1',
VAR2: 'VAL2'
}
}
I don't think it's a trivial task to create a grammar for this. But on the other hand, a simple straight forward approach is not that hard. You know the corresponding string length for every critical string. So you just chop your string according to those lengths apart..
where do you see problems?