How to remove spaces from a string in Lua? - lua

I want to remove all spaces from a string in Lua. This is what I have tried:
string.gsub(str, "", "")
string.gsub(str, "% ", "")
string.gsub(str, "%s*", "")
This does not seem to work. How can I remove all of the spaces?

It works, you just have to assign the actual result/return value. Use one of the following variations:
str = str:gsub("%s+", "")
str = string.gsub(str, "%s+", "")
I use %s+ as there's no point in replacing an empty match (i.e. there's no space). This just doesn't make any sense, so I look for at least one space character (using the + quantifier).

You use the following function :
function all_trim(s)
return s:match"^%s*(.*)":match"(.-)%s*$"
end
Or shorter :
function all_trim(s)
return s:match( "^%s*(.-)%s*$" )
end
usage:
str=" aa "
print(all_trim(str) .. "e")
Output is:
aae

The fastest way is to use trim.so compiled from trim.c:
/* trim.c - based on http://lua-users.org/lists/lua-l/2009-12/msg00951.html
from Sean Conner */
#include <stddef.h>
#include <ctype.h>
#include <lua.h>
#include <lauxlib.h>
int trim(lua_State *L)
{
const char *front;
const char *end;
size_t size;
front = luaL_checklstring(L,1,&size);
end = &front[size - 1];
for ( ; size && isspace(*front) ; size-- , front++)
;
for ( ; size && isspace(*end) ; size-- , end--)
;
lua_pushlstring(L,front,(size_t)(end - front) + 1);
return 1;
}
int luaopen_trim(lua_State *L)
{
lua_register(L,"trim",trim);
return 0;
}
compile something like:
gcc -shared -fpic -O -I/usr/local/include/luajit-2.1 trim.c -o trim.so
More detailed (with comparison to the other methods): http://lua-users.org/wiki/StringTrim
Usage:
local trim15 = require("trim")--at begin of the file
local tr = trim(" a z z z z z ")--anywhere at code

For LuaJIT all methods from Lua wiki (except for, possibly, native C/C++) were awfully slow in my tests. This implementation showed the best performance:
function trim (str)
if str == '' then
return str
else
local startPos = 1
local endPos = #str
while (startPos < endPos and str:byte(startPos) <= 32) do
startPos = startPos + 1
end
if startPos >= endPos then
return ''
else
while (endPos > 0 and str:byte(endPos) <= 32) do
endPos = endPos - 1
end
return str:sub(startPos, endPos)
end
end
end -- .function trim

If anyone is looking to remove all spaces in a bunch of strings, and remove spaces in the middle of the string, this this works for me:
function noSpace(str)
local normalisedString = string.gsub(str, "%s+", "")
return normalisedString
end
test = "te st"
print(noSpace(test))
Might be that there is an easier way though, I'm no expert!

Related

not correct num histgram

Im trying to make a toString method that prints out a histogram that shows how often each character of the alphabet is used in a string. The most frequent character has to be 60 #s long, with the rest of the characters then scaled to match.
My issue is with making the equation that scales the rest of the letters to the correct length for the histogram. My current equation is (myArray[i]/max) * 60, but im getting really weird results.
If I put in "hello world" to be analyzed, L would be the most common occuring letter, seen 3 times. So L should have 60 #s for the histogram, h should have 20, o should have 40 etc. Instead im getting results like d : 10
e : 10
h : 10
l : 360
o : 20
r : 10
w : 10
Sorry for how sloppy this is right now, im just trying to figure out whats going on
public class LetterCounter
private static int[] alphabetArray;
private static String input;
/**
* Constructor for objects of class LetterCounter
*/
public LetterCounter()
{
alphabetArray = new int[26];
}
public void countLetters(String input) {
this.input = input;
this.input.toLowerCase();
//String s= input;
//s.toLowerCase();
for ( int i = 0; i < input.length(); i++ ) {
char ch= input.charAt(i);
if (ch >= 97 && ch <= 122){
alphabetArray[ch-'a']++;
}
}
}
public void getTotalCount() {
for (int i = 0; i < alphabetArray.length; i++) {
if(alphabetArray[i]>=0){
char ch = (char) (i+97);
System.out.println(ch +" : "+alphabetArray[i]);
}
}
}
public void reset() {
for (int i =0; i<alphabetArray.length; i++) {
if(alphabetArray[i]>=0){
alphabetArray[i]=0;
char ch = (char) (i+97);
System.out.println(ch +" : "+alphabetArray[i]);
}
}
}
public String toString() {
String s = "";
int max = alphabetArray[0];
int markCounter = 0;
for(int i =0; i<alphabetArray.length; i++) {
//finds the largest number of occurences for any letter in the string
if(alphabetArray[i] > max) {
max = alphabetArray[i];
}
}
for(int i =0; i<alphabetArray.length; i++) {
//trying to scale the rest of the characters down here
if(alphabetArray[i] > 0) {
markCounter = (alphabetArray[i] / max) * 60;
char ch = (char) (i+97);
System.out.println(ch +" : "+alphabetArray[i] + markCounter);
}
}
for (int i = 0; i < alphabetArray.length; i++) {
//prints the whole alphabet, total number of occurences for all chars
if(alphabetArray[i]>=0){
char ch = (char) (i+97);
System.out.println(ch +" : "+alphabetArray[i]);
}
}
return s;
}
}
There are many many problems with your code, but lets go one by one.
First of all, your print statement is simply misleading. Change it to
System.out.println(ch +" : "+alphabetArray[i] + " " + markCounter);
and you will see
d : 1 0
e : 1 0
h : 1 0
l : 3 60
o : 2 0
r : 1 0
w : 1 0
As you can see: the counters are correct (1,1,1,3,2,1,1). But the your scaling doesn't work:
1 / 3 --> 0 ... and 0 * 3 ... is still 0
3 / 3 --> 1 and 1 * 3 ... is 60
but of course, when you dont print a space between 1 and 0 and 3 and 60.
Thus to get correct scaling, just change to:
markCounter = alphabetArray[i] * 60 / max;
Other things worth mentioning:
You are overriding toString(). Then you should put #Override in fron t of that method
toLowerCase() returns a new string in lower case; just calling it without pushing the result back into your string ... just throws away the "lower casing".
toString() shouldnt print to the console. The whole idea is that you put all the information into the string that you return. In other words: in the end you do some System.out.println(someLetterCounter.toString()
Your code is extremely low-level. You don't iterate arrays using for (int), you can do (int letter : alphabetArray) instead
You might want to read about Map. You see, if you would be using a Map<Character, Integer> where the map key would represent the different characters, and the map value represents a counter for each character ... well, you could throw out most of your code; and come up with a solution that would require a few lines of code only!
( and seriously: because of all these issues, debugging your code was really much harder than it needed to be )
countLetters seems has some issues. You can not convert String to lowercase by just calling
this.input.toLowerCase();
Because String is immutable in java. You have to assign it like:
this.input = input.toLowerCase();
Another problem is you are using input variable from parameter instead of this.input which has lower case string. You can do this way to make work countLetters method:
public void countLetters(String input) {
this.input = input.toLowerCase();
for ( int i = 0; i < this.input.length(); i++ ) {
char ch= this.input.charAt(i);
if (ch >= 97 && ch <= 122) {
alphabetArray[ch-'a']++;
}
}
}

associative arrays in awk challenging memory limits

This is related to my recent post in Awk code with associative arrays -- array doesn't seem populated, but no error and also to optimizing loop, passing parameters from external file, naming array arguments within awk
My basic problem here is simply to compute from detailed ancient archival financial market data, daily aggregates of #transactions, #shares, value, BY DATE, FIRM-ID, EXCHANGE, etc. Learnt to use associative arrays in awk for this, and was thrilled to be able to process 129+ million lines in clock time of under 11 minutes. Literally before I finished my coffee.
Became a little more ambitious, and moved from 2 array subscripts to 4, and now I am unable to process more than 6500 lines at a time.
Get error messages of the form:
K:\User Folders\KRISHNANM\PAPERS\FII_Transaction_Data>zcat
RAW_DATA\2003_1.zip | gawk -f CODE\FII_daily_aggregates_v2.awk >
OUTPUT\2003_1.txt&
gawk: CODE\FII_daily_aggregates_v2.awk:33: (FILENAME=- FNR=49300)
fatal: more_no des: nextfree: can't allocate memory (Not enough space)
On some runs the machine has told me it lacks as little as 52 KB of memory. I have what I think of a std configuration with Win-7 and 8MB RAM.
(Economist by training, not computer scientist.) I realize that going from 2 to 4 arrays makes the problem computationally much more complex for the computer, but is there something one can do to improve memory management at least a little bit. I have tried closing everything else I am doing. The error always has to do only with memory, never with disk space or anything else.
Sample INPUT:
49290,C198962542782200306,6/30/2003,433581,F5811773991200306,S5405611832200306,B5086397478200306,NESTLE INDIA LTD.,INE239A01016,6/27/2003,1,E9035083824200306,REG_DL_STLD_02,591.13,5655,3342840.15,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
49291,C198962542782200306,6/30/2003,433563,F6292896459200306,S6344227311200306,B6110521493200306,GRASIM INDUSTRIES LTD.,INE047A01013,6/27/2003,1,E9035083824200306,REG_DL_STLD_02,495.33,3700,1832721,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
49292,C198962542782200306,6/30/2003,433681,F6513202607200306,S1724027402200306,B6372023178200306,HDFC BANK LTD,INE040A01018,6/26/2003,1,E745964372424200306,REG_DL_STLD_02,242,2600,629200,REG_DL_INSTR_EQ,REG_DL_DLAY_D,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
49293,C7885768925200306,6/30/2003,48128,F4406661052200306,S7376401565200306,B4576522576200306,Maruti Udyog Limited,INE585B01010,6/28/2003,3,E912851176274200306,REG_DL_STLD_04,125,44600,5575000,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
49294,C7885768925200306,6/30/2003,48129,F4500260787200306,S1312094035200306,B4576522576200306,Maruti Udyog Limited,INE585B01010,6/28/2003,4,E912851176274200306,REG_DL_STLD_04,125,445600,55700000,REG_DL_INSTR_EQ,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
49295,C7885768925200306,6/30/2003,48130,F6425024637200306,S2872499118200306,B4576522576200306,Maruti Udyog Limited,INE585B01010,6/28/2003,3,E912851176274200306,REG_DL_STLD_04,125,48000,6000000,REG_DL_INSTR_EU,REG_DL_DLAY_P,DL_RPT_TYPE_N,DL_AMDMNT_DEL_00
Code
BEGIN { FS = "," }
# For each array subscript variable -- DATE ($10), firm_ISIN ($9), EXCHANGE ($12), and FII_ID ($5), after checking for type = EQ, set up counts for each value, and number of unique values.
( $17~/_EQ\>/ ) { if (date[$10]++ == 0) date_list[d++] = $10;
if (isin[$9]++ == 0) isin_list[i++] = $9;
if (exch[$12]++ == 0) exch_list[e++] = $12;
if (fii[$5]++ == 0) fii_list[f++] = $5;
}
# For cash-in, buy (B), or cash-out, sell (S) count NR = no of records, SH = no of shares, RV = rupee-value.
(( $17~/_EQ\>/ ) && ( $11~/1|2|3|5|9|1[24]/ )) {{ ++BNR[$10,$9,$12,$5]} {BSH[$10,$9,$12,$5] += $15} {BRV[$10,$9,$12,$5] += $16} }
(( $17~/_EQ\>/ ) && ( $11~/4|1[13]/ )) {{ ++SNR[$10,$9,$12,$5]} {SSH[$10,$9,$12,$5] += $15} {SRV[$10,$9,$12,$5] += $16} }
END {
{ print NR, "records processed."}
{ print " " }
{ printf("%-11s\t%-13s\t%-20s\t%-19s\t%-7s\t%-7s\t%-14s\t%-14s\t%-18s\t%-18s\n", \
"DATE", "ISIN", "EXCH", "FII", "BNR", "SNR", "BSH", "SSH", "BRV", "SRV") }
{ for (u = 0; u < d; u++)
{
for (v = 0; v < i; v++)
{
for (w = 0; w < e; w++)
{
for (x = 0; x < f; x++)
#check first below for records with zeroes, don't print them
{ if (BNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] + SNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] > 0)
{ BR = BNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
SR = SNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
BS = BSH[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
BV = BRV[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
SS = SSH[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
SV = SRV[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
{ printf("%-11s\t%13s\t%20s\t%19s\t%7d\t%7d\t%14d\t%14d\t%18.2f\t%18.2f\n", \
date_list[u], isin_list[v], exch_list[w], fii_list[x], BR, SR, BS, SS, BV, SV) } }
}
}
}
}
}
}
Expected output
6 records processed.
DATE ISIN EXCH FII BNR SNR BSH SSH BRV SRV
6/27/2003 INE239A01016 E9035083824200306 F5811773991200306 1 0 5655 0 3342840.15 0.00
6/27/2003 INE047A01013 E9035083824200306 F6292896459200306 1 0 3700 0 1832721.00 0.00
6/26/2003 INE040A01018 E745964372424200306 F6513202607200306 1 0 2600 0 629200.00 0.00
6/28/2003 INE585B01010 E912851176274200306 F4406661052200306 1 0 44600 0 5575000.00 0.00
6/28/2003 INE585B01010 E912851176274200306 F4500260787200306 0 1 0 445600 0.00 55700000.00
It is in this case that as the number of input records exceeds 6500, I end up having memory problems. Have about 7 million records in all.
For a 2 array subscript problem, albeit on a different data set, where 129+ million lines were processed in clock time of 11 minutes using the same GNU-AWK on the same machine, see optimizing loop, passing parameters from external file, naming array arguments within awk
Question: is it the case that awk is not very smart with memory management, but that some other more modern tools (say, SQL) would accomplish this task with the same memory resources? Or is this simply a characteristic of associative arrays, which I found magical in enabling me to avoid many passes over the data, many loops and SORT procedures, but which maybe work well up to 2 array subscripts, and then face exponential memory resource costs after that?
Afterword: the super-detailed almost-idiot-proof tutorial along with the code provided by Ed Morton in comments below makes a dramatic difference, especially his GAWK script tst.awk. He taught me about (a) using SUBSEP intelligently (b) tackling needless looping, which is crucial in this problem which tends to have very sparse arrays, with various AWK constructs. Compared to performance with my old code (only up to 6500 lines of input accepted on one machine, another couldn't even get that far), the performance of Ed Morton's tst.awk can be seen from the table below:
**filename start end min in ln out lines
2008_1 12:08:40 AM 12:27:18 AM 0:18 391438 301160
2008_2 12:27:18 AM 12:52:04 AM 0:24 402016 314177
2009_1 12:52:05 AM 1:05:15 AM 0:13 302081 238204
2009_2 1:05:15 AM 1:22:15 AM 0:17 360072 276768
2010_1 "slept" 507496 397533
2010_2 3:10:26 AM 3:10:50 AM 0:00 76200 58228
2010_3 3:10:50 AM 3:11:18 AM 0:00 80988 61725
2010_4 3:11:18 AM 3:11:47 AM 0:00 86923 65885
2010_5 3:11:47 AM 3:12:15 AM 0:00 80670 63059**
Times were obtained simply from using %time% on lines before and after tst.awk was executed, all put in a simple batch script, "min" is the clock time taken (per whatever rounding EXCEL does by default), "in ln" and "out lines" are lines of input and output, respectively. From processing the entire data that we have, from Jan 2003 to Jan 2014, we find the theoretical max number of output records = #dates*#ISINs*#Exchanges*#FIIs = 2992*2955*567*82268, while the actual number of total output lines is only 5,261,942, which is only 1.275*10^(-8) of the theoretical max -- very sparse indeed. That there was sparseness, we did guess earlier, but that the arrays could be SO sparse -- which matters a lot for memory management -- we had no way of telling till something actually completed, for a real data set. Time taken seems to increase exponentially in input size, but within limits that pose no practical difficulty. Thanks a ton, Ed.
There is no problem with associative arrays in general. In awk (except gawk for true 2D arrays) an associative array with 4 subscripts is identical to one with 2 subscripts since in reality it only has one subscript which is the concatenation of each of the pseudo-subscripts separated by SUBSEP.
Given you say I am unable to process more than 6500 lines at a time. the problem is far more likely to be in the way you wrote your code than any fundamental awk issue so if you'd like more help, post a small script with sample input and expected output that demonstrates your problem and attempted solution to see if we have suggestions on way to improve it's memory usage.
Given your posted script, I expect the problem is with those nested loops in your END section When you do:
for (i=1; i<=maxI; i++) {
for (j=1; j<=maxJ; j++) {
if ( arr[i,j] != 0 ) {
print arr[i,j]
}
}
}
you are CREATING arr[i,j] for every possible combination of i and j that didn't exist prior to the loop just by testing for arr[i,j] != 0. If you instead wrote:
for (i=1; i<=maxI; i++) {
for (j=1; j<=maxJ; j++) {
if ( (i,j) in arr ) {
print arr[i,j]
}
}
}
then the loop itself would not create new entries in arr[].
So change this block:
if (BNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] + SNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]] > 0)
{
BR = BNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
SR = SNR[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
BS = BSH[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
BV = BRV[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
SS = SSH[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
SV = SRV[date_list[u],isin_list[v],exch_list[w],fii_list[x]]
which is probably unnecessarily turning each of BNR, SNR, BSH, BRV, SSH, and SRV into huge but highly sparse arrays, to something like this:
idx = date_list[u] SUBSEP isin_list[v] SUBSEP exch_list[w] SUBSEP fii_list[x]
BR = (idx in BNR ? BNR[idx] : 0)
SR = (idx in SNR ? SNR[idx] : 0)
if ( (BR + SR) > 0 )
{
BS = (idx in BSH ? BSH[idx] : 0)
BV = (idx in BRV ? BRV[idx] : 0)
SS = (idx in SSH ? SSH[idx] : 0)
SV = (idx in SRV ? SRV[idx] : 0)
and let us know if that helps. Also check your code for other places where you might be doing the same.
The reason you have this problem with 4 subscripts when you didn't with 2 is simply that you have 4 levels of nesting in the loops now creating much larger and more sparse arrays when when you just had 2.
Finally - you have some weird syntax in your script, some of which #MarkSetchell pointed out in a comment, and your script isn't as efficient as it could be since you're not using else statements and so testing for multiple conditions that can't possibly all be true and you're testing the same condition repeatedly, and it's not robust as you aren't anchoring your REs (e.g you test /4|1[13]/ instead of /^(4|1[13])$/ so for example your 4 would match on 14 or 41 etc. instead of just 4 on its own) so change your whole script to this:
$ cat tst.awk
BEGIN { FS = "," }
# For each array subscript variable -- DATE ($10), firm_ISIN ($9), EXCHANGE ($12), and FII_ID ($5), after checking for type = EQ, set up counts for each value, and number of unique values.
$17 ~ /_EQ\>/ {
if (!seenDate[$10]++) date_list[++d] = $10
if (!seenIsin[$9]++) isin_list[++i] = $9
if (!seenExch[$12]++) exch_list[++e] = $12
if (!seenFii[$5]++) fii_list[++f] = $5
# For cash-in, buy (B), or cash-out, sell (S) count NR = no of records, SH = no of shares, RV = rupee-value.
idx = $10 SUBSEP $9 SUBSEP $12 SUBSEP $5
if ( $11 ~ /^([12359]|1[24])$/ ) {
++BNR[idx]; BSH[idx] += $15; BRV[idx] += $16
}
else if ( $11 ~ /^(4|1[13])$/ ) {
++SNR[idx]; SSH[idx] += $15; SRV[idx] += $16
}
}
END {
print NR, "records processed."
print " "
printf "%-11s\t%-13s\t%-20s\t%-19s\t%-7s\t%-7s\t%-14s\t%-14s\t%-18s\t%-18s\n",
"DATE", "ISIN", "EXCH", "FII", "BNR", "SNR", "BSH", "SSH", "BRV", "SRV"
for (u = 1; u <= d; u++)
{
for (v = 1; v <= i; v++)
{
for (w = 1; w <= e; w++)
{
for (x = 1; x <= f; x++)
{
#check first below for records with zeroes, don't print them
idx = date_list[u] SUBSEP isin_list[v] SUBSEP exch_list[w] SUBSEP fii_list[x]
BR = (idx in BNR ? BNR[idx] : 0)
SR = (idx in SNR ? SNR[idx] : 0)
if ( (BR + SR) > 0 )
{
BS = (idx in BSH ? BSH[idx] : 0)
BV = (idx in BRV ? BRV[idx] : 0)
SS = (idx in SSH ? SSH[idx] : 0)
SV = (idx in SRV ? SRV[idx] : 0)
printf "%-11s\t%13s\t%20s\t%19s\t%7d\t%7d\t%14d\t%14d\t%18.2f\t%18.2f\n",
date_list[u], isin_list[v], exch_list[w], fii_list[x], BR, SR, BS, SS, BV, SV
}
}
}
}
}
}
I added seen in front of 4 array names just because by convention arrays testing for the pre-existence of a value are typically named seen. Also, when populating the SNR[] etc arrays I created an idx variable first instead of repeatedly using the field numbers every time for both ease of changing it in future and mostly because string concatenation is relatively slow in awk and that's whats happening when you use multiple indices in an array so best to just do the string concatenation once explicitly. And I changed your date_list[] etc arrays to start at 1 instead of zero because all awk-generated arrays, strings and field numbers start at 1. You CAN create an array manually that starts at 0 or -357 or whatever number you want but it'll save shooting yourself in the foot some day if you always start them at 1.
I expect it could be made more efficient still by restricting the nested loops to only values that could exist for the enclosing loop index combinations (e.g. not every value of u+v+w is possible so there will be times when you shouldn't bother looping on x). For example:
$ cat tst.awk
BEGIN { FS = "," }
# For each array subscript variable -- DATE ($10), firm_ISIN ($9), EXCHANGE ($12), and FII_ID ($5), after checking for type = EQ, set up counts for each value, and number of unique values.
$17 ~ /_EQ\>/ {
if (!seenDate[$10]++) date_list[++d] = $10
if (!seenIsin[$9]++) isin_list[++i] = $9
if (!seenExch[$12]++) exch_list[++e] = $12
if (!seenFii[$5]++) fii_list[++f] = $5
# For cash-in, buy (B), or cash-out, sell (S) count NR = no of records, SH = no of shares, RV = rupee-value.
idx = $10 SUBSEP $9 SUBSEP $12 SUBSEP $5
if ( $11 ~ /^([12359]|1[24])$/ ) {
seen[$10,$9]
seen[$10,$9,$12]
++BNR[idx]; BSH[idx] += $15; BRV[idx] += $16
}
else if ( $11 ~ /^(4|1[13])$/ ) {
seen[$10,$9]
seen[$10,$9,$12]
++SNR[idx]; SSH[idx] += $15; SRV[idx] += $16
}
}
END {
printf "d = %d\n", d | "cat>&2"
printf "i = %d\n", i | "cat>&2"
printf "e = %d\n", e | "cat>&2"
printf "f = %d\n", f | "cat>&2"
print NR, "records processed."
print " "
printf "%-11s\t%-13s\t%-20s\t%-19s\t%-7s\t%-7s\t%-14s\t%-14s\t%-18s\t%-18s\n",
"DATE", "ISIN", "EXCH", "FII", "BNR", "SNR", "BSH", "SSH", "BRV", "SRV"
for (u = 1; u <= d; u++)
{
date = date_list[u]
for (v = 1; v <= i; v++)
{
isin = isin_list[v]
if ( (date,isin) in seen )
{
for (w = 1; w <= e; w++)
{
exch = exch_list[w]
if ( (date,isin,exch) in seen )
{
for (x = 1; x <= f; x++)
{
fii = fii_list[x]
#check first below for records with zeroes, don't print them
idx = date SUBSEP isin SUBSEP exch SUBSEP fii
if ( (idx in BNR) || (idx in SNR) )
{
if (idx in BNR)
{
bnr = BNR[idx]
bsh = BSH[idx]
brv = BRV[idx]
}
else
{
bnr = bsh = brv = 0
}
if (idx in SNR)
{
snr = SNR[idx]
ssh = SSH[idx]
srv = SRV[idx]
}
else
{
snr = ssh = srv = 0
}
printf "%-11s\t%13s\t%20s\t%19s\t%7d\t%7d\t%14d\t%14d\t%18.2f\t%18.2f\n",
date, isin, exch, fii, bnr, snr, bsh, ssh, brv, srv
}
}
}
}
}
}
}
}

Huffman's encoding and decoding

I have to build a compressor based on the Huffman algorithm.
So far, I managed to create the tree with the frequencies of each character and generate a representation with a smaller number of bits for each character.
Is something like this, for the phrase "good this sugarplum":
'o' 000, '' 001, 't' 0100, 'r' 0101, 'p' 0110, 'm' 0111, 'l' 1000, 'i' 1001, 'h' 1010, 'd' 1011, 'a'1100, 'u' 1101, 'g' 1110, 's' 1111
The problem I'm having now is finding a way to save the tree in the archive, so I can rebuild it and then decompress the file.
Any suggestions?
I did some research but found it difficult to understand, so if you can explain in detail, I would appreciate it.
The code I used to read the frequencies from file is:
int main (int argc, char *argv[])
{
int i;
TipoSentinela *sentinela;
TipoLista *no = NULL;
Arv *arvore, *arvore2, *arvore3;
int *repete = (int *) calloc (256, sizeof(int));
if(argc == 2)
{
in = load_base(argv[1]);
le_dados_arquivo (repete); //read the frequencies from the file
sentinela = cria_lista (); //create a marker for the tree node list
for (i = 0; i < 256; i++)
{
if(repete[i] > 0 && i != 0)
{
arvore = arv_cria (Cria_info (i, repete[i])); //create a tree node with the character i and the frequence of it in the file
no = inicia_lista (arvore, no, sentinela); //create the list of tree nodes
}
}
Ordena (sentinela); //sort the tree nodes list by the frequencies
for(Seta_primeiro(sentinela); Tamanho_lista(sentinela) != 1; Move_marcador(sentinela))
{
Seta_primeiro(sentinela); //put the marker in the first element of the list
no = Retorna_marcador(sentinela);
arvore2 = Retorna_arvore (no); //return the tree represented by the list marker
Move_marcador(sentinela); //put the marker to the next element
arvore3 = Retorna_arvore (Retorna_marcador (sentinela)); //return the tree represented by the list marker
arvore = Cria_pai (arvore2, arvore3); //create a tree node that will contain the both arvore2 and arvore3
Insere_arvoreFinal (sentinela, arvore); //insert the node at the end of the list
Remove_arvore (sentinela); //remove the node arvore2 from the list
Remove_arvore (sentinela); //remove the node arvore3 from the lsit
Ordena (sentinela); //sort the list again
}
out = load_out(argv[1]); //open the output file
Codificacao (arvore); //generate the code from each node of the tree
rewind(in);
char c;
while(!feof(in))
{
c = fgetc(in);
if(c != EOF)
arvore2 = Procura_info (arvore, c); //search the character c in the tree
if(arvore2 != NULL)
imprimebit(Retorna_codigo(arvore2), out); //write the code in the file
}
fclose(in);
fclose(out);
free(repete);
arvore = arv_libera (arvore);
Libera_Lista(sentinela);
}
return 0;
}
//bit_counter and cur_byte are global variables
void write_bit (unsigned char bit, FILE *f)
{
static k = 0;
if(k != 0)
{
if(++bit_counter == 8)
{
fwrite(&cur_byte,1,1,f);
bit_counter = 0;
cur_byte = 0;
}
}
k = 1;
cur_byte <<= 1;
cur_byte |= ('0' != bit);
}
//aux is the code of a character in the tree
void imprimebit(char *aux, FILE *f)
{
int i, j;
if(aux == NULL)
return;
for(i = 0; i < strlen(aux); i++)
{
write_bit(aux[i], f); //write the bits of the code in the file
}
}
With this, I can write the code of all characters in the output file, but I can't see a way to store the tree too.
You don't need to send the tree. Just send the lengths. Then establish a consistent algorithm to convert the lengths to codes on both ends. The consistency is called a "canonical" Huffman code. You sort the codes by length, and within each length, sort by the symbol. Then assign codes starting at 0. So you would end up with (_ means space):
_ 000
o 001
a 0100
d 0101
g 0110
h 0111
i 1000
l 1001
m 1010
p 1011
r 1100
s 1101
t 1110
u 1111
I did found a way to store the code of each character.
For example:
I write the tree, starting by the root and going down to the left, then right.
So, if my tree was something like
0
/ \
0 1
/ \ / \
'a' 'b' 'c' 'd'
The header of my file would be someting like this:
001[8 bits from 'a']1[8 bits from b]01[8 bits from c]1[8 bits from d]
With this, I would be able to rebuild my tree.
My problem now is in read bit-by-bit of the header of file to know in wich direction I have to create a new node.

Case-insensitive Lua pattern-matching

I'm writing a grep utility in Lua for our mobile devices running Windows CE 6/7, but I've run into some issues implementing case-insensitive match patterns. The obvious solution of converting everything to uppercase (or lower) does not work so simply due to the character classes.
The only other thing I can think of is converting the literals in the pattern itself to uppercase.
Here's what I have so far:
function toUpperPattern(instr)
-- Check first character
if string.find(instr, "^%l") then
instr = string.upper(string.sub(instr, 1, 1)) .. string.sub(instr, 2)
end
-- Check the rest of the pattern
while 1 do
local a, b, str = string.find(instr, "[^%%](%l+)")
if not a then break end
if str then
instr = string.sub(instr, 1, a) .. string.upper(string.sub(instr, a+1, b)) .. string.sub(instr, b + 1)
end
end
return instr
end
I hate to admit how long it took to get even that far, and I can still see right away there are going to be problems with things like escaped percent signs '%%'
I figured this must be a fairly common issue, but I can't seem to find much on the topic.
Are there any easier (or at least complete) ways to do this? I'm starting to go crazy here...
Hoping you Lua gurus out there can enlighten me!
Try something like this:
function case_insensitive_pattern(pattern)
-- find an optional '%' (group 1) followed by any character (group 2)
local p = pattern:gsub("(%%?)(.)", function(percent, letter)
if percent ~= "" or not letter:match("%a") then
-- if the '%' matched, or `letter` is not a letter, return "as is"
return percent .. letter
else
-- else, return a case-insensitive character class of the matched letter
return string.format("[%s%s]", letter:lower(), letter:upper())
end
end)
return p
end
print(case_insensitive_pattern("xyz = %d+ or %% end"))
which prints:
[xX][yY][zZ] = %d+ [oO][rR] %% [eE][nN][dD]
Lua 5.1, LPeg v0.12
do
local p = re.compile([[
pattern <- ( {b} / {escaped} / brackets / other)+
b <- "%b" . .
escaped <- "%" .
brackets <- { "[" ([^]%]+ / escaped)* "]" }
other <- [^[%]+ -> cases
]], {
cases = function(str) return (str:gsub('%a',function(a) return '['..a:lower()..a:upper()..']' end)) end
})
local pb = re.compile([[
pattern <- ( {b} / {escaped} / brackets / other)+
b <- "%b" . .
escaped <- "%" .
brackets <- {: {"["} ({escaped} / bcases)* {"]"} :}
bcases <- [^]%]+ -> bcases
other <- [^[%]+ -> cases
]], {
cases = function(str) return (str:gsub('%a',function(a) return '['..a:lower()..a:upper()..']' end)) end
, bcases = function(str) return (str:gsub('%a',function(a) return a:lower()..a:upper() end)) end
})
function iPattern(pattern,brackets)
('sanity check'):find(pattern)
return table.concat({re.match(pattern, brackets and pb or p)})
end
end
local test = '[ab%c%]d%%]+ o%%r %bnm'
print(iPattern(test)) -- [ab%c%]d%%]+ [oO]%%[rR] %bnm
print(iPattern(test,true)) -- [aAbB%c%]dD%%]+ [oO]%%[rR] %bnm
print(('qwe [%D]% O%r n---m asd'):match(iPattern(test, true))) -- %D]% O%r n---m
Pure Lua version:
It is necessary to analyze all the characters in the string to convert it into a correct pattern because Lua patterns do not have alternations like in regexps (abc|something).
function iPattern(pattern, brackets)
('sanity check'):find(pattern)
local tmp = {}
local i=1
while i <= #pattern do -- 'for' don't let change counter
local char = pattern:sub(i,i) -- current char
if char == '%' then
tmp[#tmp+1] = char -- add to tmp table
i=i+1 -- next char position
char = pattern:sub(i,i)
tmp[#tmp+1] = char
if char == 'b' then -- '%bxy' - add next 2 chars
tmp[#tmp+1] = pattern:sub(i+1,i+2)
i=i+2
end
elseif char=='[' then -- brackets
tmp[#tmp+1] = char
i = i+1
while i <= #pattern do
char = pattern:sub(i,i)
if char == '%' then -- no '%bxy' inside brackets
tmp[#tmp+1] = char
tmp[#tmp+1] = pattern:sub(i+1,i+1)
i = i+1
elseif char:match("%a") then -- letter
tmp[#tmp+1] = not brackets and char or char:lower()..char:upper()
else -- something else
tmp[#tmp+1] = char
end
if char==']' then break end -- close bracket
i = i+1
end
elseif char:match("%a") then -- letter
tmp[#tmp+1] = '['..char:lower()..char:upper()..']'
else
tmp[#tmp+1] = char -- something else
end
i=i+1
end
return table.concat(tmp)
end
local test = '[ab%c%]d%%]+ o%%r %bnm'
print(iPattern(test)) -- [ab%c%]d%%]+ [oO]%%[rR] %bnm
print(iPattern(test,true)) -- [aAbB%c%]dD%%]+ [oO]%%[rR] %bnm
print(('qwe [%D]% O%r n---m asd'):match(iPattern(test, true))) -- %D]% O%r n---m

How can I do mod without a mod operator?

This scripting language doesn't have a % or Mod(). I do have a Fix() that chops off the decimal part of a number. I only need positive results, so don't get too robust.
Will
// mod = a % b
c = Fix(a / b)
mod = a - b * c
do? I'm assuming you can at least divide here. All bets are off on negative numbers.
a mod n = a - (n * Fix(a/n))
For posterity, BrightScript now has a modulo operator, it looks like this:
c = a mod b
If someone arrives later, here are some more actual algorithms (with errors...read carefully)
https://eprint.iacr.org/2014/755.pdf
There are actually two main kind of reduction formulae: Barett and Montgomery. The paper from eprint repeat both in different versions (algorithms 1-3) and give an "improved" version in algorithm 4.
Overview
I give now an overview of the 4. algorithm:
1.) Compute "A*B" and Store the whole product in "C" that C and the modulus $p$ is the input for that algorithm.
2.) Compute the bit-length of $p$, say: the function "Width(p)" returns exactly that value.
3.) Split the input $C$ into N "blocks" of size "Width(p)" and store each in G. Start in G[0] = lsb(p) and end in G[N-1] = msb(p). (The description is really faulty of the paper)
4.) Start the while loop:
Set N=N-1 (to reach the last element)
precompute $b:=2^{Width(p)} \bmod p$
while N>0 do:
T = G[N]
for(i=0; i<Width(p); i++) do: //Note: that counter doesn't matter, it limits the loop)
T = T << 1 //leftshift by 1 bit
while is_set( bit( T, Width(p) ) ) do // (N+1)-th bit of T is 1
unset( bit( T, Width(p) ) ) // unset the (N+1)-th bit of T (==0)
T += b
endwhile
endfor
G[N-1] += T
while is_set( bit( G[N-1], Width(p) ) ) do
unset( bit( G[N-1], Width(p) ) )
G[N-1] += b
endwhile
N -= 1
endwhile
That does alot. Not we only need to recursivly reduce G[0]:
while G[0] > p do
G[0] -= p
endwhile
return G[0]// = C mod p
The other three algorithms are well defined, but this lacks some information or present it really wrong. But it works for any size ;)
What language is it?
A basic algorithm might be:
hold the modulo in a variable (modulo);
hold the target number in a variable (target);
initialize modulus variable;
while (target > 0) {
if (target > modulo) {
target -= modulo;
}
else if(target < modulo) {
modulus = target;
break;
}
}
This may not work for you performance-wise, but:
while (num >= mod_limit)
num = num - mod_limit
In javascript:
function modulo(num1, num2) {
if (num2 === 0 || isNaN(num1) || isNaN(num2)) {
return NaN;
}
if (num1 === 0) {
return 0;
}
var remainderIsPositive = num1 >= 0;
num1 = Math.abs(num1);
num2 = Math.abs(num2);
while (num1 >= num2) {
num1 -= num2
}
return remainderIsPositive ? num1 : 0 - num1;
}

Resources