the performance of erlang numeric calculation - erlang

C version:
#include <stdio.h>
#include <stdlib.h>
unsigned int test(unsigned int n_count) {
unsigned int c = 1;
unsigned int i;
for (i=0; i< n_count;i++) {
c += 2 * 34 + 1;
c /= 2;
c *= 39;
}
return c;
}
int main(int argc, char* argv[])
{
printf("%u\n", test(atoi(argv[1])));
}
Result:
$ gcc p2.c
$ time ./a.out 100000000
563970997
real 0m0.865s
user 0m0.864s
sys 0m0.004s
erlang version:
-module(test2).
-export([main/1]).
-mode(compile).
calc(Cnt, Total) when Cnt > 0 ->
if Total >= 4294967296 -> Total2 = Total rem 4294967296;
true -> Total2 = Total end,
calc(Cnt - 1, trunc((Total2 + 2 * 34 + 1) / 2) * 39);
calc(0, Total)->
if Total >= 4294967296 -> Total2 = Total rem 4294967296;
true -> Total2 = Total end,
io:format("~p ~n", [Total2]),
ok.
main([A])->
Cnt = list_to_integer(A),
calc(Cnt, 1).
Result:
$ erlc +native +"{hipe, [to_llvm]}" test2.erl
$ time escript test2.beam 100000000
563970997
real 0m4.940s
user 0m4.892s
sys 0m0.056s
$ erlc +native test2.erl
$ time escript test2.beam 100000000
563970997
real 0m5.381s
user 0m5.320s
sys 0m0.064s
$ erlc test2.erl
$ time escript test2.beam 100000000
563970997
real 0m9.868s
user 0m9.808s
sys 0m0.056s
How to improve the performance of erlang version?
In erlang, I have to simulate the integer overflow case, is there better way?
And even with hipe, the performance is far from C.
Edit:
Python version:
def test(n_count):
c = 1
for i in xrange(n_count):
c += 2 * 34 + 1
c /= 2
c *= 39
if c >= 4294967296:
c = c % 4294967296
return c
print test(100000000)
Result:
$ time python p2.py
563970997
real 0m17.813s
user 0m17.808s
sys 0m0.008s
$ time pypy p2.py
563970997
real 0m1.852s
user 0m0.508s
sys 0m0.128s

I think the following link may be especially helpful, you'll be able to 'bake' your C code into your Erlang application:
http://www.erlang.org/doc/tutorial/c_port.html

Erlang really has not good in digit-rolling tasks. It good, if you want take bytes and send them.
Usual serious Erlang development cycle is including final optimization, when you rewriting some bottleneck modules to native.
Yes, Erlang looks like good calc (and projects like Wings3D showing that), but maybe you must choose another tool?

Related

Why is GraalVM + native-image slower than GraalVM alone on a while loop?

Just for fun, I'm trying to compare gcc (9.4.0), OpenJDK (11.0.12), GraalVM (22.3.r19) and GraalVM + native-image (22.3.r19) performances on a "while loop n++" use case (see programs below).
Bottom line on Linux is (see results below): native-image is slower than all other options.
So I'm wondering: am I missing something (magic command-line option)? Or is it just life that native-image is slower on this particular program (and that's fine)?
Count.java
public class Count {
public static void main(String[] args) {
int n = 0;
int inc = Math.random() >= 0 ? 1 : 0; // to prevent the optimizer from removing the loop
while (n < 1000000000) {
n += inc;
}
System.out.println(n);
}
}
count.c
#include <time.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
int n = 0;
srand(time(NULL));
int inc = rand() >= 0 ? 1 : 0; // to prevent the optimizer from removing the loop
while (n < 1000000000) {
n += inc;
}
}
gcc:
me#laptop:~/dev/java-count-graalvm$ gcc -O2 -s -DNDEBUG count.c -o count
me#laptop:~/dev/java-count-graalvm$ time ./count
real 0m0,261s
user 0m0,261s
sys 0m0,000s
OpenJDK 11:
me#laptop:~/dev/java-count-graalvm$ time java -classpath target/classes Count
1000000000
real 0m0,632s
user 0m0,612s
sys 0m0,030s
GraalVM:
me#laptop:~/dev/java-count-graalvm$ time java -classpath target/classes Count
1000000000
real 0m0,326s
user 0m0,362s
sys 0m0,013s
GraalVM native-image:
me#laptop:~/dev/java-count-graalvm$ native-image -cp target/classes Count
me#laptop:~/dev/java-count-graalvm$ time ./count
1000000000
real 0m1,283s
user 0m1,271s
sys 0m0,013s
For the sake of sanity, I commented-out the while loop and native image is returning in 3 milliseconds:
bruno#hearne:~/dev/java-count-graalvm$ time ./count
0
real 0m0,003s
user 0m0,000s
sys 0m0,003s
So I would say that the penalty is coming from the while loop and nothing else.

How would one create a bitwise rotation function in dart?

I'm in the process of creating a cryptography package for Dart (https://pub.dev/packages/steel_crypt). Right now, most of what I've done is either exposed from PointyCastle or simple-ish algorithms where bitwise rotations are unnecessary or replaceable by >> and <<.
However, as I move toward complicated cryptography solutions, which I can do mathematically, I'm unsure of how to implement bitwise rotation in Dart with maximum efficiency. Because of the nature of cryptography, the speed part is emphasized and uncompromising, in that I need the absolute fastest implementation.
I've ported a method of bitwise rotation from Java. I'm pretty sure this is correct, but unsure of the efficiency and readability:
My tested implementation is below:
int INT_BITS = 64; //Dart ints are 64 bit
static int leftRotate(int n, int d) {
//In n<<d, last d bits are 0.
//To put first 3 bits of n at
//last, do bitwise-or of n<<d with
//n >> (INT_BITS - d)
return (n << d) | (n >> (INT_BITS - d));
}
static int rightRotate(int n, int d) {
//In n>>d, first d bits are 0.
//To put last 3 bits of n at
//first, we do bitwise-or of n>>d with
//n << (INT_BITS - d)
return (n >> d) | (n << (INT_BITS - d));
}
EDIT (for clarity): Dart has no unsigned right or left shift, meaning that >> and << are signed right shifts, which bears more significance than I might have thought. It poses a challenge that other languages don't in terms of devising an answer. The accepted answer below explains this and also shows the correct method of bitwise rotation.
As pointed out, Dart has no >>> (unsigned right shift) operator, so you have to rely on the signed shift operator.
In that case,
int rotateLeft(int n, int count) {
const bitCount = 64; // make it 32 for JavaScript compilation.
assert(count >= 0 && count < bitCount);
if (count == 0) return n;
return (n << count) |
((n >= 0) ? n >> (bitCount - count) : ~(~n >> (bitCount - count)));
}
should work.
This code only works for the native VM. When compiling to JavaScript, numbers are doubles, and bitwise operations are only done on 32-bit numbers.

Integer overflow in Fibonacci number

I was solving this codechef problem on Fibonacci numbers. It says number is of 1000 digits then why it is not causing integer overflow in tester's solution when it is scanning the array and storing it in unsigned long long int. I can't understand how solution is working. Below is the problem and tester's solution.
The Head Chef has been playing with Fibonacci numbers for long . He has learnt several tricks related to Fibonacci numbers . Now he wants to test his chefs in the skills .
A fibonacci number is defined by the recurrence :
f(n) = f(n-1) + f(n-2) for n > 2
and f(1) = 0
and f(2) = 1 .
Given a number A , determine if it is a fibonacci number.
Input
The first line of the input contains an integer T denoting the number of test cases. The description of T test cases follows.
The only line of each test case contains a single integer A denoting the number to be checked .
Output
For each test case, output a single line containing "YES" if the given number is a fibonacci number , otherwise output a single line containing "NO" .
Constraints
1 ≤ T ≤ 1000
1 ≤ number of digits in A ≤ 1000
The sum of number of digits in A in all test cases <= 10000.
Example
Input:
3
3
4
5
Output:
YES
NO
YES
**Tester's solution:**
#include <iostream>
#include <cstdio>
#include <algorithm>
#include <set>
#include <cstring>
using namespace std;
int const mx = 6666;
set <unsigned long long> f;
unsigned long long fib[mx + 10];
char s[mx + 1];
int main(){
// freopen("input.txt", "r", stdin);
// freopen("output.txt", "w", stdout);
fib[0] = 0;
fib[1] = 1;
f.insert(1);
f.insert(0);
int i;
for (i = 2; i <= mx; i++){
fib[i] = fib[i - 1] + fib[i - 2];
f.insert(fib[i]);
}
int tc;
cin>>tc;
while (tc--){
unsigned long long n = 0, ten = 10;
cin>>s;
int len = strlen(s);
for (i = 0; i < len; i++){
char q = s[i];
unsigned long long a = q - '0';
n = n * ten + a;
}
if (f.find(n) == f.end()) printf("NO\n");
else printf("YES\n");
}
return 0;
}
From cplusplus you will see that,
ULLONG_MAX Maximum value for an object of type unsigned long long int is 18446744073709551615 (264-1) or greater.
The actual value depends on the particular system and library implementation, but shall reflect the limits of these types in
the target platform.
Above information is just to let you know its a BIG number. Moreover the cause of not getting overflow is not the limit i mentioned.
Most probably, the input file of judge does not contain any input that can cause an overflow.
And its still possible to set such input even after fulfilling the conditions,
1 ≤ T ≤ 1000
1 ≤ number of digits in A ≤ 1000
The sum of number of digits in A in all test cases <= 10000.

Seemingly inconsistent results when profiling memory in Go

I have recently been running some numerical codes written in Go on large datasets and have been encountering memory management issues. While attempting to profile the problem, I have measured the memory usage of my program in three different ways: with Go's runtime/pprof package, with the unix time utility, and by manually adding up the size of the data that I allocated. These three methods do not give me consistent results.
Below is a simplified version of the code that I am profiling. It allocates several slices, puts values at every index and places each of them inside of a parent slice:
package main
import (
"fmt"
"os"
"runtime/pprof"
"unsafe"
"flag"
)
var mprof = flag.String("mprof", "", "write memory profile to this file")
func main() {
flag.Parse()
N := 1<<15
psSlice := make([][]int64, N)
_ = psSlice
size := 0
for i := 0; i < N; i++ {
ps := make([]int64, 1<<10)
for i := range ps { ps[i] = int64(i) }
psSlice[i] = ps
size += int(unsafe.Sizeof(ps[0])) * len(ps)
}
if *mprof != "" {
f, err := os.Create(*mprof)
if err != nil { panic(err) }
pprof.WriteHeapProfile(f)
f.Close()
}
fmt.Printf("total allocated: %d MB\n", size >> 20)
}
Running this with the command $ time time -f "%M kB" ./mem_test -mprof=out.mprof results in the output:
total allocated: 256 MB
1141216 kB
real 0m0.150s
user 0m0.031s
sys 0m0.113s
Here the first number, 256 MB, is just the size of the arrays computed from unsafe.Sizeof and the second number, 1055 MB, is what time reports. Running the pprof tool results in
(pprof) top1
Total: 108.2 MB
107.8 99.5% 99.5% 107.8 99.5% main.main
These results scale smoothly in the way you would expect them to for slices of smaller or larger lengths.
Why don't these three number line up more closely?
First, you need to provide an error free example. Let's start with the basic numbers. For example,
package main
import (
"fmt"
"runtime"
"unsafe"
)
func WriteMatrix(nm [][]int64) {
for n := range nm {
for m := range nm[n] {
nm[n][m]++
}
}
}
func NewMatrix(n, m int) [][]int64 {
a := make([]int64, n*m)
nm := make([][]int64, n)
lo, hi := 0, m
for i := range nm {
nm[i] = a[lo:hi:hi]
lo, hi = hi, hi+m
}
return nm
}
func MatrixSize(nm [][]int64) int64 {
size := int64(0)
for i := range nm {
size += int64(unsafe.Sizeof(nm[i]))
for j := range nm[i] {
size += int64(unsafe.Sizeof(nm[i][j]))
}
}
return size
}
var nm [][]int64
func main() {
n, m := 1<<15, 1<<10
var ms1, ms2 runtime.MemStats
runtime.ReadMemStats(&ms1)
nm = NewMatrix(n, m)
WriteMatrix(nm)
runtime.ReadMemStats(&ms2)
fmt.Println(runtime.GOARCH, runtime.GOOS)
fmt.Println("Actual: ", ms2.TotalAlloc-ms1.TotalAlloc)
fmt.Println("Estimate:", n*3*8+n*m*8)
fmt.Println("Total: ", ms2.TotalAlloc)
fmt.Println("Size: ", MatrixSize(nm))
// check top VIRT and RES for COMMAND peter
for {
WriteMatrix(nm)
}
}
Output:
$ go build peter.go && /usr/bin/time -f "%M KiB" ./peter
amd64 linux
Actual: 269221888
Estimate: 269221888
Total: 269240592
Size: 269221888
^C
Command exited with non-zero status 2
265220 KiB
$
$ top
VIRT 284268 RES 265136 COMMAND peter
Is this what you expected?
See MatrixSize for the correct way to calculate the memory size.
In the infinite loop that allows us to use the top command, pin the matrix as resident by updating it.
What results do you get when you run this program?
BUG:
Your result from /usr/bin/time is 1056992 KiB which too large by a factor of four. It's a bug in your version of /usr/bin/time, ru_maxrss is reported in KBytes not pages. My version of Ubuntu has been patched.
References:
Re: GNU time: incorrect results
time-1.7 counts rusage wrong on Linux
GNU Project Archives: time
“time” 1.7-24 source package in Ubuntu. ru_maxrss is reported in KBytes not pages. (Closes: #649402)
#649402 - [PATCH] time overestimates max RSS by a factor of 4 - Debian Bug report logs
Subject: Fix ru_maxrss reporting Author: Richard Kettlewell
Bug-Debian: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=649402
--- time-1.7.orig/time.c
+++ time-1.7/time.c
## -392,7 +398,7 ##
ptok ((UL) resp->ru.ru_ixrss) / MSEC_TO_TICKS (v));
break;
case 'M': /* Maximum resident set size. */
- fprintf (fp, "%lu", ptok ((UL) resp->ru.ru_maxrss));
+ fprintf (fp, "%lu", (UL) resp->ru.ru_maxrss);
break;
case 'O': /* Outputs. */
fprintf (fp, "%ld", resp->ru.ru_oublock);

How do I create an FCS for PPP packets?

I am trying to create a software simulation on an Ubuntu GNU/Linux machine which will work like PPPoE. I would like this simulator to take outgoing packets, strip off the ethernet header, insert the PPP flags (7E, FF, 03, 00, and 21) and place the IP layer information in the PPP packet. I am having trouble with the FCS that goes after the data. From what I can tell, the cell modem I am using has a 2 byte FCS using the CRC16-CCITT method. I have found several pieces of software that will calculate this checksum, but none of them produce what is coming out the serial line (I have a serial line "sniffer" that shows me everything the modem is being sent).
I have been looking into the source of pppd and the linux kernel itself, and I can see that both of them have a method of adding an FCS to the data. It seems quite difficult to implement, as I have no experience in kernel hacking. Can someone come up with a simple way (preferably in Python) of calculating an FCS that matches the one that the kernel produces?
Thanks.
P.S. If anyone wants, I can add a sample of the data output I am getting to the serial modem.
Used simple python library crcmod.
import crcmod #pip3 install crcmod
fcsData = "A0 19 03 61 DC"
fcsData=''.join(fcsData.split(' '))
print(fcsData)
crc16 = crcmod.mkCrcFun(0x11021, rev=True,initCrc=0x0000, xorOut=0xFFFF)
print(hex(crc16(bytes.fromhex(fcsData))))
fcs=hex(crc16(bytes.fromhex(fcsData)))
I recently did something like this while testing code to kill a ppp connection ..
This worked for me:
# RFC 1662 Appendix C
def mkfcstab():
P = 0x8408
def valiter():
for b in range(256):
v = b
i = 8
while i:
v = (v >> 1) ^ P if v & 1 else v >> 1
i -= 1
yield v & 0xFFFF
return tuple(valiter())
fcstab = mkfcstab()
PPPINITFCS16 = 0xffff # Initial FCS value
PPPGOODFCS16 = 0xf0b8 # Good final FCS value
def pppfcs16(fcs, bytelist):
for b in bytelist:
fcs = (fcs >> 8) ^ fcstab[(fcs ^ b) & 0xff]
return fcs
To get the value:
fcs = pppfcs16(PPPINITFCS16, (ord(c) for c in frame)) ^ 0xFFFF
and swap the bytes (I used chr((fcs & 0xFF00) >> 8), chr(fcs & 0x00FF))
Got this from mbed.org PPP-Blinky:
// http://www.sunshine2k.de/coding/javascript/crc/crc_js.html - Correctly calculates
// the 16-bit FCS (crc) on our frames (Choose CRC16_CCITT_FALSE)
int crc;
void crcReset()
{
crc=0xffff; // crc restart
}
void crcDo(int x) // cumulative crc
{
for (int i=0; i<8; i++) {
crc=((crc&1)^(x&1))?(crc>>1)^0x8408:crc>>1; // crc calculator
x>>=1;
}
}
int crcBuf(char * buf, int size) // crc on an entire block of memory
{
crcReset();
for(int i=0; i<size; i++)crcDo(*buf++);
return crc;
}

Resources