I am trying to compile LLVM IR code that contains a few large arrays (400 elements). When I try to compile this with clang (not run, just compile) - it takes longer than 10 minutes.
IR Code
define i32 #main() {
%j = alloca double
%i = alloca double
%foo = alloca [400 x double]
%B = alloca [400 x [400 x double]]
%A = alloca [400 x [400 x double]]
%1 = insertvalue [400 x double] undef, double 0.000000e+00, 0
%2 = insertvalue [400 x [400 x double]] undef, [400 x double] %1, 0
store [400 x [400 x double]] %2, [400 x [400 x double]]* %A
%3 = insertvalue [400 x double] undef, double 0.000000e+00, 0
%4 = insertvalue [400 x double] undef, double 0.000000e+00, 0
%5 = insertvalue [400 x [400 x double]] undef, [400 x double] %4, 0
store [400 x [400 x double]] %5, [400 x [400 x double]]* %B
%6 = insertvalue [400 x double] undef, double 0.000000e+00, 0
store [400 x double] %6, [400 x double]* %foo
store double 0.000000e+00, double* %i
ret i32 0
}
Command to run: clang out.ll -o built
Edit
I think this has to do with clang trying to create the arrays or something. As I make bigger arrays, clang builds take longer but, running the programs takes about the same amount of time. It does not really make sense to me why this would happen but, it does appear to be what is happening.
Versions
LLVM: Apple LLVM version 9.0.0 (clang-900.0.39.2)
Clang: clang version 6.0.0 (tags/RELEASE_600/final)
(added from comments)
How do I make this take less time? ... It seems weird that it would take this long. I know there is a way to make it take less time because, for example, C is able to make arrays that big and it will compile in no time.
Edit 2
I tried implementing malloc in order to have the arrays allocated on the heap instead of the stack. Here is some new IR Code. My question is where is this being allocated? When I generate multidimensional arrays it is still really slow - in that case how would I speed it up again?
%foo = alloca [400 x [400 x double]]
%calltmp1 = call i8* #malloc(i64 10240000)
%4 = bitcast i8* %calltmp1 to [400 x [400 x double]]*
%5 = getelementptr [400 x [400 x double]], [400 x [400 x double]]* %4, i32 0, i32 0
%calltmp2 = call i8* #malloc(i64 25600)
%6 = bitcast i8* %calltmp2 to [400 x double]*
%7 = getelementptr [400 x double], [400 x double]* %6, i32 0, i32 0
store double 1.000000e+00, double* %7
%8 = getelementptr [400 x double], [400 x double]* %6, i32 0, i32 1
store double 1.000000e+00, double* %8
%9 = getelementptr [400 x double], [400 x double]* %6, i32 0, i32 1
store double 1.000000e+00, double* %9
%initialized_array3 = load [400 x double], [400 x double]* %6
store [400 x double] %initialized_array3, [400 x double]* %5
%initialized_array4 = load [400 x [400 x double]], [400 x [400 x double]]* %4
store [400 x [400 x double]] %initialized_array4, [400 x [400 x double]]* %foo
Edit 3
Sorry for all the edits but, I think the extra info is helpful.
Here is some more IR Code I generated:
%foo = alloca [400 x [400 x double]]
store [400 x [400 x double]] undef, [400 x [400 x double]]* %foo
%0 = getelementptr [400 x [400 x double]], [400 x [400 x double]]* %foo, i32 0, i32 0
%1 = getelementptr [400 x double], [400 x double]* %0, i32 0, i32 0
store double 1.598000e+03, double* %1
This is almost identical to this IR Code generated form c:
%1 = alloca [400 x [400 x i32]], align 16
%2 = alloca i32*, align 8
%3 = getelementptr inbounds [400 x [400 x i32]], [400 x [400 x i32]]* %1, i64 0, i64 0
%4 = getelementptr inbounds [400 x i32], [400 x i32]* %3, i64 0, i64 0
store i32 1, i32* %4, align 16
However, the c code compiles in less than a second and mine takes too long to even be able to tell. This is because of line 2 in the first code snippet (below for reference). Why does that cause clang to run so slowly?
Line slowing it down:
store [400 x [400 x double]] undef, [400 x [400 x double]]* %foo
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
1 and 2 and ((3 AND 4) or 5 or (6 and 7)) Can any one help writing logic to expand expression and list out all possible expressions?
Eg.,
Result:
1 and 2 and 3 and 4
1 and 2 and 5
1 and 2 and 6 and 7
I interpreted the question as follows. Suppose x1, x2,..., x7 are seven variables, each having a value of true or false. List the combinations of values of variables that result in
x1 and x2 and ((x3 and x4) or x5 or (x6 and x7))
evaluating true.
We can obtain the desired results as follows.
arr = [true, false].repeated_permutation(7).
select {|x1,x2,x3,x4,x5,x6,x7| x1 and x2 and ((x3 and x4) or x5 or (x6 and x7))}
#=> [[true, true, true, true, true, true, true],
# [true, true, true, true, true, true, false]]
# ...
# [true, true, false, false, false, true, true]]
See Array#repeated_permutation.
To make the results easier to visualize we can display a table that shows which variables are true for each of the 23 elements of arr:
puts "x1 x2 x3 x4 x5 x6 x7"
puts "--------------------"
arr.each { |a| puts a.map { |tf| tf ? "X " : " " }.join }
displays
x1 x2 x3 x4 x5 x6 x7
--------------------
X X X X X X X
X X X X X X
X X X X X X
X X X X X
X X X X X X
X X X X X
X X X X X
X X X X
X X X X X X
X X X X X
X X X X X
X X X X
X X X X X
X X X X X X
X X X X X
X X X X X
X X X X
X X X X X
X X X X X
X X X X
X X X X
X X X
X X X X
Note that x1 and x2 both evaluate true for each element of arr. Also, there are 2**4 #=> 16 combinations for which x5 is true, as x3, x4, x6 and x7 can each be either true or false when x5 is true.
Since all expressions are literals, there are no possible different results, and this expression will always evaluate to 4:
1 and 2 and ((3 and 4) or 5 or (6 and 7))
1 and 2 and ( 4 or 5 or (6 and 7))
1 and 2 and ( 4 or 5 or 7 )
1 and 2 and 4 or 5 or 7
2 and 4 or 5 or 7
4 or 5 or 7
4 or 7
4
When using clang++ to generate IR code, it will contain align x instruction, and the memory of variables and struct is aligned. Such as:
struct CT {
char c1;
bool b1;
int i1;
double d1; // 8
char c2;
int i2;
};
int main() {
char c1 = 'a';
int i1 = 2;
char c2 = 'b';
char c3 = 'c';
int i2 = 4;
printf("address of c1 = %p ;\n", &c1);
printf("address of i1 = %p ;\n", &i1);
printf("address of c2 = %p ;\n", &c2);
printf("address of c3 = %p ;\n", &c3);
printf("address of i2 = %p ;\n", &i2);
return 0;
}
It will generate IR code (with align instructions):
source_filename = "oper.cpp"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"
#.str = private unnamed_addr constant [22 x i8] c"address of c1 = %p ;\0A\00", align 1
#.str.1 = private unnamed_addr constant [22 x i8] c"address of i1 = %p ;\0A\00", align 1
#.str.2 = private unnamed_addr constant [22 x i8] c"address of c2 = %p ;\0A\00", align 1
#.str.3 = private unnamed_addr constant [22 x i8] c"address of c3 = %p ;\0A\00", align 1
#.str.4 = private unnamed_addr constant [22 x i8] c"address of i2 = %p ;\0A\00", align 1
; Function Attrs: noinline norecurse optnone ssp uwtable
define i32 #main() #0 {
%1 = alloca i32, align 4
%2 = alloca i8, align 1
%3 = alloca i32, align 4
%4 = alloca i8, align 1
%5 = alloca i8, align 1
%6 = alloca i32, align 4
store i32 0, i32* %1, align 4
store i8 97, i8* %2, align 1
store i32 2, i32* %3, align 4
store i8 98, i8* %4, align 1
store i8 99, i8* %5, align 1
store i32 4, i32* %6, align 4
%7 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([22 x i8], [22 x i8]* #.str, i64 0, i64 0), i8* %2)
%8 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([22 x i8], [22 x i8]* #.str.1, i64 0, i64 0), i32* %3)
%9 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([22 x i8], [22 x i8]* #.str.2, i64 0, i64 0), i8* %4)
%10 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([22 x i8], [22 x i8]* #.str.3, i64 0, i64 0), i8* %5)
%11 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([22 x i8], [22 x i8]* #.str.4, i64 0, i64 0), i32* %6)
ret i32 0
}
declare i32 #printf(i8*, ...) #1
This will output:
address of c1 = 0x7ffee900328b ;
address of i1 = 0x7ffee9003284 ;
address of c2 = 0x7ffee9003283 ;
address of c3 = 0x7ffee9003282 ;
address of i2 = 0x7ffee900327c ;
We can see that the memory of variables is aligned.
But if you delete the align instruction of IR code, it will run correctly too, and the memory of variables is also aligned. Such as below IR code:
source_filename = "oper.cpp"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"
#.str = private unnamed_addr constant [22 x i8] c"address of c1 = %p ;\0A\00"
#.str.1 = private unnamed_addr constant [22 x i8] c"address of i1 = %p ;\0A\00"
#.str.2 = private unnamed_addr constant [22 x i8] c"address of c2 = %p ;\0A\00"
#.str.3 = private unnamed_addr constant [22 x i8] c"address of c3 = %p ;\0A\00"
#.str.4 = private unnamed_addr constant [22 x i8] c"address of i2 = %p ;\0A\00"
; Function Attrs: noinline norecurse optnone ssp uwtable
define i32 #main() #0 {
%1 = alloca i32
%2 = alloca i8
%3 = alloca i32
%4 = alloca i8
%5 = alloca i8
%6 = alloca i32
store i32 0, i32* %1
store i8 97, i8* %2
store i32 2, i32* %3
store i8 98, i8* %4
store i8 99, i8* %5
store i32 4, i32* %6
%7 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([22 x i8], [22 x i8]* #.str, i64 0, i64 0), i8* %2)
%8 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([22 x i8], [22 x i8]* #.str.1, i64 0, i64 0), i32* %3)
%9 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([22 x i8], [22 x i8]* #.str.2, i64 0, i64 0), i8* %4)
%10 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([22 x i8], [22 x i8]* #.str.3, i64 0, i64 0), i8* %5)
%11 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([22 x i8], [22 x i8]* #.str.4, i64 0, i64 0), i32* %6)
ret i32 0
}
declare i32 #printf(i8*, ...) #1
This will also output the aligned address of variables, no matter you use clang command or LLJIT(Tool's API of LLVM).
My Question:
We could not add align instruction to IR code? LLVM JIT will add it automatically?
Version (calc1) using direct outer function take about 1s.
But version (calc2) with pass function as parameter of function take about 2s, that is 2x slower. Why?
open System.Diagnostics
open System.Numerics
let width = 1920
let height = 1200
let xMin = -2.0
let xMax = 1.0
let yMin = -1.0
let yMax = 1.0
let scaleX x = float x * (xMax - xMin) / float width + xMin
let scaleY y = float y * (yMax - yMin) / float height - yMax
let fn (z:Complex) (c:Complex) = z * z + c
let calc1 width height =
let iterFn z c =
let rec iterFn' (z:Complex) c n =
if z.Magnitude > 2.0 || n >= 255 then n
else iterFn' (fn z c) c (n + 1)
iterFn' z c 0
Array.Parallel.init (width * height) (fun i ->
let x, y = i % width, i / width
let z, c = Complex.Zero, Complex(scaleX x, scaleY y)
(x, y, iterFn z c)
)
let calc2 width height fn =
let iterFn z c =
let rec iterFn' (z:Complex) c n =
if z.Magnitude > 2.0 || n >= 255 then n
else iterFn' (fn z c) c (n + 1)
iterFn' z c 0
Array.Parallel.init (width * height) (fun i ->
let x, y = i % width, i / width
let z, c = Complex.Zero, Complex(scaleX x, scaleY y)
(x, y, iterFn z c)
)
Execute in F# interactive get the following results:
> calc1 width height |> ignore
Real: 00:00:00.943, CPU: 00:00:03.046, GC gen0: 10, gen1: 8, gen2: 2
val it : unit = ()
> calc2 width height fn |> ignore
Real: 00:00:02.033, CPU: 00:00:07.484, GC gen0: 9, gen1: 8, gen2: 1
val it : unit = ()
F# 4.0.1, .NET 4.6.1
I suspect that in the first case, the fn is inlined.
Passing it as a paramter prevents this optimisation from occuring, so it is slower
I recently started learning how to program in F# and I have an assignment that is giving me some serious headaches.
I have to make a function that takes two arguments, an integer and a five element tuple of integers, and returns true if the sum of any three elements of the tuple is greater than the first argument, else false.
I started designing my code this way
{
let t3 = (1, 2, 3, 4, 5)
let intVal = 1
let check intVal t3 =
for t3
if (*sum of any three elements*) > intVal then true
else false
}
but at this point I am stuck and do not know how to proceed.
Easy way define - sort elements of tuple and compare with sum last three elements (ascending sort) :
let inline isAnyThreeGreaterThan2 limit (x1, x2, x3, x4, x5) =
[x1;x2;x3;x4;x5] |> List.sort |> Seq.skip 2 |> Seq.sum > limit
Example:
isAnyThreeGreaterThan2 15 (1, 2, 5, 5, 5) |> printfn "%A"
isAnyThreeGreaterThan2 14 (1, 2, 5, 5, 5) |> printfn "%A"
isAnyThreeGreaterThan2 15 (1, 2, 5, 5, 6) |> printfn "%A"
isAnyThreeGreaterThan2 15 (1, 2, 3, 4, 5) |> printfn "%A"
isAnyThreeGreaterThan2 12 (1, 2, 3, 4, 5) |> printfn "%A"
isAnyThreeGreaterThan2 11 (1, 2, 3, 4, 5) |> printfn "%A"
Print:
false
true
true
false
false
true
Link:
https://dotnetfiddle.net/7XR1ZA
It could be solved by converting the tuple into an array, getting the possible combinations out of it, summing those combinations and then verify if the any of the sums is greater than your parameter
(1,2,3,4,5)
|> Microsoft.FSharp.Reflection.FSharpValue.GetTupleFields
|> Array.toList
//Implementing this is left as and exercise to the reader
|> combinations 3
//converts the obj list as a int list and then sums the elements
|> List.map (fun x -> x |> List.map unbox<int> |> List.sum)
//Verifies if any sum is greater than intVal
|> List.exists (fun x -> x > intVal)
Something like this ought to do it:
let cross3 l1 l2 l3 =
[
for x in l1 do
for y in l2 do
for z in l3 do
yield x, y, z ]
module Tuple3 =
let distinct (x, y, z) =
let l = [x; y; z]
l |> List.distinct |> List.length = l.Length
let snd (x, y, z) = snd x, snd y, snd z
let inline sum (x, y, z) = x + y + z
let inline isAnyThreeGreaterThan limit (x1, x2, x3, x4, x5) =
let l = [x1; x2; x3; x4; x5] |> List.indexed
let legalCombinations =
cross3 l l l
|> List.filter Tuple3.distinct
|> List.map Tuple3.snd
legalCombinations |> List.exists (fun t3 -> Tuple3.sum t3 > limit)
Since this is an assignment, I'll leave it as an exercise to understand what's going on, but here's a sample FSI session:
> isAnyThreeGreaterThan 15 (1, 2, 5, 5, 5);;
val it : bool = false
> isAnyThreeGreaterThan 14 (1, 2, 5, 5, 5);;
val it : bool = true
> isAnyThreeGreaterThan 15 (1, 2, 5, 5, 6);;
val it : bool = true
> isAnyThreeGreaterThan 15 (1, 2, 3, 4, 5);;
val it : bool = false
> isAnyThreeGreaterThan 12 (1, 2, 3, 4, 5);;
val it : bool = false
> isAnyThreeGreaterThan 11 (1, 2, 3, 4, 5);;
val it : bool = true
for examples, I have a 2x2 matrix
1 2
3 4
Then I pad it with 2, it becomes,
x x x x x x
x x x x x x
x x 1 2 x x
x x 3 4 x x
x x x x x x
x x x x x x
Then I use border_replicate to fill value to x
x x 1 2 x x
x x 1 2 x x
1 1 1 2 2 2
3 3 3 4 4 4
x x 3 4 x x
x x 3 4 x x
The problem is for the x that located at the vertex of the new matrix, if I use border_replicate, what will their value? ....
Thank you very much
Based on your example, it looks like the corners (clockwise from top left) would be 1, 2, 4, and 3.