How is numpy array organized in memory? - memory

I am trying to get some data from an module which is a shared object wrapped with ctypes.
The data is a numeric array so I used numpy array to store the data. But I learned that I
don't understand how numpy organize the array in memory.
If I had a C function that would fill a array like below:
int filler(int* a,int length){
int i=0;
for(i=0;i<length;i++){
a[i]=i;
}
return 0;
}
Then I would call this function in python using ctypes
import ctypes
import numpy
lib = ctypes.cdll.LoadLibrary("libname")
data = numpy.zeros((1,10),dtype=numpy.int16)
lib.filler(data.ctypes.data,ctypes.c_int(10))
print data
But my output comes out this way.
dtype=numpy.int16
[[0 0 1 0 2 0 3 0 4 0]]
This would make sense if int was 32 bit, but I suppose C int are 16 bits (GCC in openSUSE in a x86 intel machine).
I tried running with dtypes being 32 bits and strangely I get the result I want:
dtype=numpy.int32
[[0 1 2 3 4 5 6 7 8 9]]
Trying to make sense of what is happening I ran with int8 and I got the following:
dtype=numpy.int8
[[0 0 0 0 1 0 0 0 2 0]]
I did give a look give a look at numpy docs, but so far I have not found what the answer.

This would make sense if int was 32
bit, but I suppose C int are 16 bits
(GCC in openSUSE in a x86 intel
machine). I tried running with dtypes
being 32 bits and strangely I get the
result I want:
Not strange at all: your supposition is wrong, and your machine is 32 bit with a 32 bit int and a 16 bit short int.. unless you're doing some (rather admirable) retrocomputing!
Check sizeof(int) and multiply by 8, or simply store numbers in an int and print them out, to convince yourself.

Related

Julia - Suppressing output for Clp optimizer when solving minimization

I'm developing a package in Julia that uses Clp together with JuMP to solve a Simplex problem, here is a sample of the code:
model = JuMP.Model(Clp.Optimizer)
#variable(model, x[1:size(c)[1]])
#constraint(model,A*x.==b)
#constraint(model,x.>=0)
#objective(model, Min, c'*x)
optimize!(model)
The problem is, when using Clp, the code prints the iteration steps. Here is an example:
Coin0506I Presolve 500 (-62500) rows, 62500 (0) columns and 125000 (-62500) elements
Clp0006I 0 Obj 0 Primal inf 1.9995 (500)
Clp0006I 85 Obj 5.249611e-08 Primal inf 1.9070741 (461)
Clp0006I 170 Obj 1.3219003e-06 Primal inf 1.7932731 (424)
Clp0006I 255 Obj 2.1956446e-06 Primal inf 1.6079534 (387)
Clp0006I 338 Obj 4.6964461e-06 Primal inf 1.3793942 (354)
Clp0006I 423 Obj 5.8976838e-06 Primal inf 1.4504309 (331)
...
My question is, how can I suppress this without appealing to another package as "Suppressor.jl"?
Just set the LogLevel:
set_optimizer_attribute(model, "LogLevel", 0)
This will stop the logs to appear.
There was a bug in Clp.j v0.8.2 issues#1883 which is fix.
You just have to update to Clp.jl v0.8.3.
-- Maurice

How to calculate power index in CMake file

I am working on CMake unit test cases that is using ctest.
I am having one question here.
Some part of my CMake is as below:
set(size_w 32 )
set(powerof2_w 5 )
foreach(size ${size_w})
foreach(pwr_of_2 ${powerof2_w})
...
FUNCTION_EXE(${size} ${pwr_of_2})
endforeach(pwr_of_2)
endforeach(size)
set(size_w 64 )
set(powerof2_w 6 )
foreach(size ${size_w})
foreach(pwr_of_2 ${powerof2_w})
...
FUNCTION_EXE(${size} ${pwr_of_2})
endforeach(pwr_of_2)
endforeach(size)
set(size_w 128 )
set(powerof2_w 7 )
foreach(size ${size_w})
foreach(pwr_of_2 ${powerof2_w})
...
FUNCTION_EXE(${size} ${pwr_of_2})
endforeach(pwr_of_2)
endforeach(size)
set(size_w 256 )
set(powerof2_w 8 )
foreach(size ${size_w})
foreach(pwr_of_2 ${powerof2_w})
...
FUNCTION_EXE(${size} ${pwr_of_2})
endforeach(pwr_of_2)
endforeach(size)
Expectation:
I want to reduce that one loop which is with powerof2_w parameter:
foreach(pwr_of_2 ${powerof2_w})
Is it possible to calculate the pwr_of_2 parameter from the size_w parameter inside the foreach(size ${size_w}) for-loop itself?
Note: Also, I want to combine all four of these for-loops into one for-loop using an array index.
Is this possible in CMake?
If I understand correctly, you are looking to calculate the exponential component for the powers of two for the given sizes:
32, 64, 128, 256, 512
These are powers of two with corresponding exponents of:
5, 6, 7, 8, 9
which we can calculate.
Unfortunately, CMake's math() function does not support exponential arithmetic. But luckily, powers of two are easy to manipulate using bit-shifting, which is supported in CMake. We can create a simple CMake function to calculate the (power of 2) exponents used to derive the sizes 32, 64, 128, etc.
function(calc_power_of_two_exponent num exponent)
set(counter 0)
# Shift until our number equals 1.
while(num GREATER 1)
# Right shift by 1
math(EXPR num "${num} >> 1")
# Count the number of times we shift.
math(EXPR counter "${counter} + 1")
endwhile()
# Return the number of times shifted, which is the exponent.
set(exponent ${counter} PARENT_SCOPE)
endfunction()
It looks like you want to iterate through these size and exponent values in pairs. We can set a list of sizes to iterate over, and calculate the corresponding exponent as we go.
set(sizes 32 64 128 256 512)
# Iterate through each size.
foreach(size ${sizes})
# Call the function to calculate its base-2 power (or index).
calc_power_of_two_exponent(${size} exponent)
message(STATUS "${size} ${exponent}")
FUNCTION_EXE(${size} ${exponent})
endforeach(size)
The status message can be used to confirm we pass the correct values to the FUNCTION_EXE function. This code prints:
32 5
64 6
128 7
256 8
512 9

Encode a categorical feature with multiple categories per example

I am working on a dataset which has a feature that has multiple categories for a single example.
The feature looks like this:-
Feature
0 [Category1, Category2, Category2, Category4, Category5]
1 [Category11, Category20, Category133]
2 [Category2, Category9]
3 [Category1000, Category1200, Category2000]
4 [Category12]
The problem is similar to the this question posted:- Encode categorical features with multiple categories per example - sklearn
Now, I want to vectorize this feature. One solution is to use MultiLabelBinarizer as suggested in the answer of the above similar question. But, there are around 2000 categories, which results into a sparse and very high dimentional encoded data.
Is there any other encoding that can be used? Or any possible solution for this problem. Thanks.
Given an incredibly sparse array one could use a dimensionality reduction technique such as PCA (Principal component analysis) to reduce the feature space to the top k features that best describe the variance.
Assuming the MultiLabelBinarizered 2000 features = X
from sklearn.decomposition import PCA
k = 5
model = PCA(n_components = k, random_state = 666)
model.fit(X)
Components = model.predict(X)
And then you can use the top K components as a smaller dimensional feature space that can explain a large portion of the variance for the original feature space.
If you want to understand how well the new smaller feature space describes the variance you could use the following command
model.explained_variance_
In many cases when I encountered the problem of too many features being generated from a column with many categories, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
This is the basic intuition behind Binary Encoder.
PS: Given that 2 power 11 is 2048 and you may have 2000 categories or so, you can reduce your categories to 11 feature columns instead of many (for example, 1999 in the case of one-hot)!
I also encountered these same problems but I solved using Countvectorizer from sklearn.feature_extraction.text just by giving binary=True, i.e CounterVectorizer(binary=True)

Is there an SIMD instruction to achieve batch array memory index mapping?

In my RGB to grey case:
Y = (77*R + 150*G + 29*B) >> 8;
I know SIMD (NEON, SSE2) can do like:
foreach 8 elements:
{A0,A1,A2,A3,A4,A5,A6,A7} = 77*{R0,R1,R2,R3,R4,R5,R6,R7}
{B0,B1,B2,B3,B4,B5,B6,B7} = 150*{G0,G1,G2,G3,G4,G5,G6,G7}
{C0,C1,C2,C3,C4,C5,C6,C7} = 29*{B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {A0,A1,A2,A3,A4,A5,A6,A7} + {B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} + {C0,C1,C2,C3,C4,C5,C6,C7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} >> 8
However, the multiply instruction take at least 2 clock cycles, and R,G,B in [0-255],
we can use three lookup table(an array, length=256) to store the partial result of
77*R(mark as X), 150*G(mark as Y), 29*B(mark as Z).
So I'm looking for instructions can do the intention:
foreach 8 elements:
{A0,A1,A2,A3,A4,A5,A6,A7} = {X[R0],X[R1],X[R2],X[R3],X[R4],X[R5],X[R6],X[R7]}
{B0,B1,B2,B3,B4,B5,B6,B7} = {Y[G0],Y[G1],Y[G2],Y[G3],Y[G4],Y[G5],Y[G6],Y[G7]}
{C0,C1,C2,C3,C4,C5,C6,C7} = {Z[B0],Z[B1],Z[B2],Z[B3],Z[B4],Z[B5],Z[B6],Z[B7]}
{D0,D1,D2,D3,D4,D5,D6,D7} = {A0,A1,A2,A3,A4,A5,A6,A7} + {B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} + {C0,C1,C2,C3,C4,C5,C6,C7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} >> 8
Any good suggestions?
There are no byte or word gather instructions in AVX2 / AVX512, and no gathers at all in NEON. The DWORD gathers that do exist are much slower than a multiply! e.g. one per 5 cycle throughput for vpgatherdd ymm,[reg + scale*ymm], ymm, according to Agner Fog's instruction table for Skylake.
You can use shuffles as a parallel table-lookup. But your table for each lookup is 256 16-bit words. That's 512 bytes. AVX512 has some shuffles that select from the concatenation of 2 registers, but that's "only" 2x 64 bytes, and the byte or word element-size versions of those are multiple uops on current CPUs. (e.g. AVX512BW vpermi2w). They are still fantastically powerful compared to vpshufb, though.
So using a shuffle as a LUT won't work in your case, but it does work very well for some cases, e.g. for popcount you can split bytes into 4-bit nibbles and use vpshufb to do 32 lookups in parallel from a 16-element table of bytes.
Normally for SIMD you want to replace table lookups with computation, because computation is much more SIMD friendly.
Suck it up and use pmullw / _mm_mullo_epi16. You have instruction-level parallelism, and Skylake has 2 per clock throughput for 16-bit SIMD multiply (but 5 cycle latency). For image processing, normally throughput matters more than latency, as long as you keep the latency within reason so out-of-order execution can hide it.
If your multipliers ever have few enough 1 bits in their binary representation, you could consider using shift/add instead of an actual multiply. e.g. B * 29 = B * 32 - B - B * 2. Or B<<5 - B<<1 - B. That many instructions probably has more throughput cost than a single multiply, though. If you could do it with just 2 terms, it might be worth it. (But then again, still maybe not, depending on the CPU. Total instruction throughput and vector ALU bottlenecks are a big deal.)

Out of memory exception for a matrix

I have the "'System.OutOfMemoryException" exception for this simple code (a 10 000 * 10 000 matrix) multiplied by itself:
#time
#r "Microsoft.Office.Interop.Excel"
#r "FSharp.PowerPack.dll"
open System
open System.IO
open Microsoft.FSharp.Math
open System.Collections.Generic
let mutable Matrix1 = Matrix.create 10000 10000 0.
let matrix4 = Matrix1 * Matrix1
I have the following error:
System.OutOfMemoryException: An exception 'System.OutOfMemoryException' has been raised
Microsoft.FSharp.Collections.Array2DModule.ZeroCreate[T](Int32 length1, Int32 length2)
Microsoft.FSharp.Math.DoubleImpl.mulDenseMatrixDS(DenseMatrix`1 a, DenseMatrix`1 b)
Microsoft.FSharp.Math.SpecializedGenericImpl.mulM[a](Matrix`1 a, Matrix`1 b)
<StartupCode$FSI_0004>.$FSI_0004.main#() dans C:\Users\XXXXXXX\documents\visual studio 2010\Projects\Library1\Library1\Module1.fs:line 92
Stop due to an error
I have therefore 2 questions:
I have a 8 GB memory on my computer and according to my calculation a 10 000 * 10 000 matrix should take 381 MB [computed this way : 10 000 * 10 000 = 100 000 000 integers in the matrix => 100 000 000 * 4 bytes (integers of 32 bits) = 400 000 000 => 400 000 000 / (1024*1024) = 381 MB] so I cannot understand why there is an OutOfMemoryException
More generally (it's not the case here I think), I have the impression that F# interactive registers all the data and therefore overloads the memory, do you know of a way to free all the data registered by F# interactive without exiting F#?
In summary, fsi is a 32-bit process; at most it can hold 2GB of data. Run your test as a 64-bit Windows application; you can increase the size of the matrix, but it still has 2GB limit of .NET objects.
I correct your calculation a little bit. Matrix1 is a float matrix, so each element occupies 8 bytes in memory. The total size of Matrix1 and matrix4 in memory is at least:
2 * 10000 * 10000 * 8 = 1 600 000 000 bytes ~ 1.6 GB
(ignoring some bookkeeping parts of matrix)
So it's no surprise when fsi*32 runs out of memory in this case.
Execute the test as a 64-bit Windows process, you can create float matrices of size around 15000 but not more than that. Check out this informative article for concrete numbers with different types of matrix elements.
The amount of physical memory on your computer is not the relevant bottleneck - see Eric Lippert's great blog post for more information.

Resources