how to optimize my MIP model to find the best combination of columns and rows with minimum of NAN values? - ompr

I have a dataframe with lots of NAN values. My objective is to find the best conbination of columns and rows to maximize my data and minimize the NAN values. One solution I found is to use the ompr package and create a MIP model to solve the problem.
Here's the model :
m <- +!is.na(M) # gets logical matrix; 0 if NA else 1
nr <- nrow(m)
nc <- ncol(m)
n_years <- 17
model <- MIPModel() %>%
# keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>%
# rm_row[i] is 1 if row i is selected for removal else 0
add_variable(rm_row[i], i = 1:nr, type = "binary") %>%
# rm_col[j] is 1 if column j is selected for removal else 0
add_variable(rm_col[j], j = 1:nc, type = "binary") %>%
# maximize good cells kept
set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>%
# cell can be kept only when row is not selected for removal
add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
# cell can be kept only when column is not selected for removal
add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
# only non-NA values can be kept
add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>%
# keep at most n columns i.e. remove at least (nc - n_years) columns
add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years)
model
This model works just fine as long as my data is not huge which is not the case for my data (3500 rows x 180 columns)
this model has a lot of contsraints which is why it takes too long to solve. Is there another way to recreate this model so it has less constraints and calculates faster?

Related

Constrain axis limits in chordDiagram (circlize) when making gifs

I hope somebody will be able to help me with this chordDiagram visualisation I am trying to create. I am well aware that maybe this visualization type was not suitable for this particular data, but somehow it was something I had in my head (or how I wanted to visualize this data) and what I wanted to create, and now I think it is too late to give it up :) too curious how one can fix it. It is my first real post here, though I am an active user of stackoverflow and I genuinely admire the audience here.
So I have this data on the change in the size of area in km2 over time (d0) and I am trying to create a GIF out of it using example here: https://guyabel.com/post/animated-directional-chord-diagrams/
The data "d0":
Time <- as.numeric(c(10,10,10,100,100,100,200,200,200,5,5,5,50,50,50,0,0,0))
Year <- as.character(c(2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200))
Area_km2 <- as.numeric(c(4.3075211,7.1672926,17.2780622,5.9099250,8.2909189,16.9748961,6.5400554,8.9036313,16.5627228,3.0765610,6.3929883,18.0708108,5.3520782,8.4503856,16.7938196,0.5565978,1.8415855,12.5089476))
(d0 <- as.data.frame(cbind(Time,Year,Area_km2)))
I also have the color codes stored in a separate dataframe (d1) following the above mentioned example.
The data "d1":
year <- as.numeric(c(2050,2100,2200))
order1 <- as.character(c(1,2,3))
col1 <- c("#40A4D8","#33BEB7","#0C5BCE")
(d1 <- as.data.frame(cbind(year,order1,col1)))
So the idea was to have self-linking flows within each sector increasing in size over time, which will look like just growing segments in a final animated GIF (or like growing pie segments), but I noticed that regardless how hard I try I can't seem to manage to constrain the axis of each segment to limits of that particular year in an every single frame. It seems that the axis is being added on and keeps on adding over time, which is not what I want.
Like for example in the first figure (figure0) or "starting frame" the size of the links matches well the dataframe:
figure0
So it is
orig_year
Area_km2
.frame
2050
0.557
0
2100
1.84
0
2200
12.5
0
But when one plots next figure (figure1), the axis seems to have taken the values from the starting frame and added on the current values (4, 7.4 and 19 respectively) instead of (3.08, 6.39 and 18.1) or what should have been the values according the data frame:
figure1
orig_year
Area_km2
.frame
2050
3.08
1
2100
6.39
1
2200
18.1
1
And it keep on doing so as one loops through the data and creates new plots for the next frames. I wonder whether it is possible to constrain the axis and create the visualization in a way that the links just gradually increase over time and the axis is, so to say, following the increase or does also increase gradually following the data???
Any help is highly appreciated!
Thanks.
My code:
Sort decreasing
(d0 <- arrange(d0,Time))
Copy columns
(d0$Dest_year <- d0$Year)
Re-arrange data
library(tweenr)
(d2 <- d0 %>%
mutate(corridor=paste(Year,Dest_year,sep="->")) %>%
dplyr::select(Time,corridor,Area_km2) %>%
mutate(ease="linear") %>%
tweenr::tween_elements('Time','corridor','ease',nframes=30) %>%
tibble::as_tibble())
(d2 <- d2 %>%
separate(col=.group,into=c("orig_year","dest_year"),sep="->") %>%
dplyr::select(orig_year,dest_year,Area_km2,everything()))
d2$Time <- NULL
Create a directory to store the individual plots
dir.create("./plot-gif/")
Fixing scales
scale_gap <- function(Area_km2_m,Area_km2_max,gap_at_max=1,gaps=NULL) {
p <- Area_km2_m/Area_km2_max
if(length(gap_at_max)==1 & !is.null(gaps)) {
gap_at_max <- rep(gap_at_max,gaps)
}
gap_degree <- (360-sum(gap_at_max))*(1-p)
gap_m <- (gap_degree + sum(gap_at_max))/gaps
return(gap_m)
}
Function to derive the size of gaps in each frame for an animated GIF
(d3 <- d2 %>% group_by(orig_year) %>% mutate(gaps=scale_gap(Area_km2_m=Area_km2,Area_km2_max=max(.$Area_km2),gap_at_max=4,gaps=9)))
library(magrittr)
Get the values for axis limits
(axmax <- d2 %>% group_by(orig_year,.frame) %>% mutate(max=mean(Area_km2)))
Creating unique chordDiagrams for each frame
library(circlize)
for(f in unique(d2$.frame)){
png(file=paste0("./plot-gif/figure",f,".png"),height=7,width=7,units="in",res=500)
circos.clear()
par(mar=rep(0,4),cex=1)
circos.par(start.degree=90,track.margin=c(-0.1,0.1),
gap.degree=filter(d3,.frame==f)$gaps,
points.overflow.warning=FALSE)
chordDiagram(x=filter(d2,.frame==f),directional=2,order=d1$year,
grid.col=d1$col1,annotationTrack=c("grid","name","axis"),
transparency=0.25,annotationTrackHeight=c(0.05,0.1),
direction.type=c("diffHeight"),
diffHeight=-0.04,link.sort=TRUE,
xmax=axmax$max)
dev.off()
}
Now make a GIF
library(magick)
img <- image_read(path="./plot-gif/figure0.png")
for(f in unique(d2$.frame)[-1]){
img0 <- image_read(path=paste0("./plot-gif/figure",f,".png"))
img <- c(img,img0)
message(f)
}
img1 <- image_scale(image=img,geometry="720x720")
ani0 <- image_animate(image=img1,fps=10)
image_write(image=ani0,path="./plot-gif/figure.gif")
I will start with your d0 object. I first construct the d0 object but I do not convert everything to characters, just put them as the original numeric format. Also I reorder d0 by Time and Year:
Time = c(10,10,10,100,100,100,200,200,200,5,5,5,50,50,50,0,0,0)
Year = c(2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200)
Area_km2 = c(4.3075211,7.1672926,17.2780622,5.9099250,8.2909189,16.9748961,6.5400554,8.9036313,16.5627228,3.0765610,6.3929883,18.0708108,5.3520782,8.4503856,16.7938196,0.5565978,1.8415855,12.5089476)
d0 = data.frame(Time = Time,
Year = Year,
Area_km2 = Area_km2,
Dest_year = Year)
d0 = d0[order(d0$Time, d0$Year), ]
The key thing is to calculate proper values for "gaps" between sectors so that the same unit from data corresponds to the same degree in different plots.
We first calculate the maximal total width of the circular plot:
width = tapply(d0$Area_km2, d0$Time, sum)
max_width = max(width)
We assume there are n sectors (where n = 3 in d0). We let the first n-1 gaps to be 2 degrees and we dynamically adjust the last gap according to the total amount of values in each plot. For the plot with the largest total value, the last gap is also set to 2 degrees.
n = 3
degree_per_unit = (360 - n*2)/max_width
Now degree_per_unit can be shared between multiple plots. Every time we calculate the value for last_gap:
for(t in sort(unique(Time))) {
l = d0$Time == t
d0_current = d0[l, c("Year", "Dest_year", "Area_km2")]
last_gap = 360 - (n-1)*2 - sum(d0_current$Area_km2)*degree_per_unit
circos.par(gap.after = c(rep(2, n-1), last_gap))
chordDiagram(d0_current, grid.col = c("2050" = "red", "2100" = "blue", "2200" = "green"))
circos.clear()
title(paste0("Time = ", t, ", Sum = ", sum(d0_current$Area_km2)))
Sys.sleep(1)
}

Create dummy variable for all neighborhood region depending on another dummy variable (sf object)

I have a spatial dataframe (sf) of all European NUTS2 regions. Within this sf object certain regions have a dummy = 1, and others a dummy = 0.
How can I create a new dummy (lets say "dummy_neighbor") where all the neighboring (st_touches?) regions of the regions with dummy==1 get a dummy_neighbor==1, and all those not touching a dummy==1 region get a dummy_neighbor==0?
For the time being I have this work-around. But I guess there musst be an easier solution?
# load packages
library(sf)
library(here)
library(tidyverse)
library(spdep)
library(expp)
# import nuts 2 sf object
nuts2_sf <- st_read(here("Data", "nuts2_data", "final_nuts.shp"))
# take row numbers as extra column for later
nuts2_sf$rownumber = 1:nrow(nuts2_sf)
#neighbouring list
neighbour <- poly2nb(nuts2_sf, row.names="NUTS_ID", queen=TRUE)
# transform nb into data frame
nb_df <- plyr::ldply(neighbour, rbind)
nb_df$rownumber = 1:nrow(nb_df) # get rownbumer as column
# merge neighbour-list-df with sf by rownumber
df <- merge(nuts2_sf, nb_df, by = "rownumber")
# extract all neighbours of tp100_d=1
# (tp100_d is the name of the original dummy variable)
df_dummy <- df %>% filter(tp100_d == 1)
df_dummy$geometry <- NULL
all_neighbours <- as.vector(as.matrix(df_dummy[,c(66:76)])) %>% unique(.)
# create new neighbourhood-dummy for all neighbours (but not if its a neighbour that has a 1 in its original dummy)
nuts2_sf <- nuts2_sf %>% mutate(nb_dummy = ifelse(rownumber %in% all_neighbours & tp100_d == 0, 1, 0))

Best way to parallelize computation over dask blocks that do not return np arrays?

I'd like to return a dask dataframe from an overlapping dask array computation, where each block's computation returns a pandas dataframe. The example below shows one way to do this, simplified for demonstration purposes. I've found a combination of da.overlap.overlap and to_delayed().ravel() as able to get the job done, if I pass in the relevant block key and chunk information.
Edit:
Thanks to a #AnnaM who caught bugs in the original post and then made it general! Building off of her comments, I'm including an updated version of the code. Also, in responding to Anna's interest in memory usage, I verified that this does not seem to take up more memory than naively expected.
def extract_features_generalized(chunk, offsets, depth, columns):
shape = np.asarray(chunk.shape)
offsets = np.asarray(offsets)
depth = np.asarray(depth)
coordinates = np.stack(np.nonzero(chunk)).T
keep = ((coordinates >= depth) & (coordinates < (shape - depth))).all(axis=1)
data = coordinates + offsets - depth
df = pd.DataFrame(data=data, columns=columns)
return df[keep]
def my_overlap_generalized(data, chunksize, depth, columns, boundary):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth, boundary=boundary)
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr,
chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2),
columns=['r', 'c', 'z', 'y', 'x'],
boundary=tuple(['reflect']*5))
df.compute().reset_index()
-- Remainder of original post, including original bugs --
My example only does xy overlaps, but it's easy to generalize. Is there anything below that is suboptimal or could be done better? Is anything likely to break because it's relying on low-level information that could change (e.g. block key)?
def my_overlap(data, chunk_xy, depth_xy):
data = data.rechunk((-1,-1,-1, chunk_xy, chunk_xy))
data_overlapping_chunks = da.overlap.overlap(data,
depth=(0,0,0,depth_xy,depth_xy),
boundary={3: 'reflect', 4: 'reflect'})
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features)(block, offsets=offsets, depth_xy=depth_xy)
dfs.append(df_block)
# All computation is delayed, so downstream comptutions need to know the format of the data. If the meta
# information is not specified, a single computation will be done (which could be expensive) at this point
# to infer the metadata.
# This empty dataframe has the index, column, and type information we expect in the computation.
columns = ['r', 'c', 'z', 'y', 'x']
# The dtypes are float64, except for a small number of columns
df_meta = pd.DataFrame(columns=columns, dtype=np.float64)
df_meta = df_meta.astype({'c': np.int64, 'r': np.int64})
df_meta.index.name = 'feature'
return dd.from_delayed(dfs, meta=df_meta)
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y+offsets[3]-depth_xy,
'x': x+offsets[4]-depth_xy})
df = df[(df.y > depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.z > depth_xy) & (df.z < (chunk.shape[4] - depth_xy))]
return df
data = np.zeros((2,4,8,16,16)) # round, channel, z, y, x
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap(arr, chunk_xy=8, depth_xy=2)
df.compute().reset_index()
First of all, thanks for posting your code. I am working on a similar problem and this was really helpful for me.
When testing your code, I discovered a few mistakes in the extract_features function that prevent your code from returning correct indices.
Here is a corrected version:
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y, 'x': x})
df = df[(df.y >= depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.x >= depth_xy) & (df.x < (chunk.shape[4] - depth_xy))]
df['y'] = df['y'] + offsets[3] - depth_xy
df['x'] = df['x'] + offsets[4] - depth_xy
return df
The updated code now returns the indices that were set to 1:
index r c z y x
0 0 0 0 4 2 2
1 1 0 1 4 6 2
2 2 0 3 4 2 2
3 1 1 2 4 8 2
For comparison, this is the output of the original version:
index r c z y x
0 1 0 1 4 6 2
1 3 1 2 4 8 2
2 0 0 1 4 6 2
3 1 1 2 4 8 2
It returns lines number 2 and 4, two times each.
The reason why this happens is three mistakes in the extract_features function:
You first add the offset and subtract the depth and then filter out the overlapping parts: the order needs to be swapped
df.y > depth_xy should be replaced with df.y >= depth_xy
df.z should be replaced with df.x, since it is the x dimension that has an overlap
To optimize this even further, here is a generalized version of the code that would work for an arbitrary number of dimension:
def extract_features_generalized(chunk, offsets, depth, columns):
coordinates = np.nonzero(chunk)
df = pd.DataFrame()
rows_to_keep = np.ones(len(coordinates[0]), dtype=int)
for i in range(len(columns)):
df[columns[i]] = coordinates[i]
rows_to_keep = rows_to_keep * np.array((df[columns[i]] >= depth[i])) * \
np.array((df[columns[i]] < (chunk.shape[i] - depth[i])))
df[columns[i]] = df[columns[i]] + offsets[i] - depth[i]
del coordinates
return df[rows_to_keep > 0]
def my_overlap_generalized(data, chunksize, depth, columns):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth,
boundary=tuple(['reflect']*len(columns)))
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr, chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2), columns=['r', 'c', 'z', 'y', 'x'])
df.compute().reset_index()

VBA SUMIF on Array

OK, so I am wanting to do a sumif on a column on an array because I don't want to print to the worksheet in order to obtain a range type for a Worksheet.function.Sumif, the idea is to stay completely out in VBA code and write to the worksheet as little as possible. I am trying to optimize for speed: 4814 rows by 40 columns X 60.
The first column total is total of 197,321,164 is correct, the next columns are low and instead of going to quarter 40 the Else kicks in and everything after 8 is 0. The first "To_Quarter" in the array is 9 so with the >= I would think it would go to 9. I tried putting my Next I before the End IF but then it just asks for the For.
image of locals box: https://ibb.co/cgFQhxY
Any help would be much appreciated.
Sub SumifONarray()
Dim arrQuarters, arrNumber_of_Assets As Variant
Dim I As Long, J As Long
arrNumber_of_Assets = Range("Costs_Number_of_Assets")
arrQuarters = Range("Quarters_1to40")
Dim MaxRecov_If_result, arr_Row10_Resolution_Physical_Possession_Expenses_To_Quarter, arr_Row10_Resolution__Max_Recovery_Grid As Variant
arr_Row10_Resolution_Physical_Possession_Expenses_To_Quarter = Range("_Row10_Resolution_Physical_Possession_Expenses_To_Quarter")
arr_Row10_Resolution__Max_Recovery_Grid = Range("_Row10_Resolution__Max_Recovery_Grid")
ReDim arrIf_Max_Recovery_Amount_Sum(1 To 1, 1 To UBound(arrQuarters, 2))
For J = LBound(arrQuarters, 2) To UBound(arrQuarters, 2)
For I = LBound(arrNumber_of_Assets, 1) To UBound(arrNumber_of_Assets, 1)
If arr_Row10_Resolution_Physical_Possession_Expenses_To_Quarter(I, 1) >= arrQuarters(1, J) Then
MaxRecov_If_result = MaxRecov_If_result + arr_Row10_Resolution__Max_Recovery_Grid(I, J)
Else: MaxRecov_If_result = 0
End If
Next I
arrIf_Max_Recovery_Amount_Sum(1, J) = MaxRecov_If_result
MaxRecov_If_result = 0
Next J
End Sub
I've uploaded a sample below with code with 10 rows.
https://easyupload.io/wfixds

Codility: Passing cars in Lua

I'm currently practicing programming problems and out of interest, I'm trying a few Codility exercises in Lua. I've been stuck on the Passing Cars problem for a while.
Problem:
A non-empty zero-indexed array A consisting of N integers is given. The consecutive elements of array A represent consecutive cars on a road.
Array A contains only 0s and/or 1s:
0 represents a car traveling east,
1 represents a car traveling west.
The goal is to count passing cars. We say that a pair of cars (P, Q), where 0 ≤ P < Q < N, is passing when P is traveling to the east and Q is traveling to the west.
For example, consider array A such that:
A[0] = 0
A[1] = 1
A[2] = 0
A[3] = 1
A[4] = 1
We have five pairs of passing cars: (0, 1), (0, 3), (0, 4), (2, 3), (2, 4).
Write a function:
function solution(A)
that, given a non-empty zero-indexed array A of N integers, returns the number of pairs of passing cars.
The function should return −1 if the number of pairs of passing cars exceeds 1,000,000,000.
For example, given:
A[0] = 0
A[1] = 1
A[2] = 0
A[3] = 1
A[4] = 1
the function should return 5, as explained above.
Assume that:
N is an integer within the range [1..100,000];
each element of array A is an integer that can have one of the following values: 0, 1.
Complexity:
expected worst-case time complexity is O(N);
expected worst-case space complexity is O(1), beyond input storage (not counting the storage required for input arguments).
Elements of input arrays can be modified.
My attempt in Lua keeps failing but I can't seem to find the issue.
local function solution(A)
local zeroes = 0
local pairs = 0
for i = 1, #A do
if A[i] == 0 then
zeroes = zeroes + 1
else
pairs = pairs + zeroes
if pairs > 1e9 then
return -1
end
end
end
return pairs
end
In terms of time-space complexity constraints, I think it should pass so I can't seem to find the issue. What am I doing wrong? Any advice or tips to make my code more efficient would be appreciated.
FYI: I keep getting a result of 2 when the desired example result is 5.
The problem statement says A is 0-based so if we ignore the first and start at 1, the output would be 2 instead of 5. 0-based tables should be avoided in Lua, they go against convention and will lead to a lot of off-by one errors: for i=1,#A do will not do what you want.
function solution1based(A)
local zeroes = 0
local pairs = 0
for i = 1, #A do
if A[i] == 0 then
zeroes = zeroes + 1
else
pairs = pairs + zeroes
if pairs > 1e9 then
return -1
end
end
end
return pairs
end
print(solution1based{0, 1, 0, 1, 1}) -- prints 5 as you wanted
function solution0based(A)
local zeroes = 0
local pairs = 0
for i = 0, #A do
if A[i] == 0 then
zeroes = zeroes + 1
else
pairs = pairs + zeroes
if pairs > 1e9 then
return -1
end
end
end
return pairs
end
print(solution0based{[0]=0, [1]=1, [2]=0, [3]=1, [4]=1}) -- prints 5

Resources