Save data in Julia - save

Is there a standard way to easily just save any kind of data in Julia. This time I have a tuple with two arrays in it that I want to use again later. Something like this.
a = [1,2,34]
b = [1,2,3,4,5]
save((a,b), "ab")
a, b = load("ab")

You can use JLD2.jl to save and load arbitrary Julia objects. This is more likely to work for long term storage across different Julia versions than the Serialization standard library.
using JLD2
x = (1, "a")
save_object("mytuple.jld2", x)
julia> load_object("mytuple.jld2")
(1, "a")
For additional functionality, check out the docs.

There are many scenarios depending on how long you want to store the data
Serialization (this is perfect for short time periods, for longer periods where packages update their internal data structures you might have problem reading the file)
julia> using Serialization
julia> serialize("file.dat",a)
24
julia> deserialize("file.dat")
3-element Vector{Int64}:
1
2
34
Delimited files (note that additional processing of output might be needed)
julia> using DelimitedFiles
julia> writedlm("file.csv", a)
julia> readdlm("file.csv", '\t' ,Int)
3×1 Matrix{Int64}:
1
2
34
JSON (great for long term)
julia> using JSON3
julia> JSON3.write("file.json",a)
"file.json"
julia> open("file.json") do f; JSON3.read(f,Vector{Int}); end
3-element Vector{Int64}:
1
2
34
Other worth mentioning libraries (depending on data format) include CSV.jl for saving data frames and BSON.jl for binary JSON files.

Related

Problem with function "from_file()" at z3py

Let's assume that we have the following files:
func.smt
(declare-datatypes (T) ((AVL leafA (nodeA (val T) (alt Int) (izq AVL) (der AVL)))))
espec.smt
(declare-const t (AVL Int))
And the following code:
from z3 import *
s = Solver()
s.from_file("func.smt")
s.from_file("espec.smt")
When instruction "s.from_file("espec.smt")" is executed, z3 throws the next exception:
z3.z3types.Z3Exception: b'(error "line 1 column 18: unknown sort \'AVL\'")\n'
It seems that the Solver "s" doesn't save the information of datatypes (and functions too). That is why he throws the exception (I guess).
The following code is a way to solve it:
s = Solver()
file = open("func.smt", "r")
func = file.read()
file.close()
file = open("espec.smt", "r")
espec = file.read()
file.close()
s.from_string(func + espec)
But this way it's not efficient.
Is there another more efficient way to solve this and make z3 save datatypes and functions for future asserts and declarations?
Edit:
In my real case for example, the file "func.smt" has a total of 54 declarations between functions and datatypes (some quite complex). The "spec.smt" file has few declarations and asserts. And finally I have a file in which there are many asserts which I have to read little by little and generate a model.
That is, imagine that I have 600 lines of asserts, and I read from 10 to 10, so every 10 lines I generate a model. That is, if we assume that those asserts that I have read are stored in a variable called "aux", every time I want to get a new model, I will have to do "s.from_string (func + spec + aux)" 60 times in total, and I don't know if that could be made more efficient.
Separately reading the files and loading to solver is your best bet, as you found out.
Unless efficiency is of paramount importance, I would not worry about trying to optimize this any further. For any reasonable problem, your solver time will dominate any time spent in loading the assertions from a few (or a few thousand!) files. Do you have a use case that really requires this part to be significantly faster?

How to filter list 1 without elements of list 2

What is the best way to create a new list based on List1 without elements of List2?
List1 = ["Candy", "Brandy", "Sandy", "Lady", "Baby", "Shady"].
List2 = ["Sandy", "Shady", "Candy", "Sandy"].
The contents of the new list should be:
List3 = ["Brandy", "Lady", "Baby"].
Currently, the best way to do this is to use a module that handles sets, such as ordsets:
> ordsets:subtract(ordsets:from_list(List1), ordsets:from_list(List2)).
["Baby","Brandy","Lady"]
If you're using Erlang/OTP 22 or later (due to be released in June 2019), the best way is using the -- operator:
> List3 = List1 -- List2.
["Brandy","Lady","Baby"]
The runtime complexity of this operation is O(n log n) starting in Erlang/OTP 22, but in earlier Erlang versions, the runtime complexity of this operation is O(n*m), so it would perform very badly if both lists are very long.
See the Retired Myths chapter in the Erlang Efficiency Guide:
12.3 Myth: List subtraction ("--" operator) is slow
List subtraction used to have a run-time complexity proportional to the product of the length of its operands, so it was extremely slow when both lists were long.
As of OTP 22 the run-time complexity is "n log n" and the operation will complete quickly even when both lists are very long. In fact, it is faster and uses less memory than the commonly used workaround to convert both lists to ordered sets before subtracting them with ordsets:subtract/2.

luajit copy table is slow

Within a larger lua-script, I have to copy several tables dt:
for i=1,dt:nrow() do
local r = {}
for j=1,dt:ncol() do
r[j] = dt[i][j]
end
rslt:append(r)
end
The tables are about 50,000 lines x 25 cols, containing mainly doubles. luajit takes about 10 times as long as "standard" lua. On all other calculations/operations I do before, luajit is faster (1.5 to 3 times).
As silly as this may sound, try pre-allocating the r table with 25 values:
local r = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
Unfortunately Lua API doesn't allow pre-allocation of tables, so this is the only way to avoid re-allocations caused by array assignment in the inner loop. My tests show noticeable improvement, but not close to 10x (although I don't use your methods, so your results may vary).

Multiset Partition Using Linear Arithmetic and Z3

I have to partition a multiset into two sets who sums are equal. For example, given the multiset:
1 3 5 1 3 -1 2 0
I would output the two sets:
1) 1 3 3
2) 5 -1 2 1 0
both of which sum to 7.
I need to do this using Z3 (smt2 input format) and "Linear Arithmetic Logic", which is defined as:
formula : formula /\ formula | (formula) | atom
atom : sum op sum
op : = | <= | <
sum : term | sum + term
term : identifier | constant | constant identifier
I honestly don't know where to begin with this and any advice at all would be appreciated.
Regards.
Here is an idea:
1- Create a 0-1 integer variable c_i for each element. The idea is c_i is zero if element is in the first set, and 1 if it is in the second set. You can accomplish that by saying that 0 <= c_i and c_i <= 1.
2- The sum of the elements in the first set can be written as 1*(1 - c_1) + 3*(1 - c_2) + ... +
3- The sum of the elements in the second set can be written as 1*c1 + 3*c2 + ...
While SMT-Lib2 is quite expressive, it's not the easiest language to program in. Unless you have a hard requirement that you have to code directly in SMTLib2, I'd recommend looking into other languages that have higher-level bindings to SMT solvers. For instance, both Haskell and Scala have libraries that allow you to script SMT solvers at a much higher level. Here's how to solve your problem using the Haskell, for instance: https://gist.github.com/1701881.
The idea is that these libraries allow you to code at a much higher level, and then perform the necessary translation and querying of the SMT solver for you behind the scenes. (If you really need to get your hands onto the SMTLib encoding of your problem, you can use these libraries as well, as they typically come with the necessary API to dump the SMTLib they generate before querying the solver.)
While these libraries may not offer everything that Z3 gives you access to via SMTLib, they are much easier to use for most practical problems of interest.

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Resources