horizontally joining multiple dataframes in pyspark - join

I am trying to horizontally join multiple dataframes (with same number of records) in pyspark using monotonically_increasing_id(). However the results obtained have inflated number of records
for i in range(len(lst)+1):
if i==0:
df[i] = cust_mod.select('key')
df[i+1] = df[i].withColumn("idx", monotonically_increasing_id())
else:
df_tmp = o[i-1].select(col("value").alias(obj_names[i-1]))
df_tmp = df_tmp.withColumn("idx", monotonically_increasing_id())
df[i+1] = df[i].join(df_tmp, "idx", "outer")
Expected number of records in df[i+1]=~60m. Got : ~88m. It seems monotonically increasing id is not generating same numbers all the time. How can I resole this problem?
Other details:
cust_mod > dataframe, count- ~60m
o[i] - another set of dataframes, with length equal to cust_mod
lst - a list than has 49 components . So in total 49 loops
I tried using zipWithIndex():
for i in range(len(lst)+1):
if i==0:
df[i] = cust_mod.select('key')
df[i+1] = df[i].rdd.zipWithIndex().toDF()
else:
df_tmp = o[i-1].select("value").rdd.zipWithIndex().toDF()
df_tmp1 = df_tmp.select(col("_1").alias(obj_names[i-1]),col("_2"))
df[i+1] = df[i].join(df_tmp1, "_2", "inner").drop(df_tmp1._2)
But it's way sloww. Like 50 times slow.

Related

scilab save('-append') doesn't seem to work

I am trying to create a dataset for ML using Scilab, and I need to save during data generation because it's too big for scilab's max stack.
Here is a toy example I made to find out what goes wrong but I'm not able to figure it out
datas=[];
labels=[];
for i =1:10
for j=1:100
if j==1
disp(i)
end
data = sin(-%pi:0.01:%pi);
label = rand();
datas = [datas, data];
labels = [labels, label];
end
save(chemin+'\test.h5','-append','datas','labels')
datas = [];
labels = [];
end
I am looking for the shape of data to be [1000,629] at the end, but I get [62900,0]
Have you any ideas why it is?
Here is an example of how to incrementally save a big matrix without any memory pressure:
// create a new HDF5 file
a = h5open(TMPDIR + "/test.h5", "w")
// create the dataset
N = 3; // number of chuncks
nrows = 5; // rows of a single chunk
ncols = 10; // cols of a single chunk
chsize = [nrows, ncols];
maxrows = N*nrows; // final number of rows of concatenated matrix
maxcols = ncols; // final number of cols of concatenated matrix
for k=1:N
// warning, x is viewed as a C-matrix (row-major), transpose if applicable
x = rand(nrows,ncols);
h5dataset(a, "My_Dataset", ...
[chsize ;1 1 ;1 1 ;chsize ;chsize],...
x, ...
[k*nrows ncols; maxrows maxcols; 1+(k-1)*nrows 1 ;1 1 ;chsize; chsize])
h5dump(a, "My_Dataset");
end
disp(a.root.My_Dataset.data)
h5close(a)
You have to vertically concatenate (semicolon) instead of horizontally (coma)
datas = [datas; data];
labels = [labels; label];
BTW this won't solve your memory problem as the matrices grow in Scilab's workspace and using "-append" just owerwrites the objects in the hdf5 file (you are using the same names).

Data Frame from RNeo4j cypher() output

I've got a problem with this code:
library(RNeo4j)
library(dplyr)
library(stringr)
library(MASS)
graph = startGraph("http://localhost:7474/db/data/", username = "neo4j", password = ",yT:/9L)8aoi8t")
query = "MATCH (m:MARKETS)
MATCH (n:CBP_NAICS {msa_naics: m.jll_msa} )
WHERE n.naics CONTAINS '----'
match (c1:Category)
WHERE toString(c1.id) = left(n.naics,2)
match (q:JLL_qtr {qtr: 'Q2-2016',mkt: m.mkt,level: 1} )
match (c:BldgClass {qtr: q.qtr,mkt: m.mkt,class: 'Totals'} )
match (N:Neighborhood {qtr: c.qtr,nbrhd: c.nbrhd,nbrhd: q.nbrhd,BldgClass: c.class} )
return m.mkt, n.msa_naics, ... ,N.AvgOverallAskRent"
naics_jll <- cypher(graph, query)
df_corr <- naics_jll[sapply(naics_jll, is.numeric)]
The Neo4j query itself yields expected results when run in the Neo4j shell.
In RStudio, the data frame appears correct. View(naics_jll) & View(df_corr) both "look right"
However dplyr::summarize() -- for both data frames -- gives:
## data frame with 0 columns and 0 rows
On top of that, I get "funny" results from analysis of the data in the data frames.
I did both a Google search and a SO search on both data frame with 0 columns and 0 rows AND rneo4j data frame with 0 columns and 0 rows, and found nothing helpful.

How to join DecisionTreeRegressor predict output to the original data

I am developing a model that uses DecisionTreeRegressor. I have built and fit the tree using training data, and predicted the results from more recent data to confirm the model's accuracy.
To build and fit the tree:
X = np.matrix ( pre_x )
y = np.matrix( pre_y )
regr_b = DecisionTreeRegressor(max_depth = 4 )
regr_b.fit(X, y)
To predict new data:
X = np.matrix ( pre_test_x )
trial_pred = regr_b.predict(X, check_input=True)
trial_pred is an array of the predicted values. I need to join it back to pre_test_x so I can see how well the prediction matches what actually happened.
I have tried merges:
all_pred = pre_pre_test_x.merge(predictions, left_index = True, right_index = True)
and
all_pred = pd.merge (pre_pre_test_x, predictions, how='left', left_index=True, right_index=True )
and either get no results or a second copy of the columns appended to the bottom of the DataFrame with NaN in all the existing columns.
Turns out it was simple. Leave the predict output as an array, then run:
w_pred = pre_pre_test_x.copy(deep=True)
w_pred['pred_val']=trial_pred

Total sum from a set (logic)

I have a logic problem for an iOS app but I don't want to solve it using brute-force.
I have a set of integers, the values are not unique:
[3,4,1,7,1,2,5,6,3,4........]
How can I get a subset from it with these 3 conditions:
I can only pick a defined amount of values.
The sum of the picked elements are equal to a value.
The selection must be random, so if there's more than one solution to the value, it will not always return the same.
Thanks in advance!
This is the subset sum problem, it is a known NP-Complete problem, and thus there is no known efficient (polynomial) solution to it.
However, if you are dealing with only relatively low integers - there is a pseudo polynomial time solution using Dynamic Programming.
The idea is to build a matrix bottom-up that follows the next recursive formulas:
D(x,i) = false x<0
D(0,i) = true
D(x,0) = false x != 0
D(x,i) = D(x,i-1) OR D(x-arr[i],i-1)
The idea is to mimic an exhaustive search - at each point you "guess" if the element is chosen or not.
To get the actual subset, you need to trace back your matrix. You iterate from D(SUM,n), (assuming the value is true) - you do the following (after the matrix is already filled up):
if D(x-arr[i-1],i-1) == true:
add arr[i] to the set
modify x <- x - arr[i-1]
modify i <- i-1
else // that means D(x,i-1) must be true
just modify i <- i-1
To get a random subset at each time, if both D(x-arr[i-1],i-1) == true AND D(x,i-1) == true choose randomly which course of action to take.
Python Code (If you don't know python read it as pseudo-code, it is very easy to follow).
arr = [1,2,4,5]
n = len(arr)
SUM = 6
#pre processing:
D = [[True] * (n+1)]
for x in range(1,SUM+1):
D.append([False]*(n+1))
#DP solution to populate D:
for x in range(1,SUM+1):
for i in range(1,n+1):
D[x][i] = D[x][i-1]
if x >= arr[i-1]:
D[x][i] = D[x][i] or D[x-arr[i-1]][i-1]
print D
#get a random solution:
if D[SUM][n] == False:
print 'no solution'
else:
sol = []
x = SUM
i = n
while x != 0:
possibleVals = []
if D[x][i-1] == True:
possibleVals.append(x)
if x >= arr[i-1] and D[x-arr[i-1]][i-1] == True:
possibleVals.append(x-arr[i-1])
#by here possibleVals contains 1/2 solutions, depending on how many choices we have.
#chose randomly one of them
from random import randint
r = possibleVals[randint(0,len(possibleVals)-1)]
#if decided to add element:
if r != x:
sol.append(x-r)
#modify i and x accordingly
x = r
i = i-1
print sol
P.S.
The above give you random choice, but NOT with uniform distribution of the permutations.
To achieve uniform distribution, you need to count the number of possible choices to build each number.
The formulas will be:
D(x,i) = 0 x<0
D(0,i) = 1
D(x,0) = 0 x != 0
D(x,i) = D(x,i-1) + D(x-arr[i],i-1)
And when generating the permutation, you do the same logic, but you decide to add the element i in probability D(x-arr[i],i-1) / D(x,i)

Stata: multiplying each variable of a set of time-series variables with the corresponding variable of another set

Being fairly new to Stata, I'm having a difficulty figuring out how to do the following:
I have time-series data on selling price (p) and quantity sold (q) for 10 products in a single datafile (i,e., 20 variables, p01-p10 and q01-q10). I am strugling with appropriate stata command that computes sales revenue (pq) time-series for each of these 10 products (i.e., pq01-pq10).
Many thanks for your help.
forval i = 1/10 {
local j : display %02.0f `i'
gen pq`j' = p`j' * q`j'
}
A standard loop over 1/10 won't get you the leading zero in 01/09. For that we need to use an appropriate format. See also
#article {pr0051,
author = "Cox, N. J.",
title = "Stata tip 85: Looping over nonintegers",
journal = "Stata Journal",
publisher = "Stata Press",
address = "College Station, TX",
volume = "10",
number = "1",
year = "2010",
pages = "160-163(4)",
url = "http://www.stata-journal.com/article.html?article=pr0051"
}
(added later) Another way to do it is
local j = string(`i', "%02.0f")
That makes it a bit more explicit that you are mapping from numbers 1,...,10 to strings "01",...,"10".

Resources