Use two different RDDs in Scala spark - join

I have:
RDD1 with pairs of points that i would like to compare
(2,5), (3,7), ...
and RDD2 with each point's dimensions
(0,List(5,7)), (1,List(2,4)), ...
How could I take the dimensions of second rdd, in order to compare the pairs of the first rdd?
(both rdds are big and i could not collect them)
(join doesn't work for different rdd schema)
https://www.mdpi.com/1999-4893/12/8/166/htm#B28-algorithms-12-00166

Added a sample for joining the rows hope it works for you.
Also find the place holder where you can add/modify the code to add your logic
import org.apache.spark.sql.functions._
import scala.collection.mutable
object JoinRdds {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
var df1 = List((2,5),(3,7)).toDF("x","y") // 1st Dataframe
val df2 = List((0,List(5,7)), (1,List(2,4))).toDF("id", "coordinates") // 2nd Dataframe
df1 = df1.withColumn("id", monotonically_increasing_id()) // Add row num to 1st DF
// df2.join(df1, df1("id") === df2("id")) // perform inner join
// .drop("id") // drop the id column
// .show(false)
val rdd = df2.join(df1, df1("id") === df2("id")).rdd // here's your RDD you can
val resultCoordinates : Array[(Int, Int)] = rdd.map(row => { // Iterate the result row by row
// you can do all sort of operations per row return any type here.
val coordinates = row.getAs[mutable.WrappedArray[Int]]("coordinates")
val x = row.getAs[Integer]("x")
val y = row.getAs[Integer]("y")
(coordinates(0) - x , coordinates(1) - y )
}).collect() // the collect call on the output
resultCoordinates.foreach(r => println(s"(${r._1},${r._2})")) // printing the output result
}
}

Related

Destructured iteration over variadic arguments like a tuple sequence in D

Let's say I want to process a variadic function which alternately gets passed start and end values of 1 or more intervals and it should return a range of random values in those intervals. You can imagine the input to be a flattened sequence of tuples, all tuple elements spread over one single range.
import std.meta; //variadic template predicates
import std.traits : isFloatingPoint;
import std.range;
auto randomIntervals(T = U[0], U...)(U intervals)
if (U.length/2 > 0 && isFloatingPoint!T && NoDuplicates!U.length == 1) {
import std.random : uniform01;
T[U.length/2] randomValues;
// split and iterate over subranges of size 2
foreach(i, T start, T end; intervals.chunks(2)) { //= intervals.slide(2,2)
randomValues[i] = uniform01 * (end - start) + start,
}
return randomValues.dup;
}
The example is not important, I only use it for explanation. The chunk size could be any finite positive size_t, not only 2 and changing the chunk size should only require changing the number of loop-variables in the foreach loop.
In this form above it will not compile since it would only expect one argument (a range) to the foreach loop. What I would like is something which rather automatically uses or infers a sliding-window as a tuple, derived from the number of given loop-variables, and fills the additional variables with next elements of the range/array + allows for an additional index, optionally. According to the documentation a range of tuples allows destructuring of the tuple elements in place into foreach-loop-variables so the first thing, I thought about, is turning a range into a sequence of tuples but didn't find a convenience function for this.
Is there a simple way to loop over destructured subranges (with such a simplicity as shown in my example code) together with the index? Or is there a (standard library) function which does this job of splitting a range into enumerated tuples of equal size? How to easily turn the range of subranges into a range of tuples?
Is it possible with std.algorithm.iteration.map in this case (EDIT: with a simple function argument to map and without accessing tuple elements)?
EDIT: I want to ignore the last chunk which doesn't fit into the entire tuple. It just is not iterated over.
EDIT: It's not, that I couldn't program this myself, I only hope for a simple notation because this use case of looping over multiple elements is quite useful. If there is something like a "spread" or "rest" operator in D like in JavaScript, please let me know!
Thank you.
(Added as a separate answer because it's significantly different from my previous answer, and wouldn't fit in a comment)
After reading your comments and the discussion on the answers thus far, it seems to me what you seek is something like the below staticChunks function:
unittest {
import std.range : enumerate;
size_t index = 0;
foreach (i, a, b, c; [1,2,3,1,2,3].staticChunks!3.enumerate) {
assert(a == 1);
assert(b == 2);
assert(c == 3);
assert(i == index);
++index;
}
}
import std.range : isInputRange;
auto staticChunks(size_t n, R)(R r) if (isInputRange!R) {
import std.range : chunks;
import std.algorithm : map, filter;
return r.chunks(n).filter!(a => a.length == n).map!(a => a.tuplify!n);
}
auto tuplify(size_t n, R)(R r) if (isInputRange!R) {
import std.meta : Repeat;
import std.range : ElementType;
import std.typecons : Tuple;
import std.array : front, popFront, empty;
Tuple!(Repeat!(n, ElementType!R)) result;
static foreach (i; 0..n) {
result[i] = r.front;
r.popFront();
}
assert(r.empty);
return result;
}
Note that this also deals with the last chunk being a different size, if only by silently throwing it away. If this behavior is undesirable, remove the filter, and deal with it inside tuplify (or don't, and watch the exceptions roll in).
chunks and slide return Ranges, not tuples. Their last element can contain less than the specified size, whereas tuples have a fixed compile time size.
If you need destructuring, you have to implement your own chunks/slide that return tuples. To explicitly add an index to the tuple, use enumerate. Here is an example:
import std.typecons, std.stdio, std.range;
Tuple!(int, int)[] pairs(){
return [
tuple(1, 3),
tuple(2, 4),
tuple(3, 5)
];
}
void main(){
foreach(size_t i, int start, int end; pairs.enumerate){
writeln(i, ' ', start, ' ', end);
}
}
Edit:
As BioTronic said using map is also possible:
foreach(i, start, end; intervals
.chunks(2)
.map!(a => tuple(a[0], a[1]))
.enumerate){
Your question has me a little confused, so I'm sorry if I've misunderstood. What you're basically asking is if foreach(a, b; [1,2,3,4].chunks(2)) could work, right?
The simple solution here is to, as you say, map from chunk to tuple:
import std.typecons : tuple;
import std.algorithm : map;
import std.range : chunks;
import std.stdio : writeln;
unittest {
pragma(msg, typeof([1,2].chunks(2).front));
foreach(a, b; [1,2,3,4].chunks(2).map!(a => tuple(a[0], a[1]))) {
writeln(a, ", ", b);
}
}
At the same time with BioTronic, I tried to code some own solution to this problem (tested on DMD). My solution works for slices (BUT NOT fixed-size arrays) and avoids a call to filter:
import std.range : chunks, isInputRange, enumerate;
import std.range : isRandomAccessRange; //changed from "hasSlicing" to "isRandomAccessRange" thanks to BioTronics
import std.traits : isIterable;
/** turns chunks into tuples */
template byTuples(size_t N, M)
if (isRandomAccessRange!M) { //EDITED
import std.meta : Repeat;
import std.typecons : Tuple;
import std.traits : ForeachType;
alias VariableGroup = Tuple!(Repeat!(N, ForeachType!M)); //Tuple of N repititions of M's Foreach-iterated Type
/** turns N consecutive array elements into a Variable Group */
auto toTuple (Chunk)(Chunk subArray) #nogc #safe pure nothrow
if (isInputRange!Chunk) { //Chunk must be indexable
VariableGroup nextLoopVariables; //fill the tuple with static foreach loop
static foreach(index; 0 .. N) {
static if ( isRandomAccessRange!Chunk ) { // add cases for other ranges here
nextLoopVariables[index] = subArray[index];
} else {
nextLoopVariables[index] = subArray.popFront();
}
}
return nextLoopVariables;
}
/** returns a range of VariableGroups */
auto byTuples(M array) #safe pure nothrow {
import std.algorithm.iteration : map;
static if(!isInputRange!M) {
static assert(0, "Cannot call map() on fixed-size array.");
// auto varGroups = array[].chunks(N); //fixed-size arrays aren't slices by default and cannot be treated like ranges
//WARNING! invoking "map" on a chunk range from fixed-size array will fail and access wrong memory with no warning or exception despite #safe!
} else {
auto varGroups = array.chunks(N);
}
//remove last group if incomplete
if (varGroups.back.length < N) varGroups.popBack();
//NOTE! I don't know why but `map!toTuple` DOES NOT COMPILE! And will cause a template compilation mess.
return varGroups.map!(chunk => toTuple(chunk)); //don't know if it uses GC
}
}
void main() {
testArrayToTuples([1, 3, 2, 4, 5, 7, 9]);
}
// Order of template parameters is relevant.
// You must define parameters implicitly at first to be associated with a template specialization
void testArrayToTuples(U : V[], V)(U arr) {
double[] randomNumbers = new double[arr.length / 2];
// generate random numbers
foreach(i, double x, double y; byTuples!2(arr).enumerate ) { //cannot use UFCS with "byTuples"
import std.random : uniform01;
randomNumbers[i] = (uniform01 * (y - x) + x);
}
foreach(n; randomNumbers) { //'n' apparently works despite shadowing a template parameter
import std.stdio : writeln;
writeln(n);
}
}
Using elementwise operations with the slice operator would not work here because uniform01 in uniform01 * (ends[] - starts[]) + starts[] would only be called once and not multiple times.
EDIT: I also tested some online compilers for D for this code and it's weird that they behave differently for the same code. For compilation of D I can recommend
https://run.dlang.io/ (I would be very surprised if this one wouldn't work)
https://www.mycompiler.io/new/d (but a bit slow)
https://ideone.com (it works but it makes your code public! Don't use with protected code.)
but those didn't work for me:
https://tio.run/#d2 (didn't finish compilation in one case, otherwise wrong results on execution even when using dynamic array for the test)
https://www.tutorialspoint.com/compile_d_online.php (doesn't compile the static foreach)

Setting a column to equal the negative of a row in Google Sheets

The Google Sheets API seems vague and I'm probably just too tired.
function onEdit(e) {
var sheet = SpreadsheetApp.getActiveSpreadsheet();
var positives = sheet.getRange("D3:AG3");
var negatives = sheet.getRange("C4:C33");
for (i=0;i<positives.getLastColumn();i++) {
var j = positives[i]*-1;
negatives[i].setValue(j);
}
}
I'm sure I'm doing eight things wrong but if someone is more familiar with Google Sheets, please throw a brick at me.
First, positives is a ranges, and you need to use getValues() to get an array that you can manipulate.
Second, it's not recommended to use Sheets API methods inside loops, the best practice is to manipulate arrays in loops and then use single get and set values API to read / write to a range.
Sample Code:
function onEdit(e) {
var sheet = SpreadsheetApp.getActiveSpreadsheet();
var positives = sheet.getRange("D3:AG3").getValues();
var negatives = sheet.getRange("C4:C33");
var result = [];
for (i = 0; i < positives[0].length; i++) {
result.push([positives[0][i] * -1]);
}
negatives.setValues(result);
}
Sample Output: (I only put values in three rows)
Reference:
push()
Avoid using onEdit for these kind of changes as it will be resource intensive. You are changing all the values of the column into negative of the row EVERY TIME you edit the sheet (Unless that should be the case)
If you really want to use onEdit, be sure to limit it only when the specific range is edited.
Code:
function onEdit(e) {
const row = e.range.getRow();
const column = e.range.getColumn();
// if edited range is within D3:AG3
if(row == 3 && column >= 4 && column <= 33) {
// write to the corresponding row (invert col and row)
e.source.getActiveSheet().getRange(column, row).setValue(e.value * -1);
}
}
Note:
Behaviour of the onEdit function is that when you edit the range D3:AG3, it will negate its value and write into its corresponding destination, one by one.
If you edit D3, it will assign that negative value into C4, nothing more.
If you edit outside the positive range, it will not do anything.
Another approach is to copy your positive row into negative column by transforming your data structure into the destination by bulk.
Code:
function rowToColumn() {
var sheet = SpreadsheetApp.getActiveSpreadsheet();
var pRange = sheet.getRange("D3:AG3");
var pValues = pRange.getValues();
// pValues is a 2D array now
// row range values = [[1, 2, 3, ...]
var negatives = sheet.getRange("C4:C33");
// column range values = [[1], [2], [3], ...]
// since structure of row is different than column
// one thing we can do is convert the row into column structure
// and multiply each element with -1, then assign to negatives
pValues = pValues.map(function(item) {
item = item.map(function(col) {
return [col * -1];
});
return item;
})[0];
// set values into the negatives range
negatives.setValues(pValues);
}
Note:
Behaviour of the rowToColumn function is that it transfers all the values of the row range and then put it into negatives range all at once.
Blank cells will yield 0 by default, add a condition on return [col * -1]; if you want blank cells to return other values instead.
Output:

scilab save('-append') doesn't seem to work

I am trying to create a dataset for ML using Scilab, and I need to save during data generation because it's too big for scilab's max stack.
Here is a toy example I made to find out what goes wrong but I'm not able to figure it out
datas=[];
labels=[];
for i =1:10
for j=1:100
if j==1
disp(i)
end
data = sin(-%pi:0.01:%pi);
label = rand();
datas = [datas, data];
labels = [labels, label];
end
save(chemin+'\test.h5','-append','datas','labels')
datas = [];
labels = [];
end
I am looking for the shape of data to be [1000,629] at the end, but I get [62900,0]
Have you any ideas why it is?
Here is an example of how to incrementally save a big matrix without any memory pressure:
// create a new HDF5 file
a = h5open(TMPDIR + "/test.h5", "w")
// create the dataset
N = 3; // number of chuncks
nrows = 5; // rows of a single chunk
ncols = 10; // cols of a single chunk
chsize = [nrows, ncols];
maxrows = N*nrows; // final number of rows of concatenated matrix
maxcols = ncols; // final number of cols of concatenated matrix
for k=1:N
// warning, x is viewed as a C-matrix (row-major), transpose if applicable
x = rand(nrows,ncols);
h5dataset(a, "My_Dataset", ...
[chsize ;1 1 ;1 1 ;chsize ;chsize],...
x, ...
[k*nrows ncols; maxrows maxcols; 1+(k-1)*nrows 1 ;1 1 ;chsize; chsize])
h5dump(a, "My_Dataset");
end
disp(a.root.My_Dataset.data)
h5close(a)
You have to vertically concatenate (semicolon) instead of horizontally (coma)
datas = [datas; data];
labels = [labels; label];
BTW this won't solve your memory problem as the matrices grow in Scilab's workspace and using "-append" just owerwrites the objects in the hdf5 file (you are using the same names).

How to compare two column in a spreadsheet

I have 30 columns and 1000 rows, I would like to compare column1 with another column. IF the value dont match then I would like to colour it red. Below is a small dataset in my spreadsheet:
A B C D E F ...
1 name sName email
2
3
.
n
Because I have a large dataset and I want to storing my columns in a array, the first row is heading. This is what I have done, however when testing I get empty result, can someone correct me what I am doing wrong?
var index = [];
var sheet = SpreadsheetApp.getActiveSheet();
function col(){
var data = sheet.getDataRange().getValues();
for (var i = 1; i <= data.length; i++) {
te = index[i] = data[1];
Logger.log(columnIndex[i])
if (data[3] != data[7]){
// column_id.setFontColor('red'); <--- I can set the background like this
}
}
}
From the code you can see I am scanning whole spreadsheet data[1] get the heading and in if loop (data[3] != data[7]) compare two columns. I do have to work on my colour variable but that can be done once I get the data that I need.
Try to check this tutorial if it can help you with your problem. This tutorial use a Google AppsScript to compare the two columns. If differences are found, the script should point these out. If no differences are found at all, the script should put out the text "[id]". Just customize this code for your own function.
Here is the code used to achieve this kind of comparison
function stringComparison(s1, s2) {
// lets test both variables are the same object type if not throw an error
if (Object.prototype.toString.call(s1) !== Object.prototype.toString.call(s2)){
throw("Both values need to be an array of cells or individual cells")
}
// if we are looking at two arrays of cells make sure the sizes match and only one column wide
if( Object.prototype.toString.call(s1) === '[object Array]' ) {
if (s1.length != s2.length || s1[0].length > 1 || s2[0].length > 1){
throw("Arrays of cells need to be same size and 1 column wide");
}
// since we are working with an array intialise the return
var out = [];
for (r in s1){ // loop over the rows and find differences using diff sub function
out.push([diff(s1[r][0], s2[r][0])]);
}
return out; // return response
} else { // we are working with two cells so return diff
return diff(s1, s2)
}
}
function diff (s1, s2){
var out = "[ ";
var notid = false;
// loop to match each character
for (var n = 0; n < s1.length; n++){
if (s1.charAt(n) == s2.charAt(n)){
out += "–";
} else {
out += s2.charAt(n);
notid = true;
}
out += " ";
}
out += " ]"
return (notid) ? out : "[ id. ]"; // if notid(entical) return output or [id.]
}
For more information, just check the tutorial link above and this SO question on how to compare two Spreadsheets.

how to return values between dates and group results in couchdb

I'm having issues grouping date range results in couch db.
Say I have this data:
2010-11-14, Tom
2010-11-15, Tom
2010-11-15, Dick
2010-11-15, Tom
2010-11-20, Harry
and i want use a view (and possibly reduce function) to return grouped names between 2010-11-14 and 2010-11-16, eg
Tom 3
Dick 1
how can this be
achieved?
I would suggest the following document structure, and map and reduce functions:
{ date : '2010-11-14', name : 'Tom' }
function(doc) { var r = {}; r[doc.name] = 1; emit (doc.date, r); }
function (keys, values, rereduce) {
var r = {};
for (var i in values) {
for (var k in values[i]) {
if (k in r) r[k] += values[i][k];
else r[k] = values[i][k];
}
}
return r;
}
Then, you would query the view, asking for a full reduce (no grouping) with startkey and endkey parameters 2010-11-14 and 2010-11-16. You will get in return a single value:
{ 'Tom': 3, 'Dick': 1 }

Resources