Fuzzy join on substring dask - dask

I have two data frames with columns of interest 'ParseCom', which is the left index of this fuzzy join, and 'REF' which should be a substring of 'ParseCom' during a join.
This is iterating over the Dataframe, which is not recommended.
How can I implement a fuzzy join in Dask where I am joining on substrings?
for i, com in enumerate(defects['ParseCom']):
for j, sub in enumerate(repair_matrix['REF']):
if sub in com:
print(i,j, com)

Modifying the approach shown Merge pandas on string contains
Using pandas:
def inmerge(sub,sup, sub_on, sup_on, sub_index, sup_index):
sub_par = sub.rename(columns = {sub_on: 'on', sub_index: 'common'})
sup_par = sup.rename(columns = {sup_on: 'on', sup_index: 'common'})
print(sub_par.columns.tolist())
print(sup_par.columns.tolist())
rhs = (sub_par.on
.apply(lambda x: sup_par[sup_par.on.str.find(x).ge(0)].common)
.bfill(axis=1)
.iloc[:, 0])
rel = (pd.concat([sub_par.common, rhs], axis=1, ignore_index=True).rename(columns={0: sub_index, 1: sup_index}))[[sub_index,sup_index]]
print(rel.columns.tolist())
print(sub_index,sup_index)
sub = sub.merge(rel, on = sub_index)
return sub.merge(sup, on = sup_index)
This has limitations, such as requiring pandas and speed, but it does work faster than a for loop.

Related

How do I sort a simple Lua table alphabetically?

I have already seen many threads with examples of how to do this, the problem is, I still can't do it.
All the examples have tables with extra data. For example somethings like this
lines = {
luaH_set = 10,
luaH_get = 24,
luaH_present = 48,
}
or this,
obj = {
{ N = 'Green1' },
{ N = 'Green' },
{ N = 'Sky blue99' }
}
I can code in a few languages but I'm very new to Lua, and tables are really confusing to me. I can't seem to work out how to adapt the code in the examples to be able to sort a simple table.
This is my table:
local players = {"barry", "susan", "john", "wendy", "kevin"}
I want to sort these names alphabetically. I understand that Lua tables don't preserve order, and that's what's confusing me. All I essentially care about doing is just printing these names in alphabetical order, but I feel I need to learn this properly and know how to index them in the right order to a new table.
The examples I see are like this:
local function cmp(a, b)
a = tostring(a.N)
b = tostring(b.N)
local patt = '^(.-)%s*(%d+)$'
local _,_, col1, num1 = a:find(patt)
local _,_, col2, num2 = b:find(patt)
if (col1 and col2) and col1 == col2 then
return tonumber(num1) < tonumber(num2)
end
return a < b
end
table.sort(obj, cmp)
for i,v in ipairs(obj) do
print(i, v.N)
end
or this:
function pairsByKeys (t, f)
local a = {}
for n in pairs(t) do table.insert(a, n) end
table.sort(a, f)
local i = 0 -- iterator variable
local iter = function () -- iterator function
i = i + 1
if a[i] == nil then return nil
else return a[i], t[a[i]]
end
end
return iter
end
for name, line in pairsByKeys(lines) do
print(name, line)
end
and I'm just absolutely thrown by this as to how to do the same thing for a simple 1D table.
Can anyone please help me to understand this? I know if I can understand the most basic example, I'll be able to teach myself these harder examples.
local players = {"barry", "susan", "john", "wendy", "kevin"}
-- sort ascending, which is the default
table.sort(players)
print(table.concat(players, ", "))
-- sort descending
table.sort(players, function(a,b) return a > b end)
print(table.concat(players, ", "))
Here's why:
Your table players is a sequence.
local players = {"barry", "susan", "john", "wendy", "kevin"}
Is equivalent to
local players = {
[1] = "barry",
[2] = "susan",
[3] = "john",
[4] = "wendy",
[5] = "kevin",
}
If you do not provide keys in the table constructor, Lua will use integer keys automatically.
A table like that can be sorted by its values. Lua will simply rearrange the index value pairs in respect to the return value of the compare function. By default this is
function (a,b) return a < b end
If you want any other order you need to provide a function that returs true if element a comes befor b
Read this https://www.lua.org/manual/5.4/manual.html#pdf-table.sort
table.sort
Sorts the list elements in a given order, in-place, from list[1] to
list[#list]
This example is not a "list" or sequence:
lines = {
luaH_set = 10,
luaH_get = 24,
luaH_present = 48,
}
Which is equivalent to
lines = {
["luaH_set"] = 10,
["luaH_get"] = 24,
["luaH_present"] = 48,
}
it only has strings as keys. It has no order. You need a helper sequence to map some order to that table's element.
The second example
obj = {
{ N = 'Green1' },
{ N = 'Green' },
{ N = 'Sky blue99' }
}
which is equivalent to
obj = {
[1] = { N = 'Green1' },
[2] = { N = 'Green' },
[3] = { N = 'Sky blue99' },
}
Is a list. So you could sort it. But sorting it by table values wouldn't make too much sense. So you need to provide a function that gives you a reasonable way to order it.
Read this so you understand what a "sequence" or "list" is in this regard. Those names are used for other things as well. Don't let it confuse you.
https://www.lua.org/manual/5.4/manual.html#3.4.7
It is basically a table that has consecutive integer keys starting at 1.
Understanding this difference is one of the most important concepts while learning Lua. The length operator, ipairs and many functions of the table library only work with sequences.
This is my table:
local players = {"barry", "susan", "john", "wendy", "kevin"}
I want to sort these names alphabetically.
All you need is table.sort(players)
I understand that LUA tables don't preserve order.
Order of fields in a Lua table (a dictionary with arbitrary keys) is not preserved.
But your Lua table is an array, it is self-ordered by its integer keys 1, 2, 3,....
To clear up the confusing in regards to "not preserving order": What's not preserving order are the keys of the values in the table, in particular for string keys, i.e. when you use the table as dictionary and not as array. If you write myTable = {orange="hello", apple="world"} then the fact that you defined key orange to the left of key apple isn't stored. If you enumerate keys/values using for k, v in pairs(myTable) do print(k, v) end then you'd actually get apple world before orange hello because "apple" < "orange".
You don't have this problem with numeric keys though (which is what the keys by default will be if you don't specify them - myTable = {"hello", "world", foo="bar"} is the same as myTable = {[1]="hello", [2]="world", foo="bar"}, i.e. it will assign myTable[1] = "hello", myTable[2] = "world" and myTable.foo = "bar" (same as myTable["foo"]). (Here, even if you would get the numeric keys in a random order - which you don't, it wouldn't matter since you could still loop through them by incrementing.)
You can use table.sort which, if no order function is given, will sort the values using < so in case of numbers the result is ascending numbers and in case of strings it will sort by ASCII code:
local players = {"barry", "susan", "john", "wendy", "kevin"}
table.sort(players)
-- players is now {"barry", "john", "kevin", "susan", "wendy"}
This will however fall apart if you have mixed lowercase and uppercase entries because uppercase will go before lowercase due to having lower ASCII codes, and of course it also won't work properly with non-ASCII characters like umlauts (they will go last) - it's not a lexicographic sort.
You can however supply your own ordering function which receives arguments (a, b) and needs to return true if a should come before b. Here an example that fixes the lower-/uppercase issues for example, by converting to uppercase before comparing:
table.sort(players, function (a, b)
return string.upper(a) < string.upper(b)
end)

Joining a dataframe against a filtered version of itself

I have two dataframes, left and right. The latter, right, is a subset of left, such that left contains all the rows right does. I want to use right to remove redundant rows from left by doing a simple "left_anti" join.
I've discovered that the join doesn't work if I use a filtered version of left on the right. It works only if I reconstruct the right dataframe from scratch.
What is going on here?
Is there a workaround that doesn't involve recreating the right dataframe?
from pyspark.sql import Row, SparkSession
import pyspark.sql.types as t
schema = t.StructType(
[
t.StructField("street_number", t.IntegerType()),
t.StructField("street_name", t.StringType()),
t.StructField("lower_street_number", t.IntegerType()),
t.StructField("upper_street_number", t.IntegerType()),
]
)
data = [
# Row that conflicts w/ range row, and should be removed
Row(
street_number=123,
street_name="Main St",
lower_street_number=None,
upper_street_number=None,
),
# Range row
Row(
street_number=None,
street_name="Main St",
lower_street_number=120,
upper_street_number=130,
),
]
def join_files(left_side, right_side):
join_condition = [
(
(right_side.lower_street_number.isNotNull())
& (right_side.upper_street_number.isNotNull())
& (right_side.lower_street_number <= left_side.street_number)
& (right_side.upper_street_number >= left_side.street_number)
)
]
return left_side.join(right_side, join_condition, "left_anti")
spark = SparkSession.builder.getOrCreate()
left = spark.createDataFrame(data, schema)
right_fail = left.filter("lower_street_number IS NOT NULL")
result = join_files(left, right_fail)
result.count() # Returns 2 - both rows still present
right_success = spark.createDataFrame([data[1]], schema)
result = join_files(left, right_success)
result.count() # Returns 1 - the "left_anti" join worked as expected
You could alias the DF's:
import pyspark.sql.functions as F
def join_files(left_side, right_side):
join_condition = [
(
(F.col("right_side.lower_street_number").isNotNull())
& (F.col("right_side.upper_street_number").isNotNull())
& (F.col("right_side.lower_street_number") <= F.col("left_side.street_number"))
& (F.col("right_side.upper_street_number") >= F.col("left_side.street_number"))
)
]
return left_side.join(right_side, join_condition, "left_anti")
spark = SparkSession.builder.getOrCreate()
left = spark.createDataFrame(data, schema).alias("left_side")
right_fail = left.filter("lower_street_number IS NOT NULL").alias("right_side")
result = join_files(left, right_fail)
print(result.count()) # Returns 2 - both rows still present
right_success = spark.createDataFrame([data[1]], schema).alias("right_side")
result = join_files(left, right_success)
result.count() # Returns 1 - the "left_anti" join worked as expected
Don't know which pyspark version you are on but pyspark==3.0.1, I get the following explanatory error.
AnalysisException: Column lower_street_number#522, upper_street_number#523, lower_street_number#522, upper_street_number#523 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.;

How to order a Table to tables based on 1 single data part of it?

I am a hobbyest making mods in TableTop Simulator using LUA and have a question that I can not seam to work out.
I have a number of "objects" which is a table in TTS that contains various data for those objects. For example.. obj.position = {x,y,z}... and can be accessed at the axis level as well.
obj.position = {5,10,15} -- x,y,z
obj.position.x == 5
This is an example. The makers of TTS have made it so you can access all the parts like that. So I can acess the object.. and then its various parts. There is a heap, like name, mesh, difuse and a ton more. roations{x,y,z} etc etc
Anyway. I have a table of objects... and would like to order those objects based on the positional data of the x axis.. so highest to lowest. So if I have a table and obj1 in that table is x=3 and obj2 is x=1 and obj3 = x=2 it would be sorted as obj2,obj3,obj1
Pseudo code:
tableOfObjects = {obj1,obj2,obj3}
--[[
tableOfObjectsp[1] == obj1
tableOfObjectsp[2] == obj2
tableOfObjectsp[3] == obj3
tableOfObjectsp[1].position.x == 3
tableOfObjectsp[2].position.x == 1
tableOfObjectsp[4].position.x == 2
--]]
---After Sort it would look this list
tableOfObjects = {obj1,obj3,obj2}
--[[
tableOfObjectsp[1] == obj1
tableOfObjectsp[2] == obj3
tableOfObjectsp[3] == obj2
tableOfObjectsp[1].position.x == 3
tableOfObjectsp[2].position.x == 2
tableOfObjectsp[3].position.x == 1
--]]
I hope I am making sense. I am self taught in the last few months!
So basically I have a table of objects and want to sort the objects in that table based on a single value attached to each individual object in the table. In this case the obj.position.x
Thanks!
You need table.sort. The first argument is the table to sort, the second is a function to compare items.
Example:
t = {
{str = 42, dex = 10, wis = 100},
{str = 18, dex = 30, wis = 5}
}
table.sort(t, function (k1, k2)
return k1.str < k2.str
end)
This article has more information
table.sort(tableOfObjects, function(a, b) return a.position.x > b.position.x end)
This line will sort your table tableOfObjects in descending order by the x-coordinate.
To reverse order, replace > by <.
From the Lua reference manual:
table.sort (list [, comp])
Sorts list elements in a given order, in-place, from list[1] to
list[#list]. If comp is given, then it must be a function that
receives two list elements and returns true when the first element
must come before the second in the final order (so that, after the
sort, i < j implies not comp(list[j],list[i])). If comp is not given,
then the standard Lua operator < is used instead.
Note that the comp function must define a strict partial order over
the elements in the list; that is, it must be asymmetric and
transitive. Otherwise, no valid sort may be possible.
The sort algorithm is not stable: elements considered equal by the
given order may have their relative positions changed by the sort.
So in other words table.sort will sort a table in ascending order by its values.
If you want to order descending or by something other than the table value (like the x-coordinate of your table value's position in your case) you have to provide a function that tells Lua which element will come first.
you can create a function that handles this exact thing:
local function fix_table(t)
local x_data = {};
local inds = {};
local rt = {};
for i = 1, #t do
x_data[#x_data + 1] = t[i].position.x;
inds[t[i].position.x] = t[i];
end
local min_index = math.min(table.unpack(x_data));
local max_index = math.max(table.unpack(x_data));
for i = min_index, max_index do
if inds[i] ~= nil then
rt[#rt + 1] = inds[i];
end
end
return rt;
end
local mytable = {obj1, obj2, obj3};
mytable = fix_table(mytable);
fix_table first takes in every x value inside of the given table, and also places a new index inside the table inds according to each x value (so that they will be ordered from least to greatest), then it gets the smallest value in the x_data array table, which is used to traverse the inds table in order. fix_table checks to make sure that inds[i] is not equal to nil before it increases the size of the return table rt so that every value in rt is ordered from greatest to least, starting at index 1, and ending at index #rt, finally rt is returned.
I hope this helped.

pyspark join rdds by a specific key

I have two rdds that I need to join them together. They look like the followings:
RDD1
[(u'2', u'100', 2),
(u'1', u'300', 1),
(u'1', u'200', 1)]
RDD2
[(u'1', u'2'), (u'1', u'3')]
My desired output is:
[(u'1', u'2', u'100', 2)]
So I would like to select those from RDD2 that have the same second value of RDD1. I have tried join and also cartesian and none is working and not getting even close to what I am looking for. I am new to Spark and would appreciate any help from you guys.
Thanks
Dataframe If you allow using Spark Dataframe in the solution. You can turn given RDD to dataframes and join the corresponding column together.
df1 = spark.createDataFrame(rdd1, schema=['a', 'b', 'c'])
df2 = spark.createDataFrame(rdd2, schema=['d', 'a'])
rdd_join = df1.join(df2, on='a')
out = rdd_join.rdd.collect()
RDD just zip the key that you want to join to the first element and simply use join to do the joining
rdd1_zip = rdd1.map(lambda x: (x[0], (x[1], x[2])))
rdd2_zip = rdd2.map(lambda x: (x[1], x[0]))
rdd_join = rdd1_zip.join(rdd2_zip)
rdd_out = rdd_join.map(lambda x: (x[0], x[1][0][0], x[1][0][1], x[1][1])).collect() # flatten the rdd
print(rdd_out)
For me your process looks like manual. Here is sample code:-
rdd = sc.parallelize([(u'2', u'100', 2),(u'1', u'300', 1),(u'1', u'200', 1)])
rdd1 = sc.parallelize([(u'1', u'2'), (u'1', u'3')])
newRdd = rdd1.map(lambda x:(x[1],x[0])).join(rdd.map(lambda x:(x[0],(x[1],x[2]))))
newRdd.map(lambda x:(x[1][0], x[0], x[1][1][0], x[1][1][1])).coalesce(1).collect()
OUTPUT:-
[(u'1', u'2', u'100', 2)]

JOIN two data set on the basis of string matching condition in Pig

I am new in Pig and I have two data sets, "highspender" and "feedback".
Highspender:
Price,fname,lname
$50,Jack,Brown
$30,Rovin,Pall
Feedback:
date,Name,rate
2015-01-02,Jack B Brown,5
2015-01-02,Pall,4
Now I have to join these two datasets on the basis of their name. My condition should be fname or lname of Highspender should match with the Name of feedback. How to join these two datasets? Any idea?
You can try below script to do the same all you need is to replace the names according to your data
highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray);
feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray);
out = JOIN highs BY fname, feedback BY Name;
out1 = JOIN highs BY lname, feedback BY Name;
final_out = UNION out,out1;
For further help you can refer this Pig Reference manual
EDIT
As per the comment script for joining data with string function is as bellow:
highs = LOAD 'highs' using PigStorage(',') as (Price:chararray,fname:chararray,lname:chararray);
feedback = LOAD 'feeds' using PigStorage(',') as (date:chararray,Name:chararray,rate:chararray);
crossout = cross highs, feedback;
final_lname = filter crossout by ( REPLACE (feedback::Name,highs::lname ,'') != feedback::Name);
final_fname = filter crossout by ( REPLACE (feedback::Name,highs::fname ,'') != feedback::Name);
final = UNION final_lname, final_fname;

Resources