Counting items by substrings in a Multiset in Java - java-stream

I have the following data in a guava Multiset. Each item is the combined string of 3 items separated by a ':'. I know all the values for each of the slots. I'm using the values to generate a data file for an interactive graph (by stuffing the split values into an object and then using Gson to print the object).
What's the best way to grab the cumulative count for all items that match just one, one:two, or one:two:three of the substrings? I keep going round and round with streams, forEach, maps and filters, but can't seem to write an elegant set of loops. Any suggestions or examples would be helpful.
Executive:Healthcare:United States x 5
Executive:Healthcare:Malaysia x 2
Executive:Financials:United States x 1
FinancialHealth:Technology:Malaysia x 3
FinancialHealth:Technology:United States x 2
FinancialHealth:Energy:United States x 1
Executive = 8
FinancialHealth = 6
Executive:Heathcare = 7
Executive:Financials = 1
FinancialHealth:Technology = 5
FinancialHealth:Energy = 1
Executive:Healthcare:United States = 5
etc.

Streams can help a great deal here, and it is not even difficult.
We need to take three steps in a stream:
allTheStrings.stream()
// First, we will multiply each string "A:B:C" using `flatMap`
// so that the stream contains "A", "A:B", and "A:B:C":
.flatMap(s -> Stream.of(s.substring(0, s.indexOf(":")),
s.substring(0, s.lastIndexOf(":")),
s))
// next, we are going to summarize multiple occurrences
// of the strings using a groupingBy collector:
.collect(Collectors.groupingBy(Function.identity(),
// This would return a Map<String, List<String>> containing each unique
// string mapped to its occurrences. But because you don't need the
// single occurrences, but instead just their number, we add a step
// to the collect which will make it return a Map<String, Long>
Collectors.counting()))
So, as a full example:
Stream.of("Executive:Healthcare:United States", "Executive:Healthcare:United States",
"Executive:Healthcare:United States", "Executive:Healthcare:United States",
"Executive:Healthcare:United States", "Executive:Healthcare:Malaysia",
"Executive:Healthcare:Malaysia", "Executive:Financials:United States",
"FinancialHealth:Technology:Malaysia", "FinancialHealth:Technology:Malaysia",
"FinancialHealth:Technology:Malaysia", "FinancialHealth:Technology:United States",
"FinancialHealth:Technology:United States", "FinancialHealth:Energy:United States")
.flatMap(s -> Stream.of(s.substring(0, s.indexOf(":")), s.substring(0, s.lastIndexOf(":")), s))
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet()
.forEach(System.out::println);
will output
Executive=8
Executive:Healthcare=7
FinancialHealth:Technology=5
FinancialHealth=6
FinancialHealth:Energy=1
FinancialHealth:Technology:Malaysia=3
FinancialHealth:Energy:United States=1
Executive:Healthcare:United States=5
Executive:Financials:United States=1
FinancialHealth:Technology:United States=2
Executive:Healthcare:Malaysia=2
Executive:Financials=1

Related

How do I sort a simple Lua table alphabetically?

I have already seen many threads with examples of how to do this, the problem is, I still can't do it.
All the examples have tables with extra data. For example somethings like this
lines = {
luaH_set = 10,
luaH_get = 24,
luaH_present = 48,
}
or this,
obj = {
{ N = 'Green1' },
{ N = 'Green' },
{ N = 'Sky blue99' }
}
I can code in a few languages but I'm very new to Lua, and tables are really confusing to me. I can't seem to work out how to adapt the code in the examples to be able to sort a simple table.
This is my table:
local players = {"barry", "susan", "john", "wendy", "kevin"}
I want to sort these names alphabetically. I understand that Lua tables don't preserve order, and that's what's confusing me. All I essentially care about doing is just printing these names in alphabetical order, but I feel I need to learn this properly and know how to index them in the right order to a new table.
The examples I see are like this:
local function cmp(a, b)
a = tostring(a.N)
b = tostring(b.N)
local patt = '^(.-)%s*(%d+)$'
local _,_, col1, num1 = a:find(patt)
local _,_, col2, num2 = b:find(patt)
if (col1 and col2) and col1 == col2 then
return tonumber(num1) < tonumber(num2)
end
return a < b
end
table.sort(obj, cmp)
for i,v in ipairs(obj) do
print(i, v.N)
end
or this:
function pairsByKeys (t, f)
local a = {}
for n in pairs(t) do table.insert(a, n) end
table.sort(a, f)
local i = 0 -- iterator variable
local iter = function () -- iterator function
i = i + 1
if a[i] == nil then return nil
else return a[i], t[a[i]]
end
end
return iter
end
for name, line in pairsByKeys(lines) do
print(name, line)
end
and I'm just absolutely thrown by this as to how to do the same thing for a simple 1D table.
Can anyone please help me to understand this? I know if I can understand the most basic example, I'll be able to teach myself these harder examples.
local players = {"barry", "susan", "john", "wendy", "kevin"}
-- sort ascending, which is the default
table.sort(players)
print(table.concat(players, ", "))
-- sort descending
table.sort(players, function(a,b) return a > b end)
print(table.concat(players, ", "))
Here's why:
Your table players is a sequence.
local players = {"barry", "susan", "john", "wendy", "kevin"}
Is equivalent to
local players = {
[1] = "barry",
[2] = "susan",
[3] = "john",
[4] = "wendy",
[5] = "kevin",
}
If you do not provide keys in the table constructor, Lua will use integer keys automatically.
A table like that can be sorted by its values. Lua will simply rearrange the index value pairs in respect to the return value of the compare function. By default this is
function (a,b) return a < b end
If you want any other order you need to provide a function that returs true if element a comes befor b
Read this https://www.lua.org/manual/5.4/manual.html#pdf-table.sort
table.sort
Sorts the list elements in a given order, in-place, from list[1] to
list[#list]
This example is not a "list" or sequence:
lines = {
luaH_set = 10,
luaH_get = 24,
luaH_present = 48,
}
Which is equivalent to
lines = {
["luaH_set"] = 10,
["luaH_get"] = 24,
["luaH_present"] = 48,
}
it only has strings as keys. It has no order. You need a helper sequence to map some order to that table's element.
The second example
obj = {
{ N = 'Green1' },
{ N = 'Green' },
{ N = 'Sky blue99' }
}
which is equivalent to
obj = {
[1] = { N = 'Green1' },
[2] = { N = 'Green' },
[3] = { N = 'Sky blue99' },
}
Is a list. So you could sort it. But sorting it by table values wouldn't make too much sense. So you need to provide a function that gives you a reasonable way to order it.
Read this so you understand what a "sequence" or "list" is in this regard. Those names are used for other things as well. Don't let it confuse you.
https://www.lua.org/manual/5.4/manual.html#3.4.7
It is basically a table that has consecutive integer keys starting at 1.
Understanding this difference is one of the most important concepts while learning Lua. The length operator, ipairs and many functions of the table library only work with sequences.
This is my table:
local players = {"barry", "susan", "john", "wendy", "kevin"}
I want to sort these names alphabetically.
All you need is table.sort(players)
I understand that LUA tables don't preserve order.
Order of fields in a Lua table (a dictionary with arbitrary keys) is not preserved.
But your Lua table is an array, it is self-ordered by its integer keys 1, 2, 3,....
To clear up the confusing in regards to "not preserving order": What's not preserving order are the keys of the values in the table, in particular for string keys, i.e. when you use the table as dictionary and not as array. If you write myTable = {orange="hello", apple="world"} then the fact that you defined key orange to the left of key apple isn't stored. If you enumerate keys/values using for k, v in pairs(myTable) do print(k, v) end then you'd actually get apple world before orange hello because "apple" < "orange".
You don't have this problem with numeric keys though (which is what the keys by default will be if you don't specify them - myTable = {"hello", "world", foo="bar"} is the same as myTable = {[1]="hello", [2]="world", foo="bar"}, i.e. it will assign myTable[1] = "hello", myTable[2] = "world" and myTable.foo = "bar" (same as myTable["foo"]). (Here, even if you would get the numeric keys in a random order - which you don't, it wouldn't matter since you could still loop through them by incrementing.)
You can use table.sort which, if no order function is given, will sort the values using < so in case of numbers the result is ascending numbers and in case of strings it will sort by ASCII code:
local players = {"barry", "susan", "john", "wendy", "kevin"}
table.sort(players)
-- players is now {"barry", "john", "kevin", "susan", "wendy"}
This will however fall apart if you have mixed lowercase and uppercase entries because uppercase will go before lowercase due to having lower ASCII codes, and of course it also won't work properly with non-ASCII characters like umlauts (they will go last) - it's not a lexicographic sort.
You can however supply your own ordering function which receives arguments (a, b) and needs to return true if a should come before b. Here an example that fixes the lower-/uppercase issues for example, by converting to uppercase before comparing:
table.sort(players, function (a, b)
return string.upper(a) < string.upper(b)
end)

Confusion on "Immutable" lists in F#

Totally newbie question.
I'm still trying to get a grasp on "Immutable" lists in F#.
Lets assume I have a "Visit" defined as:
module Visit =
type Model =
{
Name: string
}
let init row col cel =
{
Name = sprintf "My Name is row: %d col: %d cel: %d" row col cel
}
Now I define a "Cell" that may or may not have one visit:
module Cell =
type Model =
{
Visit: Visit.Model option
}
let setVisit m =
{ m with Visit = Some( Visit.init 9 9 9) }
and lastly I define a "Column" that has a list of cells:
module Column =
type Model =
{
Appointments: Cell.Model list
}
let updateCell (m:Model) =
let newList = m.Appointments |> List.mapi (fun index cell -> if index = 2 then Cell.setVisit cell else cell)
{m with Appointments = newList }
In the Column module, the "updateCell" function is wired to call Cell.setVisit for the 3rd cell. My intent is to replace the "Name" of the "Visit" held by the 3rd cell. My simple questions are:
Is this the correct way to do this?
If I am replacing the Appointments list, is this not changing the "Column" holding the Appointment List? (The Column is immutable, right? ).
Sorry for my confusion.
TIA
First: yes, this is an acceptable, if inefficient way of doing it for lists. Note that you're rebuilding the whole list on every updateCell call, even though most elements in it are the same.
I don't know how many appointments you expect to have in your model in practice. If it's significantly more than three, and if you're always updating the third one, it would be more efficient to cut the list, then glue it back together:
let newList = List.take 2 m.Appointments # [Cell.setVisit m.Appointments.[2]] # List.drop 3 m.Appointments
This way only the first three elements are rebuilt, and the tail of the list is reused.
However, if you need random-access operations, may I suggest using arrays instead? Sure, they're mutable, but they offer much better performance for random-access operations.
Second: no, the syntax { m with ... } does not change the Column. Instead, it creates a new column - a copy of m, but with all fields listed after with updated to new values.

How to filter out Flux<Example> that don't contain some value from Flux<String>

So let's say I have a Flux<String> firstLetters containing "A", "B", "C", "D" and Flux<String> lastLetters containing "X", "Y", "Z"
And I have a Flux containing many:
data class Example(val name: String)
And from the whole Flux<Example> I want to split the elements to two variables: one Flux<Example> containing all that name IN ("A", "B", "C", "D") and second Flux<Example> that has name IN ("X", "Y", "Z") and save those two Fluxes two variables.
Is it possible to do so in one flow without doing same logic first for firstLetters and then for lastLetters
Is it possible to do so in one flow without doing same logic first for firstLetters and then for lastLetters
As the problem stands I don't believe so, as you'll have to process each element multiple times (one per each value on the list to see if it contains the value you need.) You can call cache() on the Flux though to ensure that the values are only retrieved once, or convert to another data structure entirely.
Given that you have to re-evaluate anyway, and assuming you still want to stick with raw Flux objects, filterWhen() and any() can be used quite nicely here:
Flux<Example> firstNames = names.filterWhen(e -> firstLetters.any(e.name::contains));
Flux<Example> lastNames = names.filterWhen(e -> lastLetters.any(e.name::contains));
You can of course pull the Predicate out into a separate method if you're concerned about code duplication there.
If Flux<String> firstLetters/lastLetters can be replaced with Set<String> firstLetters/lastLetters then you can easily leverage Flux::groupBy method on Flux<Example> to split it into different groups.
enum Group {
FIRST, LAST, UNDEFINED
}
Group toGroup(Example example) {
if (firstLetters.contains(example.name)) return FIRST;
else if (lastLetters.contain(example.name)) return LAST;
else return UNDEFINED;
}
Flux<GroupedFlux<Group, Example>> group(Flux<Example> examples) {
return examples.groupBy(example -> toGroup(example));
}
You can then get the group by calling GroupedFlux<K, V>::key.

How to summarize values by key of an Erlang map list?

This is what my list of maps looks like:
Map = [#{votes=>3, likes=>20, views=> 100},#{votes=>0, likes=>1, views=> 70},#{votes=>1, likes=>14, views=> 2000}].
I would like to return a summary of all map entries. I have attempted to solve this with fun()s but the logic does not make sense, and I only got non-executeable code.
The problem is that one cannot change variables in Erlang, otherwise this would work:
Summary = #{
votes=>0,
likes=>0,
views=>0,
},
[maps:update(Key, maps:get(Key, MapItem) + maps:get(Key, Summary), Summary) || MapItem <- Map, Key <- [votes, likes, views]].
How ought one go about this and successfully summarize the values of a list of maps?
The functions of fold family are designed to be used in such situations. In your case the following code calculates the map containing totals of entries in maps in the list:
MapsList = [#{votes=>3, likes=>20, views=> 100},
#{votes=>0, likes=>1, views=> 70},
#{votes=>1, likes=>14, views=> 2000}],
Summary = lists:foldl(fun (Map, AccL) ->
maps:fold(fun (Key, Value, Acc) ->
Acc#{Key => Value + maps:get(Key, Acc, 0)}
end, AccL, Map)
end, #{}, MapsList)
Summary value is the map #{votes => 4, likes => 35, views => 2170}.

Spark join hangs

I have a table with n columns that I'll call A. In this table there are three columns that i'll need:
vat -> String
tax -> String
card -> String
vat or tax can be null, but not at the same time.
For every unique couple of vat and tax there is at least one card.
I need to alter this table, adding a column count_card in which I put a text based on the number of cards every unique combination of tax and vat has.
So I've done this:
val cardCount = A.groupBy("tax", "vat").count
val sqlCard = udf((count: Int) => {
if (count > 1)
"MULTI"
else
"MONO"
})
val B = cardCount.withColumn(
"card_count",
sqlCard(cardCount.col("count"))
).drop("count")
In the table B I have three columns now:
vat -> String
tax -> String
card_count -> Int
and every operation on this DataFrame is smooth.
Now, because I wanted to import the new column in A table, i performed the following join:
val result = A.join(B,
B.col("tax")<=>A.col("tax") and
B.col("vat")<=>A.col("vat")
).drop(B.col("tax"))
.drop(B.col("vat"))
Expecting to have the original table A with the column card_count.
Problem is that the join hangs, getting all system resources blocking the pc.
Additional details:
Table A has ~1.5M elements and is read from parquet file;
Table B has ~1.3M elements.
System is a 8 thread and 30GB of RAM
Let me know what I'm doing wrong
At the end, I didn't found out which was the issue, so I changed approach
val cardCount = A.groupBy("tax", "vat").count
val cardCountSet = cardCount.filter(cardCount.col("count") > 1)
.rdd.map(r => r(0) + " " + r(1)).collect().toSet
val udfCardCount = udf((tax: String, vat:String) => {
if (cardCountSet.contains(tax + " " + vat))
"MULTI"
else
"MONO"
})
val result = A.withColumn("card_count",
udfCardCount(A.col("tax"), A.col("vat")))
If someone knows a better approach let me know it

Resources