Ecto multiple streams in 1 transaction - stream

Background
PS: the following situation describes an hypothetical scenario, where I own a company that sells things to customers.
I have an Ecto query that is so big, that my machine cannot handle it. With billions of results returned, there is probably not enough RAM in the world that can handle it.
The solution here (or so my research indicates) is to use streams. Streams were made for potentially infinite sets of results, which would fit my use case.
https://hexdocs.pm/ecto/Ecto.Repo.html#c:stream/2
Problem
So lets imagine that I want to delete All users that bought a given item. Maybe that item was not really legal in their country, and now me, the poor guy in IT, has to fix things so the world doesn't come down crashing.
Naive way:
item_id = "123asdasd123"
purchase_ids =
Purchases
|> where([p], p.item_id == ^item_id)
|> select([p], p.id)
|> Repo.all()
Users
|> where([u], u.purchase_id in ^purchase_ids)
|> Repo.delete_all()
This is the naive way. I call it naive, because of 2 issues:
We have so many purchases, that the machine's memory will overflow (looking at purchase_ids query)
purchase_ids will likely have more than 100K ids, so the second query (where we delete things) will fail as it hits Postgres parameters limit of 32K: https://stackoverflow.com/a/42251312/1337392
What can I say, our product is highly addictive and very well priced!
Our customers simply cant get enough of it. Don't know why. Nope. No reason comes to mind. None at all.
With these problems in mind, I cannot help my customers and grow my [s]empire[/s], I mean, little home owned business.
I did find this possible solution:
Stream way:
item_id = "123asdasd123"
purchase_ids =
Purchases
|> where([p], p.item_id == ^item_id)
|> select([p], p.id)
stream = Repo.stream(purchase_ids)
Repo.transacion(fn ->
ids = Enum.to_list(stream)
Users
|> where([u], u.purchase_id in ^ids)
|> Repo.delete_all()
end)
Questions
However, I am not convinced this will work:
I am using Enum.to_list and saving everything into a variable, placing everything into memory again. So I am not gaining any advantage by using Repo.stream.
I still have too many ids for my Repo.delete_all to work without blowing up
I guess the one advantage here is that this now a transaction, so either everything goes or nothing goes.
So, the following questions arise:
How do I properly make use of streams in this scenario?
Can I delete items by streaming parameters (ids) or do I have to manually batch them?
Can I stream ids to Repo.delete_all ?

One cannot directly feed Repo.delete_all/1 with a stream, but Stream.chunk_every/2 is your friend here. One can do somewhat like below (500 is the default value for :max_rows hence we’d use it in chunk_every/2 as well.)
Repo.transacion(fn ->
max_rows = 500
purchase_ids
|> Repo.stream(max_rows: max_rows)
|> Stream.chunk_every(max_rows)
|> Stream.each(fn ids ->
Users
|> where([u], u.purchase_id in ^ids)
|> Repo.delete_all()
end)
|> Stream.run()
end, timeout: :infinity)

I know this isn't an answer to your question about using streams, but in this scenario, streams might not be necessary depending on the amount of data you are trying to delete. You should be able to delete all of these matching users with a subquery in one query without passing any variables into memory aside from the item_id:
sq =
Purchases
|> where(item_id: ^item_id)
|> select([p], p.id)
Users
|> where([u], u.purchase_id in subquery(sq))
|> Repo.delete_all(timeout: :infinity)

Related

How to remove an item from index N with FSharpx's PersistentVector?

I notice that the PersistentVector from FSharpX has no remove at index method.
https://fsprojects.github.io/FSharpx.Collections/reference/fsharpx-collections-persistentvector-1.html
It has the ability to modify the item at the nth location but no ability to remove it. This seems like a strange omission. If it is not possible then can somebody suggest a different immutable persistent collection that has this ability.
My current code for removing at item at id from the vector is brute force
state
|> Seq.indexed
|> Seq.where ( fun (_id,_)->id<>_id)
|> Seq.map (fun (_,p)->p)
|> PersistentVector.ofSeq
Note that I'm trying to use PersistentVector as a backing store for a UI. I'm experimenting with https://github.com/JaggerJo/Avalonia.FuncUI which is an Elmish port for Avalonia. I got quite far and then wanted to add a delete button on a row and I can't find a way to update my backing store. :(
Example code for the UI is
https://gist.github.com/bradphelan/77f3fcb8e660783790c5610290cd8d97
I do not think any of the collections in FSharpx.Collections support delete, except PersistentHashMap, which has a remove method.
I'm sure there are some collections in FSharpx.Experimental.Collections that also support remove/delete. These may not be useful for you.
You might want to look at this new project for what you are attempting https://github.com/fsprojects/FSharp.Data.Adaptive

Optimizing Group By in Flux

I have a measurement with a few million rows of data containing information about around 20 thousand websites.
show tag keys from site_info:
domain
proxy
http_response_code
show field keys from site_info:
responseTime
uuid
source
What I want to do is count all of the uuid's for each website over a given time frame. I have tried writing a query like this one:
from(bucket: "telegraf/autogen")
|> range($range)
|> filter(fn: (r) =>
r._measurement == "site_info"
r._field == "uuid")
|> group(columns:["domain"])
|> count()
However this query will take up to 45 minutes to run for a time range of just now()-6h (assumingly due to the fact that I am trying to group data into 20k+ buckets)
Any suggestions on how to optimize the query to not take such extended amounts of time without altering the data schema?
I think for the time being flux‘s influx datastore integration is just not optimized at all. They announced that performance tuning should start in the beta phase.

How do I select a random element from an ets set in Erlang/Elixir?

I have a large number of processes that I need to keep track of in an ets set, and then randomly select single processes. So I created the set like this:
:ets.new(:pid_lookup, [:set, :protected, :named_table])
then for argument's sake let's just stick self() in it 1000 times:
Enum.map 1..1000, fn x -> :ets.insert(:pid_lookup, {x, self()}) end
Now I need to select one at random. I know I could just select a random one using :ets.lookup(:pid_lookup, :rand.uniform(1000)), but what if I don't know the size of the set (in the above case, 1000) in advance?
How do I find out the size of an ets set? And/or is there a better way to choose a random pid from an ets data structure?
If keys are sequential number
tab = :ets.new(:tab, [])
Enum.each(1..1000, & :ets.insert(tab, {&1, :value}))
size = :ets.info(tab, :size)
# size = 1000
value_picked_randomly = :ets.lookup(tab, Enum.random(1..1000))
:ets.info(tab, :size) returns a size of a table; which is a number of records inserted on given table.
If you don't know that the keys are
first = :ets.first(tab)
:ets.lookup(tab, first)
func = fn key->
if function_that_may_return_true() do
key = case :ets.next(tab, key) do
:'$end_of_table' -> throw :reached_end_of_table
key -> func.(key)
end
else
:ets.lookup(tab, key)
end
end
func.()
func will iterate over the ets table and returns a random value.
This will be time consuming, so it will not be an ideal solution for tables with large number of records.
As I understood from the comments, this is an XY Problem.
What you essentially need is to track down the changing list and pick up one of its elements randomly. ETS in general and :ets.set in particular are by no mean intended to be queried for size. They serve different purposes.
Spawn an Agent within your supervision tree, holding the list of PIDs of already started servers and use Kernel.length/1 to query its size, or even use Enum.random/1 if the list is not really huge (the latter traverses the whole enumerable to get a random element.)

In a doubly linked list, How many pointers are affected on an insertion operation?

I had an interview yesterday. As it started, the first thing that the interviewer asked was
" In a doubly linked list, How many pointers will be affected on an insertion operation ? "
Since, he didn't specifically asked where to insert I replied that it depends on how many nodes are there in DLL.
As total pointers that will be affected will depend on whether the list is empty or not and where insertion takes place.
But, he didn't say anything whether I had convinced him or not.
Was I correct or maybe I missed something ?
I think the answer depends on whether we are inserting the new node in the middle of the list (surrounded by two nodes), or at the head or tail of the list.
For insertions in the middle of the list, to splice in a new node as follows:
A --- B
^^ splice M in here
A.next = M
M.prev = A
B.prev = M
M.next = B
Hence four pointer assignments take place. However, if the insertion be at the head or tail, then only two pointer assignments would be needed:
TAIL (insert M afterward)
TAIL.next = M
M.prev = TAIL

Getting lots of data from Mnesia - fastest way

I have a record:
-record(bigdata, {mykey,some1,some2}).
Is doing a
mnesia:match_object({bigdata, mykey, some1,'_'})
the fastest way fetching more than 5000 rows?
Clarification:
Creating "custom" keys is an option (so I can do a read) but is doing 5000 reads fastest than match_object on one single key?
I'm curious as to the problem you are solving, how many rows are in the table, etc., without that information this might not be a relevant answer, but...
If you have a bag, then it might be better to use read/2 on the key and then traverse the list of records being returned. It would be best, if possible, to structure your data to avoid selects and match.
In general select/2 is preferred to match_object as it tends to better avoid full table scans. Also, dirty_select is going to be faster then select/2 assuming you do not need transactional support. And, if you can live with the constraints, Mensa allows you to go against the underlying ets table directly which is very fast, but look at the documentation as it is appropriate only in very rarified situations.
Mnesia is more a key-value storage system, and it will traverse all its records for getting match.
To fetch in a fast way, you should design the storage structure to directly support the query. To Make some1 as key or index. Then fetch them by read or index_read.
The statement Fastest Way to return more than 5000 rows depends on the problem in question. What is the database structure ? What do we want ? what is the record structure ? After those, then, it boils down to how you write your read functions. If we are sure about the primary key, then we use mnesia:read/1 or mnesia:read/2 if not, its better and more beautiful to use Query List comprehensions. Its more flexible to search nested records and with complex conditional queries. see usage below:
-include_lib("stdlib/include/qlc.hrl").
-record(bigdata, {mykey,some1,some2}).
%% query list comprehenshions
select(Q)->
%% to prevent against nested transactions
%% to ensure it also works whether table
%% is fragmented or not, we will use
%% mnesia:activity/4
case mnesia:is_transaction() of
false ->
F = fun(QH)-> qlc:e(QH) end,
mnesia:activity(transaction,F,[Q],mnesia_frag);
true -> qlc:e(Q)
end.
%% to read by a given field or even several
%% you use a list comprehension and pass the guards
%% to filter those records accordingly
read_by_field(some2,Value)->
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
X#bigdata.some2 == Value]),
select(QueryHandle).
%% selecting by several conditions
read_by_several()->
%% you can pass as many guard expressions
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
X#bigdata.some2 =< 300,
X#bigdata.some1 > 50
]),
select(QueryHandle).
%% Its possible to pass a 'fun' which will do the
%% record selection in the query list comprehension
auto_reader(ValidatorFun)->
QueryHandle = qlc:q([X || X <- mnesia:table(bigdata),
ValidatorFun(X) == true]),
select(QueryHandle).
read_using_auto()->
F = fun({bigdata,SomeKey,_,Some2}) -> true;
(_) -> false
end,
auto_reader(F).
So i think if you want fastest way, we need more clarification and problem detail. Speed depends on many factors my dear !

Resources