How broadcast variables are used in dask parallelization - dask

I have some code applying a map function on a dask bag. I need a lookup dictionary to apply that function and it doesn't work with client.scatter.
I don't know if I am doing the right things, because the workers starts, but they don't do anything. I have tried different configuration looking to different examples, but I can't get it to work. Any support will be appreciated.
I know from Spark, you define a broadcast variable and you access the content by variable.value inside the function you want to apply. I don't see the same with dask.
# Function to map
def transform_contacts_add_to_historic_sin(data,historic_dict):
raw_buffer = ''
line = json.loads(data)
if line['timestamp] > historic_dict['timestamp]:
raw_buffer = raw_buffer + line['vid']
return raw_buffer
# main program
# historic_dict is a dictionary previously filled, which is the lookup variable for map function
# file_records will be a list of json.dump getting from a S3 file
from distributed import Client
client = Client()
historic_dict_scattered = client.scatter(historic_dict, broadcast=True)
file_records = []
raw_data = s3_procedure.read_raw_file(... S3 file.......)
data = TextIOWrapper(raw_data)
for line in data:
file_records.append(line)
bag_chunk = db.from_sequence(file_records, npartitions=16)
bag_transform = bag_chunk.map(lambda x: transform_contacts_add_to_historic(x), args=[historic_dict_scattered])
bag_transform.compute()

If your dictionary is small you can just include it directly
def func(partition, d):
return ...
my_dict = {...}
b = b.map(func, d=my_dict)
If it's large then you might want to wrap it up in Dask delayed first
my_dict = dask.delayed(my_dict)
b = b.map(func, d=my_dict)
If it's very large then yes, you might want to scatter it first (though I would avoid this if things work out with either of the approaches above).
[my_dict] = client.scatter([my_dict])
b = b.map(func, d=my_dict)

Related

Using Insert with a large multi-layered table using Lua

So I am working on a script for GTA5 and I need to transfer data over to a js script. However so I don't need to send multiple arrays to js I require a table, the template for the table should appear as below.
The issue I'm having at the moment is in the second section where I receive all vehicles and loop through each to add it to said 'vehicleTable'. I haven't been able to find the "table.insert" method used in a multilayered table
So far I've tried the following
table.insert(vehicleTable,vehicleTable[class][i][vehicleName])
This seems to store an 'object'(table)? so it does not show up when called in the latter for loop
Next,
vehicleTable = vehicleTable + vehicleTable[class][i][vehicleName]
This seemed like it was going nowhere as I either got a error or nothing happened.
Next,
table.insert(vehicleTable,class)
table.insert(vehicleTable[class],i)
table.insert(vehicleTable[class][i],vehicleName)
This one failed on the second line, I'm unsure why however it didn't even reach the next problem I saw later which would be the fact that line 3 had no way to specify the "Name" field.
Lastly the current one,
local test = {[class] = {[i]={["Name"]=vehicleName}}}
table.insert(vehicleTable,test)
It works without errors but ultimately it doesn't file it in the table instead it seems to create its own branch so object within the object.
And after about 3 hours of zero progress on this topic I turn to the stack overflow for assistance.
local vehicleTable = {
["Sports"] = {
[1] = {["Name"] = "ASS", ["Hash"] = "Asshole2"},
[2] = {["Name"] = "ASS2", ["Hash"] = "Asshole1"}
},
["Muscle"] = {
[1] = {["Name"] = "Sedi", ["Hash"] = "Sedina5"}
},
["Compacts"] = {
[1] = {["Name"] = "MuscleCar", ["Hash"] = "MCar2"}
},
["Sedan"] = {
[1] = {["Name"] = "Blowthing", ["Hash"] = "Blowthing887"}
}
}
local vehicles = GetAllVehicleModels();
for i=1, #vehicles do
local class = vehicleClasses[GetVehicleClassFromName(vehicles[i])]
local vehicleName = GetLabelText(GetDisplayNameFromVehicleModel(vehicles[i]))
print(vehicles[i].. " " .. class .. " " .. vehicleName)
local test = {[class] = {[i]={["Name"]=vehicleName}}}
table.insert(vehicleTable,test)
end
for k in pairs(vehicleTable) do
print(k)
-- for v in pairs(vehicleTable[k]) do
-- print(v .. " " .. #vehicleTable[k])
-- end
end
If there is not way to add to a library / table how would I go about sorting all this without needing to send a million (hash, name, etc...) requests to js?
Any recommendations or support would be much appreciated.
Aside the fact that you do not provide the definition of multiple functions and tables used in your code that would be necessary to provide a complete answere without making assumptions there are many misconceptions regarding very basic topics in Lua.
The most prominent is that you don't know how to use table.insert and what it can do. It will insert (append by default) a numeric field to a table. Given that you have non-numeric keys in your vehicleTable this doesn't make too much sense.
You also don't know how to use the + operator and that it does not make any sense to add a table and a string.
Most of your code seems to be the result of guess work and trial and error.
Instead of referring to the Lua manual so you know how to use table.insert and how to index tables properly you spend 3 hours trying all kinds of variations of your incorrect code.
Assuming a vehicle model is a table like {["Name"] = "MyCar", ["Hash"] = "MyCarHash"} you can add it to a vehicle class like so:
table.insert(vehicleTable["Sedan"], {["Name"] = "MyCar", ["Hash"] = "MyCarHash"})
This makes sense because vehicleTable.Sedan has numeric indices. And after that line it would contain 2 cars.
Read the manual. Then revisit your code and fix your errors.

How to get Twitter mentions id using academictwitteR package?

I am trying to create several network analyses from Twitter. To get the data, I used the academictwitteR package and their get_all_tweets command.
get_all_tweets(
users = c("LegaSalvini"),
start_tweets = "2007-01-01T00:00:00Z",
end_tweets = "2022-07-01T00:00:00Z",
file = "tweets_lega",
data_path = "tweetslega/",
bind_tweets = FALSE
)
## Binding JSON files into data.frame objects
tweets_bind_lega <- bind_tweets(data_path = "tweetslega/")
##Tyding
tweets_bind_lega_tidy <- bind_tweets(data_path = "tweetslega/", output_format = "tidy")
With this, I can easily access the ids for the creation of a retweet and reply network. However, the tidy format does not provide a tidy column for the mentions, instead it deletes them.
However, they are in my untidy df tweets_bind_lega , but stored as a list tweets_bind_afd$entities$mentions. Now I would like to somehow unnest this list and create a tidy df with a column that has contains the mentioned Twitter user ids.
Has anyone created a mention network with academictwitteR before and can help me out?
Thanks!

LUA indexed table access via named constants

I am using LUA as embedded language on a µC project, so the ressources are limited. To save some cycles and memory I do always only indexed based table access (table[1]) instead og hash-based access (table.someMeaning = 1). This saves a lot of memory.
The clear drawback of this is approach are the magic numbers thrughtout the code.
A Cpp-like preprocessor would help here to replace the number with named-constants.
Is there a good way to achieve this?
A preprocessor in LUA itself, loading the script and editing the chunk and then loading it would be a variant, but I think this exhausts the ressources in the first place ...
So, I found a simple solution: write your own preprocessor in Lua!
It's probably the most easy thing to do.
First, define your symbols globally:
MySymbols = {
FIELD_1 = 1,
FIELD_2 = 2,
FIELD_3 = 3,
}
Then you write your preprocessing function, which basically just replace the strings from MySymbols by their value.
function Preprocess (FilenameIn, FilenameOut)
local FileIn = io.open(FilenameIn, "r")
local FileString = FileIn:read("*a")
for Name, Value in pairs(MySymbols) do
FileString = FileString:gsub(Name, Value)
end
FileIn:close()
local FileOut = io.open(FilenameOut, "w")
FileOut:write(FileString)
FileOut:close()
end
Then, if you try with this input file test.txt:
TEST FIELD_1
TEST FIELD_2
TEST FIELD_3
And call the following function:
Preprocess("test.txt", "test-out.lua")
You will get the fantastic output file:
TEST 1
TEST 2
TEST 3
I let you the joy to integrate it with your scripts/toolchain.
If you want to avoid attributing the number manually, you could just add a wonderful closure:
function MakeCounter ()
local Count = 0
return function ()
Count = Count + 1
return Count
end
end
NewField = MakeCounter()
MySymbols = {
FIELD_1 = NewField(),
FIELD_2 = NewField(),
FIELD_3 = NewField()
}

Using index value in method

In my Rails application, in a model, I am trying to use the loop index x in the following method, and I can't figure out how to get the value:
def set_winners ## loops over 4 quarters
1.upto(4) do |x|
qtr_[x]_winner.winner = 1
qtr_[x]_winner.save
end
end
I'm going to keep searching but any help would be greatly appreciated!
edit: So I guess I can't do that! Here is the original method I was trying to refactor in full by looping four times:
def set_winners
## set all 4 quarter's winning squares
home_qtr_1 = game.home_q1_score.to_s.split('').last.to_i
away_qtr_1 = game.away_q1_score.to_s.split('').last.to_i
qtr_1_winner = squares.where(xvalue:home_qtr_1, yvalue:away_qtr_1).first
qtr_1_winner.winner = 1
qtr_1_winner.save
home_qtr_2 = game.home_q2_score.to_s.split('').last.to_i
away_qtr_2 = game.away_q2_score.to_s.split('').last.to_i
qtr_2_winner = squares.where(xvalue:home_qtr_2, yvalue:away_qtr_2).first
qtr_2_winner.winner = 1
qtr_2_winner.save
home_qtr_3 = game.home_q3_score.to_s.split('').last.to_i
away_qtr_3 = game.away_q3_score.to_s.split('').last.to_i
qtr_3_winner = squares.where(xvalue:home_qtr_3, yvalue:away_qtr_3).first
qtr_3_winner.winner = 1
qtr_3_winner.save
home_qtr_4 = game.home_q4_score.to_s.split('').last.to_i
away_qtr_4 = game.away_q4_score.to_s.split('').last.to_i
qtr_4_winner = squares.where(xvalue:home_qtr_4, yvalue:away_qtr_4).first
qtr_4_winner.winner = 1
qtr_4_winner.save
end
Is there a better way to do this if it's bad practice to dynamically change attribute names?
It looks like you are trying to do a PHP-like trick in a language that doesn't support it, and where we recommend NOT doing it because it results in code that is very difficult to debug due to the dynamically named variables.
It looks like you want to generate a variable name using:
qtr_[x]_winner
to create something like:
qtr_1_winner
Instead, consider creating an array named qtr_winner containing your objects and access the elements like:
qtr_winner[1]
or
qtr_winner[2]
etc.
You could create a hash to do a similar thing:
qtr_winner = {}
qtr_winner[1] = 5
then later access it using qtr_winner[1] and get 5 back or
qtr_winner[1].winner = 1
The determination of whether to use a hash or an array is whether you need to walk the container, or need random access. If you are always indexing into it using a value, then it's probably a wash about which is faster.
Based on your edit, you don't need dynamic variables. The only thing that changes in your loop is game.home_qN_score, so that's what the focus of your refactoring should be. Given that, here's a viable solution:
1.upto(4) do |i|
home_qtr = game.send("home_q#{i}_score)".to_s.split('').last.to_i
away_qtr = game.send("away_q#{i}_score)".to_s.split('').last.to_i
winner = squares.where(xvalue:home_qtr, yvalue:away_qtr).first
winner.winner = 1
winner.save
end
Original answer:
If qtr_1_winner, etc. are instance methods, you can use Object#send to achieve what you want:
def set_winners ## loops over 4 quarters
1.upto(4) do |x|
send("qtr_#{x}_winner").winner = 1
send("qtr_#{x}_winner").save
end
end

How can you join two or more dictionaries created by Bio.SeqIO.index?

I would like to be able to join the two "dictionaries" stored in "indata" and "pairdata", but this code,
indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)
produces the following error:
indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)
I have tried using,
indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)
which does work, but the resulting dictionaries take up too much memory to be practical for for the sizes of infile and pairfile I have.
The final option I have explored is:
indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)
which works perfectly, but is very slow. Does anyone know how/whether I can successfully join the two indexes from the first example above?
SeqIO.index returns a read-only dictionary-like object, so update will not work on it (apologies for the confusing error message; I just checked in a fix for that to the main Biopython repository).
The best approach is to either use index_db, which will be slower but
only needs to index the file once, or to define a higher level object
which acts like a dictionary over your multiple files. Here is a
simple example:
from Bio import SeqIO
class MultiIndexDict:
def __init__(self, *indexes):
self._indexes = indexes
def __getitem__(self, key):
for idx in self._indexes:
try:
return idx[key]
except KeyError:
pass
raise KeyError("{0} not found".format(key))
indata = SeqIO.index("f001", "fasta")
pairdata = SeqIO.index("f002", "fasta")
combo = MultiIndexDict(indata, pairdata)
print combo['gi|3318709|pdb|1A91|'].description
print combo['gi|1348917|gb|G26685|G26685'].description
print combo["key_failure"]
In you don't plan to use the index again and memory isn't a limitation (which both appear to be true in your case), you can tell Bio.SeqIO.index_db(...) to use an in memory SQLite3 index with the special index name ":memory:" like so:
indata = SeqIO.index_db(":memory:", [infile, pairfile], infmt)
where infile and pairfile are filenames, and infmt is their format type as defined in Bio.SeqIO (e.g. "fasta").
This is actually a general trick with Python's SQLite3 library. For a small set of files this should be much faster than building the SQLite index on disk.

Resources