gensim: pickle or not? - memory

I have a question related to gensim. I like to know whether it is recommended or necessary to use pickle while saving or loading a model (or multiple models), as I find scripts on GitHub that do either.
mymodel = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4)
mymodel.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)
See here
Variant 1:
import pickle
# Save
mymodel.save("mymodel.pkl") # Stores *.pkl file
# Load
mymodel = pickle.load("mymodel.pkl")
Variant 2:
# Save
model.save(mymodel) # Stores *.model file
# Load
model = Doc2Vec.load(mymodel)
In gensim.utils, it appears to me that there is a pickle function embedded: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/utils.py
def save
...
try:
_pickle.dump(self, fname_or_handle, protocol=pickle_protocol)
...
Goal of my question:
I would be glad to learn 1) whether I need pickle (for better memory management) and 2) in case, why it's better than loading *.model files.
Thank you!

Whenever you store a model using the built-in gensim function save(), pickle is being used regardless of the file extension. The documentation for utils tells us this:
class gensim.utils.SaveLoad
Bases: object
Class which inherit from this class have save/load functions, which un/pickle them to disk.
Warning
This uses pickle for de/serializing, so objects must not contain unpicklable attributes, such as lambda functions etc.
So gensim will use pickle to save any model as long as the model class inherits from the gensim.utils.SaveLoad class. In your case gensim.models.doc2vec.Doc2Vec inherits from gensim.models.base_any2vec.BaseWordEmbeddingsModel which in turn inherits from gensim.utils.SaveLoad which provides the actual save() function.
To answer your questions:
Yes, you need pickle unless you want to write your own function for
storing your models to disk. Using pickle should not be problematic though since
it is in the standard library. You won't even notice it.
If you use the gensim save()
function you can chose any file extension: *.model, *.pkl, *.p,
*.pickle. The saved file will be pickled.

It depends what are your requirements.
When you going to use the data with Python and you don't need to change between python versions (I experienced some problems with porting from python 2 to python 3 using pickled models) a binary format will be a good choice.
If you want interoperability or this model could be used by in the other projects or by other programmers I would use gensim's save method.

Related

How to do BERTopic inference endpoint correctly? (hugging face)

I'm trying to create an endpoint using a custom endpoint handler by loading a trained model. I understand that my code is not very correct, but at least it works locally. Can you tell me how to do it correctly? I spent the whole day, but I couldn't find a single clue. AWS crashes with an error at the launch stage of the inference.
Attempts to load using AutoModel Class ended up with me not knowing where to get config.json.
There is no BERTopic tag and I don't have enough reputation to create it. Therefore, I will use BERT, but it's not the same thing.
from typing import Dict, List, Any
from bertopic import BERTopic
from transformers import pipeline, AutoModel, AutoConfig
from sentence_transformers import SentenceTransformer
class EndpointHandler():
def __init__(self, path=""):
self.model = BERTopic.load(/repository/model)
self.model.calculate_probabilities = False
def __call__(self, data: Dict[str, Any]) -> List[Dict[str, Any]]:
"""
data args:
text (:obj: `str`)
Return:
A :obj:`list` | `dict`: will be serialized and returned
"""
# get inputs
text = data.pop("text")
# run normal prediction
prediction = self.model.transform([text])
return prediction
I used this instruction to create my handler. Authors load models through their module classes, but its need pytorch and tensorflow models. I ran this code on my computer via test.py , it works and returns a response. This class is needed in order to run the model from the repository in automatic mode (in a couple of clicks) and change the business logic, for example, not to return the topic and keywords, but to generate images for each word using a different model and return an array of images.
Repository HuggingFace
AWS log pastebin

How to load .model and .emb files generated from Node2Vec embeddings in python?

I have created node embeddings using Node2Vec. I have saved the model and the node embeddings using the following code-
EMBEDDING_FILENAME = './embeddings.emb'
EMBEDDING_MODEL_FILENAME = './embeddings.model'
# Save embeddings for later use
model.wv.save_word2vec_format(EMBEDDING_FILENAME)
# Save model for later use
model.save(EMBEDDING_MODEL_FILENAME)
I want to use these saved model .model and .emb files to create edge embeddings.
How can I load these files/model/node embeddings?
As stated in this answer from the Node2Vec library's author,
the Node2Vec.fit method returns an instance of gensim.models.Word2Vec, you can see in the documentation how to save and load a model.
There are two options, depending on how you stored your model. See below a snippet for doing that:
from gensim.models import Word2Vec
# Load model after Node2Vec.save
model = Word2Vec.load(PATH_TO_YOUR_SAVED_MODEL)
# Load model after Node2Vec.wv.save_word2vec_format
model = Word2Vec.wv.load_word2vec_format(PATH_TO_YOUR_SAVED_WORD2VEC_FORMAT)
Note that calling the Word2Vec.load method with (fname=PATH_TO_YOUR_SAVED_MODEL) (as in the documentation) raises an error, because apparently the right parameter name is fname_or_handle as for Word2Vec.save.

Ruby Class from .rb file

I like to read ruby files from the filesystem and get the actual ruby class
Dir["app/controllers/admin/*.rb"].select{ |f|
require File.expand_path(f)
#how to turn 'f' into an actual class
}
The problem I have is that both Kernel.load or require just respond with a boolean. Is there a way to get the actual class. I know that I can use the file path to determine the name, but I like not to deal with namespaces. How can I do that?
First, I'm going to tell you up front that this is probably a bad idea. Files in Ruby have no relationship to classes whatsoever. A file can define one class, no classes, or many classes, and it can even define classes dynamically based on arbitrary conditions. Additionally, class definitions might be spread across multiple files, and classes can be altered dynamically at runtime. For this reason, determining reliably whether a class is defined in a file is a difficult task, to say the least.
That said, here's one way you might approach the problem. Note that this solution is very hacky, won't work in all cases, and it can load the same file more than once if you're not careful:
module ClassLoader
def self.load_classes(file)
context = Module.new
context.class_eval(File.read(file), file)
context.constants.map{|constant| [constant, context.const_get(constant)]}.to_h
end
end
Usage:
./test_file.rb:
if rand < 0.5
class A
end
else
class B
end
end
class C
end
Your code:
ClassLoader.load_classes('./test_file.rb') #=> {:A=>#<Module:0x9a3c128>::A, :C=>#<Module:0x9a3c128>::C}
Alternately, if you're using Rails class names can often be inferred from the file name. This is somewhat more dependable, since it relies on the same conventions that Rails does for autoloading constants:
Dir["app/controllers/admin/*.rb"].select{ |f|
File.basename(f).camelize.constantize
}

Rails Limit Model To 1 Record

I am trying to create a section in my app where a user can update certain site wide attributes. An example is a sales tax percent. Even though this amount is relatively constant, it does change every few years.
Currently I have created a Globals model with attributes I want to keep track of. For example, to access these attributes where needed, I could simply do something like the following snippet.
(1+ Globals.first.sales_tax) * #item.total
What is the best way to handle variables that do not change often, and are applied site wide? If I use this method is there a way to limit the model to one record? A final but more sobering question.......Am I even on the right track?
Ok, so I've dealt with this before, as a design pattern, it is not the ideal way to do things IMO, but it can sometimes be the only way, especially if you don't have direct disk write access, as you would if deployed on Heroku. Here is the solution.
class Global < ActiveRecord::Base
validate :only_one
private
def only_one
if Global.count >= 1
errors.add :base, 'There can only be one global setting/your message here'
end
end
end
If you DO have direct disk access, you can create a YAML config file that you can read/write/dump to when a user edits a config variable.
For example, you could have a yaml file in config/locales/globals.yml
When you wanted to edit it, you could write
filepath = "#{Rails.root}/config/locales/globals.yml"
globals = YAML.load(File.read("#{Rails.root}/config/locales/globals.yml"))
globals.merge!({ sales_tax: 0.07 })
File.write(filepath) do |f|
f.write YAML.dump(globals)
end
More on the ruby yaml documentation
You could also use JSON, XML, or whatever markup language you want
It seems to me like you are pretty close, but depending on the data structure you end up with, I would change it to
(1+ Globals.last.sales_tax) * #item.total
and then build some type of interface that either:
Allows a user to create a new Globals object (perhaps duplicating the existing one) - the use case here being that there is some archive of when these things changed, although you could argue that this should really be a warehousing function (I'm not sure of the scope of your project).
Allows a user to update the existing Globals object using something like paper_trail to track the changes (in which case you might want validations like those presented by #Brian Wheeler).
Alternatively, you could pivot the Global object and instead use something like a kind or type column to delineate different values so that you would have:
(1+ Globals.where(kind: 'Colorado Sales Tax').last) * #item.total
and still build interfaces similar to the ones described above.
You can create a create a class and dump all your constants in it.
For instance:
class Global
#sales_tax = 0.9
def sales_tax
#sales_tax
end
end
and access it like:
Global.sales_tax
Or, you can define global variables something on the lines of this post

When you say Ruby is reflective, does this mainly refer to "duck typing"?

I was reading a text describing Ruby and it said the following:
Ruby is considered a “reflective”
language because it’s possible for a
Ruby program to analyze itself (in
terms of its make-up), make
adjustments to the way it works, and
even overwrite its own code with other
code.
I'm confused by this term 'reflective' - is this mainly talking about the way Ruby can look at a variable and figure out whether it's an Integer or a String (duck typing), e.g.:
x = 3
x = "three" # Ruby reassigns x to a String type
To say Ruby is "reflective" means that you can, for instance, find out at runtime what methods a class has:
>> Array.methods
=> ["inspect", "private_class_method", "const_missing",
[ ... and many more ... ]
(You can do the same thing with an object of the class.)
Or you can find out what class a given object is...
>> arr = Array.new
=> []
>> arr.class
=> Array
And find out what it is within the class hierarchy...
>> arr.kind_of?
>> arr.kind_of? Array
=> true
>> arr.kind_of? String
=> false
In the quote where they say "it’s possible for a Ruby program to analyze itself" that's what they're talking about.
Other languages such as Java do that too, but with Ruby it's easier, more convenient, and more of an everyday part of using the language. Hence, Ruby is "reflective."
No, it means that you can issue a ruby command to get information about, well, just about anything. For example, you can type the command File.methods() to get a listing of all methods belonging to the File module. You can do similar things with classes and objects -- listing methods, variables, etc.
Class reopening is a good example of this. Here's a simple example:
class Integer
def moxy
if self.zero?
self - 2
elsif self.nonzero?
self + 2
end
end
end
puts 10.moxy
By reopening a standard Ruby class - Integer - and defining a new method within it called 'moxy', we can perform a newly defined operation directly on a number. In this case, I've defined this made up 'moxy' method to subtract 2 from the Integer if it's zero and add two if it's nonzero. This makes the moxy method available to all objects of class Integer in Ruby. (Here we use the 'self' keyword to get the content of the integer object).
As you can see, it's a very powerful feature of Ruby.
EDIT: Some commenters have questioned whether this is really reflection. In the English language the word reflection refers to looking in on your own thoughts. And that's certainly an important aspect of reflection in programming also - using Ruby methods like is_a, kind_of, instance_of to perform runtime self-inspection. But reflection also refers to the the ability of a program to modify its own behavior at runtime. Reopening classes is one of the key examples of this. It's also called monkey patching. It's not without its risks but all I am doing is describing it here in the context of reflection, of which it is an example.
It refers mainly at how easy is to inspect and modify internal representations during run-time in Ruby programs, such as classes, constants, methods and so on.
Most modern languages offer some kind of reflective capabilities (even statically typed ones such as Java), but in Ruby, it is so easy and natural to use these capabilities, that it really make a real difference when you need them.
It just makes meta-programming, for example, an almost trivial task, which is not true at all in other languages, even dynamic ones.

Resources