How to do BERTopic inference endpoint correctly? (hugging face) - machine-learning

I'm trying to create an endpoint using a custom endpoint handler by loading a trained model. I understand that my code is not very correct, but at least it works locally. Can you tell me how to do it correctly? I spent the whole day, but I couldn't find a single clue. AWS crashes with an error at the launch stage of the inference.
Attempts to load using AutoModel Class ended up with me not knowing where to get config.json.
There is no BERTopic tag and I don't have enough reputation to create it. Therefore, I will use BERT, but it's not the same thing.
from typing import Dict, List, Any
from bertopic import BERTopic
from transformers import pipeline, AutoModel, AutoConfig
from sentence_transformers import SentenceTransformer
class EndpointHandler():
def __init__(self, path=""):
self.model = BERTopic.load(/repository/model)
self.model.calculate_probabilities = False
def __call__(self, data: Dict[str, Any]) -> List[Dict[str, Any]]:
"""
data args:
text (:obj: `str`)
Return:
A :obj:`list` | `dict`: will be serialized and returned
"""
# get inputs
text = data.pop("text")
# run normal prediction
prediction = self.model.transform([text])
return prediction
I used this instruction to create my handler. Authors load models through their module classes, but its need pytorch and tensorflow models. I ran this code on my computer via test.py , it works and returns a response. This class is needed in order to run the model from the repository in automatic mode (in a couple of clicks) and change the business logic, for example, not to return the topic and keywords, but to generate images for each word using a different model and return an array of images.
Repository HuggingFace
AWS log pastebin

Related

how to mlflow autolog with custom parameters

I'm trying to log my ML trials with mlflow.keras.autolog and mlflow.log_param simultaneously (mlflow v 1.22.0). However, the only things that are recorded are autolog's products, but not those of log_param.
experiment = mlf_client.get_experiment_by_name(experiment_name)
with mlflow.start_run(experiment_id=experiment.experiment_id):
mlflow.keras.autolog(log_input_examples=True)
mlflow.log_param('batch_size', self.batch_size)
mlflow.log_param('training_set_size', len(kwargs['training_ID_list']))
mlflow.log_param('testing_set_size', len(kwargs['testing_ID_list']))
history = self.train_NN_model(**kwargs)
I know I can use log_param with log_model to save the model itself, but then I lose some useful stuff that autolog can record for me automatically (e.g., model summary).
Is it possible to use autolog with custom parameters for logging?

Creating different types of workers that are accessed using a single client

EDIT:
My question was horrifically put so I delete it and rephrase entirely here.
I'll give a tl;dr:
I'm trying to assign each computation to a designated worker that fits the computation type.
In long:
I'm trying to run a simulation, so I represent it using a class of the form:
Class Simulation:
def __init__(first_Client: Client, second_Client: Client)
self.first_client = first_client
self.second_client = second_client
def first_calculation(input):
with first_client.as_current():
return output
def second_calculation(input):
with second_client.as_current():
return output
def run(input):
return second_calculation(first_calculation(input))
This format has downsides like the fact that this simulation object is not pickleable.
I could edit the Simulation object to contain only addresses and not clients for example, but I feel as if there must be a better solution. For instance, I would like the simulation object to work the following way:
Class Simulation:
def first_calculation(input):
client = dask.distributed.get_client()
with client.as_current():
return output
...
Thing is, the dask workers best fit for the first calculation, are different than the dask workers best fit for the second calculation, which is the reason my Simulation object has two clients that connect to tow different schedulers to begin with. Is there any way to make it so there is only one client but two types of schedulers and to make it so the client knows to run the first_calculation to the first scheduler and the second_calculation to the second one?
Dask will chop up large computations in smaller tasks that can run in paralell. Those tasks will then be submitted by the client to the scheduler which in turn wil schedule those tasks on the available workers.
Sending the client object to a Dask scheduler will likely not work due to the serialization issue you mention.
You could try one of two approaches:
Depending on how you actually run those worker machines, you could specify different types of workers for different tasks. If you run on kubernetes for example you could try to leverage the node pool functionality to make different worker types available.
An easier approach using your existing infrastructure would be to return the results of your first computation back to the machine from which you are using the client using something like .compute(). And then use that data as input for the second computation. So in this case you're sending the actual data over the network instead of the client. If the size of that data becomes an issue you can always write the intermediary results to something like S3.
Dask does support giving specific tasks to specific workers with annotate. Here's an example snippet, where a delayed_sum task was passed to one worker and the doubled task was sent to the other worker. The assert statements check that those workers really were restricted to only those tasks. With annotate you shouldn't need separate clusters. You'll also need the most recent versions of Dask and Distributed for this to work because of a recent bug fix.
import distributed
import dask
from dask import delayed
local_cluster = distributed.LocalCluster(n_workers=2)
client = distributed.Client(local_cluster)
workers = list(client.scheduler_info()['workers'].keys())
with dask.annotate(workers=workers[0]):
delayed_sum = delayed(sum)([1, 2])
with dask.annotate(workers=workers[1]):
doubled = delayed_sum * 2
# use persist so scheduler doesn't clean up
# wrap in a distributed.wait to make sure they're there when we check the scheduler
distributed.wait([doubled.persist(), delayed_sum.persist()])
worker_restrictions = local_cluster.scheduler.worker_restrictions
assert worker_restrictions[delayed_sum.key] == {workers[0]}
assert worker_restrictions[doubled.key] == {workers[1]}

How to load .model and .emb files generated from Node2Vec embeddings in python?

I have created node embeddings using Node2Vec. I have saved the model and the node embeddings using the following code-
EMBEDDING_FILENAME = './embeddings.emb'
EMBEDDING_MODEL_FILENAME = './embeddings.model'
# Save embeddings for later use
model.wv.save_word2vec_format(EMBEDDING_FILENAME)
# Save model for later use
model.save(EMBEDDING_MODEL_FILENAME)
I want to use these saved model .model and .emb files to create edge embeddings.
How can I load these files/model/node embeddings?
As stated in this answer from the Node2Vec library's author,
the Node2Vec.fit method returns an instance of gensim.models.Word2Vec, you can see in the documentation how to save and load a model.
There are two options, depending on how you stored your model. See below a snippet for doing that:
from gensim.models import Word2Vec
# Load model after Node2Vec.save
model = Word2Vec.load(PATH_TO_YOUR_SAVED_MODEL)
# Load model after Node2Vec.wv.save_word2vec_format
model = Word2Vec.wv.load_word2vec_format(PATH_TO_YOUR_SAVED_WORD2VEC_FORMAT)
Note that calling the Word2Vec.load method with (fname=PATH_TO_YOUR_SAVED_MODEL) (as in the documentation) raises an error, because apparently the right parameter name is fname_or_handle as for Word2Vec.save.

gensim: pickle or not?

I have a question related to gensim. I like to know whether it is recommended or necessary to use pickle while saving or loading a model (or multiple models), as I find scripts on GitHub that do either.
mymodel = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4)
mymodel.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)
See here
Variant 1:
import pickle
# Save
mymodel.save("mymodel.pkl") # Stores *.pkl file
# Load
mymodel = pickle.load("mymodel.pkl")
Variant 2:
# Save
model.save(mymodel) # Stores *.model file
# Load
model = Doc2Vec.load(mymodel)
In gensim.utils, it appears to me that there is a pickle function embedded: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/utils.py
def save
...
try:
_pickle.dump(self, fname_or_handle, protocol=pickle_protocol)
...
Goal of my question:
I would be glad to learn 1) whether I need pickle (for better memory management) and 2) in case, why it's better than loading *.model files.
Thank you!
Whenever you store a model using the built-in gensim function save(), pickle is being used regardless of the file extension. The documentation for utils tells us this:
class gensim.utils.SaveLoad
Bases: object
Class which inherit from this class have save/load functions, which un/pickle them to disk.
Warning
This uses pickle for de/serializing, so objects must not contain unpicklable attributes, such as lambda functions etc.
So gensim will use pickle to save any model as long as the model class inherits from the gensim.utils.SaveLoad class. In your case gensim.models.doc2vec.Doc2Vec inherits from gensim.models.base_any2vec.BaseWordEmbeddingsModel which in turn inherits from gensim.utils.SaveLoad which provides the actual save() function.
To answer your questions:
Yes, you need pickle unless you want to write your own function for
storing your models to disk. Using pickle should not be problematic though since
it is in the standard library. You won't even notice it.
If you use the gensim save()
function you can chose any file extension: *.model, *.pkl, *.p,
*.pickle. The saved file will be pickled.
It depends what are your requirements.
When you going to use the data with Python and you don't need to change between python versions (I experienced some problems with porting from python 2 to python 3 using pickled models) a binary format will be a good choice.
If you want interoperability or this model could be used by in the other projects or by other programmers I would use gensim's save method.

How to create a new gym environment in OpenAI?

I have an assignment to make an AI Agent that will learn to play a video game using ML. I want to create a new environment using OpenAI Gym because I don't want to use an existing environment. How can I create a new, custom Environment?
Also, is there any other way I can start to develop making AI Agent to play a specific video game without the help of OpenAI Gym?
See my banana-gym for an extremely small environment.
Create new environments
See the main page of the repository:
https://github.com/openai/gym/blob/master/docs/creating_environments.md
The steps are:
Create a new repository with a PIP-package structure
It should look like this
gym-foo/
README.md
setup.py
gym_foo/
__init__.py
envs/
__init__.py
foo_env.py
foo_extrahard_env.py
For the contents of it, follow the link above. Details which are not mentioned there are especially how some functions in foo_env.py should look like. Looking at examples and at gym.openai.com/docs/ helps. Here is an example:
class FooEnv(gym.Env):
metadata = {'render.modes': ['human']}
def __init__(self):
pass
def _step(self, action):
"""
Parameters
----------
action :
Returns
-------
ob, reward, episode_over, info : tuple
ob (object) :
an environment-specific object representing your observation of
the environment.
reward (float) :
amount of reward achieved by the previous action. The scale
varies between environments, but the goal is always to increase
your total reward.
episode_over (bool) :
whether it's time to reset the environment again. Most (but not
all) tasks are divided up into well-defined episodes, and done
being True indicates the episode has terminated. (For example,
perhaps the pole tipped too far, or you lost your last life.)
info (dict) :
diagnostic information useful for debugging. It can sometimes
be useful for learning (for example, it might contain the raw
probabilities behind the environment's last state change).
However, official evaluations of your agent are not allowed to
use this for learning.
"""
self._take_action(action)
self.status = self.env.step()
reward = self._get_reward()
ob = self.env.getState()
episode_over = self.status != hfo_py.IN_GAME
return ob, reward, episode_over, {}
def _reset(self):
pass
def _render(self, mode='human', close=False):
pass
def _take_action(self, action):
pass
def _get_reward(self):
""" Reward is given for XY. """
if self.status == FOOBAR:
return 1
elif self.status == ABC:
return self.somestate ** 2
else:
return 0
Use your environment
import gym
import gym_foo
env = gym.make('MyEnv-v0')
Examples
https://github.com/openai/gym-soccer
https://github.com/openai/gym-wikinav
https://github.com/alibaba/gym-starcraft
https://github.com/endgameinc/gym-malware
https://github.com/hackthemarket/gym-trading
https://github.com/tambetm/gym-minecraft
https://github.com/ppaquette/gym-doom
https://github.com/ppaquette/gym-super-mario
https://github.com/tuzzer/gym-maze
Its definitely possible. They say so in the Documentation page, close to the end.
https://gym.openai.com/docs
As to how to do it, you should look at the source code of the existing environments for inspiration. Its available in github:
https://github.com/openai/gym#installation
Most of their environments they did not implement from scratch, but rather created a wrapper around existing environments and gave it all an interface that is convenient for reinforcement learning.
If you want to make your own, you should probably go in this direction and try to adapt something that already exists to the gym interface. Although there is a good chance that this is very time consuming.
There is another option that may be interesting for your purpose. It's OpenAI's Universe
https://universe.openai.com/
It can integrate with websites so that you train your models on kongregate games, for example. But Universe is not as easy to use as Gym.
If you are a beginner, my recommendation is that you start with a vanilla implementation on a standard environment. After you get passed the problems with the basics, go on to increment...

Resources