LDA: Coherence Values using u_mass v c_v - machine-learning

I am currently attempting to record and graph coherence scores for various topic number values in order to determine the number of topics that would be best for my corpus. After several trials using u_mass, the data proved to be inconclusive since the scores don't plateau around a specific topic number. I'm aware that CV ranges from -14 to 14 when using u_mass, however my values range from -2 to -1 and selecting an accurate topic number is not possible. Due to these issues, I attempted to use c_v instead of u_mass but I receive the following error:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
This is my code for computing the coherence value
cm = CoherenceModel(model=ldamodel, texts=texts, dictionary=dictionary,coherence='c_v')
print("THIS IS THE COHERENCE VALUE ")
coherence = cm.get_coherence()
print(coherence)
If anyone could provide assistance in resolving my issues for either c_v or u_mass, it would be greatly appreciated! Thank you!

Related

max-series-per-database limit exceeded clarification needed / how to calculate number of series in use

We recently started to encounter this error:
{"error":"partial write: max-series-per-database limit exceeded: (1000000) dropped=1"}
When writing metric data like this:
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1103,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
I know that Influx recommends you keep your series cardinality low, and our impression was that series cardinality would mean keeping each tag individually to a small number of values. e.g. we felt comfortable sending instance_id=1103 as a tag, because we know that there will never be more than 2000 distinct instance_id tag values.
But after running into this error... I'm afraid maybe I was mistaken here. Do we actually need to keep the cardinality of all possible combinations of all tags low? e.g. do these two things count as two separate series towards the 1,000,000 default max, because the instance_id is different?
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1111,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=2222,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
If those count as two separate series... then is there a better way to structure this data in Influx? 1,000,000 total seems like a tiny amount if each separate combination of tags is a separate series...
Does InfluxDB 2.x help with this?
Is there a better tool that can handle a large number of tags and not bump into limits like this?
There is no way to figure out what data was not recorded. Update the max-series-per-database configuration to be more than 1M in order to stop dropping data.
This can be an indication that you are creating a lot of series. i saw some documentation on why that isn't great.
Hope this helps!

K-Mode clustering

I have a dataset of 6 million rows with mixed datatype. k prototype is not scalable and hence I converted all columns to categorical and ran K-mode for 4 clusters on a random sample of 4 M rows. However, k-mode has an initialization problem that will give different clusters every time you run the model. Let's say, I run it once and take the output for my analysis. Is the approach completely wrong for one time analysis? If yes, is there a way to fix initialization problem? May be by setting parameter or something. Any suggestion is deeply appreciated.
I am sure you did this but definitely set the seed. Because once you set the mode variable it selects a random set of rows from your data and proceeds with the algorithm. So seeting the seed is important for reproducible results. I am assuming your code is something like this:
kmodes(data, modes=4, iter.max = 10, weighted = FALSE, fast = TRUE)
I hope by different cluster you don't imply the number of clusters is also changing.

Dask performances: workflow doubts

I'm confused about how to get the best from dask.
The problem
I have a dataframe which contains several timeseries (every one has its own key) and I need to run a function my_fun on every each of them. One way to solve it with pandas involves
df = list(df.groupby("key")) and then apply my_fun
with multiprocessing. The performances, despite the huge usage of RAM, are pretty good on my machine and terrible on google cloud compute.
On Dask my current workflow is:
import dask.dataframe as dd
from dask.multiprocessing import get
Read data from S3. 14 files -> 14 partitions
`df.groupby("key").apply(my_fun).to_frame.compute(get=get)
As I didn't set the indices df.known_divisions is False
The resulting graph is
and I don't understand if what I see it is a bottleneck or not.
Questions:
Is it better to have df.npartitions as a multiple of ncpu or it doesn't matter?
From this it seems that is better to set the index as key. My guess is that I can do something like
df["key2"] = df["key"]
df = df.set_index("key2")
but, again, I don't know if this is the best way to do it.
For questions like "what is taking time" in Dask, you are generally recommended to use the "distributed" scheduler rather than multiprocessing - you can run with any number of processes/threads you like, but you have much more information available via the diagnostics dashboard.
For your specific questions, if you are grouping over a column that is not nicely split between partitions and applying anything other than the simple aggregations, you will inevitably need a shuffle. Setting the index does this shuffle for you as a explicit step, or you get the implicit shuffle apparent in your task graph. This is a many-to-many operation, each aggregation tasks needs input from every original partition, hence the bottle-neck. There is no getting around that.
As for number of partitions, yes you can have sub-optimal conditions like 9 partitions on 8 cores (you will calculate 8 tasks, and then perhaps block for the final task on one core while the others are idle); but in general you can depend on dask to make reasonable scheduling decisions so long as you are not using a very small number of partitions. In many cases, it will not matter much.

Predicting possible inputs leading to output satisfying certain condition

Suppose there is a data set of statistical data with a number of input columns and one output column. The predictors characterize some particular process that is repeated, so one data row is corresponding to one occasion of that process. And for these process characteristics the order and duration is important. Some of them might be absent at all, some of them are repeated, but with different speed or other parameter.
Let's say that our process is names P and it can have a lot of child parts, that form the process together. Let's say, once the process had N sub processes:
Sub process 1, with: speed = SpdA, duration = DurA, depth = DepA
Right after sub process A next sub process B happened:
Sub process 2, with: speed = SpdB, duration = DurB, depth = DepB
...
... N. Sub process N.
So there might be from 1 to N child processes in each process, that is, in each data row. And the amount of the child processes may vary from one row to another. This is about the input data.
As for the output - the output here in the simplest case is binary - either success or failure, but in reality it will be a positive number starting from 0 to positive infinity. This number represents the time by which the process has finished successfully. If the value for the output is a positive infinity - it means that the process failed to succeed.
Very important note, if we are going with the simplest case where the output is binary - in the statistical data set there will be data rows that mostly have failure in the output. The goal is to find the hypothetical parameters that values of the test predictors should be equal to, to make the process succeed.
For example, after learning we should be able to tell what is the concrete universal input parameters that will most process success. That was the simplest, binary output case.
However, in real life we will have the output that represents time by which the process finished successfully, and +infinity - if failure. So here the goal is the same - make the process succeed or as much close to success as possible. The goal is to generate the test inputs that we might use in future to prevent the output equal to +infinity.
The goal maximum is, having the target time provided, find the exact values for the inputs that will make the process finish successfully as closer to the given time as possible. Here we should expect the enumeration of child processes, their order and the values for each child process to be predicted.
Here in this problem, I guess, the output will play the role of the input and the input will play the role of the output.
What is the approach to solve these problems? How to handle the variable number of characteristics and how to handle the order that might vary in the each data row?
I am a novice in machine learning and would appreciate the concrete suggestions or examples of similar problems solved.
Any help and advice welcome!

Which Improvements can be done to AnyTime Weighted A* Algorithm?

Firstly , For those of your who dont know - Anytime Algorithm is an algorithm that get as input the amount of time it can run and it should give the best solution it can on that time.
Weighted A* is the same as A* with one diffrence in the f function :
(where g is the path cost upto node , and h is the heuristic to the end of path until reaching a goal)
Original = f(node) = g(node) + h(node)
Weighted = f(node) = (1-w)g(node) +h(node)
My anytime algorithm runs Weighted A* with decaring weight from 1 to 0.5 until it reaches the time limit.
My problem is that most of the time , it takes alot time until this it reaches a solution , and if given somthing like 10 seconds it usaully doesnt find solution while other algorithms like anytime beam finds one in 0.0001 seconds.
Any ideas what to do?
If I were you I'd throw the unbounded heuristic away. Admissible heuristics are much better in that given a weight value for a solution you've found, you can say that it is at most 1/weight times the length of an optimal solution.
A big problem when implementing A* derivatives is the data structures. When I implemented a bidirectional search, just changing from array lists to a combination of hash augmented priority queues and array lists on demand, cut the runtime cost by three orders of magnitude - literally.
The main problem is that most of the papers only give pseudo-code for the algorithm using set logic - it's up to you to actually figure out how to represent the sets in your code. Don't be afraid of using multiple ADTs for a single list, i.e. your open list. I'm not 100% sure on Anytime Weighted A*, I've done other derivatives such as Anytime Dynamic A* and Anytime Repairing A*, not AWA* though.
Another issue is when you set the g-value too low, sometimes it can take far longer to find any solution that it would if it were a higher g-value. A common pitfall is forgetting to check your closed list for duplicate states, thus ending up in a (infinite if your g-value gets reduced to 0) loop. I'd try starting with something reasonably higher than 0 if you're getting quick results with a beam search.
Some pseudo-code would likely help here! Anyhow these are just my thoughts on the matter, you may have solved it already - if so good on you :)
Beam search is not complete since it prunes unfavorable states whereas A* search is complete. Depending on what problem you are solving, if incompleteness does not prevent you from finding a solution (usually many correct paths exist from origin to destination), then go for Beam search, otherwise, stay with AWA*. However, you can always run both in parallel if there are sufficient hardware resources.

Resources