What's the most efficient way to split a heterogeneous flux to multiple fluxes? - project-reactor

What's the most efficient way to split a heterogeneous flux to multiple fluxes by pattern matching?
We have a use case where we want to consume from multiple Kafka topics and then we want to split into multiple streams. How can we do it?

Related

How to apply filter feature selection in datasets with various data types?

I have recently been looking into different filter feature selection approaches and have noted that some are better suited for numerical data (Pearson) and some are better suited for categorical data (Chi-Square).
I am working with a dataset with a mixture of both data types and am unsure about what the best practice is in terms of applying the filter methods.
Is it best to split the dataset into categorical and numerical, performing different filter methods on each set and then joining the results?
Or should only one filter method be applied to the whole dataset?
You can have a look at Permutation Importance. The idea is to randomly shuffle the values of a feature and observe the change in error. If the feature is important, ideally the error should increase. It does not depend on the data type of the feature, unlike some statistical tests. Also it is very straightforward to implement and analyze. link1, link2

Is it possible to cluster data with grouped rows of data in unsupervised learning?

I am working to setup data for an unsupervised learning algorithm. The goal of the project is to group (cluster) different customers together based on their behavior on the website. Obviously, some sort of clustering algorithm is best for discovering patterns in the data we can't see as humans.
However, the database contains multiple rows for each customer (in chronological order) for each action the customer took on the website for that visit. For example customer with ID# 123 clicked on page 1 at time X and that would be a row in the database, and then the same customer clicked another page at time Y. That would make another row in the database.
My question is what algorithm or approach would you use for clustering in this given scenario? K-means is really popular for this type of problem, but I don't know if it's possible to use in this situation because of the grouping. Is it somehow possible to do cluster analysis around one specific ID that includes multiple rows?
Any help/direction of unsupervised learning I should take is appreciated.
In short,
Learn a fixed-length embedding (representation) of each event;
Learn a way to combine a sequence of such embeddings into a single representation for each event, then use your favorite unsupervised methods.
For (1), you can do it either manually or use an encoder/decoder;
For (2), there is a range of things you can do, ranging from just simply averaging embeddings from each event, to training an encoder-decoder on reconstructing the original sequence of events and take the intermediate representation (that the decoder uses to reconstruct the original sequence).
A good read on this topic (though a bit old; you now also have the option of Transformer Network):
Representations for Language: From Word Embeddings to Sentence Meanings

Grid search vs Best subset selection

I didn't understand what is the difference between this two model selection procedures: Grid search, Best subset selection .
Both of them pick up a subset of parameters and try all possible combinations of them, so I don't understand where is the difference.
Subset Selection in machine learning is used for features selection and it means evaluating a subset of features (out of all potential features) as a group and then selecting the best subset by some criteria.
There are many different algorithms for doing so, one of them is a grid search of all possible subsets.
Grid search is just the most simple way to search thru hyperparameters, it is an exhaustive search through a manually specified subset of the hyperparameters.
So, Best subset selection can be done by grid search, but there are also other approaches.

What is the best way to find all possible paths between two nodes in large networks scale?

I wonder that what is the best way to find all possible paths from a source to a destination in a very large network scale (in a network matrix), i.e. 5000 nodes. I have used this function that is implemented using stacks, but its limit seems about 60 nodes and it can't retrieve the paths for a 200-node network. In another approach, DFS (depth-first search) could be one of the options but this algorithm also uses stack, so I am afraid of its scalability. Thus, do we have any efficient way for finding all paths between two given nodes in such a large network?
Depth-first is the only way to make it scalable at the level you specify, at least until quantum computing gives us infinite processing power. The number of paths if you have 100% adjacency among all nodes is about the same as the number of atoms in the universe, around 2^120.

rails - comparison of arrays of sentences

I have two arrays of sentences As you can see I'm trying to match applicant abilities with job requirements.
Array A
-Must be able to use MS Office
-Applicant should be prepared to work 40 to 50 hours a week
-Must know FDA Regulations, FCC Regulations
-Must be willing to work in groups
Array B
-Proficient in MS Office
-Experience with FDA Regulations
-Willing to work long hours
-Has experience with math applications.
Is there any way to compare the two arrays and determine how many similarities there are? Preferably on a sentence by sentence basis (not just picking out words that are similar) returning a percentage similar.
Any suggestions?
What you are asking for is pretty difficult and it is the buzz of natural language processing today.
NLTK is the toolkit of choice, but it's in python. There are lots of academic papers in this field. Most use copuses to train a a model where the hypothesis is that words that are similar tend to be in similar contexts (i.e. surrounded by similar words). This is very computationally expensive.
You can come up with a rudimentary solution by using the the nltk library with this plan in mind:
Remove filler words (a, the, and)
Use the part of speech tagger to identify label verbs, nouns etc (I'd
remove anything else than nouns and verbs)
For, say any twos noun (verbs), use the wordnet library to get
synonyms of that word. And if you have a match you count. There are
lots of other papers on this that use corpuses to build lexicons
which can use word frequencies to measure word similarities. The
latter method is preferred because you are likely to relate words
that are similar but do not have synonyms in common.
You can then give a relative measure of sentence similarity based on the word similarity
Other methods consider the syntactic structure of sentence, but you don't get that much benefits from this. Unfortunately, the above method is not very good, because of the nature of wordnet.

Resources