I have tried nodevectors , fastnode2vec. But I cannot get vectors of all nodes. Why?
e.g.
The code is
from fastnode2vec import Node2Vec
graph = Graph(_lst, directed=True, weighted=True)
model = Node2Vec(_graph, dim=300, walk_length=100, context=10, p=2.0, q=0.5, workers=-1)
model.train(epochs=epochs)
I have 10,000 nodes. When I check:
model.index_to_key
there are only 502 nodes.
Why is that?
How to set parameters so I can get the vectors of all nodes?
It's possible that your settings are not generating enough appearances of all nodes to meet other requirements for inclusion, such as the min_count=5 used by the related Word2Vec superclass to discard tokens with too few example usages to model well.
See this other related answer for related considerations & possible fixes (though in the context of the nodevectors package rather than the fastnode2vec package you're using):
nodevectors not returning all nodes
If that doesn't help resolve your issue, you should include more details about your graph - such as demonstrating via displayed output that it really has 10,000 nodes, & they're all sufficiently conneted, & that the random-walks generated by your node2vec library sufficiently revisit all of them for the training purposes.
Related
I have a list of abstracts and articles approx 500 in csv each paragraph contains approx 800 to 1000 words whenever I build vocab and print with words giving none and how I can improve results?
lst_doc = doc.translate(str.maketrans('', '', string.punctuation))
target_data = word_tokenize(lst_doc)
train_data = list(read_data())
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
train_vocab = model.build_vocab(train_data)
print(train_vocab)
{train = model.train(train_data, total_examples=model.corpus_count,
epochs=model.epochs) }
Output:
None
A call to build_vocab() only builds the vocabulary inside the model, for further usage. That function call doesn't return anything, so your train_vocab variable will be Python None.
So, the behavior you're seeing is as expected, and you should say more about what your ultimate aims are, and what you'd want to see as steps towards those aims, if you're stuck.
If you want to see reporting of the progress of your calls to build_vocab() or train(), you can set the logging level to INFO. This is always a usually a good idea working to learn a new library: even if initially the copious info shown is hard to understand, by reviewing it you'll start to see the various internal steps, and internal counts/timings/etc, that hint whehter things are doing well or poorly.
You can also examine the state of the model and its various internal properties after the code has run.
For example, the model.wv property contains, after build_vocab(), a Gensim KeyedVectors structure holding all the untrained ready-for-training vectors. You can ask for its length (len(model.wv) or examine the discovered active list of words (model.wv.index_to_key).
Other comments:
It's not clear your 1st two lines – assigning into lst_doc and target_data – affect anything further, since it's unclear what read_data() might be doing to fill the train_corpus.
Often low min_count values worsen results, by including more words that have so few usage examples that they're little more than noise during training.
only 500 documents is rather small compared to most published work showing impressive results with this algorithm, which uses tens-of-thousands of documents (if not millions). So, keep in mind that results on such a small dataset may be unrepresentative of what's possible with a larger corpus - in terms of quality, optimal parameters, etc.
I am trying to find out what indices TDB2 builds. I found out by the code that it uses B+ trees to store them on disc but I didn't get what they contain and how they are used.
So my detailed questions are:
For which collation order of RDF triples (like SPO, SOP, POS, PSO, ... ) does it build indices?
How are RDF Terms encoded and stored?
What strategy is used to load the indices into main memory? (I would expect paging)?
It would also help me if you could point me to a white paper or something similar about TDB2's software design. I searched for it but couldn't find anything.
TDB2 has a "id" for each RDF term (literal's URIs, blank nodes). The id is a fixed length 64. Another way of say ting is it keeps a dictionary.
For triples it keeps SPO, POS, and OSP (this is configurable but that's the default). A triple is stored in an index as those ids - so 3 ids per triple. Fixed length.
Indexes are memory mapped files outside the heap by default. They provide the good usability.
That's the current default setup. The code isolates changes e.g. 64 bit ids could be longer, different index choices made.
I am trying to implement relation extraction between verb pairs. I want to use dependency path from one verb to the other as a feature for my classifier (predicts if relation X exists or not). But I am not sure how to encode the dependency path as a feature. Following are some example dependency paths, as space separated relation annotations from StanfordCoreNLP Collapsed Dependencies:
nsubj acl nmod:from acl nmod:by conj:and
nsubj nmod:into
nsubj acl:relcl advmod nmod:of
It is important to keep in mind that these path are of variable length and a relation could reappear without any restriction.
Two compromising ways of encoding this feature that come to my mind are:
1) Ignore the sequence, and just have one feature for each relation with its value being the number of times it appears in the path
2) Have a sliding window of length n, and have one feature for each possible pair of relations with the value being the number of times those two relations appeared consecutively. I suppose this is how one encodes n-grams. However, the number of possible relations is 50, which means I cannot really go with this approach.
Any suggestions are welcomed.
We had a project that built a classifier based off of dependency paths. I asked the group member who developed the system, and he said:
indicator feature for the whole path
So if you have the training data point (verb1 -e1-> w1 -e2-> w2 -e3-> w3 -e4-> verb2, relation1) the feature would be (e1-e2-e3-e4)
And he also did ngram sequences, so for that same data point, you would also have (e1), (e2), (e3), (e4), (e1-e2), (e2-e3), (e3-e4), (e1-e2-e3), (e2-e3-e4)
He also recommended collapsing appositive edges to make the paths smaller.
Also, I should note that he developed a set of high precision rules for each relation, and used this to create a large set of training data.
I am using an apiori algorithm implementation to generate association rules from a transaction set and I am getting the following association rules. but I get an association rules 1->8 can i assume 8->1 because see the association rules it starts from 0 and ends till 9 because there are 10 product classes, but using this algorithm I am not getting something like 8->2 or 9->1, so can i reverse an association rules 2->8 to 8->2. if not can someone point to a better apiori algorithm implementation
0-->5
0-->9
1-->2
1-->4
1-->5
1-->7
1-->8
1-->9
2-->3
2-->4
2-->5
2-->6
2-->7
2-->8
2-->9
3-->4
3-->5
3-->6
3-->7
3-->8
4-->5
4-->6
4-->7
4-->8
4-->9
5-->6
5-->7
5-->8
5-->9
6-->7
6-->8
6-->9
7-->8
7-->9
8-->9
You can get my favourite apriori implementation here:
http://www.borgelt.net/apriori.html
(Christian Borgelt also has implementations for many other mining algorithms.)
I use it regularly to mine datasets with millions of entries and it's blazingly fast.
And you can configure it to do what you want (frequent item sets vs. association rules).
of course you can assume so (1=>9 is equal to 9=>1). the items are basically combination among the others, not permutation.
FPGrowth is way more efficient than Apriori
If you want to download a Java version of Apriori and other algorithms for frequent itemset mining, you can check my website:
http://www.philippe-fournier-viger.com/spmf/
It also offers implementations of Eclat, FPGrowth, Charm and many other algorithms that can be used for association rule mining, frequent itemset mining, sequential pattern mining and sequential rule mining.
Say I want to build a check-in aggregator that counts visits across platforms, so that I can know for a given place how many people have checked in there on Foursquare, Gowalla, BrightKite, etc. Is there a good library or set of tools I can use out of the box to associate the venue entries in each service with a unique place identifier of my own?
I basically want a function that can map from a pair of (placename, address, lat/long) tuples to [0,1) confidence that they refer to the same real-world location.
Someone must have done this already, but my google-fu is weak.
Yes, you can submit the two addresses using geocoder.net (assuming you're a .Net developer, you didn't say). It provides a common interface for address verification and geocoding, so you can be reasonably sure that one address equals another.
If you can't get them to standardize and match, you can compare their distances and assume they are the same place if they are below a certain threshold away from each other.
I'm pessimist that there is such a tool already accessible.
A good solution to match pairs based on the entity resolution literature would be to
get the placenames, define and use a good distance function on them (eg. edit distance),
get the address, standardize (eg. with the mentioned geocoder.net tools), and also define distance between them,
get the coordinates and get a distance (this is easy: there are lots of libraries and tools for geographic distance calculations, and that seems to be a good metric),
turn the distances to probabilities ("what is the probability of such a distance, if we suppose these are the same places")(not straightforward),
and combine the probabilities (not straightforward also).
Then maybe a closure-like algorithm (close the set according to merging pairs above a given probability treshold) also can help to find all the matchings (for example when different names accumulate for a given venue).
It wouldn't be a bad tool or service however.