Related
I am a newbie in machine learning and learning the basic concepts in regression. The confusion I have can be well explained by placing an example of input samples with the target values. So, For example (please notice that the example I am putting is the general case, I observed the performance and predicted values on a large custom dataset of images. Also, notice that the target values are not in floats.), I have:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
and
xtest = [13, 14, 15, 16]
ytest = [25, 25, 35, 35]
As you can notice that the ever three (two in the test set) samples have similar target values. Suppose I have a multi-layer perceptron network with one Flatten() and two Dense() layers. The network, after training, predicts the target values all same for test samples:
yPredicted = [40, 40, 40, 40]
Because the predicted values are all same, the correlations between ytest and yPredicted return null and give an error.
But when I have:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [332, 433, 456, 675, 234, 879, 242, 634, 789, 432, 897, 982]
And:
xtest = [13, 14, 15, 16]
ytest = [985, 341, 354, 326]
The predicted values are:
yPredicted = [987, 345, 435, 232]
which gives very good correlations.
My question is, what it the thing or process in a machine learning algorithm that makes the learning better when having different target values for each input? Why the network does not work when having repeated values for a large number of inputs?
Why the network does not work when having repeated values for a large number of inputs?
Most certainly, this is not the reason why your network does not perform well in the first dataset shown.
(You have not provided any code, so inevitably this will be a qualitative answer)
Looking closely at your first dataset:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
it's not difficult to conclude that we have a monotonic (increasing) function y(x) (it is not strictly monotonic, but it is monotonic nevertheless over the whole x range provided).
Given that, your model has absolutely no way of "knowing" that, for x > 12, the qualitative nature of the function changes significantly (and rather abruptly), as apparent from your test set:
xtest = [13, 14, 15, 16]
ytest = [25, 25, 35, 35]
and you should not expect it to know or "guess" it in any way (despite what many people may seem to believe, NN are not magic).
Looking closely to your second dataset, you will realize that this is not the case with it, hence the network is unsurprisingly able to perform better here; when doing such experiments, it is very important to be sure that we are comparing apples to apples, and not apples to oranges.
Another general issue with your attempts here and your question is the following: neural nets are not good at extrapolation, i.e. predicting such numerical functions outside the numeric domain on which they have been trained. For details, please see own answer at Is deep learning bad at fitting simple non linear functions outside training scope?
A last unusual thing here is your use of correlation; not sure why you choose to do this, but you may be interested to know that, in practice, we never assess model performance using a correlation measure between predicted outcomes and ground truth - we use measures such as the mean squared error (MSE) instead (for regression problems, such as yours here).
I'm working through Head First Python, and there's an example:
from datetime import datetime
odds = [ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19,
21, 23, 25, 27, 29, 31, 33, 35, 37, 39,
41, 43, 45, 47, 49, 51, 53, 55, 57, 59 ]
right_this_minute = datetime.today().minute
#if right_this_minute in odds:
#print("This minute seems a little odd.")
#else:
#print("Not an odd minute.")
Now if I substitute "import datetime" for the "from datetime import datetime", the interpreter gives me an error:
right_this_minute = datetime.today().minute
AttributeError: module 'datetime' has no attribute 'today'
I don't understand why the "from datetime import datetime" works, but "import datetime" does not. I've gone through a number of stackoverflow Q&A's about this, but I'm obviously missing something.
Any suggestions would be greatly appreciated.
First of all, there are two "things" called datetime: the module and a class defined by the module.
The two import options you use have different behaviours.
When you run:
from datetime import datetime
the first is the module, the second is the class. Python imports only one class (datetime) from the module. From then on, Python will understand datetime to refer to the class.
When you run:
import datetime
you import the whole module, so Python will understand datetime to be the module. To access class datetime, you need to use datetime.datetime.
I am trying to build a little file and email search engine. I'd like also to use more advanced search queries for the full text search. Hence I am looking at lucene indexes. From what I have seen, there are two approaches - node_auto_index and apoc.index.addNode.
Setting the index up works fine, and indexing nodes with small properties works. When trying to index nodes with properties that are larger then 32k, neo4j fails (and get's into an unusable state).
The error message boils down to:
WARNING: Failed to invoke procedure apoc.index.addNode: Caused by:
java.lang.IllegalArgumentException: Document contains at least one
immense term in field="text_e" (whose UTF8 encoding is longer than the
max length 32766), all of which were skipped. Please correct the
analyzer to not produce such terms. The prefix of the first immense
term is: '[110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32,
110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101,
111, 32, 110, 101]...', original message: bytes can be at most 32766
in length; got 40000
I have checked this on 3.1.2 and 3.1.0+ apoc 3.1.0.3
A much longer description of the problem can be found at https://baach.de/Members/jhb/neo4j-full-text-indexing.
Is there any way to fix this? E.g. have I done anything wrong, or is there something to configure?
Thx a lot!
neo4j does not support index values that are longer then ~32k because of underlying lucene limitation.
For some details around that area You can look at:
https://github.com/neo4j/neo4j/pull/6213 and https://github.com/neo4j/neo4j/pull/8404.
You need to split such longer values into multiple terms.
Algorithms in scikit-learn might have some parameters that have default range of options,
sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)
and the parameter has a default value "auto", with the following options: algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}
My question is, when using **GridSearchCV** to find the best set of values for the parameters of an algorithm, would GridSearchCV go though all the default options of a parameter even though I don't add it to the parameter_list?
For example, I want to use **GridSearchCV** to find the best parameter values for **kNN**, I need to examine the n_neighbors and algorithm parameters, is it possible that I just need to pass the values with no as below (because the algorithm parameter has default options),
parameter_list = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
or, I have to specify all the options that I want to examine?
parameter_list = {
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
Thanks.
No, You are misunderstanding about the parameter default and available option.
Looking at the documentation of KNeighborsClassifier, the parameter algorithm is an optional parameter (i.e. you may and may not specify it during constructor of KneighborsClassifier).
But if you decide to specify it, then it has options available: {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}. It means that you can give the value only from these given options for algorithm and cannot use any other string to specify for algorithm. The default option is 'auto', means that if you dont supply any value, then it will internally use 'auto'.
Case 1:- KNeighborsClassifier(n_neighbors=3)
Here since no value for algorithm has been specified, so it will by default use algorithm='auto'.
Case 2:- KNeighborsClassifier(n_neighbors=3, algorithm='kd_tree')
Here as the algorithm has been specified, so it will use 'kd_tree'
Now, GridSearchCV will only pass those parameters to the estimator which are specified in the param_grid. So in your case when you use the first parameter_list from the question, then it will give only n_neighbors to the estimator and algorithm will have only default value ('auto').
If you use the second parameter_list, then both n_neighbors and algorithm will be passed on to the estimator.
I'm new to using NetworkX library with Python.
Let's say that I import a Pajek-formatted file:
import networkx as nx
G=nx.read_pajek("pajek_network_file.net")
G=nx.Graph(G)
The contents of my file are (In Pajek, nodes are called "Vertices"):
*Network
*Vertices 6
123 Author1
456 Author2
789 Author3
111 Author4
222 Author5
333 Author6
*Edges
123 333
333 789
789 222
222 111
111 456
Now, I want to calculate all the shortest path lengths between the nodes in my network, and I'm using this function, per the library documentation
path = nx.all_pairs_shortest_path_length(G)
Returns: lengths – Dictionary of shortest path lengths keyed by source and target.
The return I'm getting:
print path
{u'Author4': {u'Author4': 0, u'Author5': 1, u'Author6': 3, u'Author1': 4, u'Author2': 1, u'Author3': 2}, u'Author5': {u'Author4': 1, u'Author5': 0, u'Author6': 2, u'Author1': 3, u'Author2': 2, u'Author3': 1}, u'Author6': {u'Author4': 3, u'Author5': 2, u'Author6': 0, u'Author1': 1, u'Author2': 4, u'Author3': 1}, u'Author1': {u'Author4': 4, u'Author5': 3, u'Author6': 1, u'Author1': 0, u'Author2': 5, u'Author3': 2}, u'Author2': {u'Author4': 1, u'Author5': 2, u'Author6': 4, u'Author1': 5, u'Author2': 0, u'Author3': 3}, u'Author3': {u'Author4': 2, u'Author5': 1, u'Author6': 1, u'Author1': 2, u'Author2': 3, u'Author3': 0}}
As you can see, it's really hard to read, and to put to a later use...
Ideally, what I'd like is a return with a format similar to the below:
source_node_id, target_node_id, path_length
123, 456, 5
123, 789, 2
123, 111, 4
In short, I need to get a return using only (or at least including) the nodes ids, instead of just showing the node labels. And, to get every possible pair in a single line with their corresponding shortest path...
Is this possible in NetworkX?
Function Reference: https://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.shortest_paths.unweighted.all_pairs_shortest_path_length.html
In the end, I only needed to calculate the shortest path for a subset of the whole network (my actual network is huge, with 600K nodes and 6M edges), so I wrote a script that reads source node and target node pairs from a CSV file, stores to a numpy array, then passes them as parameters to nx.shortest_path_length and calculates for every pair, and finally saves the results to a CSV file.
The code is below, I'm posting it just in case it can be useful for someone out there:
print "Importing libraries..."
import networkx as nx
import csv
import numpy as np
#Import network in Pajek format .net
myG=nx.read_pajek("MyNetwork_0711_onlylabel.net")
print "Finished importing Network Pajek file"
#Simplify graph into networkx format
G=nx.Graph(myG)
print "Finished converting to Networkx format"
#Network info
print "Nodes found: ",G.number_of_nodes()
print "Edges found: ",G.number_of_edges()
#Reading file and storing to array
with open('paired_nodes.csv','rb') as csvfile:
reader = csv.reader(csvfile, delimiter = ',', quoting=csv.QUOTE_MINIMAL)#, quotechar = '"')
data = [data for data in reader]
paired_nodes = np.asarray(data)
paired_nodes.astype(int)
print "Finished reading paired nodes file"
#Add extra column in array to store shortest path value
paired_nodes = np.append(paired_nodes,np.zeros([len(paired_nodes),1],dtype=np.int),1)
print "Just appended new column to paired nodes array"
#Get shortest path for every pair of nodes
for index in range(len(paired_nodes)):
try:
shortest=nx.shortest_path_length(G,paired_nodes[index,0],paired_nodes[index,1])
#print shortest
paired_nodes[index,2] = shortest
except nx.NetworkXNoPath:
#print '99999' #Value to print when no path is found
paired_nodes[index,2] = 99999
print "Finished calculating shortest path for paired nodes"
#Store results to csv file
f = open('shortest_path_results.csv','w')
for item in paired_nodes:
f.write(','.join(map(str,item)))
f.write('\n')
f.close()
print "Done writing file with results, bye!"
How about something like this?
import networkx as nx
G=nx.read_pajek("pajek_network_file.net")
G=nx.Graph(G)
# first get all the lengths
path_lengths = nx.all_pairs_shortest_path_length(G)
# now iterate over all pairs of nodes
for src in G.nodes():
# look up the id as desired
id_src = G.node[src].get('id')
for dest in G.nodes():
if src != dest: # ignore self-self paths
id_dest = G.node[dest].get('id')
l = path_lengths.get(src).get(dest)
print "{}, {}, {}".format(id_src, id_dest, l)
This yields an output
111, 222, 1
111, 333, 3
111, 123, 4
111, 456, 1
111, 789, 2
...
If you need to do further processing (e.g. sorting) then store the l values rather than just printing them.
(you could loop through pairs more cleanly with something like itertools.combinations(G.nodes(), 2) but the method above is a bit more explicit in case you aren't familiar with it.)