I'm trying out RAPIDS cudf and cuspatial, wonder what are the better ways cross join two dataframes that result in 27billion rows?
I've got two datasets - one from New York City taxi trip data (14.7million rows) containing longitude/latitude of pick up locations. Another dataset contains longitude/latitude of 1.8k metro stations. For each trip I want to cross join with all station location, then calculate the Haversine distance for all permutations.
I don't think cudf allows cross joins so I created a new column key in both datasets.
cutaxi['key'] = 0
cumetro['key'] = 0
cutaxi_metro = cutaxi.merge(cumetro, on = 'key', how = 'outer')
cutaxi_metro[hdist_km] = cuspatial.haversine_distance(cutaxi_metro[EntranceLongitude], cutaxi_metro[EntranceLatitude],cutaxi_metro[taxi_pickup], cutaxi_metro[taxi_dropoff])
I was running the code on Nvidia V100 and 4 virtual CPUs, but I still ran into out of memory issues. I'm guessing that I need to process the merging in batches, but not sure how to approach it! any suggestions are appreciated!
MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory
Related
I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX
Recently, I am implementing an algorithm from a paper that I will be using in my master's work, but I've come across some problems regarding the time it is taking to perform some operations.
Before I get into details, I just want to add that my data set comprehends roughly 4kk entries of data points.
I have two lists of tuples that I've get from a framework (annoy) that calculates cosine similarity between a vector and every other vector in the dataset. The final format is like this:
[(name1, cosine), (name2, cosine), ...]
Because of the algorithm, I have two of that lists with the same names (first value of the tuple) in it, but two different cosine similarities. What I have to do is to sum the cosines from both lists, and then order the array and get the top-N highest cosine values.
My issue is: is taking too long. My actual code for this implementation is as following:
def topN(self, user, session):
upref = self.m2vTN.get_user_preference(user)
spref = self.sm2vTN.get_user_preference(session)
# list of tuples 1
most_su = self.indexer.most_similar(upref, len(self.m2v.wv.vocab))
# list of tuples 2
most_ss = self.indexer.most_similar(spref, len(self.m2v.wv.vocab))
# concat both lists and add into a dict
d = defaultdict(int)
for l, v in (most_ss + most_su):
d[l] += v
# convert the dict into a list, and then sort it
_list = list(d.items())
_list.sort(key=lambda x: x[1], reverse=True)
return [x[0] for x in _list[:self.N]]
How do I make this code faster? I've tried using threads but I'm not sure if it will make it faster. Getting the lists is not the problem here, but the concatenation and sorting is.
Thanks! English is not my native language, so sorry for any misspelling.
What do you mean by "too long"? How large are the two lists? Is there a chance your model, and interim results, are larger than RAM and thus forcing virtual-memory paging (which would create frustrating slowness)?
If you are in fact getting the cosine-similarity with all vectors in the model, the annoy-indexer isn't helping any. (Its purpose is to get a small subset of nearest-neighbors much faster, at the expense of perfect accuracy. But if you're calculating the similarity to every candidate, there's no speedup or advantage to using ANNOY.
Further, if you're going to combine all of the distances from two such calculation, there's no need for the sorting that most_similar() usually does - it just makes combining the values more complex later. For the gensim vector-models, you can supply a False-ish topn value to just get the unsorted distances to all model vectors, in order. Then you'd have two large arrays of the distances, in the model's same native order, which are easy to add together elementwise. For example:
udists = self.m2v.most_similar(positive=[upref], topn=False)
sdists = self.m2v.most_similar(positive=[spref], topn=False)
combined_dists = udists + sdists
The combined_dists aren't labeled, but will be in the same order as self.m2v.index2entity. You could then sort them, in a manner similar to what the most_similar() method itself does, to find the ranked closest. See for example the gensim source code for that part of most_similar():
https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/keyedvectors.py#L557
Finally, you might not need to be doing this calculation yourself at all. You can provide more-than-one vector to most_similar() as the positive target, and then it will return the vectors closest to the average of both vectors. For example:
sims = self.m2v.most_similar(positive=[upref, spref], topn=len(self.m2v))
This won't be the same value/ranking as your other sum, but may behave very similarly. (If you wanted less-than-all of the similarities, then it might make sense to use the ANNOY indexer this way, as well.)
My platform here is Ruby - a webapp using Rails 3.2 in particular.
I'm trying to match objects (people) based on their ratings for certain items. People may rate all, some, or none of the same items as other people. Ratings are integers between 0 and 5. The number of items available to rate, and the number of users, can both be considered to be non-trivial.
A quick illustration -
The brute-force approach is to iterate through all people, calculating differences for each item. In Ruby-flavoured pseudo-code -
MATCHES = {}
for each (PERSON in (people except USER)) do
for each (RATING that PERSON has made) do
if (USER has rated the item that RATING refers to) do
MATCHES[PERSON's id] += difference between PERSON's rating and USER's rating
end
end
end
lowest values in MATCHES are the best matches for USER
The problem here being that as the number of items, ratings, and people increase, this code will take a very significant time to run, and ignoring caching for now, this is code that has to run a lot, since this matching is the primary function of my app.
I'm open to cleverer algorithms and cleverer databases to achieve this, but doing it algorithmically and as such allowing me to keep everything in MySQL or PostgreSQL would make my life a lot easier. The only thing I'd say is that the data does need to persist.
If any more detail would help, please feel free to ask. Any assistance greatly appreciated!
Check out the KD-Tree. It's specifically designed to speed up neighbour-finding in N-Dimensional spaces, like your rating system (Person 1 is 3 units along the X axis, 4 units along the Y axis, and so on).
You'll likely have to do this in an actual programming language. There are spatial indexes for some DBs, but they're usually designed for geographic work, like PostGIS (which uses GiST indexing), and only support two or three dimensions.
That said, I did find this tantalizing blog post on PostGIS. I was then unable to find any other references to this, but maybe your luck will be better than mine...
Hope that helps!
Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.
Is your data sparse? With rating, most of the time not every user rates every object.
Naively comparing each object to every other is O(n*n*d), where d is the number of operations. However, a key trick of all the Hadoop solutions is to transpose the matrix, and work only on the non-zero values in the columns. Assuming that your sparsity is s=0.01, this reduces the runtime to O(d*n*s*n*s), i.e. by a factor of s*s. So if your sparsity is 1 out of 100, your computation will be theoretically 10000 times faster.
Note that the resulting data will still be a O(n*n) distance matrix, so strictl speaking the problem is still quadratic.
The way to beat the quadratic factor is to use index structures. The k-d-tree has already been mentioned, but I'm not aware of a version for categorical / discrete data and missing values. Indexing such data is not very well researched AFAICT.
I am trying to find the best way to solve the problem below:
Problem
I have (up to) 100,000 Lat/Lng points in Set A
I have (up to) 2000 Lat/Lng points in Set B
I need to find the nearest neighbour of points in set B to points in Set A.
Once they have been paired - I then need to calculate their distance which will be:
2000 Set A points to 2000 Set B Points.
These points are "in memory" they do not come from a database - they are the result of other calculations done the in the system.
Current Solution
Using a KDTree implementation in Ruby I can create a KDTree lookup that will match the points I have. I then use a haversine method in Ruby to calculate the distance of the points when they are paired.
KDtree code: Ruby KDTree Code
haversine Code: Haversine Code
Platform
I am running jruby - with rails as the web framework.
Issue
Its slow! Like 30 to 40 seconds slow... I think the main bottle neck is in the KDtree, but the point look up takes a long time too (i think). At smaller numbers in Set B its quick but the higher the number of points in Set B it gets a lot quicker.
The Question
Would anyone think of doing this differently? Is there something I am missing. I think a Java library might be a lot quicker, but how would I implement this, and which one would I use (Not strong in Java - I use Jruby for multithreading ruby code in the JVM)
Is it possible to persist the information to a database? Because then you can use GeoKit, which leverages a geo-aware database (MySQL, Postgres > 8.1, etc) so that you can do this:
Location.find(:all, :origin =>[37.792,-122.393], :within=>10, :order=>"distance asc")
Also, you can find the distance between two points, etc. The response time will be more on par with a DB query, and much faster than what you're seeing.
Just an idea in my mind. If you round your lat/long's to two decimal places then all the points with-in 1.11 km's will be the same. See this for more details. I'm not 100% sure about it but may be it works for you. Off-course for areas near the pols, this will not work as longitude shrinks there.
To speed up the distance calculation between two lat/long's, you can calculate euclidean distance by using simple distance formula rather than geographical distance. This distance will not be accurate off-course but will speed up your process.
I have a model Transaction for which I need to display the results of many calculations on many fields for a subset of transactions.
I've seen 2 ways to do it, but am not sure which is the best. I'm after the one that will have the least impact in terms of performance when data set grows and number of concurrent users increases.
data[:total_before] = Transaction.where(xxx).sum(:amount_before)
data[:total_after] = Transaction.where(xxx).sum(:amount_after)
...
or
transactions = Transaction.where(xxx)
data[:total_before]= transactions.inject(0) {|s, e| s + e.amount_before }
data[:total_after]= transactions.inject(0) {|s, e| s + e.amount_after }
...
edit: the where clause is always the same.
Which one should I choose? (or is there a 3rd, better way?)
Thanks, P.
Not to nag, but what about
transactions = Transaction.where(xxx)
data[:total_before] = transactions.sum(:amount_before)
data[:total_after] = transactions.sum(:amount_before)
? This looks like the union of strengths of methods 1 and 2 :) You reuse search results and employ more clean rails-specific sum aggregator.
PS If you were asking whether it's possible to rely on Rails in caching results of Transaction.where(xxx) query, that I don't know. And when I don't know, I prefer to play safe.
Really you're talking about scalability.
If you're talking about millions of rows and needing to do calculations on them, then which do you think would be faster?
Asking the DBM to summarize millions of rows and return you two numbers.
Returning millions of query results across the network which you iterate over twice.
In the first scenario you can scale up your DB host with faster CPUs, more RAM, faster drives or pre-compute your values at regular intervals. The calculations you want done in the DBM are exactly the sort of things it's written to do.
In the second scenario you have to scale up your computing host, and maybe the switch connecting the DBM and computing host, plus maybe the database host because it will have to retrieve and push the data. Imagine the impact on the network as it's handling the data, and the impact on the computing host's CPU as it's doing everything.
I'd do the first one as it seems a lot more scalable to me.