Large datasets and route analysis - mapping

I have an extremely large dataset that is growing fortnightly by around 7GB. The dataset contains a number of vehicles and their movements throughout the day. I need to convert the points into routes and currently am struggling to find any software packages that are able to handle 10+ million data points.
If anyone has any solutions of suggestions, I would greatly appreciate.
Thanks

Related

Fastest free geocoding coordinates library for large datasets

I know the tittle might be ambitious.
I'm using a large dataset and while using nominatim to determine 'latitude and longitude' with the city and country given, it spent so much hours...
I looked out for free libraries to manage large datasets fastly but can't find a specific one.
I read about QGIS but it's a plugin right? Not a code/script without even installing
The good ones are paid
Coordinates geolocation library to be used in large datasets

Where I can find BIG dataset

I'm looking for a huge text classification datasets to apply what I learn in a Machine learning course. I'm looking for wide data and tall data. What I found till now are data between 200Mb up to 500Mb. Please is there any repo/url where I can find dataset up to 2gb or more.
You can find a good list of some publicly available datasets here :
https://github.com/awesomedata/awesome-public-datasets
As per example, have a look at CommonCrawl Dataset https://commoncrawl.org/ which has been crawled from 25 billion web pages.
An index with the list of archives can be found here : http://index.commoncrawl.org/

Is LIBSVM suitable for many categories and samples?

I'm building a text classifier, which should be able to give probabilities that a document belongs to certain categories (i.e. 80% fiction, 30% marketing etc)
I believe Libsvm does this via the "predict" method, but the problem is that I have approximately 20 categories to test for. Also, I have several hundred documents that can be used for the training.
The problem is that the training file gets 1 GB - 2 GB big, and this makes Libsvc super-slow.
How can this issue be solved? And should I go for Liblinear instead, or are there better options?
Regarding this specific question, I had to use Liblinear as LibSVC kept running forever.
But if anyone wants to know how it eventually turned out:
I switched from PHP / C++ to Python, which was tremendously
easier and did not encounter any memory issues
My case was "multi-labelling". This article put me in the right direction, and the magpie project helped me accomplish the task.

What do you do when you have an ML model that works, but does not have good results?

Sorry if this has been asked before, I have tried looking online but maybe I don't know the proper terminology because I mostly find results that try to address overfitting by splitting the data set.
So when my my models gets stuck at like 30% accuracy on the validation data and refuses to improve, my strategies tend to be trying to change the number of nodes per layer, batch size, or number of epochs. Sometimes this is helpful, but other times it doesn't seem to do much at all.
What do people usually do in this situation?
I'd like to help with your question. You probably are working on a classification task. Could you please specify the following properties of your dataset: number of samples, number of features, types of features (numerical, categorical, etc).

pattern recognition in a time series

I understand that the question I am asking seem the be somewhat related to another question which has been asked already here and here.
But I feel that this is an entirely different question. (I have also submitted this question on the dsp.stackexchange)
I have a huge (over 100K data points) time series data of the position (x, y coordinates) of an element in space. This element is vibrating randomly, and both amplitude and the frequency of vibration is random. I want to look at the events which are similar and see if there is any pattern in those events, are they periodic or related somehow.
I am working on a biological problem and have very little knowledge about signal processing. I can provide more details. Any help would be really appreciated.
One of the areas of the research on time series patterns is called Motif Detection or Discovey, some use association strategies, other are probabilistic.
Some links here http://dl.acm.org/results.cfm?query=motif+discovery&Go.x=5&Go.y=12

Resources