I'm looking for a huge text classification datasets to apply what I learn in a Machine learning course. I'm looking for wide data and tall data. What I found till now are data between 200Mb up to 500Mb. Please is there any repo/url where I can find dataset up to 2gb or more.
You can find a good list of some publicly available datasets here :
https://github.com/awesomedata/awesome-public-datasets
As per example, have a look at CommonCrawl Dataset https://commoncrawl.org/ which has been crawled from 25 billion web pages.
An index with the list of archives can be found here : http://index.commoncrawl.org/
Related
I know the tittle might be ambitious.
I'm using a large dataset and while using nominatim to determine 'latitude and longitude' with the city and country given, it spent so much hours...
I looked out for free libraries to manage large datasets fastly but can't find a specific one.
I read about QGIS but it's a plugin right? Not a code/script without even installing
The good ones are paid
Coordinates geolocation library to be used in large datasets
I uploaded more samples into AutoML, but this did not yield better results. Why can I improve the model performance?
There are many factors that affect a model performance. More training data does not necessarily link to better results. Please make sure the number of training data per label matches to minimum required on the data import page. If you're not happy with the quality levels, you can go back to earlier steps to improve the quality:
Consider adding more documents to any labels with low quality.
You may need to add different types of documents. For example, longer or shorter documents, documents by different authors that use different wording or style.
You can clean up labels.
Consider removing labels altogether if you don't have enough training documents.
I'm building a text classifier, which should be able to give probabilities that a document belongs to certain categories (i.e. 80% fiction, 30% marketing etc)
I believe Libsvm does this via the "predict" method, but the problem is that I have approximately 20 categories to test for. Also, I have several hundred documents that can be used for the training.
The problem is that the training file gets 1 GB - 2 GB big, and this makes Libsvc super-slow.
How can this issue be solved? And should I go for Liblinear instead, or are there better options?
Regarding this specific question, I had to use Liblinear as LibSVC kept running forever.
But if anyone wants to know how it eventually turned out:
I switched from PHP / C++ to Python, which was tremendously
easier and did not encounter any memory issues
My case was "multi-labelling". This article put me in the right direction, and the magpie project helped me accomplish the task.
I'm trying to utilize a pre-trained model like Inception v3 (trained on the 2012 ImageNet data set) and expand it in several missing categories.
I have TensorFlow built from source with CUDA on Ubuntu 14.04, and the examples like transfer learning on flowers are working great. However, the flowers example strips away the final layer and removes all 1,000 existing categories, which means it can now identify 5 species of flowers, but can no longer identify pandas, for example. https://www.tensorflow.org/versions/r0.8/how_tos/image_retraining/index.html
How can I add the 5 flower categories to the existing 1,000 categories from ImageNet (and add training for those 5 new flower categories) so that I have 1,005 categories that a test image can be classified as? In other words, be able to identify both those pandas and sunflowers?
I understand one option would be to download the entire ImageNet training set and the flowers example set and to train from scratch, but given my current computing power, it would take a very long time, and wouldn't allow me to add, say, 100 more categories down the line.
One idea I had was to set the parameter fine_tune to false when retraining with the 5 flower categories so that the final layer is not stripped: https://github.com/tensorflow/models/blob/master/inception/README.md#how-to-retrain-a-trained-model-on-the-flowers-data , but I'm not sure how to proceed, and not sure if that would even result in a valid model with 1,005 categories. Thanks for your thoughts.
After much learning and working in deep learning professionally for a few years now, here is a more complete answer:
The best way to add categories to an existing models (e.g. Inception trained on the Imagenet LSVRC 1000-class dataset) would be to perform transfer learning on a pre-trained model.
If you are just trying to adapt the model to your own data set (e.g. 100 different kinds of automobiles), simply perform retraining/fine tuning by following the myriad online tutorials for transfer learning, including the official one for Tensorflow.
While the resulting model can potentially have good performance, please keep in mind that the tutorial classifier code is highly un-optimized (perhaps intentionally) and you can increase performance by several times by deploying to production or just improving their code.
However, if you're trying to build a general purpose classifier that includes the default LSVRC data set (1000 categories of everyday images) and expand that to include your own additional categories, you'll need to have access to the existing 1000 LSVRC images and append your own data set to that set. You can download the Imagenet dataset online, but access is getting spotier as time rolls on. In many cases, the images are also highly outdated (check out the images for computers or phones for a trip down memory lane).
Once you have that LSVRC dataset, perform transfer learning as above but including the 1000 default categories along with your own images. For your own images, a minimum of 100 appropriate images per category is generally recommended (the more the better), and you can get better results if you enable distortions (but this will dramatically increase retraining time, especially if you don't have a GPU enabled as the bottleneck files cannot be reused for each distortion; personally I think this is pretty lame and there's no reason why distortions couldn't also be cached as a bottleneck file, but that's a different discussion and can be added to your code manually).
Using these methods and incorporating error analysis, we've trained general purpose classifiers on 4000+ categories to state-of-the-art accuracy and deployed them on tens of millions of images. We've since moved on to proprietary model design to overcome existing model limitations, but transfer learning is a highly legitimate way to get good results and has even made its way to natural language processing via BERT and other designs.
Hopefully, this helps.
Unfortunately, you cannot add categories to an existing graph; you'll basically have to save a checkpoint and train that graph from that checkpoint onward.
I can't find image files for WDRef dataset. Where to get it?
In publication authors wrote:
To address this issue, we introduce a new
dataset, Wide and Deep Reference dataset (WDRef), which is both wide (around
3,000 subjects) and deep (2,000+ subjects with over 15 images, 1,000+ subjects
with more than 40 images). To facilitate further research and evaluation on
supervised methods on the same test bed, we also share two kinds of extracted
low-level features of this dataset. The whole dataset can be downloaded from
our project website http://home.ustc.edu.cn/~chendong/JointBayesian/.
But, there are only LE and LBP features on their website.
Only features are available for WDRef dataset.
Another large dataset is here:
http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html
Also on this webpage, it is confirmed that WDRef is feature only public.