Dataset visualization - machine-learning

I am using weka for traffic classification. I have an .arff dataset that contains different columns and rows. Each row is an instance where each column is a feature. Is there any software that can visualize my Dataset for more than two features?
I have noticed that weka can visualize two dataset,However I need to visualize up to 8 features.
Thanks in advance.

You can check out the so called Parallel Coordinates which can visualize any number of features. There are many existing implementations, some of which are avaliable from prof. Inselberg page

Related

Feature Selection for Mass spectrometry data

I have a cancer patients data from mass spectrometry which consists of more than half million features and my task is to apply feature selection algorithm to extract the most relevant features from it. My question is which feature selection model would me more appropriate in this case... Any suggestions will be more appreciated...
Use a support vector machine, a type of machine learning model. Perhaps Python and Tensorflow would be useful here!
All the best,
OL

How do I combine text and numerical features in training set for machine learning?

I am trying to predict the number of likes on a post in a social network basing on both on numerical features and text features. Now I have dataframe with required features, but I don't know what to do with posts text data. Should I vectorize it/do smth else in order to get a suitable train matrix? I am going to use LinearSVC from sklearn for analysis.
There are a lot of different ways you can transform your text features into numerical ones.
One of the most common ways is the Bag of Words approach. Where you transform your text into an array with the occurrences of each word.
If you are using scikit-learn I recommend you reading their Text Feature extraction User Guide.
Also look at the NLTK toolkit for more complex ways to process your text data.

Training and Testing Data set for classification text file

Suppose we have 10000 text file and We would like to classify as political ,health,weather,sports,Science ,Education,.........
I need training data set for classification of text documents and I am Naive Bayes classification Algorithm. Anyone can help to get data sets .
OR
Is there any another way to get classification done..I am new at Machine Learning Please explain your answer completely.
Example:
**Sentence** **Output**
1) Obama won election. ----------------------------------------------->political
2) India won by 10 wickets ---------------------------------------------->sports
3) Tobacco is more dangerous --------------------------------------------->Health
4) Newtons laws of motion can be applied to car -------------->science
Any way to classify these sentences into their respective categories
Have you tried to google it? There are tons and tons of datasets for text categorization. The classical one is Reuters-21578 (https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection), another famous one and mentioned almost in each ML book is 20 newsgroup: http://web.ist.utl.pt/acardoso/datasets/
But there are lots of other, one google query away from you. Just load them, slightly adjust if needed and train your classifier on that datasets.

Good training data for text classification by LDA?

I'm classifying content based on LDA into generic topics such as Music, Technology, Arts, Science
This is the process i'm using,
9 topics -> Music, Technology, Arts, Science etc etc.
9 documents -> Music.txt, Technology.txt, Arts.txt, Science.txt etc etc.
I've filled in each document(.txt file) with about 10,000 lines of content of what i think is "pure" categorical content
I then classify a test document, to see how well the classifier is trained
My Question is,
a.) Is this an efficient way to classify text (using the above steps)?
b.) Where should i be looking for "pure" topical content to fill each of these files? Sources which are not too large (text data > 1GB)
classification is only on "generic" topics such as the above
a) The method you describe sounds fine, but everything will depend on the implementation of labeled LDA that you're using. One of the best implementations I know is the Stanford Topic Modeling Toolbox. It is not actively developed anymore, but it worked great when I used it.
b) You can look for topical content on DBPedia, which has a structured ontology of topics/entities, and links to Wikipedia articles on those topics/entities.
I suggest you to use bag-of-words (bow) for each class you are using. Or vectors where each column is the frequency of important keywords related to the class you want to target.
Regarding the dictionaries you have DBPedia as yves referred or WordNet.
a.)The simplest solution is surely the k-nearest neighbors algorithm (knn). In fact, it will classify new texts with categorical content using an overlap metric.
You could find ressources here: https://github.com/search?utf8=✓&q=knn+text&type=Repositories&ref=searchresults
Dataset issue:
If you are dealing with classifying live user feeds, then I guess no single dataset will suffice your requirement.
Because if new movie X released, it might not catch by your classification dataset as the training dataset is obsoleted for it now.
For classification I guess to stay updated with latest datasets, use twitter training datasets. Develop dynamic algorithm which update the classifier with latest updated tweet datasets. You could select top 15-20 hash tag for each category of your choice to get most relevant dataset for each category.
Classifier:
Most of the classifier uses bag of words model, you can try out various classifiers and see which gives best result. see :
http://www.nltk.org/howto/classify.html
http://scikit-learn.org/stable/supervised_learning.html

Linear Regression Real Life Example

I am learning Machine Learning(Linear Regression) from Prof. Andrew's lecture. While listening when to use normal equation vs gradient descent, he says when our features number is very high(like 10E6) then to use gradient descent. Everything is clear to me, but I wonder that can someone give me real life examples where we use such such huge number of features?
For example, in text classification (e.g., email spam filtering), we can use unigrams (bag of words), bigrams, trigrams as features. Depending on the size of the dataset, the number of features can be very large.
List of data sets having large no of attributes:-
1. Daily and Sports Activities Data Set link
2. Farm Ads Data Set link
3. Arcene Data Set link
4. Bag of Words Data Set link
Above are real life examples of data sets having large no. of attributes.

Resources