Trying to understand the difference between unlabeled and unstructured data. Are they synonyms?
From my understanding, unlabeled data is data that does highlight the target variable. Unstructured data is just raw data.
unstructured data - means that it is not structured in a table-like form. Some examples for unstructured data are - images, text, audio.
Unlabeled data means that you don’t have labels and you should use unsupervised methods to deal with this problem.
Related
I us my custom block of code to format my multivariate data to fit LSTM model.
Now I get too much data to fit my memory GPU so I want to take chunk of my data make all formating as usual and feed my model and prepare efficiently the next chunk by the time my gpu work with the first one and go on.
I see exemple using tf.data.Dataset. like this one: Using a Windowed Dataset for Time Series Prediction
This is the good way with multivariate timeseries?
Can I use my custom code to format data and at the end convert it in tf.data compatible?
I have learned the test set of image data can be augmented by a method called Test Time Augmentation
and I am wondering after I researched on it if the test set of structured or non-image data can be augmented too.
If it cannot, why does such a method can perform on image data only?
Thank you in advance
If you are referring to data augmentation in general, then yes you can apply it to non-image dataset.
Data augmentation means increasing the number of data points.
One of the example is generating synthetic samples for the minority class.
SMOTE (Synthetic Minority Over-sampling Technique) is an oversampling method can be applied to your data through imblearn package for python. It works by creating synthetic samples from the minor class instead of creating copies and you can apply it to any numerical data, not only images (actually I've never seen this method applied to images dataset).
You can go here and here for more detail.
I am a beginner in Machine learning. I have seen videos which teaches machine learning. But my questions is How can we model our data.
Mostly we get unstructured data. How can I convert that unstructured data into structured format, The BEST way. So that we can find the most useful information from the data.
Any help w.r.t books or links is very thankful.
As a machine learning engineer, You will be responsible for preprocessing your data in a way such that it will be acceptabele. by the model.
There is no best way to do this and moresoo, it depends on what type of data you have such as 1. csv datasets, 2. Text dataset, file(image & audio).
In the real world all the data will not be in a structured form. When we get the data very first thing is find
1. what is the data is all about.
2. what are the features of it and output of it.
Ex: Dataset to predict the height a person and you have all the below info like from which country, Weight, Gender, Hair color etc.. these are the features we say usually term in Machine learning.
3. Then we need to see how the data features are. Like text data or numerical etc.. We need to pre-process the data before we do any analysis of the data. For Ex: In case you data, a feature is all about a review then you need remove all the special function and corpous your data.
4. You need to understand the way model accepts the data and parameters the model has how can we improve the data.( We can do some feature engineering to improve the models etc..)
There is no hard and fast rule you need to do in the same way.
First, you need to learn about preprocessing and feature extraction. If you make a model in Python, then libraries like Pandas or Scikit learn are very useful. As a first step try to create sentences like "when x occurs then my output y becomes ...".
Before modeling, the data has to be cleaned. There are several methods to clean the data. Go through the link on how to convert data from unstructured data to structured data.
https://www.geeksforgeeks.org/how-to-convert-unstructured-data-to-structured-data-using-python/
I have encrypted text dataset and i want to classify it using neural network algorithm. I know that there is a pattern in the encrypted data.
example of my input data :
diss%^ghghE(t dffd$#KL*vb xod##:n>did ....
My questions is should i treat encrypted data as if its normal text and create vocabulary and transform my data into sequence of indices ?
should i clean my data first from all the special characters ?
What i tried is i cleaned all data from special characters, then created a vocabulary and transform my data into sequences however i am getting a very low accuracy. but my model works well when my data is in natural language.
Any help is appreciated.
By definition, a good encryption algorithm will not allow you to learn anything[*] from the encrypted data.
So, unless you suspect that the encryption algorithm is weak, I suggest you abandon this idea.
[*] apart from the approximate size of the original text
When creating a Bag Of Words, you need to create a Vocabulary to give to the BOWImgDescriptorExtractor to which you use on the images you wish to input. This creates the Testing Data.
So where does the Training Data come from, and where do you use it?
Whats the difference between Vocabulary and Training Data?
Isn't the Vocabulary the same thing as the Training Data?
Training data is a set of images you collected for your application as the input of BOWTrainer, and vocabulary is the output of the BOWTrainer. Once you have the vocabulary, you can extract features of images using BOWImgDescriptorExtractor with the words defined in the vocabulary.
An image can be described by tons of features (words), however only some of them are important. The first job to do is to find those important words, that is, to train a vocabulary. After the vocabulary is obtained, images can be described more precisely.
So where does the Training Data come from, and where do you use it?
You should provide the Training data, and use it to train the vocabulary with BOWTrainer. The Training data is a set of images (descriptors), depends on your application domain.
What's the difference between Vocabulary and Training Data?
Vocabulary is cooked, while training data is raw, unorganized.
Isn't the Vocabulary the same thing as the Training Data?
No.
There is an add function that is used to specify training data. docs on opencv bow module