From a very long time I've been trying to find a model that is capable of paraphrasing and summarization respectively. But till now I couldn't find a model that can take a huge amount of text and do the processing.
Does anybody have any suggestions?
I used PEGASUS model, BART and PARROT for my requirements, Thought they worked pretty fine for a limited amount of strings. As soon as the I try to give a huge amount of string, it was throwing me errors.
Related
Is it necessary to repeat similar template data... Like the meaning and context is the same, but the smaller details vary. If I remove these redundancies, the dataset is very small (size in hundreds) but if the data like these are included, it easily crosses thousands. Which is the right approach?
SAMPLE DATA
This is acutally not a question suited for stack overflow but I'll answer anyways:
You have to think about how the emails (or what ever your data this is) will look in real-life usage: Do you want to detect any kind of spam or just similiar to what your sample data shows? If the first is the case, your dataset is just not suited for this problem since there are not enough various data samples. When you think about it, every of the senteces are exactly the same because the company name isn't really valueable information and will probably not be learned as a feature by your RNN. So the information is almost the same. And since every input sample will run through the network multiple times (once each epoch) it doesnt really help having almost the same sample multiple times.
So you shouldnt have one kind of almost identical data samples dominating your dataset.
But as I said: When you primarily want to filter out "Dear customer, we wish you a ..." you can try it with this dataset but you wouldnt really need an RNN to detect that. If you want to detect all kind of spam, you should search for a new dataset since ~100 unique samples are not enough. I hope that was helpful!
I built a character-level LSTM model on text data, but ultimately I'm looking to apply this model on very long text documents (such as a novel) where it's important to understand contextual information, such as where in the novel it's in.
For these large-scale NLP tasks, is the data usually cut into smaller pieces and concatenated with metadata - such as position within the document, detected topic, etc. - to be fed into the model? Or are there more elegant techniques?
Personally, I have not gone that in depth with using LSTMs to go into the level of depth that you are trying to attain but I do have some suggestions.
One solution to your problem, which you mentioned above, could be to simply analyze different pieces of the document by splitting your document into smaller pieces and analyzing them that way. You'll probably have to be creative.
Another solution, that I think might be of interest of you is to uses a Tree LSTM model in order to get the level to depth. Here's the link to the paper Using the Tree model you could feed in individual characters or words on the lowest level and then feed it upward to higher levels of abstraction. Again, I am not completely familiar with the model, so don't take my word on it, but it could be a possible solution.
Adding few more ideas in answer pointed by bhaskar, which are used to handle this problem.
You can used Attention mechanism, which is used to deal with long term dependencies. Because for a long sequence, it certainly forget information or its next prediction may not depend on all the sequence information, it has in its cell. So attention mechanism helps to find the reasonable weights for the characters, it depend on. For more info you can check this link
There is potentially lots of research on this problem. This is very recent paper on this problem.
You can also break the sequence and use seq2seq model, which encode the features into low dims space and then decoder will extract it . This is short-article on this.
My personal advice is to break the sequence and then train it, because sliding window on the complete sequence is pretty much able to find the correlation between each sequence.
My task is to calculate clashes between alert time schedule and the user calendar schedule to generate the clashes less alert time schedule.
How should i represent the chromosome according to this problem?
How should i represent the time slots? (Binary or Number)
Thank You
(Please Consider i'm a beginner to the genetic algorithm studies)
Questions would be: What have you tried so far? How good are your results so far? Also your Problem is
stated quite unspecific. Thus here is what I can give:
The Chromosome should probably be the starttime of the alerts in your schedule (if I understood your Problem correctly).
As important is to think of the ways you want to evaluate and calculate the Fitness of your individuals (here clashes (e.g. amount or time overlap between appointments), but it is obvious that you might find better heuristics to receive better solutions / faster convergence)
Binary or continuous number might both work: I am usually going for numbers whenever there is no strong reason to not do so (since it is easier to Interpret, debug, etc.). Binary comes with some nice opportunities with respect to Mutation and Recombination.
I strongly recommend playing around and reading about those Things. This might look like a lot of extra work to implement, but you should rather come to see them as hyperparameters which Need to be tuned in order to receive the best Outcome.
I'm a newbie in the field of Machine Learning and Supervised learning.
My task is the following: from the name of a movie file on a disk, I'd like to retrieve some metadata about the file. I have no control on how the file is named, but it has a title and one or more additional info, like a release year, a resolution, actor names and so on.
Currently I have developed a rule heuristic-based system, where I split the name into tokens and try to understand what each word could represent, either alone or with adjacent ones. For detecting people names for example, I'm using a dataset of english names, and score the word as being a potential person's name if I find it in the dataset. If adjacent to it is a word that I scored as a potential surname, I score the two words as being an actor. And so on. It works with a decent accuracy, but changing heuristic scores manually to "teach" the system is tedious and unpredictable.
Such a rule-based system is hard to maintain or develop further, so, out of curiosity, I was exploring the field of machine learning. What I would like to know is:
Is there some kind of public literature about these kinds of problems?
Is ML a good way to approach the problem, given the limited data set available?
How would I proceed to debug or try to understand the results of such a machine? I already have problems with the "simplistic" heuristic engine I have developed..
Thanks, any advice would be appreciated.
You need to look into NLP (natural language processing). NLP deals with text processing and other things; for example entity recognition and tagging.
Here is an example of using Spacy library: https://spacy.io/usage/linguistic-features.
Some time ago I did a similar thing, you can see it here: https://github.com/Erlemar/Erlemar.github.io/blob/master/Notebooks/Fate_Zero_explore.ipynb
I'm starting to work with crf++ and crfsuite (both use a very similar file format). I want to do things related to images (segmentation, activiy recognition, etc). My main problem is how to build the training file. Has anybody work with crf and images? Has anybody explain me or give some file to learn.
Thanks in advance.
CRFsuite is faster than CRF++ and it can deal with a huge training data. I tried both of them. They perfectly work on a reasonable amount of data, but when my dataset increased to be more than 100,000 sentences, CRF++ did not manage to deal with it and suddenly stopped working.
Look at the following link
CRFsuite - CRF Benchmark test
there is a comparison between many CRF software in some criteria
I used CRF++ before and it worked very well.
But my field is natural language processing, and I use CRF++ for named entity recognition or POS tagging. CRF++ is easy to install on Linux but has some minor issue when compiling on windows.
You can just follow its document for training data format: each row represents a data sample and each column represents a feature type.
Or, you can also consider Mallet which has a CRF component.
Probably you should start with the DGM library (https://github.com/Project-10/DGM), which is the best choice for those, who never worked with CRFs before. It includes a number of ready-to-go demo projects, which will classify/ segment your images just out-of-the-box. It is also well documented.
I have just came across this one for Windows:
http://crfsharp.codeplex.com/
maybe you also want to try CRF component in Mallet package.