I am using rapidminer because I want to perform a sentiment analysis.
The things are, I have 7 queries that I need to analyze together (companies' names that I need to analyze to obtain insights about the customer).
So my idea was then to extract the data with twitter app developer and then put in rapidminer to analyze.
When I open this data in rapidminer it shows that
there are some problems with the dataset with the following:
Error: file syntax error
Message: Value quotes not closed at position 346.
Last characters read: ght.
Help enter code here
-How do I fix this?
Once enter my spreadsheet data (.csv file). It shows me the error"
Cause: Unparseable number: "FALSE"
I've searched here already for answers but none helped me to solve this error.
Is it possible to analyze this data altogether or do I
have to do it separately?
I'm not sure if this is feasible, I
suppose it would interfere in the overall analysis?
I'm quite new at rapidminer, so I appreciate you all's help.
Thanks in advance.
I decided to ignore the problem so I just selected the option to replace errors with missing values. And analyze all data together.
Related
I am dealing with tons of PDF documents (petetions data) filled with text data having numbers, tabular data etc. The objective of client is to summarize any such given document to reduce man-force in reading the entire document. I have tried conventional methods like lSA,Gensim-summarizer, BERT extractive summarizer, Pysummarizer.
The results are not at all good, Please suggest me any way where i can find a industry level summarizer(extrative/abstractive) that would give me a good start to solve this issue .
First, you will need to know exactly what data the company wants abstracted from the documents. After that, you may be able to convert the documents to raw text using OCR or some other PDF application, and then extract the data you need. If the company isn't being clear on how they want you to summarize the data, that would be something to talk to them about. It might be as simple as setting a title for the document, or classifying it. If it's document classification I can help you with that, I made a repo for that purpose a little while ago.
Our company has a lot of data that are issue which are stored in a database.We want to create a search engine so that people can check how the issues were previously dealt with.We cannot use any 3rd party api as there is sensitive data an we want to keep it as in house. Right now the approach is as following :-
Clean up the data and then use a DOC2VEC to represent each issue as a vector .
Find the closest 5 issue using some distance metric.
The problem is that the results are not at all useful.The problem is most of the data is one liner and some issue description.There are spelling mistakes and stack traces and other things.
Is this the right approch or should we switch to something else?
Right now we are testing on 200K data.
Thanks for the help.
I am working with the Persee 3D camera from Orbecc. I am currently having problems with the processing of the bag files. Someone has provided some code here: https://github.com/JaouadROS/bagtodepth. I tried looking over this and I can't quite make heads or tails of this. I really only have two main questions:
1: Where is the output being saved to? Will it be saved into one of his directories or will it be output somewhere else?
2: Is the output a sort of stream or will it just convert the data to a certain point?
I have successful downloaded (catkin_ws directory) and ran the program with the Persee, but it doesn't help if I can't access the output. I am looking to access this matrix in real time and was hoping I could just adapt his code to my project. He does mention something about information being stored at depthclean. Sadly, the guy that has posted this has not replied to any of the messages that I have sent. Thanks!
I am stuck in finding a way to resolve below issue:
I am using a web server to get xml data. I f we open the link on browse the data looks in Korean language, when I get that data in xcode's console, I get some other unknown representation of this data, and when I run the application on iPad, I get other different representation.
Can any one please suggest how can I overcome with this problem as it took much time in searching but I have'nt found any solution yet.
# Raj thanks for your suggestion, the problem was with encoding only. I used two different encoding types, when I changed it the data was coming as desired.
Hello Oracles of StackOverflow,
First time I managed to ask a question on stack overflow, so feel free to throw your cabbages at me. (or correct the way I should be asking my question)
I have this problem. I'm using HDF5 to store massive quantities of cookie information.
My Data is structured in the following way:
CookieID -> Event -> Key_value Pair
There are multiple events for each cookieID. But only one key_value pair per event.
I'd like to know what the best way I should store this in a HDF5.
Currently, I'm storing each cookie as a seperate table within a group in the HDF5, using the cookieID as the name of the table. Unfortunately for me, with 10,000,000 cookies, HDF5 (or specifically PyTables) doesn't approve of this type of storage.
Specifically throwing this error:
/CookieData`` is exceeding the recommended maximum number of children (16384)
I'm wondering if you could recommend the best way of storing this information.
Should I create a flat table? Should I keep this method? Is there something else I can do?
Help is appreciated. Thanks for reading.
Several hours of research later, I've discovered that what I was attempting to do was categorically impossible.
The following link gives details as to the impossibility of using HDF5 with variable-length nested children.
I've decided to go with a flat file for the time being and hope that this is more efficient than a database store. The problem with a flat file in the end is that I have to replicate values in the file, which otherwise should not exist.
If anyone else has any better ideas it would be appreciated.