I have many csv files and I want to create clients and give one csv file to each. Is there any method to do so?
I need any tutorial on how to covert dataset into federated dataset, or useful links or examples on the same.
Related
I am trying to study federated machine learning on time series data. The data is collected from multiple clients. How to convert this data into federated data ?
In Tensorflow Federated we generally consider federated data as a dataset pivoted on the clients. It sounds like here it might be useful to pivot on clients, but retain the time series ordering of that data.
jpgard gives a great answer in
How to create federated dataset from a CSV file? that can be used as an example for other file formats.
I have many different formats of scanned pdfs with many different fields. Think of it as an invoice that has been scanned. I need to extract the information from the scanned pdf and output the fields and the texts that are in each of the fields.
I have an OCR tool that does a good job in extracting all the texts in the raw format. I somehow using NLP have to be able to extract the fields and their values from the raw text. As there are many formats of the invoice, using OCR is not an option in this case. How could NLP help me in solving this problem?
Most NLP tools are designed to extract data from statements. If you don't have punctuation, it might not work out well. If you are using an NLU service, like https://mynlu.com you also will need to provide examples of common phrases and the locations of the relevant data contained therein (entities). If you can split this into statements, something like myNLU or another NLU service (LUIS, Watson etc) can get you out the door in < 10 minutes.
We have a requirement to do topic modelling on the twitter tweets on the live stream, the input makes to spark streaming and stores the data to HDFS. A batch job runs on the collected data. The batch job is to find the underlying topics in the tweets. For this we are using Latent Dirichlet Allocation (LDA) alogrithm to find out the topics. We receive data as tweets of max characters 140 and are stored as one row in HDFS.
I'm new to the LDA algorithm and have basic understanding on that, as the topic model are derived based on word co-occurrences across n documents
I understood two options to input the data to the LDA.
Option 1: Use one row tweet as one single document for the LDA ?.
Option 2: Group the rows and form documents pass these documents to LDA ?.
I want to understand how the distribution of the vocabulary(words) to topic is effected for each option. Which option should be considered for better topic modelling.
Also please let me know if any better solution is required to do topic modelling on the twitter data other than these otpions.
Note: When I ran the both options and displayed on the word cloud, I could see the distribution of words to the topics(3) is different for the both.
Any help appreciated.
Thanks in advance.
Using LDA with short document is a bit tricky since LDA assign a topic per word and multiple topic for each document. Doing it with short text means that few words will belong to a same topic, though mostly a tweet will contain only one topic, which will usually yield garbage topics distribution. (This is your option 1)
I know that there's a paper and java tool for topic modeling for short text but I have never used it. Here's the to the github repo link
For option 2, I think it will be possible to use LDA and get coherent topics but you need to find some semantic structure for grouping, i.e. per source, date, keyword, hashtag ..
I will be really interested by the results you get if you apply any of the proposed options any soon.
I run a download portal and basically what I want to do is after a user downloads a file i would like to recommend other related categories. I'm thinking of using google predict to do this but I'm not sure how to structure the training data. I'm thinking something like this:
category of the file downloaded (label), geo, gender, age
however that seems incomplete because the data doesn't have any information on the file downloaded. Would appreciate some advice, new to ML.
Here is a suggestion that might work...
For your training data, assuming you have the logs of downloads per user, create the following dataset:
download2 (serves as label), download1, features of users, features of download1
Then train a classifier to predict class given a download and user - the output classes and corresponding scores represent downloads to recommend.
I am new to Rapidminer. I have many XML files and I want to classify these files manually based on keywords. Then I would like to train a classifier like Naive Bayer and SVM on these data and calculate their performances using cross- validator.
Could you please let me know different steps for this?
Should I need to use text processing activities like tokenising, TFIDF etc.?
The steps would go something like this
Loop over files - i.e. iterate over all files in a folder and read each one in turn.
For each file
read it in as a document.
tokenize it using operators like Extract Information or Cut Document containing suitable XPath queries to output a row corresponding to the extracted information in the document.
Create a document vector with all the rows. This is where TF-IDF or other approaches would be used. The choice depends on the problem at hand with TF-IDF being a usual choice where it is important to give more weight to tokens that appear often in a relatively small number of the documents.
Build the model and use cross validation to get an estimate of the performance on unseen data.
I have included a link to a process that you could use as the basis for this. It reads the RapidMiner repository which contains XML files so is a good example of processing XML documents using text processing techniques. Obviously, you would have to make some large modifications for your case.
Hope it helps.
Probably, it is too late to reply. But it could help to other people. There is an extension called 'text mining extension', I am using version 6.1.0 . So you may go to RapidMiner > help>update and install this extension. It will get all the files from one directory. It has various text mining algorithms that you may use
Also, I found this tutorial video which could be of some help to you as well
https://www.youtube.com/watch?v=oXrUz5CWM4E