Weka ARFF generation - machine-learning

I am trying to generate an .arff file from a csv data file I have. Now I am totally new to Weka and have started using it just a day back. I am trying out a simple twitter sentiment analysis with this for starters. I have generated training data in CSV. Contents of CSV file are as follows:
tweet,affinScore,polarity
ATAUTHORcfoblog is giving away a $25 Amex gift card (enter to win over $600 in prizes!) http://t.co/JD8EP14c ,4,4
"American Express has always been my dark horse acquirer of ATAUTHORFoursquare. Bundle in Square-like payments & its a lite-retailer platform, no? ",0,1
African-American Demos Express Ethnic Identity Differently http://t.co/gInv4bKj via ATAUTHORmediapost ,0,3
Google ???????? Visa ? American Express http://t.co/eEZTSiHY ,0,4
Secrets to Success from Small-Business Owners : Lifestyle :: American Express OPEN Forum http://t.co/b85F8JX0 via ATAUTHOROpenForum ,2,1
RT ATAUTHORhunterwalk: American Express has always been my dark horse acquirer of ATAUTHORFoursquare. Bundle in Square-like payments & its a lite ... ,0,1
Winning Surveys $1500 american express Huggies Sweeps http://t.co/WoaTFowp ,4,1
I root for Square mostly because a small business that takes Square is also one that takes American Express. ,0,1
I dont know how bitch be acting American Express but they cards be saying DEBIT ON IT HAVE A ?? PLEASE!!! ,-5,2
Uh oh... RT ATAUTHORBlackArrowBella: I dont know how bitch be acting American Express but they cards be saying DEBIT ON IT HAVE A ?? PLEASE!!! ,-5,2
Just got another credit card. A Blue Sky card with American Express. Its gonna help pay for the honeymoon! ATAUTHORAmericanExpress ,-1,1
Follow ATAUTHORShaveMagazine and ReTweet this msg to be entered to #Win an American Express Gift card. Winners contacted bi-weekly by direct msg! ,2,4
American Express Gold zakelijk aanvragen: http://t.co/xheZwmbt ,0,3
RT ATAUTHORhunterwalk: American Express has always been my dark horse acquirer of ATAUTHORFoursquare. Bundle in Square-like payments & its a lite ... ,0,1
Here first attribute is actual tweet, second is AFFIN score and third is actual classification class (1- Positive, 2-Negative, 3-Neutral, 4-Spam)
Now I try to generate .arff format from it using code:
import weka.core.Instances;
import weka.core.converters.ArffSaver;
import weka.core.converters.CSVLoader;
import java.io.File;
public class CSV2Arff {
/**
* takes 2 arguments:
* - CSV input file
* - ARFF output file
*/
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("\nUsage: CSV2Arff <input.csv> <output.arff>\n");
System.exit(1);
}
// load CSV
CSVLoader loader = new CSVLoader();
loader.setSource(new File(args[0]));
Instances data = loader.getDataSet();
// save ARFF
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
saver.setFile(new File(args[1]));
saver.setDestination(new File(args[1]));
saver.writeBatch();
}
}
This generates .arff file that looks somewhat like:
#relation file
#attribute tweet {_ATAUTHORcfoblog_is_giving_away_a_$25_Amex_gift_card_(enter_to_win_over_$600_in_prizes!)_http://t.co/JD8EP14c_,'American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite-retailer_platform,_no?_',African-American_Demos_Express_Ethnic_Identity_Differently_http://t.co/gInv4bKj_via__ATAUTHORmediapost_,Google_????????_Visa_?_American_Express__http://t.co/eEZTSiHY_,Secrets_to_Success_from_Small-Business_Owners_:_Lifestyle_::_American_Express_OPEN_Forum_http://t.co/b85F8JX0_via__ATAUTHOROpenForum_,RT__ATAUTHORhunterwalk:_American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite_..._
#data
_ATAUTHORcfoblog_is_giving_away_a_$25_Amex_gift_card_(enter_to_win_over_$600_in_prizes!)_http://t.co/JD8EP14c_,4,4
'American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite-retailer_platform,_no?_',0,1
African-American_Demos_Express_Ethnic_Identity_Differently_http://t.co/gInv4bKj_via__ATAUTHORmediapost_,0,3
Google_????????_Visa_?_American_Express__http://t.co/eEZTSiHY_,0,4
Secrets_to_Success_from_Small-Business_Owners_:_Lifestyle_::_American_Express_OPEN_Forum_http://t.co/b85F8JX0_via__ATAUTHOROpenForum_,2,1
RT__ATAUTHORhunterwalk:_American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite_..._,0,1
I am new to Weka but from what I have read, I have a suspicion that this ARFF is not correctly formed. Can anyone comment on it?
Also if it is wrong, can someone point me to where exactly am I going wrong?

Make sure to set the type of the tweet attribute to be arbitrary strings, not a categorial attribute, which seems to be the default. This doesn't scale well, as it puts a copy of every tweet in the type definition otherwise.
Note that for actual analysis of the tweet contents, you will likely need to preprocess them much further. You will likely want a sparse vector representation of the text instead of a long string.

If you are using the UI as previously mentioned then you can just load the file directly into Weka.
If you just want to generate an ARFF file based on the CSV file you can do the following. This was taken from the CSV2Arff tool which comes as part of Weka.
import weka.core.Instances;
import weka.core.converters.ArffSaver;
import weka.core.converters.CSVLoader;
import java.io.File;
public class CSV2Arff {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("\nUsage: CSV2Arff <input.csv> <output.arff>\n");
System.exit(1);
}
// load CSV
CSVLoader loader = new CSVLoader();
loader.setSource(new File(args[0]));
Instances data = loader.getDataSet();
// save ARFF
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
saver.setFile(new File(args[1]));
saver.setDestination(new File(args[1]));
saver.writeBatch();
}
}

Related

How to create context for the chatbot using twilio

I am trying to create a bot where when ever someone send a message about a type of food to the bot, then the bot will respond with the location that serves that food. However I am trying to establish context so that the conversation can flow more thoroughly.
I have tried nesting the if statement, and it gets it to display the message, but it would have to rely on the if-statement prior to be true before testing for the ones that comes after.
from flask import Flask, request
from twilio.twiml.messaging_response import MessagingResponse
from intents import fallback_intent, getLocation
import random
app = Flask(__name__)
location_fallback = ['What kind of restaurant are you seeking?', 'What kind? Nearby, Cheap or The best?']
welcome = ['hello', 'what\'s up', 'hey','hi', 'what\'s happening?']
near = ['near', 'nearby']
cheap = ['cheap', 'good for my pockets']
good = ['good', 'top rated']
intro_resp = ['''Hey! Welcome to Crave! This interactive platform connects you to the top foodies in the world! We provide you with the best food places where ever you are. The instructions are simple:
1. Save our number in your Phone as Crave.
2. Text us and tell us what type of food you are craving!
This is from python''', '''
Welcome to Crave! Are you ready to get some food for today?
1. Save our number in your Phone as Crave.
2. Text us and tell us what type of food you are craving!
''']
#app.route('/sms', methods=['GET','POST'])
def sms():
num = request.form['From']
msg = request.form['Body'].lower()
resp = MessagingResponse()
#welcome intent
if any(word in msg for word in welcome):
if any(near_word in msg for near_word in near):
resp.message('These are the location of places near you!')
print(str(msg.split()))
return str(resp)
elif any(cheap_word in msg for cheap_word in cheap):
resp.message('These are the location of places that are low cost to you!')
return str(resp)
elif any(good_word in msg for good_word in good):
resp.message('These are the best places in town!')
return str(resp)
else:
location_fallback[random.randint(0,1)]
resp.message(intro_resp[random.randint(0, 1)])
print(str(msg.split()))
return str(resp)
else:
resp.message(fallback_intent())
print(str(msg))
return str(resp)
if __name__ == '__main__':
app.run(debug=True)
I want the user to say 'hi'' or something related to initiate the bot, then I want the bot to prompt the user to ask what kind of food they would like. Then the bot will ask what parameters for the restaurant they would like(i.e Close, cheap, or good). Then the user will answer accordingly, and then the bot needs to use these parameters to search for the restaurant near them with these attributes.
Twilio developer evangelist here.
You could store this in many places, in cookies as part of the conversation with Twilio, in a database where you use the user's number as a key to look up previous messages, or even just in memory.
If you're looking for a more robust way to achieve this, with better natural language processing, have you checked out Twilio Autopilot? It stores the context of a conversation for you and is built to collect information before giving a response based on the complete set like you are doing.

Which settings should be used for TokensregexNER

When I try regexner it works as expected with the following settings and data;
props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, regexner");
Bachelor of Laws DEGREE
Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE
What I would like to do is that using TokenRegex. For example
Bachelor of Laws DEGREE
Bachelor of ([{tag:NNS}] [{tag:NNP}]) DEGREE
I read that to do this, I should use TokensregexNERAnnotator.
I tried to use it as follows, but it did not work.
Pipeline.addAnnotator(new TokensRegexNERAnnotator("expressions.txt", true));
Or I tried setting annotator in another way,
props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, tokenregexner");
props.setProperty("customAnnotatorClass.tokenregexner", "edu.stanford.nlp.pipeline.TokensRegexNERAnnotator");
I tried to different TokenRegex formats but either annotator could not find the expression or I got SyntaxException.
What is the proper way to use TokenRegex (query on tokens with tags) on NER data file ?
BTW I just see a comment in TokensRegexNERAnnotator.java file. Not sure if it is related pos tags does not work with RegexNerAnnotator.
if (entry.tokensRegex != null) {
// TODO: posTagPatterns...
pattern = TokenSequencePattern.compile(env, entry.tokensRegex);
}
First you need to make a TokensRegex rule file (sample_degree.rules). Here is an example:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ pattern: (/Bachelor/ /of/ [{tag:NNP}]), action: Annotate($0, ner, "DEGREE") }
To explain the rule a bit, the pattern field is specifying what type of pattern to match. The action field is saying to annotate every token in the overall match (that is what $0 represents), annotate the ner field (note that we specified ner = ... in the rule file as well, and the third parameter is saying set the field to the String "DEGREE").
Then make this .props file (degree_example.props) for the command:
customAnnotatorClass.tokensregex = edu.stanford.nlp.pipeline.TokensRegexAnnotator
tokensregex.rules = sample_degree.rules
annotators = tokenize,ssplit,pos,lemma,ner,tokensregex
Then run this command:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props degree_example.props -file sample-degree-sentence.txt -outputFormat text
You should see that the three tokens you wanted tagged as "DEGREE" will be tagged.
I think I will push a change to the code to make tokensregex link to the TokensRegexAnnotator so you won't have to specify it as a custom annotator.
But for now you need to add that line in the .props file.
This example should help in implementing this. Here are some more resources if you want to learn more:
http://nlp.stanford.edu/software/tokensregex.shtml#TokensRegexRules
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.html
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/types/Expressions.html

I have 30 or so items in a .txt file which I would like to display in a listbox in wp7

.txt file looks like
Euro
US Dollar
Australian Dollar
Pounds Sterling
Swiss Franc
and so on.
I have tried things like XDocument and ObservableCollection but can't seem to get them to work.
I would rather not hard code so much into xaml.
thanks,
Assuming you're sending the file with the project (file Build action set to "Content"), here's what you need:
First, add a ListBox to the Page called CurrenciesListBox, then add this code on the page load event or constructor:
var xapResolver = new System.Xml.XmlXapResolver();
using (var currenciesStream = (Stream)xapResolver.GetEntity(new Uri("Currencies.txt", UriKind.RelativeOrAbsolute), "", typeof(Stream)))
{
using (var streamReader = new StreamReader(currenciesStream))
{
while (!streamReader.EndOfStream)
{
CurrenciesListBox.Items.Add(streamReader.ReadLine());
}
}
}
Remember to change the filename above to match your file!
This is for starter, there are better ways to do the job (using MVVM)!

Where can I find a large tabbed hierarchical data set for parser testing?

First, apologies as I realize this is only tangentially related to parser programming.
I've spend hours looking for a text file containing something like the following but with hundreds (hopefully thousands) of sub-entries. A complete biological classification file would be perfect. A massive version of the following would be great as my parser parses simple tabbed files:
TL,DR - I need a massive single-file hierarchical data set something like the following:
Kindoms
Monera
Protista
Fungi
Plants
Animals
Porifera
Sponges
Coelenterates
Hydra
Coral
Jellyfish
Platyhelminthes
Flatworms
Flukes
Nematodes
Roundworms
Tapeworms
Chordates
Urochordataes
Cephalochordates
Vertebrates
Fish
Amphibians
Reptiles
Birds
Mammals
The best I've been able to find are tree-of-life images (from which I transcribed the sample data set above). A single file with a TON of real data would be awesome. It doesn't have to be a biological classification data set, but I would really like the data to reflect something in the real-world. (My parser feeds a menu - would be great if the remainder of my testing was with a data set that actually meant something!) Even if the file is not tabbed but the data was fairly easily regex'ed to a tabbed format... that would be great.
Any ideas? Thanks!
It is possible that the xml layout was changed since the last answer but the code submitted above is no longer accurate. The resulting dump is extraneous. Some of the nodes have aliases (denoted as 'othername') that are reported as distinct nodes themselves.
I used the script below to generate the correct dump.
<?php
$reader = new XMLReader();
$reader->open('http://tolweb.org/onlinecontributors/app?service=external&page=xml/TreeStructureService&node_id=1'); //15963 is the primates index
$set=-1;
while ($reader->read()) {
switch ($reader->nodeType) {
case (XMLREADER::ELEMENT):
if ($reader->name == "OTHERNAMES"){
$set=1;
}
if ($reader->name == "NODES"){
$set=-1;
}
if ($reader->name == "NODE"){
$set=-1;
}
if ($reader->name == "NAME" AND $set == -1){
echo str_repeat("\t", $reader->depth - 2); //repeat tabs for depth
$node = $reader->expand();
echo $node->textContent . "\n";
}
break;
}
}
?>
This turned out to be such a pain in the ass. I finally tracked down a data feed from "The Tree of Life Web Project" at tolweb.org. I made the php script below to provide the basic functionality my post was looking for.
Change the node_id to have it print a tabbed representation of any of tolweb.org's data - just take the id from the page you're browsing on their site and change the node_id below.
Be aware though - their data feeds serve up large files, so definitely download the file to your own server (and change the "open" method below to point to the local file) if you're going to hit it more than once or twice.
More info on tolweb.org data feeds can be found here:
http://tolweb.org/tree/home.pages/downloadtree.html
<?php
$reader = new XMLReader();
$reader->open('http://tolweb.org/onlinecontributors/app?service=external&page=xml/TreeStructureService&node_id=15963'); //15963 is the primates index
while ($reader->read()) {
switch ($reader->nodeType) {
case (XMLREADER::ELEMENT):
if ($reader->name == "NAME"){
echo str_repeat("\t", $reader->depth - 2); //repeat tabs for depth
$node = $reader->expand();
echo $node->textContent . "\n";
}
break;
}
}
?>

DBF Large Char Field

I have a database file that I beleive was created with Clipper but can't say for sure (I have .ntx files for indexes which I understand is what Clipper uses). I am trying to create a C# application that will read this database using the System.Data.OleDB namespace.
For the most part I can sucessfully read the contents of the tables there is one field that I cannot. This field called CTRLNUMS that is defined as a CHAR(750). I have read various articles found through Google searches that suggest field larger than 255 chars have to be read through a different process than the normal assignment to a string variable. So far I have not been successful in an approach that I have found.
The following is a sample code snippet I am using to read the table and includes two options I used to read the CTRLNUMS field. Both options resulted in 238 characters being returned even though there is 750 characters stored in the field.
Here is my connection string:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\datadir;Extended Properties=DBASE IV;
Can anyone tell me the secret to reading larger fields from a DBF file?
using (OleDbConnection conn = new OleDbConnection(connectionString))
{
conn.Open();
using (OleDbCommand cmd = new OleDbCommand())
{
cmd.Connection = conn;
cmd.CommandType = CommandType.Text;
cmd.CommandText = string.Format("SELECT ITEM,CTRLNUMS FROM STUFF WHERE ITEM = '{0}'", stuffId);
using (OleDbDataReader dr = cmd.ExecuteReader())
{
if (dr.Read())
{
stuff.StuffId = dr["ITEM"].ToString();
// OPTION 1
string ctrlNums = dr["CTRLNUMS"].ToString();
// OPTION 2
char[] buffer = new char[750];
int index = 0;
int readSize = 5;
while (index < 750)
{
long charsRead = dr.GetChars(dr.GetOrdinal("CTRLNUMS"), index, buffer, index, readSize);
index += (int)charsRead;
if (charsRead < readSize)
{
break;
}
}
}
}
}
}
You can find a description of the DBF structure here: http://www.dbf2002.com/dbf-file-format.html
What I think Clipper used to do was modify the Field structure so that, in Character fields, the Decimal Places held the high-order byte of the size, so Character field sizes were really 256*Decimals+Size.
I may have a C# class that reads dbfs (natively, not ADO/DAO), it could be modified to handle this case. Let me know if you're interested.
Are you still looking for an answer? Is this a one-off job or something that needs doing regularly?
I have a Python module that is primarily intended to extract data from all kinds of DBF files ... it doesn't yet handle the length_high_byte = decimal_places hack, but it's a trivial change. I'd be quite happy to (a) share this with you and/or (b) get a copy of such a DBF file for testing.
Added later: Extended-length feature added, and tested against files I've created myself. Offer to share code with anyone who would like to test it still stands. Still interested in getting some "real" files myself for testing.
3 suggestions that might be worth a shot...
1 - use Access to create a linked table to the DBF file, then use .Net to hit the table in the access database instead of going direct to the DBF.
2 - try the FoxPro OLEDB provider
3 - parse the DBF file by hand. Example is here.
My guess is that #1 should work the easiest, and #3 will give you the opportunity to fine tune your cussing skills. :)

Resources