R package Twitter to analyze tweets text - twitter

I'm using TwitteR package (specifically, the searchTwitter function) to export in a csv format all the tweets containing a specific hashtag.
I would like to analyze their text and discover how many of them contain a specific list of words that I have just saved in a file called importantwords.txt.
How can I create a function that could return me a score of how many tweets contain the words that I have written in my file importantwords.txt?

Pseudocode:
> for (every word in importantwords.txt):
> int i = 0;
> for (every line in tweets.csv):
> if (line contains(word)):
> i = i+1
> print(word: i)
Is that along the lines of what you wanted?

I think best bet is to use the tm package.
http://cran.r-project.org/web/packages/tm/index.html
This fella uses it to create Word Clouds with the information. Looking through his code will probably help you out too.
http://davetang.org/muse/2013/04/06/using-the-r_twitter-package/
If your important words is just to avoid "the" "a" and things like that this will work fine. If its for something in particular you'll need to loop over the corpus with your list of words retrieving the counts.
Hope it helps
Nathan

Related

check for matching rows in csv file ruby

I am very new to ruby and I want to check for rows with the same number in a csv file.
What I am trying to do is go through the input csv file and copy element from the input file to the output file also adding another column called "duplicate" to the output file, then check if a similar phone is already in the output file while copying data from input to output then if the phone already exist, add "dupl" to the row in the duplicate column.
This is what I have.
file=CSV.read('input_file.csv')
output_file=File.open("output2.csv","w")
for row in file
output_file.write(row)
output_file.write("\n")
end
output_file.close
Example:
Phone
(202) 221-1323
(201) 321-0243
(202) 221-1323
(310) 343-4923
output file
Phone
Duplicate
(202) 221-1323
(201) 321-0243
(202) 221-1323
dupl
(310) 343-4923
So basically you want to write the input to output and append a "dupl" on the second occurrence of a duplicate?
Your input to output seems fine. To get the "dupl" flag, simply count the occurrence of each number in the list. If it's more than one, its a duplicate. But since you only want the flag to be shown on the second occurrence just count how often the number appeared up until that point:
lines = CSV.read('input_file.csv')
lines.each_with_index do |l,i|
output_file.write(l + ",")
if lines.take(i).count(l) >= 1
output_file.write("dupl")
end
output_file.write("\n")
end
l is the current line. take(i) is all lines before but not including the current line and count(l) applied to this counts how often the number appeared before if it's more than one, print a "dupl"
There probably is a more efficient answer to this, this is just a quick and easy to understand version.

HDFql Get Size of Group

I am wondering how to get the number of datasets within a group using C++ and HDFql. Currently I have tried something like this (inspired by the HDFql manual):
char script[1024];
uint64_t group_size = 0;
sprintf(script, "SHOW my_group SIZE INTO MEMORY %d", HDFql::variableTransientRegister(&group_size));
HDFql::execute(script);
But unfortunately this doesn't work at all.
Many thanks!
One possible solution to solve your issue is to retrieve all the datasets stored in, e.g., group my_group like this:
HDFql::execute("SHOW DATASET my_group/");
And then, get the number of datasets found using HDFql function cursorGetCount (which returns the number of elements in the cursor). Example:
std::cout << "Number of datasets: " << HDFql::cursorGetCount();
As a side note, if you wish to retrieve all the datasets stored in group my_group and in sub-groups do the following (the LIKE option activates recursive search in HDFql):
HDFql::execute("SHOW DATASET my_group/ LIKE **");
For more information, please refer to HDFql reference manual and quick start.

Reading "Awkward" CSV Files with FSharp CsvParser

I have a large file (200K - 300K lines of text).
It's almost but not quite a CSV file.
The column headers are on the second row, there's a row of dummy text
before that.
There are rows interspersed with the actual data rows. They have
commas, but most of the columns are blank. They aren't relevant to me.
I need to read this file efficiently, and parse the lines that actually are
valid, as CSV data.
My first idea was to write a clean procedure that strips out the first line, and the blank lines, leaving only the headers and details that I want
in a CSV File that the CsvParser can read.
This is easy enough, just ReadLine from a StreamReader, I can keep or disregard each line just by looking at it as a string.
Now though I have a new issue.
There is a column in the valid data that I can use to disregard a whole lot more rows.
If I read the Cleaned file using the CsvParser it's easy to filter by that column.
But, I don't really want to waste writing the rows I don't need to the Clean file.
I'd like to be able to check that Column, while Cleaning the File. But, at that point I'm working with strings representing entire lines. It's not easy to get at the specific column I want.
I can't Split on ',' there may be commas in the text of other columns.
I'm ending up writing the Csv Parsing Logic, that I was using CsvParser for in the first place.
Ideally, I'd like to read in the existing file, clean out the lines that I can based on strings, then somehow parse the resulting seq using the CsvParser.
I see CsvFile can Load from Streams and Readers, but I'm not sure that's much help.
Any suggestions or am I just asking too much? Should I just deal with the extra filtering on loading the Cleaned File?
You can avoid doing most of the work of parsing by using the CsvFile class directly.
The F# Data documentation has some extended examples that show how to do this in some detail.
Skipping over lines at the start of a file is handled by the skipRows parameter. Passing the ignoreErrors parameter will also ignore rows that fail to parse.
open FSharp.Data
let csv = CsvFile.Load(file, skipRows=1, ignoreErrors=true)
for row in csv.Rows do
printfn "%s" row.GetColumn "Name"
If you have to do more complex filtering of rows, a simple approach that doesn't require temporary files is to filter the results of File.ReadLines and pass that to CsvFile.Parse.
The example below skips a six-line prelude, reads in lines until it hits a blank line, uses CsvFile to parse the data, and finally filters the resulting rows to those of interest.
let tableA =
File.ReadLines(file)
|> Seq.skip(6)
|> Seq.takeWhile(fun l -> String.length l > 0)
|> String.concat "\n"
let csv = CsvFile.Parse(tableA)
for row in csv.Rows.Filter(fun row -> row?Close.AsFloat() > row?Open.AsFloat()) do
printfn "%s" row.GetColumn "Name"

Categorizing Hastags based on similarities

I have different documents with a list of hashtags in each. I would like to group them under the most relevant hashtag (which would be present in the document itself).
Egs: If there are #Eco, # Ecofriendly # GoingGreen - I would like to group all these under the most relevant and representative Hashtag (say #Eco). How should I be approaching this and what techniques and algorithms should I be looking at?
I would create a bipartite graph of documents-hashtags and use clustering on a bipartite graph:
http://www.cs.utexas.edu/users/inderjit/public_papers/kdd_bipartite.pdf
This way I am not using the content of the document, but just clustering the hashtags, which is what you wanted.
Your question is not very strict, and as such may have multiple answers, however, if we assume that you literally want "I would like to group all these under the most common Hashtag", then simply loop through all hashtags, compute have often they come up, and then for each document select the one with highest number of occurences.
Something like
N = {}
for D in documents:
for h in D.hashtags:
if h not in N: N[h] = 0
N[h] += 1
for D in documents:
best = None
for h in D.hashtags:
if best==None or N[best] < N[h]:
best = h
print 'Document ',D,' should be tagged with ',best

detect if a combination of string objects from an array matches against any commands

Please be patient and read my current scenario. My question is below.
My application takes in speech input and is successfully able to group words that match together to form either one word or a group of words - called phrases; be it a name, an action, a pet, or a time frame.
I have a master list of the phrases that are allowed and are stored in their respective arrays. So I have the following arrays validNamesArray, validActionsArray, validPetsArray, and a validTimeFramesArray.
A new array of phrases is returned each and every time the user stops speaking.
NSArray *phrasesBeingFedIn = #[#"CHARLIE", #"EAT", #"AT TEN O CLOCK",
#"CAT",
#"DOG", "URINATE",
#"CHILDREN", #"ITS TIME TO", #"PLAY"];
Knowing that its ok to have the following combination to create a command:
COMMAND 1: NAME + ACTION + TIME FRAME
COMMAND 2: PET + ACTION
COMMAND n: n + n, .. + n
//In the example above, only the groups of phrases 'Charlie eat at ten o clock' and 'dog urinate'
//would be valid commands, the phrase 'cat' would not qualify any of the commands
//and will therefor be ignored
Question
What is the best way for me to parse through the phrases being fed in and determine which combination phrases will satisfy my list of commands?
POSSIBLE solution I've come up with
One way is to step through the array and have if and else statements that check the phrases ahead and see if they satisfy any valid command patterns from the list, however my solution is not dynamic, I would have to add a new set of if and else statements for every single new command permutation I create.
My solution is not efficient. Any ideas on how I could go about creating something like this that will work and is dynamic no matter if I add a new command sequence of phrase combination?
I think what I would do is make an array for each category of speech (pet, command, etc). Those arrays would obviously have strings as elements. You could then test each word against each simple array using
[simpleWordListOfPets containsObject:word]
Which would return a BOOL result. You could do that in a case statement. The logic after that is up to you, but I would keep scanning the sentence using NSScanner until you have finished evaluating each section.
I've used some similar concepts to analyze a paragraph... it starts off like this:
while ([scanner scanUpToString:#"," intoString:&word]) {
processedWordCount++;
NSLog(#"%i total words processed", processedWordCount);
// Does word exist in the simple list?
if ([simpleWordList containsObject:word]) {
//NSLog(#"Word already exists: %#", word);
You would continue it with whatever logic you wanted (and you would search for a space rather than a ",".

Resources