Dataset for sentiment analysis of diary entries - machine-learning

I was wondering if there is a dataset of sentiment-labelled diary entries? What I am looking for is a table of diary entries and a label indicating at least whether the entry is "positive" or "negative" (or even classified into more categories).
Example (completely arbitrary):
"Today the floor was icy, I slipped and fell. I hate ice." => label:
"negative"
"I love my friends! They organised a surprise party for
me!" => label: "positive"

I have found this data set on kaggle.I hope this would be helpful https://www.kaggle.com/c/si650winter11/data. In this data set 1 represents positive and 0 represents negative

Related

Google Sheets pulling out specific data from multiple rows and columns to put into a logistic regression function

I have a spreadsheet of multiple years of student annual writing assessment scores.
Each row is the data for one test (Test Year, Student ID, Test score with subsections, etc.).
I need to fill in each student’s data into a logistic regression model with the following variables:
SUMPRODUCT FUNCTION where I need the selected data to appear:
Spreadsheet and corresponding cells needed in logistic regression function
B Constant Y3 -16.266 [Generate a number ‘1’ to balance the sumproduct function.]
B T1AvgScore 0.911 [Student’s first year test average score] I need a function to put the data here
B T3AvgScore 2.399 [Student’s third year test average score] I need a function to put the data here
B T3SF2 0.434 [Student’s third year subsection ‘Sentence Fluency (SF)’ score] I need a function to put the data here
B T3Conv2 0.251 [Student’s third year subsection ‘Conventions (Conv)’ score] I need a function to put the data here
y* = ln(p/(1-p)) [Calculated from the above sumproduct function]
p = exp(y*)/(exp(y*)+1 [calculation for the prediction percent]
Thanks in advance for any assistance!
Well I'm not clear if I'm answering what you're looking for, but I have the formulas that pull the Average Score values from the AWA sheet for a given student number. See the tab I added to your sheet, Example-GK.
The query formula is simply:
=query(AWA,"select F where A = "&E$15&" and B = '"&D19&"' ",0)
where 15 is the specified StudentID (a numeric value, so no single quotes used), and D19 is the specific year.
I also added the ability to select the StudentID number from a dropdown list, in E15 on that sheet. Or the StudentName could be used for the selection criteria, instead of the StudentID, if that was available and easier for you to use. For now, the StudentName is ignored, since it wasn't available in the data.
Let me know if this is what you're looking for. One issue is there might be more years of data for some students. There are other ways of listing the years, which might help you. I'll see if I can add that function.
Update Sept 9,2020:
If I've understood your comments correctly, and that for each model, there is a set of constants that apply to all students (see below for the Model 3 constants), then I may have a generic set of formulae that calcute the probabilities for each student, using all three models, provided there is sufficient data for that student.
See my updated Example-GK in your sheet.
And let me know if I still haven't understood how your final probabilities are calculated from the individual student data values.

Google Sheets: Dense Ranking from sorted values

I have a simple table with 3 columns:
[Name] [Score] [Rank]
For the 3rd column, I'm using the following formula to rank each row according to the score:
=RANK(C9,$C$9:$C$28,0)
The problem is that the formula isn't returning the values I'd expect. For example on the last row it returns 19 when it should be 5.
I found other formulas for ranking (RANK.EQ, etc.) but same issue happens.
Here is the Google Sheet to see it in context:
https://docs.google.com/spreadsheets/d/1P1m7UHPPIcQLQkzpnk-SI1y7-0mhKytCWDjA6FJzFrM/edit?usp=sharing
Any guidance appreciated
The results you want can be achieved with a simple MATCH formula:
=match(round(C9,0),NamedRange1,0)
Provided an array (named NamedRange1 for above) is created, say with:
=sort(unique(round(C9:C28,0)),1,0)
I think the result is as intended. Check this Ranking Wikipedia page (called 'standard competition ranking'). It says:
Standard competition ranking ("1224" ranking)
In competition ranking, items that compare equal receive the same
ranking number, and then a gap is left in the ranking numbers. The
number of ranking numbers that are left out in this gap is one less
than the number of items that compared equal. Equivalently, each
item's ranking number is 1 plus the number of items ranked above it.
This ranking strategy is frequently adopted for competitions, as it
means that if two (or more) competitors tie for a position in the
ranking, the position of all those ranked below them is unaffected
(i.e., a competitor only comes second if exactly one person scores
better than them, third if exactly two people score better than them,
fourth if exactly three people score better than them, etc.).
Thus if A ranks ahead of B and C (which compare equal) which are both
ranked ahead of D, then A gets ranking number 1 ("first"), B gets
ranking number 2 ("joint second"), C also gets ranking number 2
("joint second") and D gets ranking number 4 ("fourth").
What you want is 'dense ranking' and it can be achieved by pnuts's answer or something like this:
set G9 to 1
set G10 to =if(round(C10,0)<round(C9,0), G9+1, G9)
copy G10 and paste it into G11:G28
Sample sheet is here.
Thanks to #pnuts and #sangboklee for your solutions. I think I have a good solution now. It is pnuts's solution, just simplified:
=match(round($C9,0),sort(unique(round($C$9:$C$28,0)),1,false),0)
This essentially "embeds" the created array within a single formula, that can be applied to all rows. And as a bonus, the values don't even have to be sorted.
Please check for correctness folks, but I think this works. I've updated the linked Google Sheet from the original question description (it's "Solution 2b").

Analyzing raw data from a Google Sheet into a 'dashboard'

Situation:
Every time I visit a member of my field team, I put their 'score' into a Google Form, which then puts the raw data into this Sheet.
I'd like a second Dashboard sheet that has:
all of the raw data, plus a calculation of their "overall score" (an average of parts A, B, C, and D)
a way to easily see the average score of each person per quarter (Q4 2016, Q1 2017)
a way to easily see the average score for each type of Observation (Live vs scenario #1, 2, 3)
a drop-down where I can select a user and see their scores on a chart compared to the rest of the group's averages
I've done some of the work [here], but would love some help to figure out the best way to do this (keeping in mind of the performance of the sheet considering I'd actually have thousands of rows of raw data).
*Things to note:*
* I might score one person twice in a row before getting to the next person
* I might score one person twice in a month (I'm not sure how to show that in the Dashboard)
Thank you in advance for your help. I'm trying to learn as much as I can, but it's all still pretty new to me.

should PAX be in Flighth Dimension or Fact Sales table?

I need to build a data mart using power pivot for a duty free shop at Airport.
Sales manager is analying sales data using by flight number and by PAX, number of people per flight.
So, I don't know where to put PAX. In DimFlight or FactSales. It is addative, right?
Please explain me why and how should I put PAX into which table. DimFlight may includes airline, flignt_no, date, PAX. A flight may also land the airport more than once a day.
PAX is a fact describing a measureable value of a specific flight event. It should be in the fact table, not in the flight dimension. I would expect total capacity to be an attribute of the plane dimension associated to the flight event. (Flight number would likely be a degenerate dimension as it doesn't really own any attributes.) However, the PAX itself should be a measure in the fact table.
You can generate a junk dimension that has the banding mentioned by #Luis Leal to do some capacity analytics. You can even create a numbers dimension with an attribute for each group level so you can do more detailed banding. For example, an attribute for 1s, 10s, 100s, 1000s, etc. You can also calculate the filled capacity of the flight and point to the numbers dimension so you can group flights by 80% full, 90% full etc.
Nothing stops you from modeling it as both dimension and measure, so you can store it both on a dimension table and as a measure on a fact table. If you store it as a measure on the fact table, you can perform several analysis by the other possible dimensions, get insights as averages, max, min, total by x or y dimension, which would be very difficult if you store it only on the dimension table.
On the other hand,storing it in the dimension table enables additional "perspectives" of analysis, for example a common approach is to store in the dimensional table "interval" columns with values like:
from 1 to 1000 pax, from 1001 to 2000. This column calculated at ETL time depending on the value of the PAX. So why not use both?

Weka - StringtoVector Filter Not working

I am practicing Weka using the Reuters data. The StringtoVector Classifier works for converting my string data (shown below), so I can analyze the articles to understand what words predict the article type. If the article type is true, the original dataset said TRUE/FALSE, but I converted it to 0/1. However, it refuses to work for this one arff file using the StringtoVector filter on the "review" string.
I used the following StringtoVector filter while ONLY checking the review attribute:
weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -M 1 -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\""
I get this error:
"Problem filtering instances: attribute names are not unique. Cause: sentiment" when only review is checked for the filter.
Here is the header of my dataset/formatting for a few of the cases:
#relation text_files
#attribute review string
#attribute sentiment {0, 1}
#data "cocoa the the cocoa the early the levels its the the this the ended the mln against at the that cocoa the to crop cocoa to crop around mln sales at mln the to this cocoa export the their cocoa prices to to per to offer sales at to dlrs per to to crop sales to at dlrs at dlrs at dlrs per sales at at at at to dlrs at at dlrs the currency sales at to dlrs dlrs dlrs the currency sales at at dlrs at at dlrs at at sales at mln against the crop mln against the the to to the cocoa commission reuter", 0"prices reserve the agriculture department reported the reserve price loan call price price wheat corn 1986 loan call price price reserves grain wheat per reuter", 0"grain crop their products to to wheat export the export wheat oil oil reuter", 0"inc the stock corp its dlrs oil to dlrs production its the company to its to profit to reuter", 0"products stock split products inc its stock split its common shares shareholders the company its to to shareholders at the the stock mln to mln reuter", 0
Anyone have any ideas on why this is happening? I was thinking there might be a conflict with the fact the data might contain 0 and 1s as part of the words occurring naturally in the text. I'm also thinking I might need an additional space before the quote for the string after the previous string.
Hi the problem is the filter converts every term in a string into an attribute. Now there must be a term "review" or "sentiment" in your data section. Therefore the attributes are duplicated.
So, change the names of these two attributes like "myreview" and "mysentiment" or to something that is unlikely to occur in your data. It should work.
I also encountered the same problem because the word "domain" appeared in the data, causing the filter to misunderstand when recognizing it. My solution was to remove all the "domain" from the data and keep only the "domain" in #attribute.
The easiest solution to avoid these attribute name clashes, is to use a prefix for the generated attributes.
The prefix can be supplied via the -P command-line option, the attributeNamePrefix option in the GenericObjectEditor or the setAttributeNamePrefix method from Java code.
See Javadoc of StringToWordVector filter.

Resources