I'm doing some text processing and I'm interested in finding and scoring paragraphs of text based on frequency of words and/or phrases, using Ruby ideally.
An example of the problem would be: I have "apple", "banana", "fruit salad", and "orange". This list is likely to be several thousand words and/or phrases long.
I have a body of text to search:
I have a set of apples, and apple computer, and an account on Apple.com but never a fruit salad. Why they never released an Apple Computer that doubled as an orange was beyond me.
This would spit out an array that said:
Apple 4
Orange 1
Banana 0
Fruit salad 1
Ideally, I'd be able to apply different weights, like the domain "apple.com" gets two points, etc.
Is there a library that is particularly useful for doing this?
text = <<_.downcase
I have a set of apples, and apple computer, and an account on Apple.com. Why they never released an Apple Computer that doubled as an orange was beyond me.
_
["apple", "banana", "fruit salad", "orange"]
.map{|w| [w, text.scan(/\b#{w}\b/).length]}
# => [
# ["apple", 3],
# ["banana", 0],
# ["fruit salad", 0],
# ["orange", 1]
# ]
Very easy way to do this is to have a hash of counts, where the key is the word, and the value is incremented on each word occurrence.
Once you have built your hash then you can easily print out the counts of each words such as Apple, Orange, Banana. If case doesn't matter then make sure you convert each word to lower case before using it as the key.
it looks like your are trying to count the term frequency, try this package https://github.com/reddavis/TF-IDF
Related
I am using Google Forms to create a survey with weighted answers. I've been able to make things work when there is just one possible correct answer - I made a separate tab with a set of answer tables with point values assigned, then used vlookup to call back and match the given response to the answer table and fetch the assigned point value.
=VLOOKUP(P2, Sheet2!$A$49:$B$50, 2, FALSE)
P2 is a value pulled from the "Form Responses" tab - in this case, a yes/no answer. Sheet 2 has a table for each question with the possible answers and the point values for each answer (A49=yes, A50=no)
However, for some of the questions, multiple answers are valid and I want to add up the total number of points for that given question. So for example:
What are your hobbies? and folks can choose from
Riding your bike
Playing football
Swimming
Going fishing
Painting
And the respective point values are 2, 2, 3, 4, 4
So then, if someone chose the "Swimming" and "Going fishing" checkboxes in the form, I'd get "7", and if someone chose "Riding your bike", "Playing football", and "Painting", I'd get "8".
I realize that the output from the Google form will list the chosen answers all in one cell (Playing football, Going fishing), so I'm not sure how to make it count each answer (especially since some of them are multi-word answers) and output the sum of the values.
VLOOKUP is not suitable in this case. try FILTER like:
=FILTER(Sheet2!B49:B50, Sheet2!A49:A50=P2)
then VLOOKUP it like:
=SUMPRODUCT(IFNA(VLOOKUP(FILTER(Sheet2!B49:B50, Sheet2!A49:A50=P2), sheetx!A:B, 2, 0)))
where sheetx!A:B is like:
Riding your bike
2
Playing football
2
Swimming
3
Going fishing
4
Painting
4
and if Sheet2!B49:B50 contains multiple comma+space separated values you will need to split them like:
=SUMPRODUCT(IFNA(VLOOKUP(FILTER(
IFERROR(SPLIT(Sheet2!B49:B50, ", ")), Sheet2!A49:A50=P2), sheetx!A:B, 2, 0)))
My apologies for asking an incomplete question previously.
This is what I'm trying to accomplish.
I'm building a TTRPG sheet that automatically combines dice rolls, bonuses (additive) and penalties (subtractive) from a variety of sources. All of this data is expressed as either dice notation (D4, D6, D8, D10, D12, D20, and D100) or an Integer (1, 2, 4, 6), or both (combined). These also include negative values (-1D4, -1D6, -2, etc.). The goal isn't to generate the random numbers, but instead combine like dice together for the player to roll manually (I tried the automatic random numbers... Players were not happy about it.)
So, the goal is to combine likes, so something like:
"1D6+1D6" would become "2D6". However, because penalties could outweigh the bonus, you can't combine "1D6+1D6+-1D6" into "1D6". (Since each of the rolls could be a different number, such as "6+6-1" compared to "1+1-6").
Additionally, Integers (2, 4, 6, 8, etc.) are by necessity handled in a different part of the sheet, so the goal is to strip the integers out from the output. (The reason for stripping them out has nothing to do with formula complexity, but other game factors that require it to be viewed separately.)
Here are some examples of typical inputs and expected outputs:
1D6+1D4+1D8+-1D4+1D6+2 = 1D4+-1D4+2D6+1D8 (Notice the integer is removed)
1D6+2+0+1+8 = 1D6 (Because all integers have been stripped out)
1D20+-1D4+2D6+0+1D6+-1D6 = +-1D4+3D6+-1D6+1D20
(Yes, negative numbers will have the "+-" in front of them).
My original "mostly working" formula was 2 solid pages long when copied/pasted into MS Word. This formula will be repeated THOUSANDS of times, so smaller/faster makes a huge difference in the overall scheme of things. Two previous amazing Spreadsheet Wizards (Player0 and TheMaster) gave great answers, but I failed to disclose the integer as a part of the overall process.
The table below shows the formula that works for the first example, but not the second (gives "2D" in the output).
For original explanation, see Google Sheets Formula for combining dice rolls
After the first split by +, check if the result is a TEXT and if not, FILTER it out:
=JOIN("+",BYROW(QUERY(REDUCE({"",""},SEQUENCE(2),LAMBDA(a,c,{a;QUERY({ARRAYFORMULA(SPLIT(TRANSPOSE(LAMBDA(ar,FILTER(ar,ISTEXT(ar)))(SPLIT(B1,"+"))),"D"))}," select sum(Col1),Col2 where Col1"&IF(c=1,">","<")&"0 group by Col2 label sum(Col1) ''")})),"order by Col2"),LAMBDA(r,JOIN("D",r))))
For no negative values, add a empty array {"",""} for NA:
=JOIN("+",BYROW(QUERY(REDUCE({"",""},SEQUENCE(2),LAMBDA(a,c,{a;IFNA(QUERY({ARRAYFORMULA(SPLIT(TRANSPOSE(LAMBDA(ar,FILTER(ar,ISTEXT(ar)))(SPLIT(B1,"+"))),"D"))},"select sum(Col1),Col2 where Col1"&IF(c=1,">","<")&"0 group by Col2 label sum(Col1) ''"),{"",""})})),"where Col1 is not null order by Col2"),LAMBDA(r,JOIN("D",r))))
try:
=INDEX(REGEXREPLACE(TEXTJOIN("+", 1, FLATTEN(QUERY(TRANSPOSE(QUERY(QUERY(IFERROR(IFNA(TRANSPOSE({
REGEXEXTRACT(SPLIT(C5, "+"), "^\d+")*1; REGEXEXTRACT(SPLIT(C5, "+"), "D\d+");
REGEXEXTRACT(SPLIT(C5, "+"), "^-\d+")*1; REGEXEXTRACT(SPLIT(C5, "+"), "D\d+");
REGEXEXTRACT(SPLIT(C5, "+"), "D(\d+)")*1}), 0)),
"select sum(Col1),Col2,'+',sum(Col3),Col4,Col5
where Col2 is not null group by Col2,Col4,Col5 order by Col5"),
"select Col1,Col2,Col3,Col4,Col5 offset 1", )),,9^9))), " |\+ 0 D\d+", ))
Here is a formula:
=sum(filter(somerange,someotherrange=index(lookups!$E$3:$F$50,match($A34,lookups!$F$3:$F$50,0),1)))
So, if the range "someotherrange" contains say apples, pears and oranges, and if I were to lookup apples, pears and oranges in my index match and it returns a single value of "fruit", I would like to sum "somerange" where it is a fruit.
I can hear you yelling "just add a column to your source data table with a lookup for each and base it on that", but the particular sheet I have makes that pretty complicated (just trust me on this).
Continuing this example, is there a way to sum(filter()) on "somerange" where the values in "someotherrange" correspond to a fruit?
Example sheet here.
Try use formula:
=SUM(FILTER(C5:C13,ARRAYFORMULA(VLOOKUP(A5:A13,A17:B25,2,0))="fruit"))
for fruits.
I am not certain if excel can do this but I am trying to simplify the data dump that I get from twitter.
Basically what I would like to do is this:
If the tweet (in Column A) contains apple OR orange OR pear then it can be classified (in Column B) as "fruit" BUT if it has carrot OR squash OR lettuce it will be classified as "vegetable". If it has none of these then can be classified as "none"
Is this possible?
Thanks in advance.
Here is using array constant and range.
=IF(SUMPRODUCT(IF(ISERROR(SEARCH({"apple","orange","pear"},A1)),0,1))>0,"Fruit",IF(SUMPRODUCT(IF(ISERROR(SEARCH({"carrot","squash","lettuce"},A1)),0,1))>0,"Vegetable","None"))
Now for example, both fruit and vegetable are present in a string, it will always test for fruit first since that is the way the formula was arranged. (e.g. "more apple on salad than lettuce" will return "Fruit").
You can also use a range that contains your list instead of the array constant.
For example, you can put your fruit list in Column C (C1:C3) and your vegetable list in Column D (D1:D3). Your formula would then be:
=IF(SUMPRODUCT(IF(ISERROR(SEARCH(C$1:C$3,A1)),0,1))>0,"Fruit",IF(SUMPRODUCT(IF(ISERROR(SEARCH(D$1:D$3,A1)),0,1))>0,"Vegetable","None"))
But you need to enter it as Array Formula using Ctrl+Shift+Enter.
Same results and rule apply when both fruit and vegetable appear on a string. HTH.
Sure.
Try this formula
=IF(
OR(
NOT(ISERROR(SEARCH("apple",A1))),
NOT(ISERROR(SEARCH("pear",A1))),
NOT(ISERROR(SEARCH("orange",A1)))
),
"fruit",
IF(
OR(
NOT(ISERROR(SEARCH("carrot",A1))),
NOT(ISERROR(SEARCH("squash",A1))),
NOT(ISERROR(SEARCH("lettuce",A1)))
),
"veggie",
"none"
)
)
I am using highchart, trying to make a chart that shows high and low values for the number of people occupying various rooms. So I have a data like this:
[[roomName, low, high], [roomName, low, high] ...]
For example:
["XRay", 12, 45], ["Waiting Room", 8, 22], ["Admitting", 22, 56]]
What I want to have happen is for the x Axis to use the room names as the values on the category axis. But I can't see to get this to happen. It uses them as the names of the points instead.
If I am just doing a column chart, I can set x and y properties for the points:
[x:"XRay", y:12], [x:"Waiting Room", y:8], [x:"Admitting", y:56]]
But I don't know how I can do this with column ranges.
I can of course manually parse the data and set the categories of the xAxis myself, but I am wondering if there is a better way.
Thanks!
It could be done in two ways, as suggested categories, or using label formatter, both are here: http://jsbin.com/oyudan/17/edit
Formats:
[ {x: 'string' } ... ]
are not proper.