The output of Mahout spark-itemsimilarity and its indicators - mahout

The output of Mahout (0.11.1) spark-itemsimilarity looks like:
3705021559 3705021558:241.35418715327978 3705021546:163.6168323904276
By my understanding, its format is:
(item)tab(item1:score)tab(item2:score), item1, item2, itemx...
are so called indicators.
My question is how to use the indicators?
In some examples like
https://www.mapr.com/products/mapr-sandbox-hadoop/tutorials/recommender-tutorial and https://www.mapr.com/blog/mahout-spark-whats-new-recommenders%E2%80%94part-2,
we index the indicators, and we get the recommendation by query the indicator field, then we get the recommendation. To me it looks like: we form a list of what people bought as an indicator list, and we query Elasticsearch/Solr with the indicator list, and we get the recommended (similar) items. In this approach we query the indicator field to get similar items.
Why is it not simply like: if we know what people bought as a list, and we query ID field to get the indicators as result. In another words, the output we got from spark-itemsimilarity has already told us what items (indicators) are similar to an item?
Maybe I misunderstood the meaning of indicators, anybody please kindly clear my question?

3705021559 3705021558:241.35418715327978 3705021546:163.6168323904276Is exactly the format (item)tab(item1:score)tab(item2:score)
The first item is the item being compared to all the rest. So this is saying that compared with 3705021559, 3705021558 has a Log-likelihood ratio of 241.35418715327978 and so on.
The output matches you input so if 3705021558 is not an item id you may have specified the location of the item in the input. Run spark-itemsimilarity with no params to get help output. you can specify where in the input TSV yous item-id is, where the user-id and where your indicator-name is.
BTW if you are planning to use this in a recommender try out the Universal Recommender, which has event capture and a recommendation server all integrated.
http://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation

Related

Automatically moving data to another tab based on criteria

I have a Google Doc I'm trying to build. It's not this exact book, this all contains fake data, but the gist is the same:
https://docs.google.com/spreadsheets/d/12ebgFNCcRbJfgz6MS5XxcLCEv9vcEGw-0aUJrYmIqec/edit#gid=0
You'll notice that the first tab is called "Master". What I want Master to populate with is any time someone in one of the three sheets has a Grade of "Negative" or the "Follow Up" checkbox is checked off. So this would account for the data currently in the workbook, as well as any future data that gets added.
So the end result would be the "Master" tab looking like the picture below.
Is there a way I can do this?
Upon checking on the sheet you've provided, it seems like you already have the correct formula for getting the data from the other sheets to "master". The only difference with it and your expected output is the checkboxes.
If you want to use checkboxes instead of TRUE/FALSE even in the queried data, you need to put/create the blank checkboxes first on the Follow-up? and Resolved? and the value of those checkboxes will depend on the queried values. Please see screenshot below:
Please let me know if you have other concerns aside from the ones mentioned.

How do i trace multiple XML elements with same name & without any id?

I am trying to scrape a website for financials of Indian companies as a side project & put it in Google Sheets using XPATH
Link: https://ticker.finology.in/company/AFFLE
I am able to extract data from elements that have a specific id like cash, net debt, etc. however I am stuck with extracting data for labels like Sales Growth.
tried
Copying the full xpath from console, //*[#id="mainContent_updAddRatios"]/div[13]/p/span - this works, however, i am reliable on the index of the div (13) and that may change for different companies, hence i am unable to automate it.
Please assist with a scalable solution
PS: I am a Product Manager with basic coding expertise as I was a developer few years ago.
At some point you need to "hardcode" something unless you have some other means of mapping the content of the page to your spreadsheet. In your example you appear to be targeting "Sales Growth" percentage. If you are not comfortable hardcoding the index of the div (13), you could identify it by the id of the "Sales Growth" label which is mainContent_lblSalesGrowthorCasa.
For example, change your
//*[#id="mainContent_updAddRatios"]/div[13]/p/span
to:
//*[#id = "mainContent_updAddRatios"]/div[.//span/#id = "mainContent_lblSalesGrowthorCasa"]/p/span
which is selecting the div based on the div containing a span with id="mainContent_lblSalesGrowthorCasa". Ultimately, whether you "hardcode" the exact index of the div or "hardcode" the ids of the nodes, you are still embedding assumptions regarding the structure of page.
Thanks #david, that helped.
Two questions
What if the structure of the page would change? Example: If the website decided to remove the p tag then would my sheet fail? How do we avoid failure in such cases?
Also, since every id is unique, the probability of that getting changed is lesser than the index being changed. Correct me, if I am wrong?
What do we do when the elements don't have an id like Profit Growth, RoE, RoCE etc

Transform comma separated google form answers to multiple lines in spreadsheet

I have made a google form to which some answers are formatted as comma separated strings inside the automatically populated google spreadsheet. I would like to read from this sheet to another sheet and reformat the answers so that each comma separated answer is shown on a new row. I have tried to apply an ARRAYFORMULA that reads from the original sheet and then use a solution that uses SPLIT and TRANSPOSE the cell content, however combined with the ARRAYFORMULA this fails since it would overwrite contents in other cells.
Here is an example spreadsheet with the responses, a solution sheet, and a desired results sheet. https://docs.google.com/spreadsheets/d/1r_l5fVJ9lGfpubO2o3pXicV7JlZWmANjwSgNi7_DL0A
Any suggestions for how I can achieve the end result?
Okay, I assume this isn't really what you want, but visually it looks okay...
Try this formula:
={{'Form responses'!A2:A3},ArrayFormula(regexreplace('Form responses'!B2:E3,", ",CHAR(10)))}
Then format the cells so that the cell contents are TOP-aligned, instead of the default BOTTOM-aligned.
Realistically, I imagine that you want each question answer split into multiple cells. But if your data responses really contain letter values separated by commas, as you've indicated, you can still search through those cells to find whether an answer contains a certain value. It all depends on why you want the results structured the way you do.
If you can clarify what you want to do with the form results, instead of just appearing vertically for each question, perhaps we can provide a full solution for that requirement?
UPDATE1:
Okay, I may be getting close. I can get your data transformed to look like the following:
This would let you do the analysis that you want, by searching for Q.1 (question 1 responses) in the first column, and then all the answers in the third column, along with the owner in column 2. And from this, it will also definitely be possible to put the results in the exact form you want. It just may take an intermediate step.
UPDATE2:
Okay, I think I have something you can use. I can convert your data to either of the following two layouts.
The one on the right is closest to what you asked for, with the exception that the answers on the right are bottom aligned, with blanks above. But you can still process them for analysis, with queries. I honestly think having the user identifier (email address) on each row would make things simpler, but I can provide it either way.
The layout on the left is more of a traditional database layout, and would make analysis very simple. Each row has the date and email identifiers, the question number, and the answer (or one of the answers) to that question, from that user.
If this is helpful, it might be best if you enabled your sample sheet to allow us to edit it, to enable me to implement it in your sheet. But here is my sample sheet, in case anyone wants to look through it. Note that the main formula to reformat the data, in Solution!B3, could benefit from a lot of cleanup, and is probably nowhere near the best way to achieve this. Just throwing up one possible solution...
I'll try to add some explantion for the formula at some point, but ask if you have any questions.

want to name the formula in drop down list

in my original workbook i want to display some values from another sheet for multiple actions.. so i put drop down list for each actions. but when the actions increased i couldn't recognize each formulas. (all of them are import range) so that if i can name each formulas in the drop down then i can recognize fast which action to be performed. here a sample sheet is attached for a solution.. pls take a look. in the dropdown list i included (=a2+b2) instead of that if it displayed as addition in drop down list would be help ful. please take a look.
sorry for my english
Any type of help would be appreciated.
https://docs.google.com/spreadsheets/d/1mpIWyQASMlxRVdlTkv9K1e4oihsrckjT6sD1mLDxvEc/edit#gid=0
If I understand correctly, you want to have a dropdown list menu (from Data Validation) that displays the operation name, but when you click it, it displays just the result.
This is very hacky, but here's a way to create some "labels" in your criteria box:
=IF(;"ADDITION";A2+B2),
=IF(;"SUBTRACTION";A2-B2),
=IF(;"DIVISION";A2/B2),
=IF(;"MULTIPLICATION";A2*B2),
How?(!)
After kicking around some no-op ideas, I finally settled on this as the cleanest and most flexible approach. (By some freak coincidence, it also makes some semantic sense too.) It works because when the first argument to IF is omitted, it defaults to 0 -> FALSE. This effectively makes the second argument to IF a comment/no-op, and always just selects the formula.
Yes, the semicolons are intentional or the parser will think of the args as list items.
Productivity Tip/Footnotes
Sheets will remove any line breaks in your validation criteria, so the formula will be hard to read when you have to edit it. If you anticipate that you'll be adding a bunch of functions later, save the above block in a text file and edit that. Then you can copy+paste it into the validation field.
It will also always show up as "INVALID" because the value will of course never match the formula text.

Google Sheet function to Split the Data Accordingly

I have a worksheet being fed by a Google Form. I want the responses on the Google Form to populate two fields in the next tab. The B column in the second tab is the one beyond my skillset. I have written out how the field should display, based on the form responses for reference. I also have used comments on the sheet to explain the rules for each field.
I know split function can be used but it wont adjust it. any possible solution.
Here the Sheet link
https://docs.google.com/spreadsheets/d/1ueKCNdcn1xmJHYtrzKKKkj_FSraRfpvJS4Oi3BHNUvk/edit?usp=sharing
I've added an answer on your sheet. Since the data is all delineated by semi-colons, this formula seems to match what you want.
=SPLIT('Import Data'!B1,";",0,0)
Let us know if it doesn't do what you want, or if this helps.
Updated: After checking with you, I realise that you want only some of the data split, and some kept concatenated. But since all of the data "fields" look the same, separated by semi-colons, and since there could be various numbers of fields in each response category, I don't think there is a simple logic that can tell where to split, and where to keep things like dress styles or sizes concatenated. So I understand that this is not your desired answer.

Resources