How to Order Dataset in Ascending Order by Sample Size - mean

I'm currently working on a dataset of lead concentrations in food products in which there are 800+ rows of data. One of my tasks is to arrange the data in ascending order and include the sample size or observations to find the food categories that have the most lead concentration. The task is to 'Identify the top 10 most contaminated products based on their mean concentration, where the sample size used to calculate the mean exceeds 5 (aka 5 observations).' My dataset doesn't have a column of sample sizes.
How would I go about this?
Dataset Snippet

Related

How to generate a distribution based bar chart on row_numbers

I have a SQL query that acts as a data source in my tableau desktop:
SELECT
row_number() over (order by sales) as rn,
article_number,
country,
SUM(sold_items) as si,
SUM(sales) as sales
FROM data.sales
WHERE sales.order_date between '2021-01-01' and '2021-12-31'
GROUP BY 2, 3
On tableau I dragged rn to column and sales to row to generate a bar chart. The following is the output:
I want to convert this into a 0-100% distribution chart so that I can get the following result:
How can I achieve this? Also, I want the user to filter by country level so even if the # of records increase or decrease, the distribution should always be consistent with the filtered data.
You can do this with nested table calcs.
For example, the following uses the Superstore sample data set, and then first computes a running total of SUM(Sales) per day, then converts that to a percent of total. Notice the edit table calc dialog box - applying two back to back calculations in this case.
The x-axis in this example is Order-Date, and in your question, the the x-axis is a percentage somehow - so its not exactly what you requested but still shows that table calcs are an easy way to do these types of operations.
Also, realize you can just connect to the sales table directly, the custom sql isn’t adding any value, and in fact can defeat query optimizations that Tableau normally makes.
The tableau help docs explains table calculations. Pay attention to the discussion on partitioning and addressing.

Trying to get the size (in %) of subcategories in a pivot

I am keeping track of my (stock) portfolio in Google Sheets, as follows:
category subcategory company amount
-------------------------------------------
health care diagnostics AA 100
health care diagnostics AB 50
materials mining BA 75
financials banks CA 30
financials insurers CB 35
financials banks CC 10
financials banks CD 40
financials hedge fund CE 5
health care equipment DA 50
But now I want to extract some statistics from this, and I'm using a Pivot. Specifically, I want to see:
the relative size of each category in the portfolio
the relative size of each subcategory in the portfolio
the relative size of each company in their subcategory
The first two I get done:
For instance, I can see that:
category financials has a relative size of 30% in the portfolio
subcategory diagnostics has a relative size of 37.97% in the portfolio
What is missing however, is the third column, see mockup below:
I can now see in the last column what the relative size of each company is in its subcategory:
Company CD is 50% of the subcategory banks
Company AB is 300.33% of the subcategory diagnostics
That last column however, is not calculated but added manually to show what output I am trying to get but am unable to.
Does anyone know how to have this last column as part of the Pivot?
Here is a link to the Google Sheet used: Pivot
Currently, there is no % by subtotal option in Google Sheets to show percentages against subtotals for the whole Pivot Table:
As a workaround, you can display the subtotal percentages by filtering the Pivot Table to display only one subcategory, for example:
Add Filter here and select subcategory:
Then select the desired field, for example, banks, and click OK.:
This should display the subtotal percentages like this:
Note: You can delete category field in Rows to make the Pivot Table cleaner.
You can also submit a feature idea request to Google Workspace, please see instructions on this site: Submit ideas for Google Workspace and Cloud Identity
So I found out there are two possible solution using calculated column and another using arrayformula.
Option 1: Use a duplicate pivot table. One issue of getPivotData function is that it will run in the table calculations stage so it will return empty result. So the solution was to have two clones of the pivot table. One will be used to retrieve the data and the other to retrieve the subtotal.
The formula to use would be
=amount/getpivotdata("amount",'Copy of Pivot'!$F$10, "category", category, "subcategory", subcategory)
Where amount, category, subcategory without the double quotes are values (references) filled by Google Sheets. While the double quoted ones are column names.
You also need to set summarize by to custom for it to work.
The only issue I found here is that I couldn't find a way to identify subtotal or total rows, they do have company value equal to the first record in the group. If anyone know how to know if I am rendering a subtotal, please let me know.
The second option is to use one pivot table only but reference original data.
=amount/SUM(QUERY($A$1:$D,"Select D Where A='"&category&"' AND B='"&subcategory&"'",false))
It still shares the same total rows though.
The last option will be the least robust. It would be using arrayformula but I don't have that ready at the moment. It is main issue is that changing the column orders or adding new columns will require to keep the position of the refenced columns and move the arrayformula columns. Not to mention, handling the empty cells in a category since data is grouped.

VLOOKUP with wildcard and find Nth occurance?

I'm setting up a Google Sheet that will calculate the most effective purchase size of specific agricultural inputs (fertilizer, chemical, etc). I set up the price data in its own tab with a separate row for each input name + size.
To keep it easy for the user I'd like to require only the input name, # of gallons per acre, and acres and then have a formula spit out the total cost and most effective purchase (bulk if > X gallons, X # of 250 gallon containers + X 55 drums, etc). How can I use the input name plus a wildcard to find the appropriate purchase size?
https://docs.google.com/spreadsheets/d/1bMOPuk2qhmVuJT7vE_ni3KFxfcgKvwTwkM4p50xQF_0/edit?usp=sharing
I tried:
=ArrayFormula(iferror(INDEX('Data (Current)'!H2:H,SMALL(IF($A2&"*"='Data (Current)'!A2:A,ROW('Data (Current)'!A2:A)-1),1))))
...but it returns blank so I'm guessing the reference $A2&"*" to the input name isn't working properly. When I replace it with a string found in the 'Data (Current)' tab then it works fine.
=ArrayFormula(iferror(INDEX('Data (Current)'!H2:H,SMALL(IF($A2&"*"='Data (Current)'!A2:A,ROW('Data (Current)'!A2:A)-1),1))))
I expected the output to be the smallest value (in this case I think it's 5). Then when I change the last number to 2 or 3 it will find the next smallest value, in this case, 55 or 250. Then I can use simple formulas to interact with that and finish the spreadsheet.
Unfortunately, the actual output is nothing, or "".
Sorry if this isn't what you're looking for, as I had some trouble understanding your question.
Presuming what you want is essentially this:
I want to buy Y quantity of item.
I can buy item at cheaper prices if I buy in higher quantities, although sometimes they have a minimum order quantity.
What is the most optimal combination of the options I have to minimize the price I pay?
I'm unsure if there's a simple solution for this within Google Sheets alone. This might be treading more into Apps Script territory.
However, that's not to say that it's not impossible. I've "brute-forced" the above solution above with an iterative-like approach, for the "Chelated Calcium" product: https://docs.google.com/spreadsheets/d/1YSBiSx0IMr4T0R11Dqb-tqOhH4AOTTAWeH2yQfT4X5w
First, list the data in a standardized manner. This includes giving each same product something easy to look it up by. For example, on the Data (Current) tab, I've added 3 columns:
Product Common Name - This is used so that all items of different quantities can be found easily, without needing wildcards.
Gallons - Much easier to parse the data if it it's explicitly laid out.
Minimum Order Gallons - This is your threshold for Bulk. I've set it at an arbitrary 20,000 gallons for Chelated Calcium.
The data here is ordered least-effective first. How you do this will be up to you. In this case, I sorted by the Retail Cost Per Ounce parameter from your sheet, highest first. This eliminates any guesswork about which of the options are most effective, since you can just traverse your options in order. Note: The way I've laid out the formulas will only work IFF the same products are directly next to each other. It won't work if there are other products between them.
On the Field Level Tool tab, standardize your inputs to the Gallons unit. I do this in Total Gallons Needed column (I multiply anything with a "GAL" with 1, and "QUART" with 0.25).
For each item, determine the row numbers where the product begins and ends. This is marked by columns L (Least Efficient Index) and M (Most Efficient Index). I got these results by using the MATCH function.
Set up the iterations, from 0 to N-1. On this sheet, I've set up N=5 iterations, which means that it can traverse 5 different options of the same product only. Since Chelated Calcium only has 4 different options (5 Gal, 30 Gal, 250 Gal, Bulk), 5 is more than enough for this product. If you have products with more options, you may want to have more iterations.
The iterations are on the right side of the Field Level Tool tab.
In your case, you might want to put it on a different tab since the place I put it makes the file look very messy.
In each iteration, I perform the following steps:
To Fulfill - How many gallons still need to be purchased by this iteration?
ThisIndex - What is the row number of this iteration? This is determined by Most Efficient Index - Iteration Number. Remember that since we sorted in order of ascending efficiency, this means that the iteration starts with the most efficient option it can find first. There is a check to make sure that it only outputs a value if it is between the range [Least Efficient Index, Most Efficient Index]. Otherwise, it will be blank to avoid miscalculations by intruding into another product in the Data (Current) tab.
Retail Price, Minimum Gals, Gallons per Order - Simple data extraction for easy usage in the iteration, using INDEX (and indirectly, MATCH by virtue of ThisIndex).
Order - This formula does a couple of things, outlined below:
It checks whether there still remains a valid choice of product at this iteration. It does this by checking whether ThisIndex still exists. If the product doesn't exist, then it will be nulled. This is accomplished by using the IF function.
It will determine if there is a minimum threshold that must be met to purchase this choice. You can see in the 0th iteration, for example, that there is a minimum quantity of 20,000 gallons. If To Fulfill quantity is greater than or equal to the threshold OR there is no threshold, then a purchase is quantified by this column. The mathematics are simply to divide the To Fulfill amount by the Gallons per Order amount to determine the number of orders of this particular product choice. If there is a threshold but the To Fulfill amount doesn't meet it, then this iteration is skipped with a 0 order value.
If the item is already on its least efficient choice (ThisIndex == Least Efficient Index), it will do a CEILING function to ensure that the order is fulfilled. If not, it will do a FLOOR function instead. This is because you cannot order 3.5 units of an item, so they have to be rounded either up or down.
Expenditure - This is simply Order multiplied by the Retail Price, or how much money you spend in this iteration.
Remaining - How much of the product is left unfulfilled at the end of this iteration, to be used as To Fulfill for the next iteration.
Note: If you see formulas that are of the form =IF(ThisIndex, [calculations_here],), that is simply a check to nullify that calculation if ThisIndex is invalid.
Copy the iterations as many times as you want to the right. Something nice to do is to force the iterations to do a CEILING on the very last one to ensure that you never under-buy.
Generate a user-readable string for the purchase suggestion. You can see this on the Suggested Purchase column.
Calculate the Gallons Bought with a simple SUMPRODUCT over all the iterations.
Calculate the total expenditure with a simple SUM over all the iterations.
I hope this is what you were looking for. Regardless, it's at least a fun exercise on how much you can abuse Sheets. ;)

Alteryx: Creating multiple Histograms from one dataset

I have a data set that contains the following information - Date, item # and the unit price for that item on that date.What I would like to create is one histogram per item (my dataset has 17 unique items), charting the frequency of the unit prices? Is this possible in Alteryx?
What you really want is the ability to group by items within your data set. I think the closest thing to this for your specific use case is the summarize tool. You can group by item and then use the percentile operation to generate several points within the data range to add to a histogram.

should PAX be in Flighth Dimension or Fact Sales table?

I need to build a data mart using power pivot for a duty free shop at Airport.
Sales manager is analying sales data using by flight number and by PAX, number of people per flight.
So, I don't know where to put PAX. In DimFlight or FactSales. It is addative, right?
Please explain me why and how should I put PAX into which table. DimFlight may includes airline, flignt_no, date, PAX. A flight may also land the airport more than once a day.
PAX is a fact describing a measureable value of a specific flight event. It should be in the fact table, not in the flight dimension. I would expect total capacity to be an attribute of the plane dimension associated to the flight event. (Flight number would likely be a degenerate dimension as it doesn't really own any attributes.) However, the PAX itself should be a measure in the fact table.
You can generate a junk dimension that has the banding mentioned by #Luis Leal to do some capacity analytics. You can even create a numbers dimension with an attribute for each group level so you can do more detailed banding. For example, an attribute for 1s, 10s, 100s, 1000s, etc. You can also calculate the filled capacity of the flight and point to the numbers dimension so you can group flights by 80% full, 90% full etc.
Nothing stops you from modeling it as both dimension and measure, so you can store it both on a dimension table and as a measure on a fact table. If you store it as a measure on the fact table, you can perform several analysis by the other possible dimensions, get insights as averages, max, min, total by x or y dimension, which would be very difficult if you store it only on the dimension table.
On the other hand,storing it in the dimension table enables additional "perspectives" of analysis, for example a common approach is to store in the dimensional table "interval" columns with values like:
from 1 to 1000 pax, from 1001 to 2000. This column calculated at ETL time depending on the value of the PAX. So why not use both?

Resources