Problem: Compute the natural join of R and S. Which of the following tuples is in the result? Assume each tuple has schema (A,B,C,D).
Relation R
| A | C |
|---|---|
| 3 | 3 |
| 6 | 4 |
| 2 | 3 |
| 3 | 5 |
| 7 | 1 |
Relation S
| B | C | D |
|---|---|---|
| 5 | 1 | 6 |
| 1 | 5 | 8 |
| 4 | 3 | 9 |
I'm not quite sure what it means by "assume each tuple has a schema of A,B,C,D". Does this mean the R relation has a scheme of ABCD although it only lists A and C? I should assume there's also B and D but columns B and D are blank?
Operating under that assumption, I got the answer wrong. The explanation says there's no (7,5) in R which there clearly is under column A. Could someone explain to me what I'm doing wrong or if I'm missing something? Thank you!
The answer feedback is misleading and wrong, that would be the feedback if you choose (7,1,5,8)
Your answer is right.
For thoroughness: in a natural join you connect tuples on common attributes, in this case C is the attribute in common.
Your return tuples are:
R S
A,C B,C,D A,B,C,D
(7,1) & (5,1,6) = (7,5,1,6)
(3,5) & (1,5,8) = (3,1,5,8)
(2,3) & (4,3,9) = (2,4,3,9)
(3,3) & (4,3,9) = (3,4,3,9) --Your answer, correct
I even found a Stanford doc defining a natural join, just in case they lived in a different universe than the rest of us, but they don't. It's just a bug in the quiz.
The question doesn't say R has that scheme. It says the natural join of R & S has that scheme.
(There are many variations on what a relation is, what relational operators are available, how they work & what their symbols are. They are telling you to expect that the schema for the join of those two relations has columns A, B, C & D. You should already know that from the definitions in the course, but since they give it nobody should get that part wrong.)
You seem to be saying that your choice of a row in the natural join was 2. That's correct. The explanation says that a wrong choice can't be right because tuple (7,5) is not in R. They do not mean that (7,5) is a list of values "under column A". But that feedback is for choice 3, not choice 2. So the answer checking seems to have a bug. Let them know.
Related
I am using the StarSpace embedding framework for the first time and am unclear on the "modes" that it provides for training and the differences between them.
The options are:
wordspace
sentencespace
articlespace
tagspace
docspace
pagespace
entityrelationspace/graphspace
Let's say I have a dataset that looks like this:
| Author | City | Tweet_ID | Tweet_contents |
|:-------|:-------|:----------|:-----------------------------------|
| A | NYC | 1 | "This is usually a short sentence" |
| A | LONDON | 2 | "Another short sentence" |
| B | PARIS | 3 | "Check out this cool track" |
| B | BERLIN | 4 | "I like turtles" |
| C | PARIS | 5 | "It was a dark and stormy night" |
| ... | ... | ... | ... |
(In reality, my dataset is not a language data and looks nothing like this, but this example demonstrates the point well enough.)
I would like to simultaneously create embeddings from scratch (not using pre-existing embeddings at any point) for each of the following:
Authors
Cities
Tweet/Sentences/Documents (EG. 1, 2, 3, 4, 5, etc.)
Words (EG. 'This', 'is', 'usually', ..., 'stormy', 'night', etc.)
Even after reading the coumentation, it doesn't seem clear which 'mode' of starspace training I should be using.
If anyone could help me understand how to interpret the modes to help select the appropriate one, that would be much appreciated.
I would also like to know if there are conditions under which the embeddings generated using one of the modes above, would in some way be equivalent to the embeddings built using a different mode (ignoring the fact that the embeddings would be different because of the non-determinstic nature of the process.)
Thank you
I am working with a spreadsheet where I store the books I read. The format is as follows:
A | B | C | D | E | F
year | book | author | mark | language | country of the author
With entries like:
A | B | C | D | E | F
-------------------------------------------------------------
2004 | Hamlet | Shakespeare | 8 | ES | UK
2005 | Crimen y punishment | Dostoevsky | 9 | CAT | Russia
2007 | El mundo es ansí | Baroja | 8 | ES | Spain
2011 | Dersu Uzala | Arsenyev | 8 | EN | Russia
2015 | Brothers Karamazov | Dostoevsky | 8 | ES | Russia
2019 | ... Shanti Andía | Baroja | 7 | ES | Spain
I have several pivot tablas to get different data, such as top countries, top books, etc. In one of them I want to group by authors and order by number of books I have read from each one of them.
So I defined:
ROWS
author (column C) with
order: Desc for COUNT of author
VALUES
author
summation by: COUNT
show as Default
mark
summation by: AVERAGE
show as Default
This way, the data above show like this:
author | COUNT of author | AVERAGE of mark
-------------------------------------------------------------
Baroja | 2 | 7,5
Dostoevsky | 2 | 8,5
Shakespeare | 1 | 8
Arsenyev | 1 | 8
It is fine, since it orders data having top read authors on top. However, I would also like to order also by AVERAGE of mark. This way, when COUNT of author matches, it would use AVERAGE of mark to solve the ties and put on top the one author with a better average on their books.
On my sample data, Dostoevsky would go above Baroja (8,5 > 7).
I have been looking for different options, but I could not find any without including an extra column in the pivot table.
How can I use a second option to solve the ties when the first option gives the same value?
You can achieve a customized sort order on a pivot table without any extra columns in the source range. However... you'd definately need an extra field added to the pivot.
In the Pivot table editor go to Values and add a Calculated Field.
Use any formula that describes the sort order you want. E.g. let's multiply the counter by 100 to use as first criteria:
=COUNTA(author) * 100 + AVERAGE(score)
Do notice it is important to select Summarize by your Custom formula (screenshot above).
Now, just add this new calculated field as your row's Sort by field, and you're done!
Notice though, you do get an extra column added to the pivot.
Of course, you could hide it.
Translated from my answer to the cross-posted question on es.SO.
try:
=QUERY(A2:F,
"select C,count(C),avg(D)
where A is not null
group by C
order by count(C) desc
label C'author',count(C)'COUNT of author',avg(D)'AVERAGE of mark'")
I am writing a sheet where I am trying to create a multi level Index that searches through 5 different columns with 3 pieces of data. So for example:
x = 40
y = 5000
z = 20000
Column1 | Column2 | Column3 | Column4 | Column5 | Column6
13 | 29 | 0 | 0 | 0 | Yes
30 | 870 | 0 | 0 | 0 | No
10 | 870 | 0 | 30000 | 1 | Blue
10 | 870 | 30001 | 100000 | 1 | Yes
10 | 870 | 100001 | 300000 | 1 | Unknown
Here's a sample set of my data, what I need is to compare
the variable x to columns 1 and then 2 (x must fall between these values)
variable y to columns 3 and 4 (y must fall between these values)
and then finally z to column 5 (z must be above these values)
In each of these cases I need to know if the the variable is either lower than or higher than . Finally, I need the matching data from column 6 to be returned as a result in my sheet. At the moment I have a simply IMMENSE list of nested if statements which consider all of these criteria separately but it doesn't lend itself very well to editing when changes need to be made to the values.
I've looked at every single page on the internet (every... single... page...) and can't seem to find the solution to my issue. Most solutions I have found are either using a single data point, using multiple data points against a single range or simply don't seem to work. The latest iteration I have tried is:
=INDEX('LTV Data'!$N$3:$N$10, MATCH($D$5 & $G$8 & $G$12, ARRAYFORMULA($D$5 <= 'LTV Data'!$H$3:$H$10 & $D$5 >= 'LTV Data'!$I$3:$I$10 & $G$12 <= 'LTV Data'!$J$3:$J$10 & $G$12 >= 'LTV Data'!$K$3:$K$10 & $G$8 <= 'LTV Data'!$L$3:$L$10), 0), 7)
But this only produces an error as the separate values I want to test against are concatenated and the Match can't find that string. I'm also unsure about the greater than and less than symbols as to how valid that syntax is. Is anyone able to shed some light on how I can achieve the result I need in a more elegant way than the mass of IFS, ANDS + ORs I have right now? Like I said, it works but it sure ain't pretty ;)
Thanks a bunch in advance!
ETA: So with the above variables the result I would like would be 'Blue'. This is because x falls between columns 1 and 2, y falls between columns 3 and 4 and z is higher than column 5 on the third row. This is all contained in the MATCH statement in the example code above. Please see the MATCH statement to see the comparisons I am trying to make.
You need to put the different criteria together using multiplication if you want to get the effect of an AND in an array:
=INDEX(F2:F10,MATCH(1,(A2:A10<x)*(B2:B10>x)*(C2:C10<y)*(D2:D10>y)*(E1:E10<z),0))
or
=INDEX(F2:F10,MATCH(1,(A2:A10<=x)*(B2:B10>=x)*(C2:C10<=y)*(D2:D10>=y)*(E1:E10<=z),0))
to include the equality (I have used named ranges for x, y and z).
Works in Google Sheets and (if entered as an array formula) Excel.
In Google Sheets you also have the option of using a filter
=filter(F2:F10,A2:A10<=x,B2:B10>=x,C2:C10<=y,D2:D10>=y,E2:E10<=z)
but then you aren't guaranteed to get just one row.
I have a graph with about 1.2 million nodes, and roughly 3.85 million relationships between them. I need to find the top x weighted edges - that is to say, the top x tuples of a, b, n where a and b are unique pairs of vertices and n is the number of connections between them. I am currently getting this data via the following Cypher query:
MATCH (n)-[r]->(x)
WITH n, x, count(r) as weight
ORDER BY weight DESC
LIMIT 50
RETURN n.screen_name, r.screen_name, weight;
This query takes about 25 seconds to run, which is far too slow for my needs. I ran the query with the profiler, which returned the following:
ColumnFilter(0)
|
+Extract
|
+ColumnFilter(1)
|
+Top
|
+EagerAggregation
|
+TraversalMatcher
+------------------+---------+---------+-------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+---------+---------+-------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 50 | 0 | | keep columns n.twitter, x.twitter, weight |
| Extract | 50 | 200 | | n.twitter, x.twitter |
| ColumnFilter(1) | 50 | 0 | | keep columns n, x, weight |
| Top | 50 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATE01a74d75-74df-42f8-adc9-9a58163257d4 of type Integer) |
| EagerAggregation | 3292734 | 0 | | n, x |
| TraversalMatcher | 3843717 | 6245164 | | x, r, x |
+------------------+---------+---------+-------------+------------------------------------------------------------------------------------------------+
My questions, then, are these:
1. Are my expectations way off, and is this the sort of thing that's just going to be slow? It's functionally a map/reduce problem, but this isn't that large of a data set - and it's just test data. The real thing will have a lot more (but be filterable by relationship properties; I'm working on a representative sample).
2. What the heck can I do to make this run faster? I've considered using a start statement but that doesn't seem to help. Actually, it seems to make it worse.
3. What don't I know here, and where can I go to find out that I don't know it?
Thanks,
Chris
You're hitting the database 6,245,164 in the first statement: MATCH (n)-[r]->(x).
It looks like what you're attempting to do is a graph global query -- that is, you're trying to run query over the entire graph. By doing this you're not taking advantage of indexing that reduces the number of hits to the database.
In order to do this at the level of performance you're needing, an unmanaged extension may be required.
http://docs.neo4j.org/chunked/stable/server-unmanaged-extensions.html
Also, a good community resource to learn about unmanaged extensions: http://www.maxdemarzi.com/
The approach here would be to create a REST API method that extends the Neo4j server. This requires some experience programming with Java.
Alternatively you may be able to run an iterative Cypher query that updates an aggregate count on each [r] between (n) and (x).
A couple of years ago I had occasion to work a bit with the Ruby "nested_set" gem. With some helpful explanation from the chief technologist where I was at, I was able to appreciate how it work, with its columns:
parent
left
right
I've had no occasion to reconsider it in the past couple of years however since I don't work with Rails regularly, but now I'd like to implement it myself on another platform, to structure some data as a tree. So I am seeking a cogent explanation of how it works, be that with a link or links, or with a fleshed out answer.
Thanks in advance
Nested sets are similar to adjacency lists, but offer additional operations that can't easily be performed if parents and children only know about their immediate joins via parent column.
For instance, if we were given the following data model:
Graph Table
A node, parent
/ \ A,
B E B, A
/ \ C, B
C D D, B
E, A
We could easily retrieve node A's immediate children, but where it gets tricky is if we wanted to determine if node C is in node A's hierarchy, or if we want to retrieve node A's entire tree and not just it's immediate children. It's tricky because node C is not an an immediate child of node A, and without knowing the depth of the tree, or a recursive query (isn't an option for some databases), or some kind of SQL voodoo we're pretty much out of luck. Another example that might become problematic is if we wanted to destroy or update every record in the node A tree.
Nested sets introduce "left" and "right" attributes in addition to our initial parent attribute. Now nodes are numbered twice in relation to when they get visited via tree traversal as the record is inserted or modified. Using the previous example with a nested set would look something like this:
+---------------------------+ id, text, lft, rgt
| A | 1, A, 1, 10
| | 2, B, 2, 7
| +----------------+ +----+ | 3, C, 3, 4
| | B | | E | | 4, D, 5, 6
| | | | | | 5, E, 8, 9
| | +----+ +----+ | +----+ |
| | | C | | D | | |
| | | | | | | |
| | +----+ +----+ | |
| +----------------+ |
+---------------------------+
1 2 3 4 5 6 7 8 9 10
With the above example we can determine that node A has a left and right depth of 1 & 10 respectively, so anything within it's hierarchy will have a left and right depth somewhere between those two values. With that said, querying for node A's entire tree now becomes trivial:
SELECT c.id, c.text, c.lft, c.rgt FROM nodes c, nodes p WHERE p.lft < c.lft AND p.rgt > c.rgt AND p.id = 1;
Gives us:
id, text, lft, rgt
2, B, 2, 7
3, C, 3, 4
4, D, 5, 6
5, E, 8, 9
See Recursive data structures with rails for source. As discussed in the question's comments, there might be better/more efficient solutions depending on your requirements - the linked article covers this in more detail.