Join 2 tables in Hive using a phone number and a prefix (variable length) - join

I'm trying to match phone numbers to an area using Hive.
I've got a table (prefmap) that maps a number prefix (prefix) to an area (area) and another table (users) with a list of phone numbers (nb).
There is only 1 match per phone number (no sub-area)
The problem is that the length of the prefixes is not fixed so I cannot use the UDF function substr(nb,"prefix's length") in the JOIN's ON() condition to match the substring of a number to a prefix.
And when I try to use instr() to find if a number has a matching prefix:
SELECT users.nb,prefix.area
FROM users
LEFT OUTER JOIN prefix
ON (instr(prefmap.prefix,users.nb)=1)
I get an error on line4 "Both left and right aliases encountered in Join '1')
How could I get this to work?
I'm using hive 0.9
Thanks for any advice.

Probably not the best solution but at least it does the job:
use WHERE to define the matching condition instead of ON() (that is now forced to TRUE)
select users.nb, prefix.area
from users
LEFT OUTER JOIN prefix
ON(true)
WHERE instr(users.nb,prefmap.prefix)=1
It's not perfect as it's a bit slow. It creates as many temporary (useless) entries as there are in the matching table before the WHERE condition keeps the only right one. So it's better to use this only if it's not too long.
Can anyone think of a better way to do this?

hive cannot convert (instr(prefmap.prefix,users.nb)=1) to mapreduce job.
so hive's join just support equality expression. see hive joins wiki for more information.

Related

How to concatenate three columns into one and obtain count of unique entries among them using Cypher neo4j?

I can query using Cypher in Neo4j from the Panama database the countries of three types of identity holders (I define that term) namely Entities (companies), officers (shareholders) and Intermediaries (middle companies) as three attributes/columns. Each column has single or double entries separated by colon (eg: British Virgin Islands;Russia). We want to concatenate the countries in these columns into a unique set of countries and hence obtain the count of the number of countries as new attribute.
For this, I tried the following code from my understanding of Cypher:
MATCH (BEZ2:Officer)-[:SHAREHOLDER_OF]->(BEZ1:Entity),(BEZ3:Intermediary)-[:INTERMEDIARY_OF]->(BEZ1:Entity)
WHERE BEZ1.address CONTAINS "Belize" AND
NOT ((BEZ1.countries="Belize" AND BEZ2.countries="Belize" AND BEZ3.countries="Belize") OR
(BEZ1.status IN ["Inactivated", "Dissolved shelf company", "Dissolved", "Discontinued", "Struck / Defunct / Deregistered", "Dead"]))
SET BEZ4.countries= (BEZ1.countries+","+BEZ2.countries+","+BEZ3.countries)
RETURN BEZ3.countries AS IntermediaryCountries, BEZ3.name AS
Intermediaryname, BEZ2.countries AS OfficerCountries , BEZ2.name AS
Officername, BEZ1.countries as EntityCountries, BEZ1.name AS Companyname,
BEZ1.address AS CompanyAddress,DISTINCT count(BEZ4.countries) AS NoofConnections
The relevant part is the SET statement in the 7th line and the DISTINCT count in the last line. The code shows error which makes no sense to me: Invalid input 'u': expected 'n/N'. I guess it means to use COLLECT probably but we tried that as well and it shows the error vice-versa'd between 'u' and 'n'. Please help us obtain the output that we want, it makes our job hell lot easy. Thanks in advance!
EDIT: Considering I didn't define variable as suggested by #Cybersam, I tried the command CREATE as following but it shows the error "Invalid input 'R':" for the command RETURN. This is unfathomable for me. Help really needed, thank you.
CODE 2:
MATCH (BEZ2:Officer)-[:SHAREHOLDER_OF]->(BEZ1:Entity),(BEZ3:Intermediary)-
[:INTERMEDIARY_OF]->(BEZ1:Entity)
WHERE BEZ1.address CONTAINS "Belize" AND
NOT ((BEZ1.countries="Belize" AND BEZ2.countries="Belize" AND
BEZ3.countries="Belize") OR
(BEZ1.status IN ["Inactivated", "Dissolved shelf company", "Dissolved",
"Discontinued", "Struck / Defunct / Deregistered", "Dead"]))
CREATE (p:Connections{countries:
split((BEZ1.countries+";"+BEZ2.countries+";"+BEZ3.countries),";")
RETURN BEZ3.countries AS IntermediaryCountries, BEZ3.name AS
Intermediaryname, BEZ2.countries AS OfficerCountries , BEZ2.name AS
Officername, BEZ1.countries as EntityCountries, BEZ1.name AS Companyname,
BEZ1.address AS CompanyAddress, AS TOTAL, collect (DISTINCT
COUNT(p.countries)) AS NumberofConnections
Lines 8 and 9 are the ones new and to be in examination.
First Query
You never defined the identifier BEZ4, so you cannot set a property on it.
Second Query (which should have been posted in a separate question):
You have several typos and a syntax error.
This query should not get an error (but you will have to determine if it does what you want):
MATCH (BEZ2:Officer)-[:SHAREHOLDER_OF]->(BEZ1:Entity),(BEZ3:Intermediary)- [:INTERMEDIARY_OF]->(BEZ1:Entity)
WHERE BEZ1.address CONTAINS "Belize" AND NOT ((BEZ1.countries="Belize" AND BEZ2.countries="Belize" AND BEZ3.countries="Belize") OR (BEZ1.status IN ["Inactivated", "Dissolved shelf company", "Dissolved", "Discontinued", "Struck / Defunct / Deregistered", "Dead"]))
CREATE (p:Connections {countries: split((BEZ1.countries+";"+BEZ2.countries+";"+BEZ3.countries), ";")})
RETURN BEZ3.countries AS IntermediaryCountries,
BEZ3.name AS Intermediaryname,
BEZ2.countries AS OfficerCountries ,
BEZ2.name AS Officername,
BEZ1.countries as EntityCountries,
BEZ1.name AS Companyname,
BEZ1.address AS CompanyAddress,
SIZE(p.countries) AS NumberofConnections;
Problems with the original:
The CREATE clause was missing a closing } and also a closing ).
The RETURN clause had a dangling AS TOTAL term.
collect (DISTINCT COUNT(p.countries)) was attempting to perform nested aggregation, which is not supported. In any case, even if it had worked, it probably would not have returned what you wanted. I suspect that you actually wanted the size of the p.countries collection, so that is what I used in my query.

Auto-assigning objects to users based on priority in Postgres/Ruby on Rails

I'm building a rails app for managing a queue of work items. I have several types of users ("access levels") to whom I want to auto-assign these work items.
The end goal is an "Auto-assign" button on one of my views that will automatically grab the next work item based on a priority, which is defined by the users's access level.
I'm trying to set up a class method in my work_item model to automatically sort work items by type based on the user's access level. I am looking at something like this:
def self.auto_assign_next(access_level)
case
when access_level = 2
where("completed = 'f'").order("requested_time ASC").limit(1)
when access_level > 2
where("completed = 'f'").order("CASE WHEN form='supervisor' THEN 1 WHEN form='installer' THEN 2 WHEN form='repair' THEN 3 WHEN form='mail' THEN 4 WHEN form='hp' THEN 5 ELSE 6 END").limit(1)
end
This isn't very DRY, though. Ideally I'd like the sort order to be configurable by administrators, so maybe setting up a separate table on which the sort order is kept would be best. The problem with that idea is that I have no idea how to pass the priority order on that table to the [postgre]SQL query. I'm new to SQL in general and somewhat lost with this one. Does anybody have any suggestions as to how this should be handled?
One fairly simple approach starts with turning your case statement into a new table, listing form values versus what precedence value they should be sorted by:
id | form | precedence
-----------------------------------
1 | supervisor | 1
2 | installer | 2
(etc)
Create a model for this, say, FormPrecedences (not a great name, but I don't totally grok your data model so pick one that better describes it). Then, your query can look like this (note: I'm assuming your current model is called WorkItems):
when access_level > 2
joins("LEFT JOIN form_precedences ON form_precedences.form = work_items.form")
.where("completed = 'f'")
.order("COALESCE(form_precedences.precedence, 6)")
.limit(1)
The way this works isn't as complicated as it looks. A "left join" in SQL simply takes all the rows of the table on the left (in this case, work_items) and, for each row, finds all the matching rows from the table on the right (form_precedences, where "matching" is defined by the bit after the "ON" keyword: form_precedences.form = work_items.form), and emits one combined row. If no match is found, a LEFT JOIN will still emit a row, but with all the right-hand values being NULL. A normal join would skip any rows with no right-hand match found.
Anyway, with the precedence data joined on to our work items, we can just sort by the precedence value. But, in case no match was found during the join above, that value will be NULL -- so, I use COALESCE (which returns the first of its arguments that's not NULL) to default to a precedence of 6.
Hope that helps!

Ascending sort order Index versus descending sort order index when performing OrderBy

I am working on an asp.net mvc web application, and I am using Sql server 2008 R2 + Entity framework.
Now on the sql server I have added a unique index on any column that might be ordered by . for example I have created a unique index on the Sql server on the Tag colum and I have defined that the sort order for the index to be Ascending. Now I have some queries inside my application that order the tag ascending while other queries order the Tag descending, as follow:-
LatestTechnology = tms.Technologies.Where(a=> !a.IsDeleted && a.IsCompleted).OrderByDescending(a => a.Tag).Take(pagesize).ToList(),;
TechnologyList = tms.Technologies.Where(a=> !a.IsDeleted && a.IsCompleted).OrderBy (a => a.Tag).Take(pagesize).ToList();
So my question is whether the two OrderByDescending(a => a.Tag). & OrderBy(a => a.Tag), can benefit from the asending unique index on the sql server on the Tag colum ? or I should define two unique indexes on the sql server one with ascending sort order while the other index with decedning sort order ?
THanks
EDIT
the following query :-
LatestTechnology = tms.Technologies.Where(a=> !a.IsDeleted && a.IsCompleted).OrderByDescending(a => a.Tag).Take(pagesize).ToList();
will generate the following sql statement as mentioned by the sql server profiler :-
SELECT TOP (15)
[Extent1].[TechnologyID] AS [TechnologyID],
[Extent1].[Tag] AS [Tag],
[Extent1].[IsDeleted] AS [IsDeleted],
[Extent1].[timestamp] AS [timestamp],
[Extent1].[TypeID] AS [TypeID],
[Extent1].[StartDate] AS [StartDate],
[Extent1].[IT360ID] AS [IT360ID],
[Extent1].[IsCompleted] AS [IsCompleted]
FROM [dbo].[Technology] AS [Extent1]
WHERE ([Extent1].[IsDeleted] <> cast(1 as bit)) AND ([Extent1].[IsCompleted] = 1)
ORDER BY [Extent1].[Tag] DESC
To answer your question:
So my question is whether the two OrderByDescending(a => a.Tag). &
OrderBy(a => a.Tag), can benefit from the asending unique index on the
sql server on the Tag colum ?
Yes, SQL Server can read an index in both directions: as in index definition or in the exact opposite direction.
However, from your intro I suspect that you still have a wrong impression how indexing works for order by. If you have both, a where clause and an order by clause, you must make sure to have a single index that covers both clauses! It does not help to have on index for the where clause (like on isDeleted and isCompleted — whatever that is in your example) and another index on tag. You need to have a single index that first has the columns of the where clause followed by the columns of the order by clause (multi-column index).
It can be tricky to make it work correctly, but it's worth the effort especially if your are only fetching the first few rows (like in your example).
If it doesn't work out right away, please have a look at this:
http://use-the-index-luke.com/sql/sorting-grouping/indexed-order-by
It is generally best to show the actual SQL query—not the .NET source code—when asking for performance advice. Then I could tell you which index to create exactly. At the moment I'm unsure about isDeleted and isCompleted — are these table columns or expressions that evaluate upon other columns?
EDIT (after you added the SQL query)
There are two ways to make your query work as indexed top-n query:
http://sqlfiddle.com/#!6/260fb/4
The first option is a regular index on the columns from the where clause followed by those from the order by clause. However, as you query uses this filter IsDeleted <> cast(1 as bit) it cannot use the index in a order-preserving way. If, however, you re-phrase the query so that it reads like this IsDeleted = cast(0 as bit) then it works. Please look at the fiddle, I've prepared everything there. Yes, SQL Server could be smart enough to know that, but it seems like it isn't.
I don't know how to tweak EF to produce the query in the above described way, sorry.
However, there is a second option using a so called filtered index — that is an index that only contains a sub-set of the table rows. It's also in the SQL Fiddle. Here it is important that you add the where clause to the index definition in the very same way as it appears in your query.
In both ways it still works if you change DESC to ASC.
The important part is that the execution plan doesn't show a sort operation. You can also verify this in SQL Fiddle (click on 'View execution plan').

Informix: UPDATE with SELECT - syntax?

I wanna update my table for all persons whoes activity lasted toooo long. The update should correct one time and for the subsequent rows I need to deal with new result. So thought about something like
UPDATE summary_table st
SET st.screen_on=newScreenOnValue
st.active_screen_on=st.active_screen_on-(st.screen_on-newScreenOnValue) --old-value minus thedifference
FROM (
SUB-SELECT with rowid, newScreenOnValue ... JOIN ... WHERE....
) nv
WHERE (st.rowid=nv.rowid)
I know that I can update the first and the second value directly, by rerunning the same query. But my problem is the costs of the subselect seems quite high and therefore wanna avoid a double-update resp. double-run of the same query.
The above SELECT is just a informal way of writting what I think I would like to get. I know that the st doesn't work, but I left it here for better understanding. When I try the above statement I always get back a SyntaxError at the position the FROM ends.
This can be achieved as follows:
UPDATE summary_table st
SET (st.screen_on, st.active_screen_on) =
((SELECT newScreenOnValue, st.active_screen_on-(st.screen_on-newScreenOnValue)
FROM ...
JOIN...
WHERE..))
[WHERE if any additional condition required];
The above query works perfectly fine on informix tried and tested until you make any errors in the FROM, JOIN, WHERE clauses.
Cheers !
Syntax error because a comma is missing between the first and second columns you're updating.
Never use ROWID's, they're volatile and also not used by default with IDS, unless you specify so.
Why are you using a subquery?

Rails: select unique values from a column

I already have a working solution, but I would really like to know why this doesn't work:
ratings = Model.select(:rating).uniq
ratings.each { |r| puts r.rating }
It selects, but don't print unique values, it prints all values, including the duplicates. And it's in the documentation: http://guides.rubyonrails.org/active_record_querying.html#selecting-specific-fields
Model.select(:rating)
The result of this is a collection of Model objects. Not plain ratings. And from uniq's point of view, they are completely different. You can use this:
Model.select(:rating).map(&:rating).uniq
or this (most efficient):
Model.uniq.pluck(:rating)
Rails 5+
Model.distinct.pluck(:rating)
Update
Apparently, as of rails 5.0.0.1, it works only on "top level" queries, like above. Doesn't work on collection proxies ("has_many" relations, for example).
Address.distinct.pluck(:city) # => ['Moscow']
user.addresses.distinct.pluck(:city) # => ['Moscow', 'Moscow', 'Moscow']
In this case, deduplicate after the query
user.addresses.pluck(:city).uniq # => ['Moscow']
If you're going to use Model.select, then you might as well just use DISTINCT, as it will return only the unique values. This is better because it means it returns less rows and should be slightly faster than returning a number of rows and then telling Rails to pick the unique values.
Model.select('DISTINCT rating')
Of course, this is provided your database understands the DISTINCT keyword, and most should.
This works too.
Model.pluck("DISTINCT rating")
If you want to also select extra fields:
Model.select('DISTINCT ON (models.ratings) models.ratings, models.id').map { |m| [m.id, m.ratings] }
Model.uniq.pluck(:rating)
# SELECT DISTINCT "models"."rating" FROM "models"
This has the advantages of not using sql strings and not instantiating models
Model.select(:rating).uniq
This code works as 'DISTINCT' (not as Array#uniq) since rails 3.2
Model.select(:rating).distinct
Another way to collect uniq columns with sql:
Model.group(:rating).pluck(:rating)
If I am going right to way then :
Current query
Model.select(:rating)
is returning array of object and you have written query
Model.select(:rating).uniq
uniq is applied on array of object and each object have unique id. uniq is performing its job correctly because each object in array is uniq.
There are many way to select distinct rating :
Model.select('distinct rating').map(&:rating)
or
Model.select('distinct rating').collect(&:rating)
or
Model.select(:rating).map(&:rating).uniq
or
Model.select(:name).collect(&:rating).uniq
One more thing, first and second query : find distinct data by SQL query.
These queries will considered "london" and "london " same means it will neglect to space, that's why it will select 'london' one time in your query result.
Third and forth query:
find data by SQL query and for distinct data applied ruby uniq mehtod.
these queries will considered "london" and "london " different, that's why it will select 'london' and 'london ' both in your query result.
please prefer to attached image for more understanding and have a look on "Toured / Awaiting RFP".
If anyone is looking for the same with Mongoid, that is
Model.distinct(:rating)
Some answers don't take into account the OP wants a array of values
Other answers don't work well if your Model has thousands of records
That said, I think a good answer is:
Model.uniq.select(:ratings).map(&:ratings)
=> "SELECT DISTINCT ratings FROM `models` "
Because, first you generate a array of Model (with diminished size because of the select), then you extract the only attribute those selected models have (ratings)
You can use the following Gem: active_record_distinct_on
Model.distinct_on(:rating)
Yields the following query:
SELECT DISTINCT ON ( "models"."rating" ) "models".* FROM "models"
In my scenario, I wanted a list of distinct names after ordering them by their creation date, applying offset and limit. Basically a combination of ORDER BY, DISTINCT ON
All you need to do is put DISTINCT ON inside the pluck method, like follow
Model.order("name, created_at DESC").offset(0).limit(10).pluck("DISTINCT ON (name) name")
This would return back an array of distinct names.
Model.pluck("DISTINCT column_name")

Resources