Unable to use bucket map join in hive 0.12 - join

I was trying out some hive optimization features and encountered such problem:
I cannot use bucket map join in hive 0.12. After all the setting I tried below, only one hashtable file is generated and the join turn out to be just map join.
I have two tables both in rcfile format and both bucktized into 10 bucket, they are created and populated as follows(Origin data was generated from TPC-H):
hive> create table lsm (l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipstruct string, l_shipmode string, l_comment string) clustered by (l_orderkey) into 10 buckets stored as rcfile;
hive> create table osm (o_orderkey int, o_custkey int) clustered by (o_orderkey) into 10 buckets stored as rcfile;
hive> set hive.enforce.bucketing=true;
hive> insert overwrite table lsm select * from orili;
hive> insert overwrite table osm select o_orderkey, o_custkey from orior;
And I can query both table’s data normally, and lsm is 790MB, osm is 11MB, both 10 bucket files, then I want to try bucket map join:
hive> set hive.auto.convert.join=true;
hive> set hive.optimize.bucketmapjoin=true;
hive> set hive.enforce.bucketmapjoin=true;
hive> set hive.auto.convert.join.noconditionaltask=true;
hive> set hive.auto.convert.join.noconditionaltask.size=1000000000000000;
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
and my query is as follows:
hive> select /*+ Mapjoin(osm) */ osm.o_orderkey, lsm.* from osm join lsm on osm.o_orderkey = lsm.l_orderkey;
This query only generate 1 hashtable of osm and fall back to a map join, I was really confused about it. Do I have all the hint set to enable the bucket map join feature, or are there any problems in my query ?

Short Version:
Set hive> set hive.ignore.mapjoin.hint=false;
will make Bucket Map Join work as Expected. Which means I would get the 10 small tables's bucket files build as hash table and do hash join with its corresponding big file's buckets.
A longer Version:
I dive into the hive 0.12 code and find hive.ignore.mapjoin.hint in HiveConf.java and it was set to true by default, which means the /*+ MAPJOIN */ hint is ignored deliberately. Since there are 2 phase of Optimization in hive, logical optimization and physical optimization, both are rule based optimizations.
Logical Optimization In the logical optimization, mapjoin optimization was followed by bucketmapjoin optimization, bucketmapjoin optimization take the MapJoin operator tree as input and convert it into BucketMapJoin, so a hinted query would be first transformed into a mapjoin and then a bucketmapjoin. Therefore, hint disabled logical optimisation would do nothing join optimisation on the join tree.
Physical Optimization In the physical optimisation, the hive.auto.convert.join was tested and MapJoinResolver was used and just convert a reduce join into a MapJoin. No further BucketMapJoin Optimization rules in this phase. That's why I just get Mapjoin in my question.

Related

deleting columns from influx DN using flux command line

Is there any way to delete columns of an influx timeseries as we have accidentally injected data using the wrong data type (int instead of float).
Or to change the type of data instead.
Unfortunately, there is no way to delete a "column" (i.e. a tag or a field) from an Influx measurement so far. Here's the feature request for that but there is no ETA yet.
Three workarounds:
use SELECT INTO to copy the desirable data into a different measurement, excluding the undesirable "columns". e.g.:
SELECT desirableTag1, desirableTag2, desirableField1, desirableField2 INTO new_measurement FROM measurement
use CAST operations to "change the data type" from float to int. e.g.:
SELECT desirableTag1, desirableTag2, desirableField1, desirableField2, undesiredableTag3::integer, undesiredableField3::integer INTO new_measurement FROM measurement
"Update" the data with insert statement, which will overwrite the data with the same timestamp, same tags, same field keys. Keep all other things equal, except that the "columns" that you would like to update. To make the data in integer data type, remember to put a trailing i on the number. Example: 42i. e.g.:
insert measurement,desirableTag1=v1 desirableField1=fv1,desirableField2=fv2,undesirableField1=someValueA-i 1505799797664800000
insert measurement,desirableTag1=v21 desirableField1=fv21,desirableField2=fv22,undesirableField1=someValueB-i 1505799797664800000

Neo4j imports zero records from csv

I am new to Neo4j and graph database. While trying to import a few relationships from a CSV file, I can see that there are no records, even when the file is filled with enough data.
LOAD CSV with headers FROM 'file:/graphdata.csv' as row WITH row
WHERE row.pName is NOT NULL
MERGE(transId:TransactionId)
MERGE(refId:RefNo)
MERGE(kewd:Keyword)
MERGE(accNo:AccountNumber {bName:row.Bank_Name, pAmt:row.Amount, pName:row.Name})
Followed by:
LOAD CSV with headers FROM 'file/graphdata.csv' as row WITH row
WHERE row.pName is NOT NULL
MATCH(transId:TransactionId)
MATCH(refId:RefNo)
MATCH(kewd:Keyword)
MATCH(accNo:AccountNumber {bName:row.Bank_Name, pAmt:row.Amount, pName:row.Name})
MERGE(transId)-[:REFERENCE]->(refId)-[:USED_FOR]->(kewd)-[:AGAINST]->(accNo)
RETURN *
Edit (table replica):
TransactionId Bank_Name RefNo Keyword Amount AccountNumber AccountName
12345 ABC 78 X 1000 5421 WE
23456 DEF X 2000 5471
34567 ABC 32 Y 3000 4759 HE
Is it likely the case that the Nodes and relationships are not created at all? How do I get all these desired relationships?
Neither file:/graphdata.csv nor file/graphdata.csv are legal URLs. You should use file:///graphdata.csv instead.
By default, LOAD CSV expects a "csv" file to consist of comma separated values. You are instead using a variable number of spaces as a separator (and sometimes as a trailer). You need to either:
use a single space as the separator (and specify an appropriate FIELDTERMINATOR option). But this is not a good idea for your data, since some bank names will likely also contain spaces.
use a comma separator (or some other character that will not occur in your data).
For example, this file format would work better:
TransactionId,Bank_Name,RefNo,Keyword,Amount,AccountNumber,AccountName
12345,ABC,78,X,1000,5421,WE
23456,DEF,,X,2000,5471
34567,ABC,32,Y,3000,4759,HE
Your Cypher query is attempting to use row properties that do not exist (since the file has no corresponding column headers). For example, your file has no pName or Name headers.
Your usage of the MERGE clause is probably not doing what you want, generally. You should carefully read the documentation, and this answer may also be helpful.

Converting GeoJSON to DbGeometry

I am working on Bing maps, but fairly new to spatial data types. I have managed to get the GeoJson data for a shape from bing maps for example,
{"type":"MultiPolygon","coordinates":[[[[30.86202,-17.85882],[30.93311,-17.89084],[30.90701,-17.92073],[30.87112,-17.90048],[30.86202,-17.85882],[30.86202,-17.85882],[30.86202,-17.85882]]]]}
However I need to save this as DbGeomtry in SQL, how can convert GeoJson to DbGeomtry
Option 1.
Convert the GeoJSON to WKT and then use stgeomfromtext to create the Db object.
Option 2.
Deserialize the GeoJSON using GeoJSON.Net and then use the nuget package GeoJSON.Net.Contrib.MsSqlSpatial to convert to a Db object. eg.
DbGeography dbGeographyPoint = point.ToDbGeography();
Option 3.
For some types of GeoJSON data, modifications based on this approach can be used
drop table if exists BikeShare
create table BikeShare(
id int identity primary key,
position Geography,
ObjectId int,
Address nvarchar(200),
Bikes int,
Docks int )
declare #bikeShares nvarchar(max) =
'{"type":"FeatureCollection",
"features":[{"type":"Feature",
"id":"56679924",
"geometry":{"type":"Point",
"coordinates":[-77.0592213018017,38.90222845310455]},
"properties":{"OBJECTID":56679924,"ID":72,
"ADDRESS":"Georgetown Harbor / 30th St NW",
"TERMINAL_NUMBER":"31215",
"LATITUDE":38.902221,"LONGITUDE":-77.059219,
"INSTALLED":"YES","LOCKED":"NO",
"INSTALL_DATE":"2010-10-05T13:43:00.000Z",
"REMOVAL_DATE":null,
"TEMPORARY_INSTALL":"NO",
"NUMBER_OF_BIKES":15,
"NUMBER_OF_EMPTY_DOCKS":4,
"X":394863.27537199,"Y":137153.4794371,
"SE_ANNO_CAD_DATA":null}
},
......'
-- NOTE: This GeoJSON is truncated.
-- Copy full example from https://github.com/Esri/geojson-layer-js/blob/master/data/dc-bike-share.json
INSERT INTO BikeShare(position, ObjectId, Address, Bikes, Docks)
SELECT geography::STGeomFromText('POINT ('+long + ' ' + lat + ')', 4326),
ObjectId, Address, Bikes, Docks
from OPENJSON(#bikeShares, '$.features')
WITH (
long varchar(100) '$.geometry.coordinates[0]',
lat varchar(100) '$.geometry.coordinates[1]',
ObjectId int '$.properties.OBJECTID',
Address nvarchar(200) '$.properties.ADDRESS',
Bikes int '$.properties.NUMBER_OF_BIKES',
Docks int '$.properties.NUMBER_OF_EMPTY_DOCKS' )
I suggest to try Option 2 first.
Note: Consider Geography instead of Geometry if you are using GCS_WGS_1984 projection as is with Bing Maps.
Instead of retrieving GeoJSON from Bing Maps, retrieve Well Known Text: https://www.bing.com/api/maps/sdk/mapcontrol/isdk/wktwritetowkt
Send this to your backend and then use the geometry::stgeomfromtext https://learn.microsoft.com/en-us/sql/t-sql/spatial-geometry/stgeomfromtext-geometry-data-type?view=sql-server-ver15
Note that DbGeometery will not support spatially accurate calculations. Consider storing the data as a DbGeograpgy instead using geography::stgeomfromtext https://learn.microsoft.com/en-us/sql/t-sql/spatial-geography/stgeomfromtext-geography-data-type?view=sql-server-ver15 and pass in '4326' as the SRID.

Join 2 tables in Hive using a phone number and a prefix (variable length)

I'm trying to match phone numbers to an area using Hive.
I've got a table (prefmap) that maps a number prefix (prefix) to an area (area) and another table (users) with a list of phone numbers (nb).
There is only 1 match per phone number (no sub-area)
The problem is that the length of the prefixes is not fixed so I cannot use the UDF function substr(nb,"prefix's length") in the JOIN's ON() condition to match the substring of a number to a prefix.
And when I try to use instr() to find if a number has a matching prefix:
SELECT users.nb,prefix.area
FROM users
LEFT OUTER JOIN prefix
ON (instr(prefmap.prefix,users.nb)=1)
I get an error on line4 "Both left and right aliases encountered in Join '1')
How could I get this to work?
I'm using hive 0.9
Thanks for any advice.
Probably not the best solution but at least it does the job:
use WHERE to define the matching condition instead of ON() (that is now forced to TRUE)
select users.nb, prefix.area
from users
LEFT OUTER JOIN prefix
ON(true)
WHERE instr(users.nb,prefmap.prefix)=1
It's not perfect as it's a bit slow. It creates as many temporary (useless) entries as there are in the matching table before the WHERE condition keeps the only right one. So it's better to use this only if it's not too long.
Can anyone think of a better way to do this?
hive cannot convert (instr(prefmap.prefix,users.nb)=1) to mapreduce job.
so hive's join just support equality expression. see hive joins wiki for more information.

erlang - how can I match tuple contents with qlc and mnesia?

I have a mnesia table for this record.
-record(peer, {
peer_key, %% key is the tuple {FileId, PeerId}
last_seen,
last_event,
uploaded = 0,
downloaded = 0,
left = 0,
ip_port,
key
}).
Peer_key is a tuple {FileId, ClientId}, now I need to extract the ip_port field from all peers that have a specific FileId.
I came up with a workable solution, but I'm not sure if this is a good approach:
qlc:q([IpPort || #peer{peer_key={FileId,_}, ip_port=IpPort} <- mnesia:table(peer), FileId=:=RequiredFileId])
Thanks.
Using on ordered_set table type with a tuple primary key like { FileId, PeerId } and then partially binding a prefix of the tuple like { RequiredFileId, _ } will be very efficient as only the range of keys with that prefix will be examined, not a full table scan. You can use qlc:info/1 to examine the query plan and ensure that any selects that are occurring are binding the key prefix.
Your query time will grow linearly with the table size, as it requires scanning through all rows. So benchmark it with realistic table data to see if it really is workable.
If you need to speed it up you should focus on being able to quickly find all peers that carry the file id. This could be done with a table of bag-type with [fileid, peerid] as attributes. Given a file-id you would get all peers ids. With that you could construct your peer table keys to look up.
Of course, you would also need to maintain that bag-type table inside every transaction that change the peer-table.
Another option would be to repeat fileid and add a mnesia index on that column. I am just not that into mnesia's own secondary indexes.

Resources