Accuracy of the geolite_city_bq_b2 dataset - geolocation

I believe there are inaccuracies in the BigQuery fh-bigquery.geocode.geolite_city_bq_b2 dataset, and am curious if others have noticed this too.
Background: I have the BigQuery code from Ramtin M. Seraj running, and his/my logic appear to be sound. However there are IP addresses known to represent certain places, e.g. Tokyo # 150.249.199.17, but which are indicated by Ramtin's query to be in Rochester NY-USA or Ottawa ON-CA. If the query logic is sound then the only conclusion is that the underlying Geolite dataset is not.
To verify, look at results of this query:
SELECT *
FROM `fh-bigquery.geocode.geolite_city_bq_b2b`
WHERE classB = 38649
Note from these results that startIp = 150.245.0.0 and endIp = 150.249.255.255, therefore address 150.249.199.17 is within this IP range.
Now compare with the results from https://ipinfo.io/150.249.199.17, and also with the results from the following BigQuery. Notice that all computed values, such as IPV4_TO_INT64() of the IP address, fall within the ranges returned by the query above.
SELECT '150.249.199.17' as ipAddress
, NET.IPV4_TO_INT64(NET.IP_FROM_STRING('150.249.199.17')) AS clientIpNum_int
, TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING('150.249.199.17'))/(256*256)) AS classB
, CAST(TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING('150.249.199.17'))/(256*256)) as INT64) as client_classB_int
p.s. I would upvote the first answer, or add a comment, but I don't have enough Reputons yet!

2019, much improved answer:
https://medium.com/#hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2
#standardSQL
# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c
FROM `publicdata.samples.wikipedia`
WHERE contributor_ip IS NOT null
GROUP BY 1
)
SELECT country_name, SUM(c) c
FROM (
SELECT ip, country_name, c
FROM (
SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
)
JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`
USING (network_bin, mask)
)
GROUP BY 1
ORDER BY 2 DESC
I'm about to publish a much improved version of Geolite in BigQuery. Stay tuned to https://twitter.com/felipehoffa and https://medium.com/#hoffa. And I'll update this answer then too.
With that said, to answer the accuracy part which titles this question, Maxmind says:
GeoLite2 databases are free IP geolocation databases comparable to, but less accurate than, MaxMind’s GeoIP2 databases
https://dev.maxmind.com/geoip/geoip2/geolite2/

Related

downsampling: get constant value. E.g. sensor name from GROUP BY

I created a continuous query to downsample readings from temperature sensors in my influxdb to store hourly means for a longer time. There are readings of multiple sensors in one table. Upon executing the query, the sensors ip is missing.
Basic data looks like this:
> SELECT ip,tC FROM ht LIMIT 5
name: ht
time ip tC
---- -- --
1671057540000000000 192.168.0.83 21
1671057570000000000 192.168.0.83 21
1671057750000000000 192.168.0.17 21.38
The continuous query (simplified without CREATE ... END):
SELECT last(ip), mean("tC") AS "mean_temp" INTO "downsampled"."ht_downsampled" FROM "ht" GROUP BY time(1h),ip
The issue is, the value of 'ip' is only a tag, not the value in the table and subsequently is missing in the table the query inserts into:
name: ht
tags: ip=192.168.0.17
time ip mean_temp mean_hum
---- -- --------- --------
1671055200000000000 21.47 42.75
1671058800000000000 21.39428571428571 48.785714285714285
1671062400000000000 21.314999999999998 51.625
Why is last(ip) not producing any value?
Can I get the value from the 'tags' into the table?
Is there a different approach to group data with a constant value?
Could you just try query the ip instead of the last(ip) since you are grouping by the ip in the statement already?
Sample code:
SELECT ip, mean("tC") AS "mean_temp" INTO "downsampled"."ht_downsampled" FROM "ht" GROUP BY time(1h), ip

Cypher Performance with Multiple Relations

I am tasked with prototyping Neo4J as a replacement to our existing data mart which stores data in Redshift/Postgres schemas. I have loaded an instance of neo running on an EC2 instance on an m5.xlarge server to model a marketing campaign and trying to get simple counts of who saw my spots in a given demographic. The results as far as numbers produced are the exact same as my existing data mart, but i am surprised to see that performance is much slower. The same query to get simple count of impressions by a television network returns in 48 seconds compared to 1.5 seconds in Redshift. Question is am I doing something wrong in my cypher (i.e. too many joins) or is this expected behavior. Here is a diagram of the model:
Here is my Cypher to get the results in 48s:
match (c:Campaign{campaign_id:98})<-[:PART_OF]-(sa)
, (sa)-[:AIRED_ON]->(n)
, (n)-[:BELONGS_TO]->(ng:NetworkGroup{network_group_id:2})
, (sa)<-[:EXPOSED_WITH]-(e)
, (e)<-[se:CONTAINS_ENTITY]-(s:Sample{sample_id:2000005})
, (e)-[:MEMBER_OF]->(a:DemographicAudience{audience_id:2})
return c.campaign_id as `campaign_id`
, a.audience_id as `audience_id`
, a.audience_name as `audience_name`
, s.sample_id as `sample_id`
, n.network_id as `network_id`
, n.network_name as `network_name`
, n.network_call_sign as `network_call_sign`
, count(distinct sa.spot_airing_id) as `spot_airings`
, sum(se.weight) as `spot_impressions`
In addition, I believe all necessary constraints are added to optimize:
Indexes
ON :DemographicAudience(audience_id) ONLINE (for uniqueness constraint)
ON :Campaign(campaign_id) ONLINE (for uniqueness constraint)
ON :Entity(entity_id) ONLINE (for uniqueness constraint)
ON :Network(network_id) ONLINE (for uniqueness constraint)
ON :NetworkGroup(network_group_id) ONLINE (for uniqueness constraint)
ON :Sample(sample_id) ONLINE (for uniqueness constraint)
ON :SpotAiring(spot_airing_id) ONLINE (for uniqueness constraint)
Constraints
ON ( audience:DemographicAudience ) ASSERT audience.audience_id IS UNIQUE
ON ( campaign:Campaign ) ASSERT campaign.campaign_id IS UNIQUE
ON ( entity:Entity ) ASSERT entity.entity_id IS UNIQUE
ON ( network:Network ) ASSERT network.network_id IS UNIQUE
ON ( networkgroup:NetworkGroup ) ASSERT networkgroup.network_group_id IS UNIQUE
ON ( sample:Sample ) ASSERT sample.sample_id IS UNIQUE
ON ( spotairing:SpotAiring ) ASSERT spotairing.spot_airing_id IS UNIQUE
Running Neo4J 3.3.1 Community on AWS: https://aws.amazon.com/marketplace/pp/B073S5MDPV/?ref_=_ptnr_intuz_ami_neoj4
I should also mention that quite a bit of data is loaded: 24,154,440 nodes and 33,220,694 relationships. Most of these are relationships to entities.
My understanding was that Neo4J should hold toe to toe with any RDBMS and even outperform as data grows. I'm hoping I am just being naive with my rookie cypher skills. Any help would be appreciated.
Thanks.
Keep in mind that in Neo4j indexes are used to find starting points in the graph, and once those starting points are found, relationship traversal is used to expand out to find the paths which fit the pattern.
In this case we have several unique nodes, so we can save on some operations by ensuring that we match to all of these nodes first, and we should see Expand(Into) in our query plan instead of Expand(All) and then Filter. I have a hunch that the planner is only using index lookup on a single node, and the rest of them aren't using the index but using property access and filtering, which is less efficient.
In the case that the planner doesn't lookup all your unique nodes first, before expansion, we can force it by introducing a LIMIT 1 after the initial match.
Lastly, it's a good idea to aggregate using the node itself rather than its properties if the properties in question are unique. This will use the underlying graph id for the node for comparison purposes rather than having to do more expensive property access.
Give this a try and see how it compares:
MATCH (c:Campaign{campaign_id:98}), (s:Sample{sample_id:2000005}), (a:DemographicAudience{audience_id:2}), (ng:NetworkGroup{network_group_id:2})
WITH c,s,a,ng
LIMIT 1
MATCH (c)<-[:PART_OF]-(sa)
, (sa)-[:AIRED_ON]->(n)
, (n)-[:BELONGS_TO]->(ng)
, (sa)<-[:EXPOSED_WITH]-(e)
, (e)<-[se:CONTAINS_ENTITY]-(s)
, (e)-[:MEMBER_OF]->(a)
WITH c, a, s, n, count(distinct sa) as `spot_airings`, sum(se.weight) as `spot_impressions`
RETURN c.campaign_id as `campaign_id`
, a.audience_id as `audience_id`
, a.audience_name as `audience_name`
, s.sample_id as `sample_id`
, n.network_id as `network_id`
, n.network_name as `network_name`
, n.network_call_sign as `network_call_sign`
, `spot_airings`
, `spot_impressions`

InfluxDB mixing agregation function with non-aggregat fields/values

I have a following issue:
I need to calculate difference between consecutive points where some arbitrary ID is equal. The following:
SELECT difference(value_field) FROM mesurementName WHERE "IdField" = '10'
Works, returns difference between each consecutive point with IdField BUT IdField is lost (only time is propagated to query result). In my case time is not unique (i.e. measurement may contain many points with same timestamp, but different IdField). So I tried:
SELECT difference(value_field), IdField FROM mesurementName WHERE "IdField" = '10'
which yields:
error parsing query: mixing aggregate and non-aggregate queries is not supported!!
My next attempt was using sub-query:
SELECT IdField, diff
FROM (
SELECT
difference(flow_val) as diff
FROM
mesurementA
WHERE "IdField" = '10'
)
Which resulted in always null value in IdField.
I'd like to ask you for help or suggestion how to solve issue. By the way, we are using InfluxDB 1.3, which is not supporting JOIN anymore
If anyone would stuck as I was, then solution is following:
SELECT difference(value_field) FROM mesurementName GROUP BY "IdField"
Above somehow implicitly add "IdField" to result series and is propagated to resulting measurements with INTO clause

Low performance of neo4j

I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.

Neo4j performance - is it really fast?

I'm wanting to run some tests on neo4j, and compare its performance with other databases, in this case postgresql.
This postgres database have about 2000000 'content's distributed around 3000 'categories'. ( this means that there is a table 'content', one 'category' and a relation table 'content-to-category' since one content can be in more than 1 category).
So, mapping this to a neo4j db, i'm creating nodes 'content', 'category' and their relations ( content to category, and content to content, cause contents can have related contents).
category -> category ( categories can have sub-categories )
content -> category
content -> content (related)
Do you think this 'schema' is ok for this type of domain ?
migrating all data from postgresql do neo4j: it is taking forever ( about 4, 5 days ). This is just some search for nodes and creating/updating accordingly. (search is using indexes and the insert/update if taking 500ms for each node)
Am i doing something wrong ?
Migration is done, so i went to try some querying ...
i ended up with about 2000000 content nodes, 3000 category nodes, and more than 4000000 relationships.
(please note that i'm new to all this neo4j world, so i have no idea how to optimize cypher queries...)
One of the queries i wanted to test is: get the 10 latest published contents of a given 'definition' in a given category (this includes contents that are in sub categories of the given category)
experimenting a little, i ended up with something like this :
START
c = node : node_auto_index( 'type: category AND code: category_code' ),
n = node : node_auto_index( 'type: content AND state: published AND definitionCode: definition_name' )
MATCH (c) <- [ r:BELONGS_TO * ] - (n)
RETURN n.published_stamp, n.title
ORDER BY n.published_stamp DESC
LIMIT 6
this takes around 3 seconds, excluding the first run, that takes a lot more ... is this normal ?
What am i doing wrong ?
please note that i'm using neo4j 1.9.2, and auto indexing some node properties ( type, code, state, definitionCode and published_stamp included - title is not auto indexed )
also, returning 'c' on the previous query ( start c = node : node_auto_index( 'type: category AND code : category-code' ) return c; ) is fast (again, excluding the first run, that takes aroung 20-30ms)
also, i'm not sure if this is the right way to use indexes ...
Thank you in advance (sorry if something is not making sense - ask me and i'll try to explain better).
Have you looked at the batch import facilities: http://www.neo4j.org/develop/import? You really should look at that for the initial import - it will take minutes instead of days.
I will ask some of our technical folks to get back to you on some of the other stuff. You really should not be seeing this.
Rik
How many nodes are returned by this?
START
n = node : node_auto_index( 'type: content AND state: published AND definitionCode: definition_name' )
RETURN count(*)
I would try to let the graph do the work.
How deep are your hierarchies usually?
Usually you limit arbitrary length relationships to not have the combinatorial explosion:
I would also have a different relationship-type between content and category than the category tree.
Can you point out your current relationship-types?
START
c = node : node_auto_index( 'type: category AND code: category_code' ),
MATCH (c) <- [:BELONGS_TO*5] - (n)
WHERE n.type = 'content' AND n.state='published' and n.definitionCode = 'definition_name'
RETURN n.published_stamp, n.title
ORDER BY n.published_stamp DESC
LIMIT 6
Can you try that?
For import it is easiest to generate CSV from your SQL and import that using http://github.com/jexp/batch-import
Are you running Linux, maybe on an ext4 filesystem?
You might want to set the barrier=0 mount option, as described here: http://structr.org/blog/neo4j-performance-on-ext4
Further discussion of this topic: https://groups.google.com/forum/#!topic/neo4j/nflUyBsRKyY

Resources