My team is debating internally whether or not we should be creating a separate dimension of address information. The use case is a warehouse for a mail marketing agency, so address is quite important for a multitude of reasons.
We have a couple of pieces of address information flowing in (like Bank address, Customer Address (Our Client's customers), Mailing List Address (or Manifests), And Client Address. We might also get information in bits and pieces from other information that we might need to tie to a specific customer based on address comparisons.
We also do geocoding on our addresses to augment, standardize and validate our addresses that come in.
In total, we are storing the following fields for any given address:
DeliveryLine1
DeliveryLine2
LastLine
DeliveryPointBarcode
StreetNumber
ApartmentNumber
ApartmentUnitType
StreetName
StreetSuffix
Locality
Region
ZipCode
ZipCodePlusFour
DeliveryPoint
DeliveryPointCheckpointDigit
Latitude
Longitude
RecordType
ZipType
CountyFIPS
CarrierRoute
ResidentialDeliveryIndicator
Precision
DPV
Vacant
Active
EWS
thats 27 fields in total.
My colleague is of the opinion that address should go into each dimension (Customer, Bank, Client, Manifest). While I agree that in simple cases where we store Address1, Address2, City, State, Zip it would make sense, but we store a significant amount of added information about an address, with more bits and pieces being added later on (potentially). I make the contention that something like this would be better suited as a separate dimension. Any thoughts?
Looking from dimensional modelling point of view, your fact tables should answer to this question. If your [mail marketing] facts relates to addresses then go ahead and make Address as a separate dimension. I mean, if you do [a mail marketing] to Banks, to Customers, to Mailing List Addresses and Clients and want to analyze information based on Geo information (that is on Address) then it should be created as a separated dimension. However, if you [usually] mail market only your CLients and use an address for other purpose, i.e to find nearly Customer, Banks,etc, then I don't see much value to make an Address as a dimension. In essence, if your facts are related to Addresses to the same level as targets (Banks, Customers, Mailing List Addresses, Clients) then it should be a dimension. If that means nothing but just an attribute of Bank, Customer, Mailing List Address or Client then no need to go with a dimension.
Related
Lets say we have a resource structure like below
GUID
Region
Country
State
StateDetails
a120c850-e296-4563-8fb9-31d0192aef75
EMEA
FR
Normandy
Statedetails
6f4b3ca6-c992-42dd-b1e3-8c8f8ba62886
APAC
AU
New South Wales
Statedetails
d202b255-5fe1-4203-b4ad-3cc74f4f6986
AMERICAS
US
California
Statedetails
...etc
...etc
...etc
...etc
...etc
Where guid is a unique identifier for a resource record
and Region is a parent for country which in turn is a parent for state and each such state has one record in the table. (region/country/state combination is unique and can act as an alternate key)
In order to display the State details json, which URL naming would be more appropriate. Are there pros and cons to each approach ?
Option 1:
http://www.xyzsamplecompany.com/insertGUIDhere
eg: http://www.xyzsamplecompany.com/6f4b3ca6-c992-42dd-b1e3-8c8f8ba62886
(or)
Option 2:
http://www.xyzsamplecompany.com/region/country/state/resource
eg: http://www.xyzsamplecompany.com/EMEA/FR/Normandy/statedetails
Listing my thoughts here.
In Option 1,
the link is immutable and relatively static. So the same URL will return same data over a long period of time. So a client (requestor) may be provided the static link which they can book mark for future use. Tomorrow if the data changes (say france decides to rename normandy to normândy) such complexity is hidden from the requestor.
However there is less transparency as to what resource we are inquiring and the complexity is hidden.
In Option 2, the hierarchy is very clearly laid out. The complexity of arriving at the correct URL is left to the client (requestor) so they need to keep track of underlying data changes.
It is transparent for any system inspecting the resource such as a monitoring tool or WAF. However with excess transparency, if an unwanted third party knows the list of states, countries, regions (which is common knowledge) then there is risk of scraping which could prove resource intensive.
Currently, I'm working on dimensional modeling and have a question in regards to an outrigger dimension.
The company is trading and acts as a broker between customer and supplier.
For a fact table, "Fact Trades", we include dimCustomer and dimSupplier.
Each of these dimensions have an address.
My question is if it is correct to do outrigger dimensions that refer to geography. This way we can measure how much we have delivered from an origin and delivered to a city.
dimensional model
I am curious to what is best practice. I hope you can help to explain how this should be modelled correctly and why.
Hope my question was clear and that I have posted it the correct place.
Thanks in advance.
I can think of at least 3 possible options; your particular circumstances will determine which is best for you:
If you often filter your fact by geography but without needing company/person information (i.e. how many trades where between London and New York?) then I would create a standalone geography dimension and link it directly to your fact (twice - for customer and supplier). This doesn't also stop you having geographic attributes in your customer/supplier Dims, as a dimensional model is not normalised
If geographic attributes change at a significantly more frequent rate than the customer/supplier attributes, and the customer/supplier Dims have a lot of attributes, then it may be worth creating an outrigger dim for the geographical attributes - as this reduces the maintenance required for the customer/supplier Dims. However, given that most companies/people rarely change their address, this is probably unlikely
Keep the geographical attributes in the customer/supplier Dims. I would probably do this anyway even if also picking option 1 above
Just out of interest - do customer and supplier have significantly different sets of attributes (I assume they are both companies or people)? Is it necessary to create separate Dims for them?
I have a schema which looks like below:
A customer is linked to another customer with a relationship SIMILAR having similarity score.
Example: (c1:Customer)-->(c2:Customer)
An Email node is connected to each customer with relationship MAIL_AT with the following node properties:
{
"active_email_address": "a#mail.com",
"cibil_email_addresses": [
"b#mail.com", "c#mail.com"
]
}
Example: (e1:Email)<-[:MAIL_AT]-(c1:Customer)-[:SIMILAR]->(c2:Customer)-[:MAIL_AT]->(e2:Email)
A Risk node with some risk-related properties (below) and is related to customer with relationship HAS_RISK:
{
"f0_score": 870.0,
"pta_score": 430.0
}
A Fraud node with some fraud-related properties (below) and is related to customer with relationship IS_FRAUD:
{
"has_commited_fraud": true
}
My Objectives:
To find the customers with common email addresses (irrespective of active and secondary)?
My tentative solution:
MATCH (email:Email)
WITH email.cibil_email_addresses + email.active_email_address AS emailAddress, email
UNWIND emailAddress AS eaddr
WITH DISTINCT eaddr AS deaddr, email
UNWIND deaddr AS eaddress
MATCH (customer:Customer)-[]->(someEmail:Email)
WHERE eaddress IN someEmail.cibil_email_addresses + someEmail.active_email_address
WITH eaddress, COLLECT(customer.customer_id) AS customers
RETURN eaddress, customers
Problem: It is taking forever to execute this. Working with lists will take time I understand, however, I'm flexible to change the schema (If suggested). Should I break the email address into separate nodes? If yes, then how do I break cibil_email_addresses into different nodes as they can vary - Should I create two nodes with different cibil email addresses and connect both of them to customer with relationship HAS_CIBIL_EMAIL? (Is this a valid schema design). Also, it is possible, a customer's active_email_address is present in other customer's cibil_email_address. I'm trying to find a synthetic identity attack. PS: If there exists some APOC that can help achieve this and below, do suggest with example.
In production, for a given customer with email addresses, risk values, similarity score, and also given other customers may or may not be tagged with fraud_status, I want to check whether this new person will fall in a fraud ring or not. PS: If I need to use any gds to solve this, please suggest with examples.
If I were to do this same exercise with some other node such as Address which may be partially matching and will be having same list of historical addresses in a list, what should be my ideal approach?
I know, I'm tagging someone in my question, but that person only seems to be active with respect to Cypher on StackOverflow. #cybersam any help?
Thanks.
This should work:
MATCH (e:Email)
UNWIND (e.cibil_email_addresses + e.active_email_address) AS address
WITH address, COLLECT(e) AS es
UNWIND es AS email
MATCH (email)<-[:MAIL_AT]-(cust)
RETURN address, COLLECT(cust) AS customers
The WITH clause takes advantage of the arregating function COLLECT to automatically collect all the Email nodes containing the same address, using address as the grouping key.
You should only ask one question at a time. You have a couple of other questions at the bottom. If you continue to need help with them, please create new questions.
Is there are (web) service which offers a Geo Location API?
For example:
I have the German State "Baden-Württemberg", now I want to get a result which are the biggest city's (for example order by population).
My problem is little bit abstract, but hope someone can understand it.
This is not exactly what you are looking for, but I think it is a step in the direction if you are willing to setup your own database to query. The United Nations Statistical Division (UNSD) keeps a dataset of the largest cities > 100,000 population. You can find it at this link. Note that it does not show what state (1st level administrative division) the city is within, just the country.
http://unstats.un.org/unsd/demographic/products/dyb/dyb2011/Table08.xls
I have created a CSV version of the data (using semi-colons as delims) you can use as well:
http://www.opengeocode.org/cude1.1/UN/UNSD/dyd2011-pop100k.zip
OpenGeoCode.Org is an open data project where we take national and international publicly available datasets and convert them into a common CSV format.
Andrew
I'm looking for a way to search through a database and find close similarities between email addresses. The only solution I can thing of is O(N^2), and involves a nested loop. Basically grab an email address, and then check it against the rest of the addresses, over and over. This will be extremely consuming as I'm dealing with 100,000 email addresses in a database. If it makes a difference, this will be implemented as a background job for a Ruby on Rails app.
Is there any way to do this?
I'm really only looking for basic similarities. An example would be
docjohnson#gmail.com
docjohnson1#gmail.com
docjohnson333#gmail.com
docjohnson#hotmail.com
I would want those all marked similar to each other.
Thanks for the help!
EDIT: I'm using a Mongo database connected to ROR via Mongoid, if that changes the game at all.
Compute a "signature" for each email address; for instance, a signature might be the first five characters of the username part of the address. Sort all email addresses to bring together those with identical signatures; if your signature algorithm does a good job, each set of signatures should refer to the same person. You'll have to tune the signature algorithm based on your data and your definition of similarity.
I suggest that you start with "canonicalizing" the e-mails:
strip trailing digits from the username part, e.g., john123 -> john.
maybe drop some punctuation from the username, e.g., john.smith -> johnsmith.
drop the some hosts from the domain part, e.g., mail.foo.com -> foo.com; but not math.mit.edu -> mit.edu.
after you do 1 & 2, you should collect the original emails into a hash table mapping the canonical usernames to the original ones, so that when you are done, you only need to iterate over the canonical usernames.