Comparing datafeeds from different networks (Affiliate Marketing) - parsing

I am working on integrating affiliate sales into few existing sites. We are using a few merchants who work via different networks (cj, shareasale, linkshare, avantlink).
Now my observation is that all these networks provide data feeds in different formats. But that's not a big problem. My main concern is actually merchants using different titles on same products. I don't want to run into these situations:
a) two listings of the SAME product from N merchants (if titles are just a bit different)
b) one listing of N different products from merchants (if we don't use strict comparison algorithm)
We want to automate everything as much as possible, want to avoid operators scanning listings under question all the time.
How is this problem typically handled?

We have a similar issue with trying to collapse products from multiple merchant feeds. What we do is collapse products based on their brand (or manufacturer) + sku combo.
Our data is pretty messy so we have to do some work to normalize both the brand and the sku so the products collapse nicely. We have a list of brands that we care about and do some work to map brands from the merchant feed into our brand. e.g. If we have an "ACME" brand in our system we might map the following to that brand:
A.C.M.E => ACME
ACME Inc. => ACME
Acme Incorporated => ACME
For skus we usually just strip any non-alphanumeric characters for matching purposes. e.g. all the following would map to the same sku:
abc-123 => abc123
abc.123 => abc123
abc 123 => abc123
ab.c1.23 => abc123
So if we see brand "ACME Inc." and sku "abc-123" in one feed that will collapse with brand "A.C.M.E" and sku "abc 123" from another feed.
As part of the collapsing process we end up with multiple names/images/descriptions/categories/etc... for each collapsed part and need to choose the "best" one to show on the website.
That's a very high level overview of how we handle it.

Look for merchants who provide UPC codes in their feeds. They are universal. Plus in AvantLink you can customize your own feed output so that's nice.

I was actually looking at 2 sample data feeds from AvantLink a minute ago. Here's the list of fields they provide (not filtered, so I assume it's everything):
SKU
Manufacturer
Id
Brand Name
Product Name
Long Description
Short Description
Category
SubCategory
Product Group
Thumb URL
Image URL
Buy Link
Keywords
Reviews
Retail Price
Sale Price
Brand
Page Link
Brand Logo Image
Product Page View Tracking
Product Content Widget
I was thinking that yes, having UPC would be (almost) ideal but both stores I was looking at (one of them is REI) don't provide UPC's.
Checked Commission Junction and Sshareasale, a few large merchants, they don't include UPC's either.

How is this problem typically handled?
Such scenarios are typically covered by data warehouse systems like provided by ORACLE, HP, Microsoft, IBM, Netezza or Teradata.

Related

In Google Sheets, how can I remove similar (but not duplicate) strings from the same column?

In Google Sheets, I have a column which is a list of company names, all within the same column. The problem is that some of them are repeated with slight differences. For example:
Company 1 Limited
Company 1 Ltd
Company 2
Company 3 Group
Company 3
Company 4
Company 5 p.l.c
Company 5 plc
Is there a way I can delete the similar (e.g. Company 1 "Ltd" vs Company 1 "Limited" entries to end up with a list like this?
Company 1 Ltd
Company 2
Company 3 Group
Company 4
Company 5 plc
I don't have a preference between words like 'Ltd' or 'Limited', or whether 'group' is present or not. I would just like to reduce, as much as possible, these similar double entries. I've come across fuzzylookup but my understanding is that it only works between two ranges.
The easiest thing for me to do would be to strip down the company names to a standard form by removing "Ltd" and "limited" with find and replace, and then remove the duplicates, but I would rather not go down this path as I would like to retain something following the company name. Keep in mind that the column contains company names which vary in string length. "Company X" is used in this case for demonstration purposes.
Sample here: https://docs.google.com/spreadsheets/d/1u2XDzKR09Ri_hR9FXxs9OwRswEKDvhlXCaLb1w-rhSc/edit?usp=sharing

Rails & Postgres: Best practice for many boolean relationship entries

I'm looking for input on what's best practice for my problem. In abstract terms, I need to store a lot of relationships and need to prioritise between database size, query speed and “ease of maintenance”. The setup is Ruby on Rails with PostgreSQL.
More specific description: Imagine a website that's essentially a searchable database of Products sold by Vendors. Vendors may not ship worldwide, and I want to filter out Products by Vendors who won't ship to a user based on their GeoIP country. To make matters slightly more complex, a Product page can have two different features (let’s call them F1 and F2) that are separately geo-IP-dependent.
Example: A Vendor may want their Product pages to have feature F1 for all countries worldwide except a few e.g. because of embargos; but feature F2 may only be available for countries within e.g. Europe.
Country filters will always be set at the Vendor level.
The “search” function of the website is a basic SQL search in which I want Products to show up if at least one of the features is available for the current user’s country.
The website will allow in the range of 1,000 to 3,000 Vendors (this is a hard limit), and a total of around 10,000 to 50,000 Products. Let's assume that in the beginning, filtering is only relevant to around 100 Vendors.
I had the following ideas and hope that others have feedback on these, or additional approaches:
One relation model CountryVendor with two boolean columns (in which case optionally, a Product could still be shown if the respective country_vendor does not yet exist; i.e. show if !country_vendor&.allow).
Assuming ca. 200 countries, this would imply ca. 2,000 rows in the beginning, and around 600k rows if filters were in place for every one of the 3,000 potential Vendors.
(Theoretically, if non-existence is treatet as true, I could also set up a rake task that removes rows that are false for both features, thus reigning in the table size.)
Two relation models CountryVendorF1 and CountryVendorF2, each with just one boolean column. Not sure if this will effectively be much different, but I imagine it closer to how I think of the UI for setting up the country filters (without going into detail here).
Two JSON columns in the Vendors table that would store true/false for each Country. (Maybe with an ISO code string as the index for simplicity.) There wouldn’t be thousands of new rows, although the DB would still grow in size, but querying might become slow.

Unit Price and Discounts - Fact or Dimension Table

I'm working on a datamart for our sales and marketing departments, and I've come across a modeling challenge. Our ERP stores pricing data in a few different ways:
List pricing for each item
A discount percentage from list pricing for a product line, either for groups of customers or for a specific account
A custom price for an item, either for groups of customers or for a specific account
The Pricing department primarily uses this data operationally, not analytically. For example, they generate reports for customers ("What special pricing / discount %s do I have?") and identify which items / item groups need to be changed when they engage in a new pricing strategy.
Pricing changes happen somewhat regularly on a small scale, usually on a customer-by-customer or item-by-item basis. Infrequently, there are large-scale adjustments to list pricing and group pricing (discounts and individual items) in addition to the customer-level discounts.
My head has been in creating one or more fact tables to represent this process. Unfortunately, there's no pre-existing business key for pricing. There's also no specific "transaction date," since the ERP doesn't (accurately) maintain records of when pricing is changed. Essentially, a "pricing event" is going to be a combination of:
Effective date
End date
Item OR product line
(Not required for list price) customer or customer group
A price amount OR discount percentage
A single fact table seems problematic in that I'm going to have to deal with a lot of invalid combinations of dimensions and facts. First, a record will never have both a non-NULL price amount and a non-NULL discount percentage; pricing events are either-or. Second, only certain combinations of dimensions are valid for each fact. For example, a discount percentage will only ever have a product line, not an individual item.
Does it make sense to model pricing as a fact table in the first place? If so, how many tables should I be considering? My intuition is to use at least two, one for the percentages and one for the price amounts, but this still leaves a problem where each record will either have a valid customer group OR a valid customer (or neither, for list prices), since we need to maintain customer-specific pricing separate from any group pricing that customer might have.
You may need to keep them both as attributes and as facts.
The price a certain item was sold for is a fact. When you multiply it by the quantity sold it's actually an additive measure. So, keep it in the fact table. Total discount applied is also additive, I'd keep it. You can later query "how much was discounted in 2019 per customer", which would be much harder to achieve without those facts.
But if you also need to query things like "what's the discount customer X is on", then you should also keep that as an attribute of the customer dimension, and treat it as a type II dimension, so as to keep discount history. If you know when a certain discount was applied, great, if not take the 1st sale as the start date and you won't be too far off.
Maybe the list price can also be kept as an attribute of product or product line in a dimension, but only if they don't change too often; but if most customers get discounts anyway that would be of limited use.

what are methods for comparing documents

recently I started to a do some research in standardising products data.
Supermarkets often sell the same products at different prices, and it is useful to compare these prices. To do this, we need to know we are matching the same products from each supermarket. The problem is, supermarkets will often have small differences in how they name their products and list them on their websites. We need a tool that can standardise product names, recognising two differently-named products as the same product, while successfully recognising different but similarly-named products as well as differences in quantity. For example, I want to buy rasher, and when you go to search the rasher , we are going to code all rashers even though differently-named and map to HS-codes , I want to know what technologies behind this process?
Additionally, we need these products prices converted to standard units and the products to be aligned with groups defined by the World trade HS-codes. lets say price of rasher is 2.99 euro per 180g, but now I want to change it to 16.62 euro per kg with some technologies, An examination of the appropriate Natural language techniques to determine which method best fulfils these goals.
It depends on the type of difference in titles, but for now I will explain two different types :
1- If name of product is separated from pre and post words, then you can "index" product titles with tools like "Apache lucene" and just search for product name and you will get all products with that title.
2- If name of product has came with postfixes and prefixes then you can use "Edit distance algorithms" to find similar products.
For second part of you question, you should define all patterns to catch differences in weight, value, etc. and then unify them.

What is the canonical way to implement a global search (that searches throught different resources) in rails?

For my app searching is a big deal. I have many resources and each search result should pick bits from there and there.
Should I make a search controller?
What's the best architecture in that situation?
An exmaple use case:
The user searches for "Eos D5".
The app should reply with a field with the full product name, that's constructed out of manufacturers name and product model. Also if the product is available in neary (to user) shops then he is also told that there is a shop he can buy the product from near him. Manufacturer's name for product EOS D5 is "Canon".
That makes three models used: product, manufacturer, shop.
The output is something like
Canon EOS D5
For complex search functionality I'd recommend using a full-text search engine such as Solr.
There's a great gem called sunspot that makes integrating with Solr a breeze.

Resources