what are methods for comparing documents

what are methods for comparing documents - machine-learning

recently I started to a do some research in standardising products data.
Supermarkets often sell the same products at different prices, and it is useful to compare these prices. To do this, we need to know we are matching the same products from each supermarket. The problem is, supermarkets will often have small differences in how they name their products and list them on their websites. We need a tool that can standardise product names, recognising two differently-named products as the same product, while successfully recognising different but similarly-named products as well as differences in quantity. For example, I want to buy rasher, and when you go to search the rasher , we are going to code all rashers even though differently-named and map to HS-codes , I want to know what technologies behind this process?
Additionally, we need these products prices converted to standard units and the products to be aligned with groups defined by the World trade HS-codes. lets say price of rasher is 2.99 euro per 180g, but now I want to change it to 16.62 euro per kg with some technologies, An examination of the appropriate Natural language techniques to determine which method best fulfils these goals.

It depends on the type of difference in titles, but for now I will explain two different types :
1- If name of product is separated from pre and post words, then you can "index" product titles with tools like "Apache lucene" and just search for product name and you will get all products with that title.
2- If name of product has came with postfixes and prefixes then you can use "Edit distance algorithms" to find similar products.
For second part of you question, you should define all patterns to catch differences in weight, value, etc. and then unify them.

Related

Rails & Postgres: Best practice for many boolean relationship entries

I'm looking for input on what's best practice for my problem. In abstract terms, I need to store a lot of relationships and need to prioritise between database size, query speed and “ease of maintenance”. The setup is Ruby on Rails with PostgreSQL.
More specific description: Imagine a website that's essentially a searchable database of Products sold by Vendors. Vendors may not ship worldwide, and I want to filter out Products by Vendors who won't ship to a user based on their GeoIP country. To make matters slightly more complex, a Product page can have two different features (let’s call them F1 and F2) that are separately geo-IP-dependent.
Example: A Vendor may want their Product pages to have feature F1 for all countries worldwide except a few e.g. because of embargos; but feature F2 may only be available for countries within e.g. Europe.
Country filters will always be set at the Vendor level.
The “search” function of the website is a basic SQL search in which I want Products to show up if at least one of the features is available for the current user’s country.
The website will allow in the range of 1,000 to 3,000 Vendors (this is a hard limit), and a total of around 10,000 to 50,000 Products. Let's assume that in the beginning, filtering is only relevant to around 100 Vendors.
I had the following ideas and hope that others have feedback on these, or additional approaches:
One relation model CountryVendor with two boolean columns (in which case optionally, a Product could still be shown if the respective country_vendor does not yet exist; i.e. show if !country_vendor&.allow).
Assuming ca. 200 countries, this would imply ca. 2,000 rows in the beginning, and around 600k rows if filters were in place for every one of the 3,000 potential Vendors.
(Theoretically, if non-existence is treatet as true, I could also set up a rake task that removes rows that are false for both features, thus reigning in the table size.)
Two relation models CountryVendorF1 and CountryVendorF2, each with just one boolean column. Not sure if this will effectively be much different, but I imagine it closer to how I think of the UI for setting up the country filters (without going into detail here).
Two JSON columns in the Vendors table that would store true/false for each Country. (Maybe with an ISO code string as the index for simplicity.) There wouldn’t be thousands of new rows, although the DB would still grow in size, but querying might become slow.

Unit Price and Discounts - Fact or Dimension Table

I'm working on a datamart for our sales and marketing departments, and I've come across a modeling challenge. Our ERP stores pricing data in a few different ways:
List pricing for each item
A discount percentage from list pricing for a product line, either for groups of customers or for a specific account
A custom price for an item, either for groups of customers or for a specific account
The Pricing department primarily uses this data operationally, not analytically. For example, they generate reports for customers ("What special pricing / discount %s do I have?") and identify which items / item groups need to be changed when they engage in a new pricing strategy.
Pricing changes happen somewhat regularly on a small scale, usually on a customer-by-customer or item-by-item basis. Infrequently, there are large-scale adjustments to list pricing and group pricing (discounts and individual items) in addition to the customer-level discounts.
My head has been in creating one or more fact tables to represent this process. Unfortunately, there's no pre-existing business key for pricing. There's also no specific "transaction date," since the ERP doesn't (accurately) maintain records of when pricing is changed. Essentially, a "pricing event" is going to be a combination of:
Effective date
End date
Item OR product line
(Not required for list price) customer or customer group
A price amount OR discount percentage
A single fact table seems problematic in that I'm going to have to deal with a lot of invalid combinations of dimensions and facts. First, a record will never have both a non-NULL price amount and a non-NULL discount percentage; pricing events are either-or. Second, only certain combinations of dimensions are valid for each fact. For example, a discount percentage will only ever have a product line, not an individual item.
Does it make sense to model pricing as a fact table in the first place? If so, how many tables should I be considering? My intuition is to use at least two, one for the percentages and one for the price amounts, but this still leaves a problem where each record will either have a valid customer group OR a valid customer (or neither, for list prices), since we need to maintain customer-specific pricing separate from any group pricing that customer might have.

You may need to keep them both as attributes and as facts.
The price a certain item was sold for is a fact. When you multiply it by the quantity sold it's actually an additive measure. So, keep it in the fact table. Total discount applied is also additive, I'd keep it. You can later query "how much was discounted in 2019 per customer", which would be much harder to achieve without those facts.
But if you also need to query things like "what's the discount customer X is on", then you should also keep that as an attribute of the customer dimension, and treat it as a type II dimension, so as to keep discount history. If you know when a certain discount was applied, great, if not take the 1st sale as the start date and you won't be too far off.
Maybe the list price can also be kept as an attribute of product or product line in a dimension, but only if they don't change too often; but if most customers get discounts anyway that would be of limited use.

Dimensional Modeling - How to deal with a single fact table with facts with inconsistent dimensions?

I want to set up a fact table for restaurant sales transactions. Adding up the entire fact table will give the entire sales across the restaurant(s). The restaurant has two main sources of revenue - food and beverage. The dimensions for each are very different.
For example, for food, I might want to track whether it's dairy free, gluten free, etc. Or I might want to see whether the dish is Italian, French, etc. For wine, I might be interested in the vintage, where the wine is from, what grape the wine is.
How do I accomplish this with one fact table? Should I simply have a Wine dimension that is NULL if the item is a food, and a Food dimension that is NULL if the item is a wine?

Your fact probably looks something like this?
SALES_LINE_ITEM_FACT
TRAN_DATE
TRAN_HOUR (or other time buckets if needed)
SERVER_KEY
TABLE_KEY
SEAT_KEY
PROMOTION_KEY
PRODUCT_KEY
REGULAR_PRICE
NET_SALE_PRICE
PRODUCT_COST
Your "product" dimension is where you need to focus your attention on, if you want to report from a sales fact how many people ordered a specific wine.
To start, it might just look something like:
PRODUCT_DIM
PRODUCT_KEY
PRODUCT_NAME
PRODUCT_CATEGORY (food / beverage)
PRODUCT_SUBCATEGORY (wine / beer / dairy / french / italian etc)
CURRENT_AVERAGE_PRODUCT_COST
You could either add the detail information as another level on the category hierarchy, or if you want to do more detailed analysis, create specific snowflakes for certain product types and connect them to the product dim.

Chosing categories rails

Hopefully we have good rails developer who can definitely give correct answer! For 2 days I didn't receive any valid answer for my question
I will explain in a very simple example
Customer is offering product. When he pushes create it gives form. Choose a category. Once he chooses another form will pop up.
Depending on a category, form should have totally different attributes.I can't have Product.new for every category. Reason is they have different attributes(Logicaly true). So do I have to create 100 models for 100 categories
Categories are : cars, apartments, coupons, books and many more
If you can give just one example I will be gratefull and call you expert
Thanks

It sounds like you're getting there. However, I wouldn't have a bunch of models like you're indicating in your question. I would say that you need a Product model and a Category model. The Category model will belong_to Product. The Product model would have many Categories. The Category model can use the acts_as_tree gem so that you can have categories and subcategories. Use javascript or jQuery (there was a recent Railscasts on this) to dynamically change and post a different field with a set of choices based on what was chosen.
EDIT:
I would have three Models; Product, Category, Specification
Product has many Categories
Product has many Specifications through Categories
Category belongs to Product
Category has many Specifications
Specification belongs to Category
This way I can create a product that has several categories. I can create several categories that have several specifications. Specifications are linked to the respective category. This will allow you to have three models and limited number of classes. Once your project is complete, new categories and specifications can be maintained by a web admin instead of a programmer.

This isn't the answer you want, but you're going to need a lot of models.
The attributes associated with an apartment (square meters, utilities, floor of building) are completely different from the attributes associated with a car (make, model, mileage, condition) which are completely different from a book (title, author, publisher, edition, etc). These items are so fundamentally different that there is no way to manage them in a single model.
That being said, there may be a core collection of attributes that might be associated with a product that is for sale (seller, price, terms). You have basically two paths forward:
You could decide to use Single Table Inheritance. In this case, you'd create an abstract class that defines the attributes that are common to all products that you are selling (seller, price, item). You'd then add a "type" column to your database that would be used to determine what type of product it is (mapped to your categories), and define all of the possible attributes in a single table.
You could choose a core set of attributes, and use these as a part of any other object that is considered a product. You'd have multiple tables that would have the full record for any given object.
Without knowing a lot of details about your application, it's hard to make a specific recommendation about which approach is right for you. Your best bet at this point is to spend a lot of time on google with "single table inheritance rails" and "multi table inheritance rails" and figure out which one is right for you (though my gut says multi table).

Comparing datafeeds from different networks (Affiliate Marketing)

I am working on integrating affiliate sales into few existing sites. We are using a few merchants who work via different networks (cj, shareasale, linkshare, avantlink).
Now my observation is that all these networks provide data feeds in different formats. But that's not a big problem. My main concern is actually merchants using different titles on same products. I don't want to run into these situations:
a) two listings of the SAME product from N merchants (if titles are just a bit different)
b) one listing of N different products from merchants (if we don't use strict comparison algorithm)
We want to automate everything as much as possible, want to avoid operators scanning listings under question all the time.
How is this problem typically handled?

We have a similar issue with trying to collapse products from multiple merchant feeds. What we do is collapse products based on their brand (or manufacturer) + sku combo.
Our data is pretty messy so we have to do some work to normalize both the brand and the sku so the products collapse nicely. We have a list of brands that we care about and do some work to map brands from the merchant feed into our brand. e.g. If we have an "ACME" brand in our system we might map the following to that brand:
A.C.M.E => ACME
ACME Inc. => ACME
Acme Incorporated => ACME
For skus we usually just strip any non-alphanumeric characters for matching purposes. e.g. all the following would map to the same sku:
abc-123 => abc123
abc.123 => abc123
abc 123 => abc123
ab.c1.23 => abc123
So if we see brand "ACME Inc." and sku "abc-123" in one feed that will collapse with brand "A.C.M.E" and sku "abc 123" from another feed.
As part of the collapsing process we end up with multiple names/images/descriptions/categories/etc... for each collapsed part and need to choose the "best" one to show on the website.
That's a very high level overview of how we handle it.

Look for merchants who provide UPC codes in their feeds. They are universal. Plus in AvantLink you can customize your own feed output so that's nice.

I was actually looking at 2 sample data feeds from AvantLink a minute ago. Here's the list of fields they provide (not filtered, so I assume it's everything):
SKU
Manufacturer
Id
Brand Name
Product Name
Long Description
Short Description
Category
SubCategory
Product Group
Thumb URL
Image URL
Buy Link
Keywords
Reviews
Retail Price
Sale Price
Brand
Page Link
Brand Logo Image
Product Page View Tracking
Product Content Widget
I was thinking that yes, having UPC would be (almost) ideal but both stores I was looking at (one of them is REI) don't provide UPC's.
Checked Commission Junction and Sshareasale, a few large merchants, they don't include UPC's either.

How is this problem typically handled?
Such scenarios are typically covered by data warehouse systems like provided by ORACLE, HP, Microsoft, IBM, Netezza or Teradata.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart