I have several tables that have different column names which are mapped through ETL. There are a total of around 200 tables and 500 attributes, so the set is not massive.
Some column mappings are as follows:
startDate EFT_DATE
startDate START_DATE
startDate entryDate
As you can see the same column name can be mapped to different names across different tables.
I'm trying to solve the following problem :
Given two schemas I want to find matches between attribute names.
I was wondering if there is a way to leverage gensim to solve this problem similar to source-words from Google example. The challenge I'm facing is which dataset to use to train the model. Also I am wondering if there is another approach to solve the problem.
You can apply basic text analyzers to this by pre-processing each term.
Split by any non-alphabetic character.
e.g. EFT_DATE becomes [eft,date]
Split by camelCase.
e.g. startDate becomes [start,date]
Lowercase each term
Apply a fuzzy dictionary lookup to each token
e.g. startt -> start (typo detection..)
Apply stemming
e.g. starting -> start
Maybe apply a synonym conversion .
e.g. begin -> start
Optionally Sort the terms:
dateStarted -> [date,start]
startingDate -> [start,date] -> [date,start]
You can apply set distance operations now - which is O(^2). Given your moderate cardinality that is fine. If you had larger set of terms than scalable set-comparison approaches like the following can help reduce the complexity.
LSH Forests
theory http://infolab.stanford.edu/~bawa/Pub/similarity.pdf
python/sklearn http://lijiancheng0614.github.io/scikit-learn/modules/generated/sklearn.neighbors.LSHForest.html
SimHash / MinHash
theory https://stackoverflow.com/a/46415603/1056563
python
simhash https://github.com/leonsim/simhash
minhash/simhash/others https://github.com/ekzhu/datasketch
Related
I have a complex data model consisting of around hundred tables containing business data. Some tables are very wide, up to four hundred columns. Columns can have various data types - integers, decimals, text, dates etc. I'm looking for a way to identify relevant / important information stored in these tables.
I fully understand that business knowledge is essential to correctly process a data model. What I'm looking for are some strategies to pre-process tables and identify columns that should be taken to later stage where analysts will actually look into it. For example, I could use data profiling and statistics to find and exclude columns that don't have any data at all. Or maybe all records have the same value. This way I could potentially eliminate 30% of fields. However, I'm interested in exploring how AI and Machine Learning techniques could be used to identify important columns, hoping I could identify around 80% of relevant data. I'm aware, that relevant information will depend on the questions I want to ask. But even then, I hope I could narrow the columns to simplify the manual assesment taking place in the next stage.
Could anyone provide some guidance on how to use AI and Machine Learning to identify relevant columns in such wide tables? What strategies and techniques can be used to pre-process tables and identify columns that should be taken to the next stage?
Any help or guidance would be greatly appreciated. Thank you.
F.
The most common approach I've seen to evaluate the analytical utility of columns is the correlation method. This would tell you if there is a relationship (positive or negative) among specific column pairs. In my experience you'll be able to more easily build analysis outputs when columns are correlated - although, these analyses may not always be the most accurate.
However, before you even do that, like you indicate, you would probably need to narrow down your list of columns using much simpler methods. For example, you could surely eliminate a whole bunch of columns based on datatype and basic count statistics.
Less common analytic data types (ids, blobs, binary, etc) can probably be excluded first, followed by running simple COUNT(Distinct(ColName)), and Count(*) where ColName is null . This will help to eliminating UniqueIDs, Keys, and other similar data types. If all the rows are distinct, this would not be a good field for analysis. Same process for NULLs, if the percentage of nulls is greater than some threshold then you can eliminate those columns as well.
In order to automate it depending on your database, you could create a fairly simple stored procedure or function that loops through all the tables and columns and does a data type, count_distinct and a null percentage analysis on each field.
Once you've narrowed down list of columns, you can consider a .corr() function to run the analysis against all the remaining columns in something like a Python script.
If you wanted to keep everything in the database, Postgres also supports a corr() aggregate function, but you'll only be able to run this on 2 columns at a time, like this:
SELECT corr(column1,column2) FROM table;
so you'll need to build a procedure that evaluates multiple columns at once.
Thought about this tech challenges for some time. In general it’s AI solvable problem since there are easy features to extract such as unique values, clustering, distribution, etc.
And we want to bake this ability in https://columns.ai, obviously we haven’t gotten there yet, the first step we have done though is to collect all columns stats upon a data connection, identify columns that have similar range of unique values and generate a bunch of query templates for users to explore its dataset.
If interested, please take a look, as we keep advancing this part, it will become closer to an AI model to find relevant columns. Cheers!
Given a query and a document, I would like to compute a similarity score using Gensim doc2vec.
Each document consists of multiple fields (e.g., main title, author, publisher, etc)
For training, is it better to concatenate the document fields and treat each row as a unique document or should I split the fields and use them as different training examples?
For inference, should I treat a query like a document? Meaning, should I call the model (trained over the documents) on the query?
The right answer will depend on your data & user behavior, so you'll want to try several variants.
Just to get some initial results, I'd suggest combining all fields into a single 'document', for each potential query-result, and using the (fast-to-train) PV-DBOW mode (dm=0). That will let you start seeing results, doing either some informal assessment or beginning to compile some automatic assessment data (like lists of probe queries & docs that they "should" rank highly).
You could then try testing the idea of making the fields separate docs – either instead-of, or in addition-to, the single-doc approach.
Another option might be to create specialized word-tokens per field. That is, when 'John' appears in the title, you'd actually preprocess it to be 'title:John', and when in author, 'author:John', etc. (This might be in lieu of, or in addition to, the naked original token.) That could enhance the model to also understand the shifting senses of each token, depending on the field.
Then, providing you have enough training data, & choose other model parameters well, your search interface might also preprocess queries similarly, when the user indicates a certain field, and get improved results. (Or maybe not: it's just an idea to be tried.)
In all cases, if you need precise results – exact matches of well-specified user queries – more traditional searches like exact DB matches/greps, or full-text reverse-indexes, will outperform Doc2Vec. But when queries are more approximate, and results need filling-out with near-in-meaning-even-if-not-in-literal-tokens results, a fuzzier vector document representation may be helpful.
I am reading some legacy code and I found that all the inserts in influxdb are done this way (simplified code here) :
influxDB.write(Point.measurement(myeasurement)
.time( System.currentTimeMillis(), TimeUnit.MILLISECONDS)
.addField("myfield", 123)
.tag("rnd",String.valueOf(Math.random() * 100000)))
.build())
As you can guess, the tag value of the tag "rnd" is different for each value, which means that we can have 100k different tag values. Actually, for now we have less values than that, so we should end up having one different tag value per value...
I am not an expert in influxdb, but my understanding is that tags are used by influxdb to group related values together, like partition or shards in other tools. 100k tags values seems to be a lot ...
Is that as horrible as I think it is ? Or is there any possibility that this kind of insert may be usefull for something ?
EDIT: I just realized that Math.random()* is a double, so the * 100000 is just useless. As the String.valueOf(). Actually, there is one series in the database per value, I can't imagine how that could be a good thing :(
It is bad and unnecessary.
Unnecessary because each point that you write to influxdb is uniquely identified by its timestamp + set of applied tag values.
Bad because each set of tag values creates a separate series. Influxdb keeps an index over the series. Having a unique tag value for each datapoint will grow your system resoruce requirements and slow down the database. Unless you don't have that many datapoints, but then you don't really need a timeseries database or just don't care.
As the OP said. Tags are used for grouping by or filtering.
Here are some good reads on the topic
https://docs.influxdata.com/influxdb/v1.7/concepts/tsi-details/
https://www.influxdata.com/blog/path-1-billion-time-series-influxdb-high-cardinality-indexing-ready-testing/
According to documentation the upper bound [for series or unique tag values] is usually somewhere between 1 - 4 million series depending on the machine used. which is easily a day worth of high resolution data.
Im a software engineering student, and new to Data Mining, I want to implement a solution to find similar users based on their interests and skills (Strings sets).
I think I cannot use K nearest Neighbors using an edit distance(Levenshtein or ..)
If someone could help with that please
The first thing you should do is convert your data into some reasonable representation, so that you will have a well-defined notion of distance between suitably represented users.
I would recommend converting all strings into some canonical form, then sorting all n distinct skills and interest strings into a dictionary D. Now for each user u, construct a vector v(u) with n components, which has i-th component set to 1 if the property in dictionary entry i is present, and 0 otherwise. Essentially we represented each user with a characteristic vector of her interests/skills.
Now you can compare users with Jaccard index (it's just an example, you'll have to figure out what works best for you). With the notion of a distance in hand, you can start trying out various approaches. Here are some that spring to mind:
apply hierarchical clustering if the number of users is sufficiently small;
apply association rule learning (I'll leave you to think out the details);
etc.
Suppose I have a bunch of User node, which has a property named gender, which can be male or female. Now in order to cluster user based on gender, I have two choice of structure:
1) Add an index to the gender property, and use a WHERE to select users under a gender.
2) Create a Male node and a Female node, and edges linking them to relevant users. Then every time when querying upon gender, I use pattern ,say, (:Male)-[]->(:User).
My question is, which one is better?
Indices should never be a replacement for putting things in the graph.
Indexing is great for looking up unique values and, in some cases, groups of values; however, with the caching that Neo4j can do (and the extensibility of modeling your domain).
Only indexing a property with two (give or take) properties is not the best use of an index and likely won't net too much of a performance boost given the number of results per property value.
That said, going with option #2 can create supernodes, a bottle-necking issue which can become a major headache depending on your model.
Maybe consider using labels (:Male and :Female, for example) as they are essentially "schema indices". Also keep in mind you can use multiple labels per node, so you could have (user:User:Male), etc. It also helps to avoid supernodes while not creating a classic or "legacy" index.
HTH