Duplicate records in data for Myrrix? - mahout

Can someone help me identify how Myrrix handles if there are duplicated records of in the input data?
What would be the result in the case of implicit as well as explicit feedback data? What if explicit feedback duplicated records have different ratings?

Data is always additive. So "user,item,X" and "user,item,Y" is (essentially) like "user,item,X+Y". Input without a value are considered to have value 1.

Related

How can I one-hot encode the data which has multiple same values for different properties?

I have data containing candidates who look for a job. The original data I got was a complete mess but I managed to enhance it. Now, I am facing an issue which I am not able to resolve.
One candidate record looks like
https://i.imgur.com/LAPAIbX.png
Since ML algorithms cannot work with categorical data, I want to encode this. My goal is to have a candidate record looking like this:
https://i.imgur.com/zzsiDzy.png
What I need to change is to add a new column for each possible value that exists in Knowledge1, Knowledge2, Knowledge3, Knowledge4, Tag1, and Tag2 of original data, but without repetition. I managed to encode it to get way more attributes than I need, which results in an inaccurate model. The way I tried gives me newly created attributes Jscript_Knowledge1, Jscript_Knowledge2, Jscript_Knowledge3 and so on, for each possible option.
If the explanation is not clear enough please let me know so that I could explain it further.
Thanks and any help is highly appreciated.
Cheers!
I have some understanding of your problem based on your explanation. I will try and elaborate how I would approach this problem. If that is not solving your problem, I may need more explanation to understand your problem. Lets get started.
For all the candidate data that you would have, collect a master
skill/knowledge list
This list becomes your columns
For each candidate, if he has this skill, the column becomes 1 for his record else it stays 0
This is the essence of one hot encoding, however, since same skill is scattered across multiple columns you are struggling with autoencoding it.
An alternative approach could be:
For each candidate collect all the knowledge skills as list and assign it into 1 column for knowledge and tags as another list and assign it to another column instead of current 4(Knowledge) + 2 (tags).
Sort the knowledge(and tag) list alphabetically within this column.
Auto One hot encoding after this may yield smaller columns than earlier
Hope this helps!

How do I fix inconsistent types in InfluxDB?

In InfluxDB (1.5), I have a table where the fields have become inconsistently typed. Most rows in the table are Integer, however, some rows have become strings.
How is this possible? I thought, once a field's types were set (upon first insert), any insert into the table with incorrect typing would fail.
What do I do now? If I go back and attempt to overwrite the data in the inconsistent rows, I get errors saying the field is a string.
After some more research, here's what I've discovered:
Answer to Part 1:
InfluxDB uses a system they refer to as 'sharding' - while I don't know the specifics, I do know that data from the same measurement/table can be stored across multiple, different 'shards'.
According to the InfluxDB documentation, field types can differ between these shards, within the same field, on the same table.
Answer to Part 2:
In order to fix this, the currently-suggested answer is to make a new table, download all the data, and re-insert while ensuring the data that gets inserted is the proper types.
If you had a tag which changed type and became a field, this can be especially difficult to fix, the link above does not address that. To do selects only on tag or field, you can use tag_name::tag or field_name::field within a select statement.
The GROUP BY * clause suggested in the link is required in order to preserve tags, but seemed to cause issues when I used it.
My current solution is a PHP script that uses curl, downloads the points, chunks them, then re-inserts the points into the new table, ensuring each point that gets inserted is casted to the new, uniform type, and properly inserted.
The best way to stopping future issues, is simply not to have them. I went looking for how to lock field types in all cases, across all shards, for a particular measurement table.
Unfortunately, it seems impossible to guarantee 100% type consistency across all current and future shards. "Don't make mistakes because it's really difficult to clean up" seems to be InfluxDB's modus operandi.

Assign Key Field Value Only If Corresponding Lookup Result value Exist

I have ten master tables and one Transaction table. In my transaction table (it is a memory table just like ClientDataSet) there are ten lookup fields pointing to my ten master tables.
Now i am trying to dynamically assigning key field values to all my lookup key field values (of the transaction table) from a different Server(data is coming as a soap xml). Before assigning these values i need to check whether the corresponding result value is valid in master tables or not. I am using a filter (eg status = 1 ) to check whether it is valid or not.
Currently how we are doing is, before assigning each key field value we are filtering the master tables using this filter and using the locate function to check whether it is there or not. and if located we will assign its key field value.
This will work fine if there is only few records in my master tables. Consider my master tables having fifty thousand records each (yeah, customer is having so much data), this will lead to big performance issue.
Could you please help me to handle this situation.
Thanks
Basil
The only way to know if it is slow, why, where, and what solution works best is to profile.
Don't make a priori assumptions.
That being said, minimizing round trips to the server and the amount of data transferred is often a good thing to try.
For instance, if your master tables are on the server (not 100% clear from your question), sending only 1 Query (or stored proc call) passing all the values to check at once as parameters and doing a bunch of "IF EXISTS..." and returning all the answers at once (either output params or a 1 record dataset) would be a good start.
And 50,000 records is not much, so, as I said initially, you may not even have a performance problem. Check it first!

Rails 3: Compare unique codes

What's the best way to guarantee that a code is unique? The code is XXX-XXXXX where X is a number only.
What way other than search for the code in a database table there is to make the process faster and cleaner?
Regards.
Normal approach is to use :uniqueness validation. That handles db searching.
More bulletproof is to use 1) + unique index on that field. If the saving fails without validation errors, you could generate a new code and try again.
Since no two times are the same, using some kind of hash based on time is the easiest way to guarantee uniqueness. If you are storing xxx-xxxx though, you are limiting yourself. You may also use a unique auto-incrementing value. Store the value server-side for the next number to be assigned and then increment it whenever you issue a new unique id.
both are acceptable options without knowing additional information
A hash based on time is actually not 'guarateed' to be unique. Using some type of hash is just a way to create a digest from a large source data. Since all data can then be described in 128bits (using md5) then its possible to encounter hash collisions.
The validates :uniquness will do a query to determine if the fields value has been used before. You can use this but it should not be your only solution. If the field is intended to be unique, you should place a unique index on the column in the database. If you only rely on the rails validation, you are running the risk of a race condition on data insertion into the table. I can bypass the validation, but another write could have also passed the validation and both end up getting into the table.
Are you generating the value or is it user input?

How to remove duplicate records in grid?

Good morning !
What is the best way to remove duplicate records from grid control? I use Delphi 2009 and devEx quantumGrid component.
I tried looping through all the records and when a duplicate record is found then add it to list and apply filter on grid. I found this as time consuming logic. There are also two downsides of this approach.
[1] When the duplicate records are considerably more say 10K records then applying filter takes lot of time, because of lot of entries to filter out.
[2] Looping through all the records is itself time consuming for big result set like 1M rows.
SQL query returns me distinct rows, but when the user hides any column in grid, then it resembles as if there are duplicate records(internally they are distinct).
Is there any other way of doing this?
Any ideas on this are greatly helpful !
Thanks & Regards,
Pavan.
Can you alter your dataset to not return duplicate records in the first place? I would normally only return the records I want displayed instead of returning unwanted records from the database and then using a database grid to try to suppress unwanted records.
With thousands of rows I would add an additional field to the DB called say Sum or Hash or if you can't change the DB add a calculated field if it is a ClientDataSet but this carries overhead at display time
Calculate the contents of the hash field with something fast and simple like a sum of all the chars in your text field. All dupes are now easily identified. Add this field to your Unique or Distinct Query parameters or filter out on that.
Just an Idea.
Checking for duplicates is always a bit tricky, for the reasons you just mentioned. The best way to do it in this particular case is probably to filter before the data reaches the grid.
If this grid is getting its records from a database, try tweaking your SQL query to not return any duplicate records. (The "distinct" keyword can be useful here.) The database server can usually do a much better job of it than you can.
If not, then you're probably loading your result set from some sort of object list. Try filtering the list and culling duplicate objects before you load it into the grid. Then it's over with and you don't have to filter the grid itself. This is a lot less time-consuming.
I have worked with DevExpress's Quantum Grid for some time and their support form http://www.devexpress.com/Support/Center/ is excellent. When you post questions the DevExpress staff will answer you directly. With that said, I did a quick search for you and found some relevant articles.
how to hide duplicate row values: http://www.devexpress.com/Support/Center/p/Q142379.aspx?searchtext=Duplicate+Rows&p=T1|P0|83
highlight duplicate records in a grid: http://www.devexpress.com/Support/Center/p/Q98776.aspx
Unfortunately, it looks like you will have to iterate through the table in order to hide duplicate values. I would suggest that you try to clean the data up before it makes it to the grid. Ideally you would update the code/sql that produces the data. If that is not possible, you could write a TcxCustomDataSource that will scrub the data when it is first loaded. This should have better performance because you will not be using the grid's api to access the data.
Edit
ExpressQuantumGrid will not automatically hide rows that look like duplicates because the user hid a column. See: http://www.devexpress.com/Support/Center/p/Q205956.aspx?searchtext=Duplicate+Rows&p=T1|P0|83.
Poster
For example, I have a dataset which
contains two fields ID and TXT. ID is
a unique field and TXT field may
contain duplicate values. So when the
dataset is connected to the grid with
all columns visible, the records are
unique. See the image1.bmp. But if I
hide the ID column then the grid shows
duplicate rows. See the image2.bmp.
DevExpress Team
I'm sorry, but our ExpressQuantumGrid
Suite doesn't support such a
functionality, because this task is
very specific. However, you can
implement it manually.

Resources