How should browser and version be one hot encoded? - machine-learning

I need to output browser and version data for one hot encoding. We have come up with a few options (outlined below). I did some searching but couldn't find any existing examples of someone with similar data (searched Kaggle Datasets and DuckDuckGo).
Option 1: One column with browser name and version joined together
e.g. "browser_version" column values: "Safari-1.2.3", "Chrome-4.5.6", "Firefox-7.8.9"
| order_id | browser_version |
| 1 | Safari-1.2.3 |
| 2 | Chrome-4.5.6 |
| 3 | Firefox-7.8.9 |
Option 2: Two columns: one with browser name, another with browser version
e.g. "browser" (column 1) values: "Safari", "Chrome", "Firefox"
e.g. "version" (column 2) values: "1.2.3", "4.5.6", "7.8.9"
| order_id | browser | version |
| 1 | Safari | 1.2.3 |
| 2 | Chrome | 4.5.6 |
| 3 | Firefox | 7.8.9 |
Option 3: Two columns: one with browser name, another with browser name and version joined together
e.g. "browser" (column 1) values: "Safari", "Chrome", "Firefox"
e.g. "browser_version" (column 2) values: "Safari-1.2.3", "Chrome-4.5.6", "Firefox-7.8.9"
| order_id | browser | browser_version |
| 1 | Safari | Safari-1.2.3 |
| 2 | Chrome | Chrome-4.5.6 |
| 3 | Firefox | Firefox-7.8.9 |
What is the most beneficial way to set up the data values (assuming a CSV file, columns) for one hot encoding?
I suppose the correct answer might be to test each option and check the results but I thought this is likely is something that has been done before so I figured it's worth an ask.

I would use the first option. It will give on index per pair (browser | version).
The second option put version number of different browsers in the same column, whereas these numbers are not comparable. You can compare a Chrome version number with another Chrome version number but not a Chrome version number with a Firefox one.
And the third option contains the first one, with additional redundant data.

Related

How to achieve conditional formatting of names between pages?

I have a Google Sheet with one page for team leaders to list their desired team members (first names, last names, emails) by row--each TL fills in one row--, and a second page where team members are listed who have actually registered with my program.
Page 1
+------------------------+------------+---------------+------------+--------------------+
| Team Leader First Name | First Name | Email Address | First Name | Email Address |
+------------------------+------------+---------------+------------+--------------------+
| Danielle | Elizabeth | XXX#tamu.edu | Matthew | XXX#tamu.edu |
| Stoian | William | XXX#tamu.edu | Victoria | XXX#email.tamu.edu |
| Christa | Olivia | XXX#tamu.edu | | |
+------------------------+------------+---------------+------------+--------------------+
Page 2
+--------------------+-------------------------+
| Scholar First Name | Scholar Preferred Email |
+--------------------+-------------------------+
| elizabeth | xxx#gmail.com |
| william | xxx#tamu.edu |
+--------------------+-------------------------+
I want to be able to see at a glance which of the names listed by the TL on pg 1 have not registered and thus don't appear on pg 2.
In the example above, I want Olivia, Matthew, and Victoria's names to appear red because she does not show up on pg2 (which means they still need to register). Everyone else should appear normally.
I tried at first to importrange from pg1 to get a clean list of the team members, then conditional formatting to match against pg2, the idea I had being it shows up red if a name is not found.
Import range from page 2 to page 1 the scholar first name to F12:F14
Conditional Formatting: Apply to range B2:B999(first name list in page 1)
=NOT(OR(ISNUMBER(MATCH(TRIM(B2),$F$12:$F$13,0)),ISBLANK(B2)))
Conditional Formatting2: Apply to range D2:D999(Second First name list)
=NOT(OR(ISNUMBER(MATCH(TRIM(D2),$F$12:$F$13,0)),ISBLANK(D2)))
Note: Instead of importing, You could also reference the second sheet using INDIRECT.

Loosing Column when running dask to_parquet method with partition_on option

I have data that I need to optimize in order to perform group_by .
Currently I have the data in several parquet files (over 2.5B rows) which looks as follows:
ID1 | ID2 | Location |
AERPLORDRVA | AOAAATDRLVA | None
ASDFGHJHASA | QWEFRFASEEW | home
I'm adding a third column in order to resave the file with partitions (and also append them) that will hopefully assist with the groupby
df['ID4']=df.ID1.apply(lambda x: x[:2])
When I view the df I see the column like this
ID1 | ID2 | Location | ID4
AERPLORDRVA | AOAAATDRLVA | None | AE
ASDFGHJHASA | QWEFRFASEEW | home | AS
....
But when I run the following code the ID4 column changes
dd.to_parquet(path2newfile, df, compression='SNAPPY', partition_on = ['ID4'], has_nulls= ['Location'], fixed_text ={'ID1':11,'ID2':11,'ID4':2}
into
df2 = dd.read_parquet(path2newfile)
ID1 | ID2 | Location | dir0
AERPLORDRVA | AOAAATDRLVA | None | ID4=AE
ASDFGHJHASA | QWEFRFASEEW | home | ID4=AS
....
Any ideas?
I was planning to include the ID4 within the groupby an thus improve the efficacy of the query
dfc = df.groupby(['ID4','ID1','ID2').count()
I'm working on a single workstation with 24 cores and 190GB (although the dask cluster only recognizes 123.65GB)
This was a bug in how directory names were parsed: apparently you are the first to use a field name containing numbers since the addition of the option for "drill"-style directory partitioning.
The fix is here: https://github.com/dask/fastparquet/pull/190 and was merged into master on 30-Jul-2017, and will eventually be released.
For the time being, you could rename your column not to include numbers.

How do I count all cells in a column that have emoji?

I have a problem with emoji in my production database. Since it's in production, all I get out of it is an auto-geneated excel spreadsheet (.xls) every so often with tens of thousands of rows. I use Google Sheets to parse this so I can easily share the results.
What formula can I use to get a count of all cells in column n that contain emoji?
For instance:
Data
+----+-----------------+
| ID | Name |
+----+-----------------+
| 1 | Chad |
+----+-----------------+
| 2 | ✨Darla✨ |
+----+-----------------+
| 3 | John Smith |
+----+-----------------+
| 4 | Austin ⚠️ Powers |
+----+-----------------+
| 5 | Missus 🎂 |
+----+-----------------+
Totals
+----------------------------------+---+
| People named Chad | 1 |
+----------------------------------+---+
| People with emoji in their names | 3 |
+----------------------------------+---+
Edit by Ben C. R. Leggiero:
=COUNTA(FILTER(A2:A6;REGEXMATCH(A2:A6;"[^\x{0}-\x{F7}]")))
This should work:
=arrayformula(countif(REGEXMATCH(A2:A6,"[^a-zA-Z\d\s:]"),true))
You cannot extract emojis with regular formula because Google Spreadsheet uses the light-weight re2 regex engine, which lacks many features, including those necessary to find emojis.
What you need to do is creating a custom formula. Select Tools menu, then Script editor.... In the script editor, add the following:
function find_emoji(s) {
var re = /[\u1F60-\u1F64]|[\u2702-\u27B0]|[\u1F68-\u1F6C]|[\u1F30-\u1F70]|[\u2600-\u26ff]|[\uD83C-\uDBFF\uDC00-\uDFFF]+/i;
if (s instanceof Array) {
return s.map(function(el){return el.toString().match(re);});
} else {
return s.toString().match(re);
}
}
Save the script. Go back to your spreadsheet, then test your formula =find_emoji(A1)
My test yields the following:
| Missus 🎂 | 🎂 |
| Austin ⚠️ Powers | ⚠ |
| ✨Darla✨ | ✨ |
| joke 😆😆 | 😆😆 |
And, to count entries that don't have emojis, you can use this formula:
=countif( arrayformula(isblank( find_emoji(filter(F2:F,not(isblank(F2:F)))))), FALSE)
EDIT
I was wrong. You can use regular formula to extract emoji. The regex syntax is [\x{1F300}-\x{1F64F}]|[\x{2702}-\x{27B0}]|[\x{1F68}-\x{1F6C}]|[\x{1F30}-\x{1F70}]|[\x{2600}-\x{26ff}]|[\x{D83C}-\x{DBFF}\x{DC00}-\x{DFFF}]

Specflow Feature-level Templates

I'm trying to execute an entire SpecFlow Feature using three different UserID/Password combinations. I'm struggling to find a way to do this in the .feature file without having to introduce any loops in the MSTest.
On the Scenario level I'm doing this:
Scenario Template: Verify the addition functionality
Given the value <x>
And the value <y>
When I add the values together
Then the result should be <z>
Examples:
|x|y|z|
|1|2|3|
|2|2|4|
|2|3|5|
Is there a way to do a similar table at the feature level that will cause the entire feature to be executed for each row in the table?
Is there other functionality available to do the same thing?
I don't think the snippet you have is working is it? I've updated the below with the corrections I think you need (as Fresh also points out) and a couple of possible improvements.
With this snippet, you'll see that the scenario is run for each line in the table of examples. So, the first test will connect with 'Bob' and 'password', ask your tool to add 1 and 2 and check that the answer is 3.
I've also added an ID column - that is optional but I find it much easier to read the results with an ID number.
Scenario Outline: Verify the addition functionality
Given I am connecting with <username> and <password>
When I add <x> and <y> together
Then the result should be <total>
Examples:
| ID | username | password | x | y | total |
| 1 | Bob | password | 1 | 2 | 3 |
| 2 | Helen | Hello123 | 1 | 2 | 3 |
| 3 | Dave | pa£sword | 1 | 2 | 3 |
| 4 | Bob | password | 2 | 3 | 5 |
| 5 | Helen | Hello123 | 2 | 3 | 5 |
| 6 | Dave | pa£sword | 2 | 3 | 5 |
| 7 | Bob | password | 2 | 2 | 4 |
| 8 | Helen | Hello123 | 2 | 2 | 4 |
| 9 | Dave | pa£sword | 2 | 2 | 4 |
"Is there a way to do a similar table at the feature level that will
cause the entire feature to be executed for each row in the table?"
No, Specflow (and indeed the Gherkin language) doesn't have a concept of a "Feature Outline" i.e. a way of specifying a collection of features which should be run in their entirety.
You could possibly achiever what you are looking for by making use of Specflow tags to tag related scenarios. You could then use your test runner to trigger the testing of all the scenarios with that tag e.g.
#related
Scenario: A
Given ...etc...
#related
Scenario: B
Given ...etc.
SpecFlow+ Runner (aka SpecRun, http://www.specflow.org/plus/), provides infrastructure (called test targets) to be able to run the same test suite (or selected scenarios) with different settings. With this you can solve problems like the one you have mentioned. It can be also used to run the same web tests with different browsers, etc. Check this screencast for details: http://www.youtube.com/watch?v=RZYV4Dvhw3w

How revisions control works on quora? Database design

Well, I've seen some plugins to create a versions table to keep track of modifications on specific models, but cant do easily like quora shows
What I have so far is a table like that:
id
item_type: especifies what model revision refers: "Topic"
item_id
event: if it was: "edited, added, reverted, removed"
who: who triggered the event
column: What column in "Topic" the value has changed. "topic.photo_url"
new: new value: "http://s3.amazonaws.../pic.png"
old old value: ""http://s3.amazonaws.../oldpic.png"
revision_rel: points to the past revision
timestamp
Someone could give me some help and guidelines with this design? Im worried about performance, wrong columns, missing columns, etc
id | item_type | item_id | event | who | column | new | old | revision_rel | date
________________________________________________________________________________________________________
1 | Topic | 2 | edit | Luccas | photo | pic.png | oldpic.png | null | m:d:y
2 | Topic | 2 | revert | Chris | photo | oldpic.png | pic.png | 1 | m:d:y
There are some gems available that already do what you are looking for. Have you looked into:
Take a looks at existing gems:
https://www.ruby-toolbox.com/categories/Active_Record_Versioning
I am using audited (previously acts_as_audited) for something very similar:
https://github.com/collectiveidea/audited

Resources