Interaction effect or split sample - interaction

I have a dataset where I want to test the effect of a change in regulation, and whether this effect is different for big4 auditors and non-big 4 auditors. So I have a dummy=1 if the firm is audited by a big 4 auditor, and a dummy for the regulation change (reg=1 if it obliges to the new regulation).
First, I regressed this this in 2 different subsamples, 1 for companies audited by a big4 auditor, and a sample for companies audited by non-big 4 auditors. Next, I did the same regression, but with the full sample and an interaction between the 2 dummies.
Now, my results change depending on which method I choose. The interaction big4*reg is significant (using the full sample), but the variable reg is not significant in the big 4 sample. Which method is correct?

Related

How to create a generalized dataset to detect all display digits with Roboflow

I want to detect digits on a display. For doing that I am using a custom 19 classes dataset. The choosen model has been yolov5-X. The resolution is 640x640. Some of the objets are:
0-9 digits
Some text as objects
Total --> 17 classes
I am having problems to detect all the digits when I want to detect 23, 28, 22 for example. If they are very close to each other the model finds problems.
I am using roboflow to create diferent folders in which I add some prepcocessings to have a full control of what I am entering into the model. All are checked and entered in a new folder called TRAIN_BASE. In total I have 3500 images with digits and the majority of variance is with hue and brightness.
Any advice to make the model able to catch all the digits besides being to close from each other?
Here are the steps I follow:
First of all, The use of mosaic dataset was not a good choice the purpose of detecting digits on a display because in a real scenario I was never gonna find pieces of digits. That reason made the model not to recognize some digits if it was not shure.
example of the digits problem concept
Another big improvement was to change the anchor boxes of the yolo model to adapt them to small objects. To know which anchor boxes I needed. Just with adding this argument to train.py is enought in the script provided by ultralitics to print custom anchors and add them to your custom architecture.
To check which augmentations can be good and which not, the next article explains it quite visually.
P.D: Thanks for the fast response to help the comunity gave me.

Which reinforcement algorithm to use for binary classification

I am new to machine learning, but I've read a lot about Reinforcement Learning in the past 2 days. I have an application that fetches a list of projects (e.g. from Upwork). There is a moderator that manually accepts or rejects a project (based on some parameters explained below). If a project is accepted, I want to send a project proposal and if it is rejected, I'll ignore it. I am looking to replace that moderator with AI (among other reasons) so I want to know which Reinforcement Algorithm should I use for this.
Parameters:
Some of the parameters that should decide whether the agent accepts or rejects the project are listed below. Assuming I only want to accept projects related to web development (specifically backend/server-side) here is how the parameters should influence the agent.
Sector: If the project is in related to IT sector it should have more chances of being accepted.
Category: If the project is in the Web Development category it should have more chances of being accepted.
Employer Rating: Employers having a rating of over 4 (out of 5) should have more chances of being accepted.
I thought Q-Learning or SARSA would be able to help me out but most of the examples that I saw were related to Cliff Walking problem where the states are dependent on each other which is not applicable in my case since each project is different from the previous one.
Note: I want the agent to be self-learning so that if in the future I start rewarding it for front-end projects too, it should learn that behavior. Therefore, suggesting a "pure" supervised learning algorithm won't work.
Edit 1: I would like to add that I have data (sector, category, title, employer rating etc.) of 3000 projects along with whether that project was accepted or rejected by my moderator.
your problem should easily be able to be solved using Q-learning. It just depends on how you design your problem. Reinforcement learning itself is a very robust algorithm that allows an agent to receive states from an environment, and then perform actions given those states. Depending on those actions, it will get rewarded accordingly. For your problem, the structure will look like this:
State
States: 3 x 1 matrix. [Sector, Category, Employer Rating]
The sector state are all integers, where each integer represents a different sector. For example, 1 = IT Sector, 2 = Energy, 3 = Pharmaceuticals, 4 = Automotives, etc.
The category state can also be all integers, where each integer represents a different category. Ex: 1 = Web Development, 2 = Hardware, 3 = etc.
Employer rating is again all integers between 1 - 5. Where the state represents the rating.
Action
Action: Output is an integer.
The action space would be binary. 1 or 0. 1 = Take the project, 0 = Don't take the project.
Reward
The reward provides feedback to your system. In your case, you would only evaluate the reward if the action = 1, i.e., you took the project. This will then allow your RL to learn how good of a job it did taking the project.
Reward would be a function that looks something like this:
def reward(states):
sector, category, emp_rating = states
rewards = 0
if sector == 1: # The IT sector
rewards += 1
if category == 1: # The web development category
rewards += 1
if emp_rating = 5: # Highest rating
rewards += 2
elif emp_rating = 4: # 2nd highest rating
rewards += 1
return rewards
To enhance this reward function, you can actually give some sectors negative rewards, so the RL will actually receive negative rewards if it took those projects. I avoided that here to avoid the further complexity.
You can also edit the reward function in the future to allow your RL to learn new things. Such as making some sectors better than others, etc.
edit: yes, regards to lejlot's comment, it basically is a multi-armed bandit problem, where there is no sequential decision making. The setup of the bandit problem is basically the same as Q-learning minus the sequential part. All your concerned with is you have a project proposal (state), make a decision (action), and then your reward. It does not matter what happens next in your case.

Weka Attribute selection - justifying different outcomes of different methods

I would like to use attribute selection for a numeric data-set.
My goal is to find the best attributes that I will later use in Linear Regression to predict numeric values.
For testing, I used the autoPrice.arff that I obtained from here(datasets-numeric.jar)
Using ReliefFAttributeEval I get the following outcome:
Ranked attributes:
**0.05793 8 engine-size**
**0.04976 5 width**
0.0456 7 curb-weight
0.04073 12 horsepower
0.03787 2 normalized-losses
0.03728 3 wheel-base
0.0323 10 stroke
0.03229 9 bore
0.02801 13 peak-rpm
0.02209 15 highway-mpg
0.01555 6 height
0.01488 4 length
0.01356 11 compression-ratio
0.01337 14 city-mpg
0.00739 1 symboling
while using the InfoGainAttributeEval (after applying numeric to nominal filter) leaves me with the following results:
Ranked attributes:
6.8914 7 curb-weight
5.2409 4 length
5.228 2 normalized-losses
5.0422 12 horsepower
4.7762 6 height
4.6694 3 wheel-base
4.4347 10 stroke
4.3891 9 bore
**4.3388 8 engine-size**
**4.2756 5 width**
4.1509 15 highway-mpg
3.9387 14 city-mpg
3.9011 11 compression-ratio
3.4599 13 peak-rpm
2.2038 1 symboling
My question is :
How can I justify contradiction between the 2 results ? If the 2 methods use different algorithms to achieve the same goal (revealing the relavance of the attribute to the class) why one say e.g engine-size is important and the other says not so much !?
There is no reason to think that RELIEF and Information Gain (IG) should give identical results, since they measure different things.
IG looks at the difference in entropies between not having the attribute and conditioning on it; hence, highly informative attributes (with respect to the class variable) will be the most highly ranked.
RELIEF, however, looks at random data instances and measures how well the feature discriminates classes by comparing to "nearby" data instances.
Note that relief is more heuristic (i.e. is a more stochastic) method, and the values and ordering you get depend on several parameters, unlike in IG.
So we would not expect algorithms optimizing different quantities to give the same results, especially when one is parameter-dependent.
However, I'd say that actually your results are pretty similar: e.g. curb-weight and horsepower are pretty close to the top in both methods.

Within subjects main effect in an ANOVA changes after a between subjects variable is included in the model

Yesterday my colleagues (all researchers) and I were bewildered by something. My colleague was asked to revise her article by the reviewers of a journal. She had done 3 repeated measures ANOVAs to answer each of the research questions, and the reviewers asked her to check for differences in the between subjects measure (in this case gender). She therefore set out to repeat her ANOVA (in SPSS) as a mixed model, with the repeated measures staying the same, but adding one between subjects measure (gender). She found no significant differences in gender and one significant interaction of a within subjects measure and gender (not problematic for this paper).
Here is the baffling part though: Her main within-subjects effects suddenly had changed - different F and p values were shown.
This defies my knowledge and understanding of what a mixed model ANOVA calculates. As I understand it:
F(within subjects main effect) = MS(within) / MS error(within)
F(within subjects * between subjects interaction) = MS(within * between) / MS error(within)
F(between subjects main effect) = MS(between) / MS error(between)
The F value of the within subjects main effect should not change, whether you are including a between subjects variable or not. This calculation is based on the same numbers either way and should therefore yield the same results either way. What are we missing here? What are we not understanding?

Applying MACHINE learning in biological text data

I am trying to solve the following question - Given a text file containing a bunch of biological information, find out the one gene which is {up/down}regulated. Now, for this I have many such (60K) files and have annotated some (1000) of them as to which gene is {up/down}regulated.
Conditions -
Many sentences in the file have some gene name mention and some of them also have neighboring text that can help one decide if this is indeed the gene being modulated.
Some files also have NO gene modulated. But these still have gene mentions.
Given this, I wanted to ask (having absolutely no background in ML), what sequence learning algorithm/tool do I use that can take in my annotated (training) data (after probably converting the text to vectors somehow!) and can build a good model on which I can then test more files?
Example data -
Title: Assessment of Thermotolerance in preshocked hsp70(-/-) and
(+/+) cells
Organism: Mus musculus
Experiment type: Expression profiling by array
Summary: From preliminary experiments, HSP70 deficient MEF cells display moderate thermotolerance to a severe heatshock of 45.5 degrees after a mild preshock at 43 degrees, even in the absence of hsp70 protein. We would like to determine which genes in these cells are being activated to account for this thermotolerance. AQP has also been reported to be important.
Keywords: thermal stress, heat shock response, knockout, cell culture, hsp70
Overall design: Two cell lines are analyzed - hsp70 knockout and hsp70 rescue cells. 6 microarrays from the (-/-)knockout cells are analyzed (3 Pretreated vs 3 unheated controls). For the (+/+) rescue cells, 4 microarrays are used (2 pretreated and 2 unheated controls). Cells were plated at 3k/well in a 96 well plate, covered with a gas permeable sealer and heat shocked at 43degrees for 30 minutes at the 20 hr time point. The RNA was harvested at 3hrs after heat treatment
Here my main gene is hsp70 and it is down-regulated (deducible from hsp(-/-) or HSP70 deficient). Many other gene names are also there like AQP.
There could be another file with no gene modified at all. In fact, more files have no actual gene modulation than those who do, and all contain gene name mentions.
Any idea would be great!!
If you have no background in ML I suggest buying a product like this one, this one or this one. These products where in development for decades with team budgets in millions.
What you are trying to do is not that simple. For example a lot of papers contain negative statements by first citing the original statement from another paper and then negating it. In your example how are you going to handle this:
AQP has also been reported to be important by Doe et al. However, this study suggest that this might not be the case.
Also, if you are looking into large corpus of biomedical research papers, or for this matter any corpus of research papers. You will find tons of papers that suggest something for example gene being up-regulated or not, and then there is one paper published in Cell magazine that all previous research has been mistaken.
To make matters worse, gene/protein names are not that stable. Besides few famous ones like P53. There is a bunch of run of the mill ones that are initially thought that they are one gene, but later it turns out that these are two different things. When this happen there are two ways community handles it. Either both of the genes get new names (usually with some designator at the end) or if the split is uneven the larger class retains original name and the second one gets the new name. To compound this problem, after this split happens not all researchers get the memo at instantly, so there is still stream of publications using old publication.
These are just two simple problems, there are 100s of these.
If you are doing this for personal enrichment. Here are some suggestions:
Build a language model on biomedical papers. Existing language models are usually built from news-wire sources or from social media data. All three of the corpora claim to be written in English language. But in reality these are three different languages with their own grammar and vocabulary
Look into things like embeddings and word2vec.
Look into Kaggle competitions, this is somewhat popular topic there.
Subscribe to KDD and BIBM magazines or find them in nearby library. There are 100s of papers on this subject.

Resources