Is it appropriate to use Fama and French 3 and 5 factor models on individual stocks (not portfolios) that are all large companies? - return

I have daily data for 7 large companies. I am analyzing the returns initially with regression - CAPM and Fama and French Models. I am concerned about the interpretation of SMB coefficient, as there are no small companies in this study. Thank you

Related

AutoML NL - Training model based on ICD10-CM - Amount of text required

We are currently working on integrating ICD10-CM for our medical company, to be used for patient diagnosis. ICD10-CM is a coding system for diagnoses.
I tried to import ICD10-CM data in description-code pairs but obviously, it didn't work since AutoML needed more text for that code(label). I found a dataset on Kaggle but it only contained hrefs to an ICD10 website. I did find out that the website contains multiple texts and descriptions associated with codes that can be used to train our desired model.
Kaggle Dataset:
https://www.kaggle.com/shamssam/icd10datacom
Sample of a page from ICD10data.com:
https://www.icd10data.com/ICD10CM/Codes/A00-B99/A15-A19/A17-/A17.0
Most notable fields are:
- Approximate Synonyms
- Clinical Information
- Diagnosis Index
If I made a dataset from the sentences found in these pages and assigned them to their code(labels), will it be enough for AutoML dataset training? since each label will have 2 or more texts finally instead of just one, but definitely still a lot less than a 100 for each code unlike those in demos/tutorials.
From what I can see here, the disease code has a tree-like structure where, for instance, all L00-L99 codes refer to "Diseases of the skin and subcutaneous tissue". At the same time L00-L08 codes refer to "Infections of the skin and subcutaneous tissue", and so on.
What I mean is that the problem is not 90000 examples for 90000 different independent labels, but a decision tree (you take several decisions in function of the previous decision: the first step would be choosing which of the about 15 most general categories fits best, then choosing which of the subcategories etc.)
In this sense, probably autoML is not the best product, given that you cannot implement a specially designed decision tree model that takes into account all of this.
Another way of using autoML would be training separately for each of the decisions and then combine the different models. This would easily work for the first layer of decision but would be exponentially time consuming (the number of models to train in order to be able to predict more accurately grows exponentially with the level of accuracy, by accurate I mean afirminng it is L00-L08 instad of L00-L99).
I hope this helps you understand better the problem and the different approaches you can give to it!

What features should I use for predicting the performance of soccer players?

I want to build a model to help me build a team in fantasy premier league. There are two parts to the problem:
1) Predicting the player performances next week given the data for the last week and for the least season.
2) Using the result of the predictive model to build a team within a price of 100million euros.
For part 2), I was thinking of using either a 6D knapsack algorithm (2D for weight and number of items and the other 4 dimensions to make sure the appropriate number of players are picked from each category) or to use min cost max flow (not sure how I can add categories or restrict the number of players from each category).
For part 1) the only examples and papers I have come across either use models to predict whether or not a team will win or just classify the players as "good" or "bad". The second part of my problem requires that I predict a specific value for each player. At the moment I am thinking of using regression but I am not sure what kind of features I should use in this.

Use Machine learning to predict the prices of used car

I have a large table of used cars.
The header looks like this:
maker | model | year | kilometers | transmission | gas_type | price
I made a prediction model, that work like this: every time I wanted to know the price of a car, I filtered the data by maker and model, and then I run a quadratic Regression, using year and kilometers as parameters.
The results are OK, but not for every car.
The problem is that there are different "versions" for the same maker and model.
(It is not the same a FULL version than a simple version, or 4WD, or Leather Seats, etc. )
How can I identify the differences? Can I use some kind of clustering to identify different version between cars with the same model and maker.
Any help will be appreciated
That's not a clustering problem, just a sub-model feature. Also, you might want to differentiate between a sub-model (standard, Luxury Edition, hatchback, etc.) from model-independent features (4WD, leather seats, premium sound system, sun roof, etc.). The sub-model would likely be a single feature (text column), while the options would be individual features (Boolean column).
UPDATE AFTER OP CLARIFICATION
I see: those features are output, not input.
Yes, you can use clustering. However, that may or may not identify sub-models (your "version"). If you cluster only observations that have very similar use (kilometeres) and all other features equal, you will find some useful clustering. However, this will work only to the extent that the version is a major factor in the remaining price variation. You may find that your clustering is also affected by geographic region and other factors.

Can you use sampling to normalize data?

Context
I have a retail data set that contains sales for a large number of customers. Some of these customers received a marketing treatment (i.e. saw a TV ad or similar) while others did not. The data is very messy with most customers having $0 in sales, some having negative, some positive, a lot of outliers/influential cases etc. Ultimately I am trying to "normalize" the data so that assumptions of the General Linear Model (GLM) are met and I can thus use various well-known statistical tools (regression, t-Test, etc.). Transformations have failed to normalize the data.
Question
Is it appropriate to sample groups of these customers so that the data starts to become more normal? Would doing so violate any assumptions for the GLM? Are you aware of any literature on this subject?
Clarification
For example, instead of looking at 20,000 individual customers (20,000 groups of 1) I could pool customers into groups of 10 (2,000 groups of 10) and calculate their mean sale. Theoretically, the data should begin to normalize as all of these random draws from the population begin to cluster around the population mean with some standard error. I could keep breaking them into larger groups (i.e. 200 groups of 100) until the data is relatively normal and then proceed with my analysis.

Product Categorization?

There are several data sets for automobile manufacturers and models. Each contains several hundreds data entries like the following:
Mercedes GLK 350 W2
Prius Plug-in Hybrid Advanced Toyota
General Motors Buick Regal 2012 GS 2.4L
How to automatically divide the above entries into the manufacturers (e.g. Toyota ) and models (e.g. Prius Plug-in Hybrid Advanced) by using only those files?
Thanks in advance.
Machine Learning (ML) typically relies on training data which allows the ML logic to produce and validate a model of the underlying data. With this model, it is then in a position to infer the class of new data presented to it (in the classifier application, as the one at hand) or to infer the value of some variable (in the regression case, as would be, say, an ML application predicting the amount of rain a particular region will receive next month).
The situation presented in the question is a bit puzzling, at several levels.
Firstly, the number of automobile manufacturers is finite and relatively small. It would therefore be easy to manually make the list of these manufacturers and then simply use this lexicon to parse out the manufacturers from the model numbers, using plain string parsing techniques, i.e. no ML needed or even desired here. (alas the requirement that one would be using "...only those files" seems to preclude this option.
Secondly, one can think of a few patterns or heuristics that could be used to produce the desired classifier (tentatively a relatively weak one, as the patterns/heuristics that come to mind ATM seem relatively unreliable). Furthermore, such an approach is also not quite an ML approach in the common understanding of the word.

Resources