Transfer Learning completely different domain and task - machine-learning

I've been reading some papers and blog posts on Transfer Learning. What I realized is that some say "different" and others say "different but similar". Now, I'm confused.
Say that,
D is domain. T is task.
a is source, hence source domain is Da and source task is Ta.
a' is different but similar to a.
b is different from a.
Ms is the source model learned from Da and Ta.
Mt is the target model.
In which target domain-task combination, Ms can transfer its knowledge to Mt?
1. (Da, Tb)
2. (Db, Ta)
3. (Da', Tb)
4. (Db, Ta')
5. (Da', Ta)
6. (Da, Ta')
7. (Da', Ta')
8. (Db, Tb)
Honestly, I know 5, 6, 7 are possible since the paper said so. And I doubt 8 will work (won't it?).
But what about 1, 2, 3, 4 where either domain or task is b?

It depends how different the source and target domains are. If the source and target domains have no similarities then you can’t improve your model for a task in the target domain by pre-training on the task domain. However if there are similarities e.g. any image domain to almost any other image domain, and your source domain dataset is large, transferring your model from the source domain to the target domain is likely to help regularise your model and improve generalisation in the target domain. Especially if the target domain dataset is small.
In deep learning you want to reinitialise (retrain from random weights) more layers (from the top down) and do more fine tuning depending on how different your source and target domains and source and target tasks are.

What's a "domain"?
In the field of natural language processing (NLP) there's a lot of research for domain adaptation, and you can get some benefit in all the cases (#1-#8) you describe regarding for what NLP would call "different domains" - e.g. newswire text vs tweet text vs clinical radiology report text; yes, even in #8.
However, you can have more different "domains" (which wouldn't be really called domains IMHO) - for example, english text vs chinese text; or english text vs english audio recordings. It is possible do do some transfer learning even in those cases, but much more limited; so it really depends on where you draw the line between "that's a different domain" vs "that's a completely different type of input data".

Related

Dynamic Process parameter adjustment in Semiconductor manufacturing data

I have process parameter data from semiconductor manufacturing.and requirement is to suggest what could be the best parameter adjustment to be made to process parameter to get better yield ie best path for high yield. what machine learning /Statistical models best suits this requirement
Note:I have thought of using decision tree which can give us best path for high yield.
Would like to know it any other methods that can be more efficient
data is like
lotno x1 x2 x3 x4 x5 yield(%)
<95% yield is considered as 0 and >95% as 1
I'm not really sure of the question here, but as a former semiconductor process engineer, here is how I look at the yield improvement approach - perspective.
Process Development.
DOE: Typically, I would run structured DOEs to understand my process (#4). I would first identify "potential" 'factors', and run various "screening" experiments to identify statistical significance. With the goal basically here to identify the most statistically significant (and for that matter, least significant) factors. So these are inherently simple experiments, low # of "levels" which don't target understanding of the curvature of the response surface, they just look for magnitude change of response vs factor. Generally, I am most concerned with 'Process' factors, but it is important to recognize that the influence of variable inputs can come from more than just "machine knobs' as example. Variable can arise from 1) People, 2) Environment (moisture, temp, etc), 3) consumables (used in the process), 4) Equipment (is 40 psi on this tool really 40 psi and the same as 40 psi on a different tool) 4) Process variable settings.
With the most statistically significant factors, I would run more elaborate DOE using the major factors and analyze this data to develop a model. There are generally more 'levels' used here to allow for curvature insight of the response surface via the analysis. There are many types of well known standard experimental design structures here. And there is software such as JMP that is specifically set up to do this analysis.
From here, the idea would be to generate a model in the form of Response = F (Factors). That allows you to essentially optimize the response based upon these factors where the response is a reflection of your yield criteria.
From here, the engineer would typically execute confirmation runs with optimized factors to confirm optimized response.
Note that the software analysis typically allows for the engineer to illuminate any run order dependence. The execution of the DOE is typically performed in a randomized cell fashion. (Each 'cell' is a set of conditions for the experiment). Similarly the experiments include some level of repetition to gauge 'repeatability' of the 'system'. This inclusion can be explicit (run the same cell twice), but there is also some level of repeatability inherent in the design as well since you are running multiple cells, albeit at difference settings. But generally, the experiment includes explicitly repeated cells.
And finally there is the concept of manufacturability, which includes constraints of time, cost, physical limits, equipment capability, etc. (The ideal process works great, but it takes 10 years, costs 1 million dollars and requires projected settings outsides the capability of the tool.)
Since you have manufacturing data, hopefully, you have the data that captures the other types of factors as well (1,2,3), so you should specifically analyze the data to try to identify such effects. This is typically done as A vs B comparisons. Person A vs B, Tool A vs B, Consumable A vs B, Consumable lot A vs B, Summer vs Winter, etc.
Basically, there are all sorts of comparisons you could envision here and check for statistically differences across two sets of populations.
A comment on response: What is the yield criteria? You should know this in order to formulate the model. For semiconductors, we have both line yield (process yield) but there is also device yield. I assume for your work, you are primarily concerned with line yield. So minimizing variability in the factors (from 1,2,3,4) to achieve the desired response (target response(s) with minimal variability) is the primary goal.
APC (Advanced Process Control).
In many cases, there is significant trending that results from whatever reason; crappy tool control (the tool heats up), crappy consumable (the target material wears, the polishing pad wears, the chemical bath gets loaded, whatever), and so the idea here is how to adjust the next batch/lot/wafer based upon the history of what came prior. Either improve the manufacturing to avoid/minimize this trending (run order dependence) or adjust process to accommodate it to achieve the desired response.
Time for lunch, hope this helps...if you post on the specific process module type, and even equipment and consumables, I might be able to provide more insight.

Applying MACHINE learning in biological text data

I am trying to solve the following question - Given a text file containing a bunch of biological information, find out the one gene which is {up/down}regulated. Now, for this I have many such (60K) files and have annotated some (1000) of them as to which gene is {up/down}regulated.
Conditions -
Many sentences in the file have some gene name mention and some of them also have neighboring text that can help one decide if this is indeed the gene being modulated.
Some files also have NO gene modulated. But these still have gene mentions.
Given this, I wanted to ask (having absolutely no background in ML), what sequence learning algorithm/tool do I use that can take in my annotated (training) data (after probably converting the text to vectors somehow!) and can build a good model on which I can then test more files?
Example data -
Title: Assessment of Thermotolerance in preshocked hsp70(-/-) and
(+/+) cells
Organism: Mus musculus
Experiment type: Expression profiling by array
Summary: From preliminary experiments, HSP70 deficient MEF cells display moderate thermotolerance to a severe heatshock of 45.5 degrees after a mild preshock at 43 degrees, even in the absence of hsp70 protein. We would like to determine which genes in these cells are being activated to account for this thermotolerance. AQP has also been reported to be important.
Keywords: thermal stress, heat shock response, knockout, cell culture, hsp70
Overall design: Two cell lines are analyzed - hsp70 knockout and hsp70 rescue cells. 6 microarrays from the (-/-)knockout cells are analyzed (3 Pretreated vs 3 unheated controls). For the (+/+) rescue cells, 4 microarrays are used (2 pretreated and 2 unheated controls). Cells were plated at 3k/well in a 96 well plate, covered with a gas permeable sealer and heat shocked at 43degrees for 30 minutes at the 20 hr time point. The RNA was harvested at 3hrs after heat treatment
Here my main gene is hsp70 and it is down-regulated (deducible from hsp(-/-) or HSP70 deficient). Many other gene names are also there like AQP.
There could be another file with no gene modified at all. In fact, more files have no actual gene modulation than those who do, and all contain gene name mentions.
Any idea would be great!!
If you have no background in ML I suggest buying a product like this one, this one or this one. These products where in development for decades with team budgets in millions.
What you are trying to do is not that simple. For example a lot of papers contain negative statements by first citing the original statement from another paper and then negating it. In your example how are you going to handle this:
AQP has also been reported to be important by Doe et al. However, this study suggest that this might not be the case.
Also, if you are looking into large corpus of biomedical research papers, or for this matter any corpus of research papers. You will find tons of papers that suggest something for example gene being up-regulated or not, and then there is one paper published in Cell magazine that all previous research has been mistaken.
To make matters worse, gene/protein names are not that stable. Besides few famous ones like P53. There is a bunch of run of the mill ones that are initially thought that they are one gene, but later it turns out that these are two different things. When this happen there are two ways community handles it. Either both of the genes get new names (usually with some designator at the end) or if the split is uneven the larger class retains original name and the second one gets the new name. To compound this problem, after this split happens not all researchers get the memo at instantly, so there is still stream of publications using old publication.
These are just two simple problems, there are 100s of these.
If you are doing this for personal enrichment. Here are some suggestions:
Build a language model on biomedical papers. Existing language models are usually built from news-wire sources or from social media data. All three of the corpora claim to be written in English language. But in reality these are three different languages with their own grammar and vocabulary
Look into things like embeddings and word2vec.
Look into Kaggle competitions, this is somewhat popular topic there.
Subscribe to KDD and BIBM magazines or find them in nearby library. There are 100s of papers on this subject.

In Weka make a arff file from text file

In naive byes classifier i want to find out the accuracy from my train and test. But my train set is like
Happy: absolution abundance abundant accolade accompaniment accomplish accomplished achieve achievement acrobat admirable admiration adorable adoration adore advance advent advocacy aesthetics affection affluence alive allure aloha
Sad: abandon abandoned abandonment abduction abortion abortive abscess absence absent absentee abuse abysmal abyss accident accursed ache aching adder adrift adultery adverse adversity afflict affliction affront aftermath aggravating
Angry: abandoned abandonment abhor abhorrent abolish abomination abuse accursed accusation accused accuser accusing actionable adder adversary adverse adversity advocacy affront aftermath aggravated aggravating aggravation aggression aggressive aggressor agitated agitation agony alcoholism alienate alienation
For test set
data: Dec 7, 2014 ... This well-known nursery rhyme helps children practice emotions, like happy, sad, scared, tired and angry. If You're Happy and You Know It is ...
Now the problem is how do i convert them into arff file
Your training set is not appropriate for training a model for Weka however these information can be used in feature extraction.
Your Test set can be converted into an arff file. From every message extract these basics features like
1. Any form of the word 'Happy' is present or not
2. Any form of the word 'Sad' is present or not
3. Any form of the word 'Angry' is present or not
4. TF-IDF
etc.
then for some messages (say 70%) you should assign one class {Happy, Sad, Angry} manually and for remaining 30% you can test through your model.
More about arff file is given here:
http://www.cs.waikato.ac.nz/ml/weka/arff.html
Where to start ;).
As written before your "training data" is not real training data. Training data should be texts similar to the data you are using for Testing. However, in your example it is merely a list of words. My gut feeling is that you would be better of to avoid using weka, count the number of occurrence in each category, and the take the one with most matches.
In case you want use Weka I'd recommend to use the toolbox https://www.knime.org which nicely integrates with weka.
You should then convert your data into a bag of words representation. This is basically you have the number of times each word occurs in each of the texts as features.
Also for this Knime has nice package. http://www.tech.knime.org/files/KNIME-TextProcessing-HowTo.pdf

Text clustering within a log file

I am working on a problem of finding similar content in a log file. Let's say I have a log file which looks like this:
show version
Operating System (OS) Software
Software
BIOS: version 1.0.10
loader: version N/A
kickstart: version 4.2(7b)
system: version 4.2(7b)
BIOS compile time: 01/08/09
kickstart image file is: bootflash:/m9500-sf2ek9-kickstart-mz.4.2.7b.bin
kickstart compile time: 8/16/2010 13:00:00 [09/29/2010 23:10:48]
system image file is: bootflash:/m9500-sf2ek9-mz.4.2.7b.bin
system compile time: 8/16/2010 13:00:00 [09/30/2010 00:46:36]`
Hardware
xxxx MDS 9509 (9 Slot) Chassis ("xxxxxxx/xxxxx-2")
xxxxxxx, xxxx with 1033100 kB of memory.
Processor Board ID xxxx
Device name: xxx-xxx-1
bootflash: 1000440 kB
slot0: 0 kB (expansion flash)
For a human eye, it can easily be understood that "Software" and the data below is a section and "Hardware" and the data below is another section. Is there a way I can model using machine learning or some other technique to cluster similar sections based on a pattern? Also, I have shown 2 similar kinds of pattern but the patterns between sections might vary and hence should identify as different section. I have tried to find similarity using cosine similarity but it doesn't help much because the words aren't similar but the pattern is.
I see actually two separate machine learning problems:
1) If I understood you correctly the first problem you want to solve is the problem to split each log into distinct section, so one for Hardware, one for Software etc.
In order to achieve this one approach could be try to extract heading which mark the beginning of a new section. In order to do so you could manually label a set of different logs and label each row as heading=true, heading= false
No you could try to train a classifier which takes your labeled data as an input and the result could be a model.
2) Now that you have this different sections, you can split each log into those section and treat each section as a separate document.
Now I would first try a straigt-forward document clustering using a standard nlp pipeline:
Tokenize your document to get the tokens
Normalize them (maybe stemming is not the best idea for logs)
Create for each document a tf-idf vector
Start with a simple clustering algorithm like k-means to try to cluster the different section
After the clustering you should have the section similar to each other in the same cluster
I hope this helped, I think especially the first task is quit hard and maybe hand-tailored patterns will perform better.

How to tag text based on its category using OpenNLP?

I want to tag text based on the category it belongs to ...
For example ...
"Clutch and gear is monitored using microchip " -> clutch /mechanical , gear/mechanical , microchip / electronic
"software used here to monitor hydrogen levels" -> software/computer , hydrogen / chemistry ..
How to do this using openNLP or other NLP engines.
MY WORKS
I tried NER model , but It needs large number of training corpus which I don't have ?
My Need
Do any ready made training corpus available for NER or classification (it must contains scientific and engineering words).. ?
If you want to create a set of class labels for an entire sentence, then you will want to use the Doccat lib. With Doccat you would get a prob distribution for each chunk of text.
with doccat your sample would produce something like this:
"Clutch and gear is monitored using microchip " -> mechanical 0.85847568, electronic 0.374658
with doocat you will lose the keyword->classlabel mapping, so if you really need it doccat might not cut it.
as for NER, OpenNLP has an addon called Modelbuilder-addon that may help you. It is designed to expedite the creation of NER model building. You can create a file/list of as many of the terms for each category as you can think of, then create a file of a bunch of sentences, then use the addon to create an NER model using the seed terms and the file of sentences. see this post where I described it before with code example. You will have to pull down the addon from SVN.
OpenNLP: foreign names does not get recognized

Resources