How is a context diagram different than a level 0 diagram? - entity-relationship

What is the difference, if any, between a context diagram and a level 0 diagram?

There are some conflicts in the literature about these two terms.
Refer page 54 of this book for example. It is highly rated on google books and is a standard text in many schools. It says that a context diagram is the same as a Level 0 DFD. This one disagrees on page 210.
I'll first address the notion of "levels".
As we know, initially, the whole of the system is represented by one big block, and interactions with the system are clearly depicted. Initially, we are seeing the system with a naked eye.
Now, think of yourself holding something like a microscope. You place the lens above the system block and zoom in. This "zooming in" takes you to the next level in the hierarchy. So now, you see that the system is made up of a number of blocks.
You pick up any of the sub-blocks, and then zoom in again, thus going to the next level and so on.
So we see that there is a hierarchy of diagrams, with each level taking us to the next level of detail. The only bone of contention that now remains is name of the first level (The view with the naked eye).
As you can see, the question is not very objective, hence the ambiguity.
We can have :
Context Diagram ->
Level 0 DFD ->
-> Level n DFD</pre>
OR
Context Diagram/Level 0 DFD
->Level 1 DFD
->Level n DFD
It boils down to which one looks better. In my personal opinion , the first hierarchy is more apt. This is because initially, all we see is the system and the context within which it operates. I feel that anyone who understands the explanation should't worry much about the nomenclature.
Refer this for more.

A very difficult discussion.
My thoughts:
A context diagram only has 1 process, while a DFD level 0 can have more.

The context diagram established context at the system to be developed that is it represents the interaction at the system with various external entities.
Where data flow diagram is a simple graphical notation that can be used to represent a system in the term of input data to the system,various processing carried out on this data and the output generated by the system. It is simple to understand and use.

Related

replicating trees between ACID RDB using CRDT

I'm interested in replicating "hierachies" of data say similar to addresses.
Area
District
Sector
Unit
but you may have different pieces of data associated to each layer, so you may know the area of Sectors, but not of units, and you may know the population of a unit, basically its not a homogenious tree.
I know little about replication of data except brushing Brewers theorem/CAP, and some naive intuition about what eventual consistency is.
I'm looking for SIMPLE mechanisms to replicate this data from an ACID RDB, into other ACID RDBs, systemically the system needs to eventually converge, and obviously each RDB will enforce its own local consistent view, but any 2 nodes may not match at any given time (except 'eventually').
The simplest way to approach this is to simple store all the data in a single message from some designated leader and distribute it...like an overnight dump and load process, but thats too big.
So the next simplest thing (I thought) was if something inside an area changes, I can export the complete set of data inside an area, and load it into the nodes, thats still quite a coarse algorithm.
The next step was if, say an 'object' at any level changed, was to send all the data in the path to that 'object', i.e. if something in a sector is amended, you would send the data associated to the sector, its parent the district, and its parent the sector (with some sort of version stamp and lets say last update wins)....what i wanted to do was to ensure that any replication 'update' was guaranteed to succeed (so it needs the whole path, which potentially would be created if it didn't exist).
then i stumbled on CRDTs and thought....ah...I'm reinventing the wheel here, and the algorithms are allegedly easy in principle, but tricky to get correct in practice
are there standards accepted patterns to do this sort of thing?
In my use case the hierarchies are quite shallow, and there is only a single designated leader (at this time), I'm quite attracted to state based CRDTs because then I can ignore ordering.
Simplicity is the key requirement.
Actually it appears I've reinvented (in a very crude naive way) the SHELF algorithm.
I'll write some code and see if I can get it to work, and try to understand whats going on.

Get some information from HEVC reference software

I am new to HEVC and I am understanding the reference software now (looking at intra prediction right now).
I need to get information as below after encoding.
the CU structure for a given CTU
for each CU during calculations, it's information (eg. QP value, selected mode for Luma, selected mode for chroma, whether the CU is in final CU structure of the CTU-split decision, etc.)
I know CTU decision are made when m_pcCuEncoder->compressCtu( pCtu ) is called in TEncSlice.cpp. But where exactly I can get these specific information? Can someone help me with this?
p.s. I am learning C++ too (I have a Java background).
EDIT: This post is a solution for the encoder side. However, the decoder side solution is far less complex.
Getting CTU information (partitioning etc.) is a bit tricky at encoder if you are new to the code. But I try to help you with it.
Everything that I am going to tell you is based on the JEM code and not HM, but I am pretty sure that you can apply them to HM too.
As you might have noticed, there are two completely separate phases for compression/encoding of each CTU:
The RDO phase: first there is the Rate-Distortion Optimization loop to "make the decisions". In this phase, literally all possible combinations of the parameters are tested (e.g. differetn partitionings, intra modes, filters etc.). At the end of this phase the RDO determines the best combination and passes them to the second phase.
The encoding phase: Here the encoder does the actual final encoding step. This includes writing all the bins into the bitstream, based on the parameters determined during the RDO phase.
In the CTU level, these two phases are performed by the m_pcCuEncoder->compressCtu( pCtu ) and the m_pcCuEncoder->encodeCtu( pCtu ) functions, respectively, both in the compressSlice() function of the TEncSlice.cpp file.
Given the above information, you must look for what you are looking for, in the second phase and not the first phase (you may already know these things, but I suspected that you might be looking at the first phase).
So, now this is my suggestion for getting your information. It's not the best way to do it, but is easier to explain here.
You first go to this point in your HM code:
compressGOP() -> encodeSlice() -> encodeCtu() -> xEncodeCU()
Then you find the line where the prediction mode (intra/inter) is encoded:
m_pcEntropyCoder->encodePredMode()
At this point, you have access to the pcCU object which contains all the final decisions, including the information you look for, that are made during the first phase. At this point of the code, you are dealing with a single CU and not the entire CTU. But if you want your information for the entire CTU, you may go back to
compressGOP() -> encodeSlice() -> encodeCtu()
and find the line where the xEncodeCU() function is called for the first time. There, you will have access to the pCtu object.
Reminder: each TComDataCU object (pcCU if you are in the CU level, or pCtu if you are in the CTU level) of size WxH is split to NumPartition=(W/4)x(H/4) partitions of size 4x4. Each partition is accessible by an index (uiAbsPartIdx) which indicates its Z-scan order. For example, the uiAbsPartIdx for the partition at <x=8,y=0> is 4.
Now, you do the following steps:
Get the number of partitions (NumPartition) within your pCtu by calling pCtu->getTotalNumPart().
Loop over all NumPartition partitions and call the functions pCtu->getWidth(idx), pCtu->getHeight(idx), pCtu->getCUPelX(idx) and pCtu->getCUPelY(), where idx is your loop iterator. These functions return the following information for each CU coincided with the 4x4 partition at idx: width, height, positionX, positionY. [both positions are relative to the pixel <0,0> of the frame]
The above information is enough for deriving the CTU partitioning of the current pCtu! So the last step is to write a piece of code to do that.
This was an example of how to extract CTU partitioning information during the second phase (i.e. encoding phase). However, you may call some proper functions to get the other information in your second question. For example, to get selected luma intra mode, you may call pCtu->getIntraDir(CHANNEL_TYPE_LUMA, idx), instead of getWidth()/getHeight() functions. Or pCtu->getQP(CHANNEL_TYPE_LUMA, idx) to get the QP value.
You can always find a list of functions that provide useful information at the pCtu level, in the TComDataCU class (TComDataCU.cpp).
I hope this helps you. If not, let me know!
Good luck,

Dynamic Process parameter adjustment in Semiconductor manufacturing data

I have process parameter data from semiconductor manufacturing.and requirement is to suggest what could be the best parameter adjustment to be made to process parameter to get better yield ie best path for high yield. what machine learning /Statistical models best suits this requirement
Note:I have thought of using decision tree which can give us best path for high yield.
Would like to know it any other methods that can be more efficient
data is like
lotno x1 x2 x3 x4 x5 yield(%)
<95% yield is considered as 0 and >95% as 1
I'm not really sure of the question here, but as a former semiconductor process engineer, here is how I look at the yield improvement approach - perspective.
Process Development.
DOE: Typically, I would run structured DOEs to understand my process (#4). I would first identify "potential" 'factors', and run various "screening" experiments to identify statistical significance. With the goal basically here to identify the most statistically significant (and for that matter, least significant) factors. So these are inherently simple experiments, low # of "levels" which don't target understanding of the curvature of the response surface, they just look for magnitude change of response vs factor. Generally, I am most concerned with 'Process' factors, but it is important to recognize that the influence of variable inputs can come from more than just "machine knobs' as example. Variable can arise from 1) People, 2) Environment (moisture, temp, etc), 3) consumables (used in the process), 4) Equipment (is 40 psi on this tool really 40 psi and the same as 40 psi on a different tool) 4) Process variable settings.
With the most statistically significant factors, I would run more elaborate DOE using the major factors and analyze this data to develop a model. There are generally more 'levels' used here to allow for curvature insight of the response surface via the analysis. There are many types of well known standard experimental design structures here. And there is software such as JMP that is specifically set up to do this analysis.
From here, the idea would be to generate a model in the form of Response = F (Factors). That allows you to essentially optimize the response based upon these factors where the response is a reflection of your yield criteria.
From here, the engineer would typically execute confirmation runs with optimized factors to confirm optimized response.
Note that the software analysis typically allows for the engineer to illuminate any run order dependence. The execution of the DOE is typically performed in a randomized cell fashion. (Each 'cell' is a set of conditions for the experiment). Similarly the experiments include some level of repetition to gauge 'repeatability' of the 'system'. This inclusion can be explicit (run the same cell twice), but there is also some level of repeatability inherent in the design as well since you are running multiple cells, albeit at difference settings. But generally, the experiment includes explicitly repeated cells.
And finally there is the concept of manufacturability, which includes constraints of time, cost, physical limits, equipment capability, etc. (The ideal process works great, but it takes 10 years, costs 1 million dollars and requires projected settings outsides the capability of the tool.)
Since you have manufacturing data, hopefully, you have the data that captures the other types of factors as well (1,2,3), so you should specifically analyze the data to try to identify such effects. This is typically done as A vs B comparisons. Person A vs B, Tool A vs B, Consumable A vs B, Consumable lot A vs B, Summer vs Winter, etc.
Basically, there are all sorts of comparisons you could envision here and check for statistically differences across two sets of populations.
A comment on response: What is the yield criteria? You should know this in order to formulate the model. For semiconductors, we have both line yield (process yield) but there is also device yield. I assume for your work, you are primarily concerned with line yield. So minimizing variability in the factors (from 1,2,3,4) to achieve the desired response (target response(s) with minimal variability) is the primary goal.
APC (Advanced Process Control).
In many cases, there is significant trending that results from whatever reason; crappy tool control (the tool heats up), crappy consumable (the target material wears, the polishing pad wears, the chemical bath gets loaded, whatever), and so the idea here is how to adjust the next batch/lot/wafer based upon the history of what came prior. Either improve the manufacturing to avoid/minimize this trending (run order dependence) or adjust process to accommodate it to achieve the desired response.
Time for lunch, hope this helps...if you post on the specific process module type, and even equipment and consumables, I might be able to provide more insight.

NeuroEvolution of Augmenting Topologies (NEAT) and global innovation number

I was not able to find why we should have a global innovation number for every new connection gene in NEAT.
From my little knowledge of NEAT, every innovation number corresponds directly with an node_in, node_out pair, so, why not only use this pair of ids instead of the innovation number? Which new information there is in this innovation number? chronology?
Update
Is it an algorithm optimization?
Note: this more of an extended comment than an answer.
You encountered a problem I also just encountered whilst developing a NEAT version for javascript. The original paper published in ~2002 is very unclear.
The original paper contains the following:
Whenever a new
gene appears (through structural mutation), a global innovation number is incremented
and assigned to that gene. The innovation numbers thus represent a chronology of the
appearance of every gene in the system. [..] ; innovation numbers are never changed. Thus, the historical origin of every
gene in the system is known throughout evolution.
But the paper is very unclear about the following case, say we have two ; 'identical' (same structure) networks:
The networks above were initial networks; the networks have the same innovation ID, namely [0, 1]. So now the networks randomly mutate an extra connection.
Boom! By chance, they mutated to the same new structure. However, the connection ID's are completely different, namely [0, 2, 3] for parent1 and [0, 4, 5] for parent2 as the ID is globally counted.
But the NEAT algorithm fails to determine that these structures are the same. When one of the parents scores higher than the other, it's not a problem. But when the parents have the same fitness, we have a problem.
Because the paper states:
In composing the offspring, genes are randomly chosen from veither parent at matching genes, whereas all excess or disjoint genes are always included from the more fit parent, or if they are equally fit, from both parents.
So if the parents are equally fit, the offspring will have connections [0, 2, 3, 4, 5]. Which means that some nodes have double connections... Removing global innovation counters, and just assign id's by looking at node_in and node_out, you avoid this problem.
So when you have equally fit parents, yes you have optimized the algorithm. But this is almost never the case.
Quite interesting: in the newer version of the paper, they actually removed that bolded line! Older version here.
By the way, you can solve this problem by instead of assigning innovation ID's, assign ID based on node_in and node_out using pairing functions. This creates quite interesting neural networks when fitness is equal:
I can't provide a detailed answer, but the innovation number enables certain functionality within the NEAT model to be optimal (like calculating the species of a gene), as well as allowing crossover between the variable length genomes. Crossover is not necessary in NEAT, but it can be done, due to the innovation number.
I got all my answers from here:
http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf
It's a good read
During crossover, we have to consider two genomes that share a connection between the two same nodes in their personal neural networks. How do we detect this collision without iterating both genome's connection genes over and over again for each step of crossover? Easy: if both connections being examined during crossover share an innovation number, they are connecting the same two nodes because they received that connection from the same common ancestor.
Easy Example:
If I am a genome with a specific connection gene with innovation number 'i', my children that take gene 'i' from me may eventually cross over with each other in 100 generations. We have to detect when these two evolved versions (alleles) of my gene 'i' are in collision to prevent taking both. Taking two of the same gene would cause the phenotype to probably loop and crash, killing the genotype.
When I created my first implementation of NEAT I thought the same... why would you keep a innovation number tracker...? and why would you use it only for one generation? Wouldn't be better to not keep it at all and use a key value par with the nodes connected?
Now that I am implementing my third revision I can see what Kenneth Stanley tried to do with them and why he wanted to keep them only for one generation.
When a connection is created, it will start its optimization in that moment. It marks its origin. If the same connection pops out in another generation, that will start its optimization then. Generation numbers try to separate the ones which come from a common ancestor, so the ones that have been optimized for many generations are not put side to side that one that was just generated. If a same connection is found in two genomes, that means that that gene comes from the same origin and thus, can be aligned.
Imagine then that you have your generation champion. Some of their genes will have 50 percent chance to be lost due that the aligned genes are treated equally.
What is better...? I haven't seen any experiments comparing the two approaches.
Kenneth Stanley also addressed this issue in the NEAT users page: https://www.cs.ucf.edu/~kstanley/neat.html
Should a record of innovations be kept around forever, or only for the current
generation?
In my implementation of NEAT, the record is only kept for a generation, but there
is nothing wrong with keeping them around forever. In fact, it may work better.
Here is the long explanation:
The reason I didn't keep the record around for the entire run in my
implementation of NEAT was because I felt that calling something the same
mutation that happened under completely different circumstances was not
intuitive. That is, it is likely that several generations down the line, the
"meaning" or contribution of the same connection relative to all the other
connections in a network is different than it would have been if it had appeared
generations ago. I used a single generation as a yardstick for this kind of
situation, although that is admittedly ad hoc.
That said, functionally speaking, I don't think there is anything wrong with
keeping innovations around forever. The main effect is to generate fewer species.
Conversely, not keeping them around leads to more species..some of them
representing the same thing but separated nonetheless. It is not currently clear
which method produces better results under what circumstances.
Note that as species diverge, calling a connection that appeared in one species a
different name than one that appeared earlier in another just increases the
incompatibility of the species. This doesn't change things much since they were
incompatible to begin with. On the other hand, if the same species adds a
connection that it added in an earlier generation, that must mean some members of
the species had not adopted that connection yet...so now it is likely that the
first "version" of that connection that starts being helpful will win out, and
the other will die away. The third case is where a connection has already been
generally adopted by a species. In that case, there can be no mutation creating
the same connection in that species since it is already taken. The main point is,
you don't really expect too many truly similar structures with different markings
to emerge, even with only keeping the record around for 1 generation.
Which way works best is a good question. If you have any interesting experimental
results on this question, please let me know.
My third revision will allow both options. I will add more information to this answer when I have results about it.

How can I remember which data structures are used by DFS and BFS?

I always mix up whether I use a stack or a queue for DFS or BFS. Can someone please provide some intuition about how to remember which algorithm uses which data structure?
Queue can be generally thought as horizontal in structure i.e, breadth/width can be attributed to it - BFS, whereas
Stack is visualized as a vertical structure and hence has depth - DFS.
Draw a small graph on a piece of paper and think about the order in which nodes are processed in each implementation. How does the order in which you encounter the nodes and the order in which you process the nodes differ between the searches?
One of them uses a stack (depth-first) and the other uses a queue (breadth-first) (for non-recursive implementations, at least).
I remember it by keeping Barbecue in my mind. Barbecue starts with a 'B' and ends with a sound like 'q' hence BFS -> Queue and the remaining ones DFS -> stack.
BFS explores/processes the closest vertices first and then moves outwards away from the source. Given this, you want to use a data structure that when queried gives you the oldest element, based on the order they were inserted. A queue is what you need in this case since it is first-in-first-out(FIFO).
Whereas a DFS explores as far as possible along each branch first and then bracktracks. For this, a stack works better since it is LIFO(last-in-first-out)
Take it in Alphabetical order...
.... B(BFS).....C......D (DFS)....
.... Q(Queue)...R......S (Stack)...
BFS uses always queue, Dfs uses Stack data structure. As the earlier explanation tell about DFS is using backtracking. Remember backtracking can proceed only by Stack.
BFS --> B --> Barbecue --> Queue
DFS --> S --> Stack
Don't remember anything.
Assuming the data structure used for the search is X:
Breadth First = Nodes entered X earlier, have to be generated on the tree first: X is a queue.
Depth First = Nodes entered X later, must be generated on the tree first: X is a stack.
In brief: Stack is Last-In-First-Out, which is DFS. Queue is First-In-First-Out, which is BFS.
Bfs;Breadth=>queue
Dfs;Depth=>stack
Refer to their structure
The depth-first search uses a Stack to remember where it should go when it reaches a dead end.
DFSS
Stack (Last In First Out, LIFO). For DFS, we retrieve it from root to the farthest node as much as possible, this is the same idea as LIFO.
Queue (First In First Out, FIFO). For BFS, we retrieve it one level by one leve, after we visit upper level of the node, we visit bottom level of node, this is the same idea as FIFO.
An easier way to remember, especially for young students, is to use similar acronym:
BFS => Boy FriendS in queue (for popular ladies apparently).
DFS is otherwise (stack).
I would like to share this answer:
https://stackoverflow.com/a/20429574/3221630
Taking BFS and replacing a the queue with a stack, reproduces the same visiting order of DFS, it uses more space than the actual DFS algorithm.
You can remember by making an acronym
BQDS
Beautiful Queen has Done Sins.
In Hindi,
बहुरानी क्यु दर्द सहा
Here is a simple analogy to remember. In a BFS, you are going one level at a time but in DFS you are going as deep as possible to the left before visiting others. Basically stacking up a big pile of stuff, then analyze them one by one, so if this is STACK, then the other one is queue.
Remember as "piling up", "stacking up", big as possible. (DFS).
if you visually rotate 'q' symbol (as in queue) 180 degrees you will get a 'b' (as in bfs).
Otherwise this is stack and dfs.

Resources