PIG error message: Projected field doesn't e - join

I'm taking an on-line course. My current assignment asks me to compute an average for one of the fields.
I got this working when I was using a single relation, but the current task involves computing a value from a relation created by joining two others.
When I try a function based on the previously successful approach, I get this error message which has me confused.
Invalid field projection. Projected field [join2::Y_rate2::wtd_stars] does not exist in schema:
The code I have entered in the PIG shell is:
avg = FOREACH groupedJoin2 GENERATE AVG(join2::Y_rate2::wtd_stars);
When I enter
grunt> describe groupedJoin2
this is my output:
groupedJoin2: {group: chararray,join2: {(Y_rate2::business_id: chararray,Y_rate2::stars: int,Y_rate2::useful_clipped: int,Y_rate2::wtd_stars: double,Y_m2::business_idgroup: chararray,Y_m2::num_ratings: long,Y_m2::avg_stars: double,Y_m2::avg_useful: double,Y_m2::avg_wtdstars: double)}}
I believe that my problem is that I don't know how to reference the field I want to compute an average of, but hours of searching over several days has not enlightened me.
Can anyone point out how to reference the field, if that is my problem? If that isn't my problem, I'll be grateful for your pointing me in the right direction.

I think you want to say: AVG(join2.Y_rate2::wtd_stars)
Dot is used to dereference a bag or tuple (in this case, the join2 bag).
Colons are used after a join or group to disambiguate field names.
Bag dereferencing can be done by name (bag.field_name) or position (bag.$0). If a set of fields are dereferenced (bag.(name1, name2) or bag.($0, $1)), the expression represents a bag composed of the specified fields.
http://pig.apache.org/docs/r0.15.0/basic.html#deref

Related

Handling a missing value in machine learning

I was analyzing a dataset in which i have column names as follows: [id , location, tweet, target_value]. I want to handle the missing values for column location in some rows. So i thought to extract location from tweet column from that row(if tweet contains some location) itself and put that value in the location column for that row.
Now i have some questions regarding above approach.
Is this a good way to do it this way?. Can we fill some missing values by using the training data itself?. Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)
Can you please clarify your dataset a little bit more?
First, If we assume that the location is the information of the tweet that has been posted from, then your method (filling out the location columns in the rows in which that information is missing) becomes wrong.
Secondly, if we assume that the tweet contains the location information correctly, then you can fill out the missing rows using the tweets' location information.
If our second assumption is correct, then it would be a good way because you are feeding your dataset with correct information. In other words, you are giving the model a more detailed information so that it could predict more correctly in the testing process.
Regarding to your question about "Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)":
You can try to remove the location column from your model and train your model with the rest of your 3 columns. Then, you can check the success of the new model using different parameters (accuracy etc.). You can compare it with the results of the model that you have trained using all 4 different columns. After that, if there is not any important difference or the results become severe, then you would say it, the column is redundant. Also you can use Principal Component Analysis(PCA) to detect correlated columns.
Finally, please NEVER use training data in your test dataset. It will lead to overtraining and when you use your model in the real world environment, your model will most probably fail.

Custom names detection

This is a project in really early phase and I'm trying to find ideas on where to start.
Any help or pointers would be greatly appreciated!
My problem:
I have text on one side, and a list of named GraphDB elements on the other (usually the name is either an acronym or a multi-word expression). My texts are not annotated.
I want to detect whenever a name is explicitly used in the text. The trick is that it will not necessarily be a perfect string match (for example an acronym can be used to shorten a multi-word expression, or a small part can be left out). So a simple string search will not have a 100% recall (even though it can be used as a starter).
If I just had an input and I wanted it to match it to one of the names, I would do a simple edit distance computation and that's it. What bugs me is that I have to do this for a whole text, and I don't know how to approach/break down the problem.
I cannot break down everything in N-grams because my named entities can be a single word or up to seven words long... Or can I?
I have thousands of Graph elements so I don't think NER can be applied here... Or can it?
An example could be:
My list of names is ['Graph Database', 'Manager', 'Employee Number 1']
The text is:
Every morning, the Manager browse through the Graph Database to look for updates. Every evening, Employee 1 updates the GraphDB.
I want in this block of text to map the 4 highlighted portions to their corresponding item in the list.
I have a small background in Machine Learning but I haven't really ever done NLP. To be clear, I do not care about the meaning of these words, I just want to be able to detect them.
Thanks

Unknown symbol in ERD

I was searching for the meaning of the symbol marked in red in the image below, but I didn't get anything.
So, do you guys know what it means?
It indicates a supertype/subtype relationship. The notation is used in Microsoft Visio, where it's called a category. A double horizontal line is used to indicate a complete category.
In your image, Jurusan records information about the parent/supertype while Animasi, TKJ, RPL and Otomotif describe children/subtypes.
Here's a video on the topic.
Supertype/subtype hierarchies
It does seem to represent supertypes and subtypes. Often the circle will have either a d or an o inside. The d stands for "disjoint" or non-overlapping and the o would stand for "overlapping". These would be constraints. A d would mean that an instance can only appear in one subtype. An o would imply that an instance can be in multiple subtypes. In your example, I'm assuming the supertype instance in question would come from your Colon Siswa table.
Since no information is included in the circle, we do not know. As for the line beneath the circle, there is an option of 1 line - partial completeness or 2 lines - total completeness. The completeness constraint identifies whether an instance in the supertype has to be in one of the subtypes. A single line implies that a supertype instance does not have to be in the subtype. While a double line (2 lines), implies that an instance HAS to be in the subtype.
Supertypes and subtypes are sometimes organized in a specialization hierarchy. The ER model is sometimes called an "enhanced" or "extended" ER Model or EER in that case.
The model you provided is also missing the subtype discriminator. The subtype discriminator would let us know which attribute (field) determines which subtype the instance would belong too. Since you're familiar with the language your model was created in, it may be obvious how it will be determined which instances from the supertype (parent entity) would fall into which subtypes (child entities).
We sometimes call the relationships shown on a specialization hierarchy a "is a" relationship.
You can find examples of these relationships in the textbook below:
Coronel, C. (2017). Database Systems: Design, Implementation, & Management, 12th Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781337509596/
or the following reading room...not my own, but it is an example from the text above.
Reading Room Notes
cardinality symbol, i think one to many symbol

Delphi - What Structure allows for SAVING inverted index type of information?

Delphi XE6. Looking to implemented a limited style of search, specifically an edit field for the user to enter a business name which would get looked up. I need to allow the user to enter multiple words, or part of multiple words. For Example, on a business "First Bank of Kansas", user should be able to enter "Fir Kan", and it should return a match. This means an inverted index type of structure. I have some type of list of each unique word, then a (document ID, primary Key ID, etc, which is an integer). I am struggling with WHAT type of structure to make this... I have approximately 250,000 business names, which have 43,500 unique words. Word count will vary from 1 occurrence of a word to several thousand (company, corporation, etc) I have some requirements...
1). Assume the user enters BAN. I need to find ALL words that start with BAN. I need to return BANK, BANKER, etc... This means that whatever structure I use, I have to be able to find BAN and then move to the next alphabetic entry... and keep moving to the next until I find a value that does NOT start with BAN. This eliminates any type of HASH structure, correct?
2). I obviously want this to be fast. HASH is the fastest, but I can't use this, correct? See requirement 1.
3). Each entry in this structure needs to be able to hold a list of integers. If I end up going with a LinkedList, then each element has to hold a list of Integers.
4). I need to be able to save and load this structure. I don't want to have to build it each time I use it.
Whatever I end up with, it appears to have to be a NESTED structure, a higher level list (LinkedList?) with each node being an Integer List.
What am I looking for? What do commercial product use? Outlook, etc have search capabilities.
Every word is linked to a specific set of IDs, each representing a business name, right?.
I recommend using a binary tree data structure because effort for searching is normally log(n), which is quite fast. Especially, if business names are changing at runtime, an AVLTree should do well, although it's quite some work to implement it by yourself. But there should be many ready-to-use units on binary trees all over the internet.
For each successful search for a word in your tree data structure, you should take their list of IDs and aggregate those grouped by the entered word they succeeded for.
As the last step you take all those aggregated lists of IDs and do an intersection.
There should only be IDs left which are fitting to all entered words. Those IDs are referencing the searched business names.

Bipartite graph maximum matching

I am new to graphs. I have two sets in a bipartite graph. I need to find unique matching of all the possible combinations. So I thought I use Hopcroft-Karp to find maximum matching. Being a newbie I thought I would get the resulting matching graph but all it tells me is 42. Ahhh that really helps. I don't need to know how many matchings there are I need to know the unique matchings themselfs.
Am I missing something? How do I get the resulting matching?
I diden't check the datastructures generated by the Hopcroft-Karp match function, only the retrun value. The return value is the number of matchings. However there was also a self.pair dictionary in the python code, the pair dictionary contains the matchings from "both" sides, which answers my question.

Resources