Question about Scala on Spark-Shell (Docker) - docker

I am just starting to learn spark/ scala and i have a text file. For about five hours i' ve been trying to do the next 5 operations:
1.Create an RDD that is key-value pairs in the form of (line_number, line_text)
2.Create an RDD containing key-value pairs of the form: (line_number, Array_of_line_words)
3.Create an RDD containing key-value pairs of the form: (word, (line_number,frequency))
4.Create an RDD containing key-value pairs of the form: (word,
list_of_(line_number, frequency)). That is, for each word, gather all line-frequency
pairs in a list
5.Create an RDD containing key-value pairs of the form: (word, list_of_lines). That
is, for each word, gather all lines in a list
Honestly whatever i tried failed miserably because i am like a 2 day old spark/scala user trying to learn on his own.

Related

I'm creating a dropdown from multiple ranges in google sheets & having a problem

https://docs.google.com/spreadsheets/d/1TbPKehggzCGWy6LyEuzHZc5RouBHhyju35bqIGpInhM/edit#gid=2100307022
In this sheet, I have a List 1 (colA) and List 2 (colB). I am able to merge the data from these two lists. I want to use that merged data for data validation options in col C. It needs to be dynamic as list1 and list 2 will get more names added.
You will see I have concatenated those lists in ColH. While I know I can use a list from range to incorporate the data in H, I'd really like to do this without storing the merged data in a separate column. Is this possible? Is there a formula I can write that I can directly input into the data validation form to render this data?

Objective C: reading a string from CSV file

I am parsing csv file, and I have this row..
45,12,bruine verbinding,mechelse heide,bruin,"276,201,836,338,468",01050000208A7A000001000000010200000002000000443BFF11CF720D41F296072200BD0641189D9C0D026D0D417A50F15264C30641,"MULTILINESTRING((241241.883787597 186272.016616039,241056.256646373 186476.540499333))"
When I convert this string into array by the method
NSArray *arrObjects=[strObjects componentsSeparatedByString:#","];
Then I get 13 objects rather I want 8 objects of array. Objects at index 5 further splits up into five more objects instead of one object(because this object has further commas (,) in string) and index 7 further splits up into 2 objects.
I want only full string object of at index 5 and index 7 instead of five and two objects respectively. I know this is because of method componentsSeparatedByString:#",".
Since the CSV standard allows commas to appear inside a record you can't blindly use componentsSeparatedByString:#"," to separate the fields.
It is actually a rather fussy problem to write a CSV parser that can handle line breaks, commas, and quotation marks as field data.
I would suggest either:
Dictating that the data for a field NOT contain commas, line breaks, or quotes (percent escape each field before saving it to the CSV)
or, if you must deal with data in that format, use an existing CSV library.
A quick Google search on "objective-c csv parser" shows this on Github:
CHCSVParser
Since it claims to be a "proper CSV parser" it should handle fields containing commas.

Dataflow GroupByKey transform splits input rows

I run a data flow job to read data from files stored in GCS, each record has an "event type", my goal is to split the data per "event type" and write each output to a bq table, now I'm using a filter to do this, however I'd like to try GroupByKey transform which hopefully can make the process dynamic as new Event Types will flow in over time which can't be predicted at the development time. So now my challenge is, I don't know if its possible to construct a WRITE transform per each KEY(the key from output GroupByKey)? It would be ideal if its doable, or any other ways can achieve this, any advice would be appreciated
You don't need to write a transform for each value of event type; you just need to write a transform that can handle all values for event type.
A GroupByKey will produce a PCollection<KV<EventType, Iterable<ValueType>>. So each record of this PCollection is a key value pair. The key is an EventType and the value is an iterable of values with this key type. You can then apply a transform which converts each of these keys into a TableRow representing the row you want to create in BigQuery. You can do this by defining a:
ParDo<KV<EventType, Iterable<ValueType>>, TableRow>
For example, if your EventType is a string and your ValueType is a string then you might emit a row with two columns for each key value pair. The first column might just be a string corresponding to the EventType and the second column could be a comma separated list of the values.

Simple way to analyze data based on common key

What would be the simplest way to process all the records that were mapped to a specific key and output multiple records for that data.
For example (a synthetic example), assuming my key is a date and the values are intra-day timestamps with measured temperatures. I'd like to classify the temperatures into high/average/low within the day (again, below/above 1 stddev from average).
The output would be the original temperatures with their new classifications.
Using Combine.PerKey(CombineFn) allows only one output per key using the #extractOutput() method.
Thanks
CombineFns are restricted to a single output value because that allows the system to do additional parallelization: combining different subsets of the values separately, and then combining their intermediate results in an arbitrary tree reduction pattern, until a single result value is produced for each key.
If your values per key don't fit in memory (so you can't use the GroupByKey-ParDo pattern that Jeremy suggests) but the computed statistics do fit in memory, you could also do something like this:
(1) Use Combine.perKey() to calculate the stats per day
(2) Use View.asIterable() to convert those into PCollectionViews.
(3) Reprocess the original input with a ParDo that takes the statistics as side inputs
(4) In that ParDo's DoFn, have startBundle() take the side inputs and build up an in-memory data structure mapping days to statistics that can be used to do lookups in processElement.
Why not use a GroupByKey operation followed by a ParDo? The GroupBy would group all the values with a given key. Applying a ParDo then allows you to process all the values with a given key. Using a ParDo you can output multiple values for a given key.
In your temperature example, the output of the GroupByKey would be a PCollection of KV<Integer, Iterable<Float>> (I'm assuming you use an Integer to represent the Day and Float for the temperature). You could then apply a ParDo to process each of these KV's. For each KV you could iterate over the Float's representing the temperature and compute the hi/average/low temperatures. You could then classify each temperature reading using those stats and output a record representing the classification. This assumes the number of measurements for each Day is small enough as to easily fit in memory.

Script to delete columns based on list of possible values in row 1

I would like a script to delete columns in a Google Spreadsheet if the contents match a list of approximately 30 possible text strings. e.g. Custom Variable 1, Custom Variable 3, Custom Variable 9, etc.
I'm new to Google Scripts. I've searched this forum but haven't found a starting point that handles my specific situation -- deleting columns based on a list of string values rather than a single value or value input from a dialogue box.
Any help would be greatly appreciated.
Scott C
If I'm understanding what you're asking, you wish to read through a column and compare a list of values to the value in each row of that column. If that column contains any one of those 30 values, delete that column.
One way you could do it:
Hard code an array that contains all of those values you have in mind. Loop over the column you desire, storing each value into an array (So you would have two arrays total when you're done. One with the hard coded values, the other with the column values). Then, take those values from Array 2, and see if they match any of the words in Array 1. If one of those values matches your list, delete that column.

Resources