Shift in the columns of spool file - sqlplus

I am using a shell script to extract the data from 'extr' table. The extr table is a very big table having 410 columns. The table has 61047 rows of data. The size of one record is around 5KB.
I the script is as follows:
#!/usr/bin/ksh
sqlplus -s \/ << rbb
set pages 0
set head on
set feed off
set num 20
set linesize 32767
set colsep |
set trimspool on
spool extr.csv
select * from extr;
/
spool off
rbb
#-------- END ---------
One fine day the extr.csv file was having 2 records with incorrect number of columns (i.e. one record with more number of columns and other with less). Upon investigation I came to know that the two duplicate records were repeated in the file. The primary key of the records should ideally be unique in file but in this case 2 records were repeated. Also, the shift in the columns was abrupt.
Small example of the output file:
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|154|H
5003|A7A|AAB|249.67|AAB|153.33|205|R
5004|A8A|269|F
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
Here the primary key records for 5003 and 5004 have reappeared in place of 5007 and 5008. Also the duplicate reciords have shifted the records of 5007 and 5008 by appending/cutting down their columns.
Need your help in analysing why this happened? Why the 2 rows were extracted multiple times? Why the other 2 rows were missing from the file? and Why the records were shifted?
Note: This script is working fine since last two years and has never failed except for one time (mentioned above). It ran successfully during next run. Recently we have added one more program which accesses the extr table with cursor (select only).

I reproduced a similar behaviour.
;-> cat input
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
See the input file as your database.
Now I write a script that accesses "the database" and show some random freezes.
;-> cat writeout.sh
# Start this script twice
while IFS=\| read a b c d e f; do
# I think you need \c for skipping \n, but I do it different one time
echo "$a|$b|$c|$d|" | tr -d "\n"
(( sleeptime = RANDOM % 5 ))
sleep ${sleeptime}
echo "$e|$f"
done < input >> output
EDIT: Removed cat input | in script above, replaced by < input
Start this script twice in the background
;-> ./writeout.sh &
;-> ./writeout.sh &
Wait until both jobs are finished and see the result
;-> cat output
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|200|F
5003|A3A|AAB|153.33|258|G
5006|A6A|ABB|147.89|154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|215|F
154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
When I edit the last line of writeout.sh into done > output I do not see the problem, but that might be due to buffering and the small amount of data.
I still don't know exactly what happened in your case, but it really seems like 2 progs writing simultaneously to the same script.
A job in TWS could have been restarted manually, 2 scripts in your masterscript might write to the same file or something else.
Preventing this in the future can be done using some locking / checks (when the output file exists, quit and return errorcode to TWS).

Related

Splitting file processing by initial keys

Use Case
I have some terabytes of US property data to merge. It is spread across two distinct file formats and thousands of files. The source data is split geographically.
I can't find a way to branch a single pipeline into many independent processing flows.
This is especially difficult because the Dataframe API doesn't seem to support a PTransform on a collection of filenames.
Detailed Background
The distribution of files is like this:
StateData - 51 total files (US states + DC)
CountyData - ~2000 total files (county specific, grouped by state)
The ideal pipeline would split into thousands of independent processing steps and complete in minutes.
1 -> 51 (each US state + DC starts processing)
51 -> thousands (each US state then spawns a process that merges the counties, combining at the end for the whole state)
The directory structure is like this:
📂state-data/
|-📜AL.zip
|-📜AK.zip
|-📜...
|-📜WY.zip
📂county-data/
|-📂AL/
|-📜COUNTY1.csv
|-📜COUNTY2.csv
|-📜...
|-📜COUNTY68.csv
|-📂AK/
|-📜...
|-📂.../
|-📂WY/
|-📜...
Sample Data
This is extremely abbreviated, but imagine something like this:
State Level Data - 51 of these (~200 cols wide)
uid
census_plot
flood_zone
abc121
ACVB-1249575
R50
abc122
ACVB-1249575
R50
abc123
ACVB-1249575
R51
abc124
ACVB-1249599
R51
abc125
ACVB-1249599
R50
...
...
...
County Level Data - thousands of these (~300 cols wide)
uid
county
subdivision
tax_id
abc121
04021
Roland Heights
3t4g
abc122
04021
Roland Heights
3g444
abc123
04021
Roland Heights
09udd
...
...
...
...
So we join many county-level to a single state level, and thus have an aggregated, more-complete state-level data set.
Then we aggregate all the states, and we have a national level data set.
Desired Outcome
I can successfully merge one state at a time (many county to one state). I built a pipeline to do that, but the pipeline starts with a single CountyData CSV and a single StateData CSV. The issue is getting to the point where I can load the CountyData and StateData.
In other words:
#
# I need to find a way to generalize this flow to
# dynamically created COUNTY and STATE variables.
#
from apache_beam.dataframe.convert import to_pcollection
from apache_beam.dataframe.io import read_csv
COUNTY = "county-data/AL/*.csv"
STATE = "state-data/AL.zip"
def key_by_uid(elem):
return (elem.uid, elem)
with beam.Pipeline() as p:
county_df = p | read_csv(COUNTY)
county_rows_keyed = to_pcollection(county_df) | beam.Map(key_by_uid)
state_df = pd.read_csv(STATE, compression="zip")
state_rows_keys = to_pcollection(state_df, pipeline=p) | beam.Map(key_by_uid)
merged = ({ "state": state_rows_keys, "county": county_rows_keyed } ) | beam.CoGroupByKey() | beam.Map(merge_logic)
merged | WriteToParquet()
Starting with a list of states
By state, generate filepatterns to the source data
By state, load and merge the filenames
Flatten the output from each state into a US data set.
Write to Parquet file.
with beam.Pipeline(options=pipeline_options) as p:
merged_data = (
p
| beam.Create(cx.STATES)
| "PathsKeyedByState" >> tx.PathsKeyedByState()
# ('AL', {'county-data': 'gs://data/county-data/AL/COUNTY*.csv', 'state-data': 'gs://data/state-data/AL.zip'})
| "MergeSourceDataByState" >> tx.MergeSourceDataByState()
| "MergeAllStateData" >> beam.Flatten()
)
merged_data | "WriteParquet" >> tx.WriteParquet()
The issue I'm having is something like this:
I have two filepatterns in a dictionary, per state. To access those I need to use a DoFn to get at the element.
To communicate the way the data flows, I need access to Pipeline, which is a PTransform. Ex: df = p | read_csv(...)
These appear to be incompatible needs.
Here's an alternative answer.
Read the state data one at a time and flatten them, e.g.
state_dataframe = None
for state in STATES:
df = p | read_csv('/path/to/state')
df['state'] = state
if state_dataframe is None:
state_dataframe = df
else:
state_dataframe = state_dataframe.append(df)
Similarly for county data. Now join them using dataframe operations.
I'm not sure exactly what kind of merging you're doing here, but one way to structure this pipeline might be to have a DoFn that takes the county data in as a filename as an input element (i.e. you'd have a PCollection of county data filenames), opens it up using "normal" Python (e.g. pandas), and then reads the relevant state data in as a side input to do the merge.

Google dataflow streaming pipeline is not distributing workload over several workers after windowing

I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this:
The first step is doing some basic processing and takes about 2 seconds per message to get to the windowing. We are using sliding windows of 3 seconds and 3 second interval (might change later so we have overlapping windows). As last step we have the SOG prediction that takes about 15ish seconds to process and which is clearly our bottleneck transform.
So, The issue we seem to face is that the workload is perfectly distributed over our workers before the windowing, but the most important transform is not distributed at all. All the windows are processed one at a time seemingly on 1 worker, while we have 50 available.
The logs show us that the sog prediction step has an output once every 15ish seconds which should not be the case if the windows would be processed over more workers, so this builds up huge latency over time which we don't want. With 1 minute of messages, we have a latency of 5 minutes for the last window. When distribution would work, this should only be around 15sec (the SOG prediction time). So at this point we are clueless..
Does anyone see if there is something wrong with our code or how to prevent/circumvent this?
It seems like this is something happening in the internals of google cloud dataflow. Does this also occur in java streaming pipelines?
In batch mode, Everything works fine. There, one could try to do a reshuffle to make sure no fusion etc occurs. But that is not possible after windowing in streaming.
args = parse_arguments(sys.argv if argv is None else argv)
pipeline_options = get_pipeline_options(project=args.project_id,
job_name='XX',
num_workers=args.workers,
max_num_workers=MAX_NUM_WORKERS,
disk_size_gb=DISK_SIZE_GB,
local=args.local,
streaming=args.streaming)
pipeline = beam.Pipeline(options=pipeline_options)
# Build pipeline
# pylint: disable=C0330
if args.streaming:
frames = (pipeline | 'ReadFromPubsub' >> beam.io.ReadFromPubSub(
subscription=SUBSCRIPTION_PATH,
with_attributes=True,
timestamp_attribute='timestamp'
))
frame_tpl = frames | 'CreateFrameTuples' >> beam.Map(
create_frame_tuples_fn)
crops = frame_tpl | 'MakeCrops' >> beam.Map(make_crops_fn, NR_CROPS)
bboxs = crops | 'bounding boxes tfserv' >> beam.Map(
pred_bbox_tfserv_fn, SERVER_URL)
sliding_windows = bboxs | 'Window' >> beam.WindowInto(
beam.window.SlidingWindows(
FEATURE_WINDOWS['goal']['window_size'],
FEATURE_WINDOWS['goal']['window_interval']),
trigger=AfterCount(30),
accumulation_mode=AccumulationMode.DISCARDING)
# GROUPBYKEY (per match)
group_per_match = sliding_windows | 'Group' >> beam.GroupByKey()
_ = group_per_match | 'LogPerMatch' >> beam.Map(lambda x: logging.info(
"window per match per timewindow: # %s, %s", str(len(x[1])), x[1][0][
'timestamp']))
sog = sliding_windows | 'Predict SOG' >> beam.Map(predict_sog_fn,
SERVER_URL_INCEPTION,
SERVER_URL_SOG )
pipeline.run().wait_until_finish()
In beam the unit of parallelism is the key--all the windows for a given key will be produced on the same machine. However, if you have 50+ keys they should get distributed among all workers.
You mentioned that you were unable to add a Reshuffle in streaming. This should be possible; if you're getting errors please file a bug at https://issues.apache.org/jira/projects/BEAM/issues . Does re-windowing into GlobalWindows make the issue with reshuffling go away?
It looks like you do not necessarily need GroupByKey because you are always grouping on the same key. Instead you could maybe use CombineGlobally to append all the elements inside the window in stead of the GroupByKey (with always the same key).
combined = values | beam.CombineGlobally(append_fn).without_defaults()
combined | beam.ParDo(PostProcessFn())
I am not sure how the load distribution works when using CombineGlobally but since it does not process key,value pairs I would expect another mechanism to do the load distribution.

Asterisk PBX - Infinite Loop when user disconnects while using 'Read' application from LUA

I'm configuring interactive dial plans for asterisk at the moment and because I already know some LUA I thought it'd be easier to go that route.
I have a start extension like this:
["h"] = function(c,e)
app.verbose("Hung Up")
end;
["s"] = function(c, e)
local d = 0
while d == 0 do
say:hello()
app.read("read_result", nil, 1)
d = channel["read_result"].value;
if d == 1 then
say:goodbye()
elseif d == 2 then
call:forward('front desk')
end
d = 0
end
say:goodbye()
end;
As you can see, I want to repeat the instructions say:hello() whenever
the user gives an invalid answer. However, if the user hangs up while
app.read waits for their answer, asterisk ends up in an infinite loop
since d will always be nil.
I WOULD check for d==nil to detect disconnection, but nil also shows
up when the user just presses the # pound sign during app.read.
So far I've taken to using for loops instead of while to limit the
maximum iterations that way, but I'd rather find out how to detect a disconnected
channel. I can't find any documentation on that though.
I also tried setting up a h extension, but the program won't go to it when the
user hangs up.
Asterisk Verbose Output:
[...]
-- Executing [s#test-call:1] read("PJSIP/2300-00000004", "read_result,,1") │ test.lua:3: in main chunk
-- Accepting a maximum of 1 digit. │ [C]: ?
-- User disconnected │root#cirro asterisk lua test.lua
-- Executing [s#test-call:1] read("PJSIP/2300-00000004", "read_result,,1") │Global B
-- Accepting a maximum of 1 digit. │LocalB-B->a
-- User disconnected │LocalB-A
-- Executing [s#test-call:1] read("PJSIP/2300-00000004", "read_result,,1") │LocalB-A
-- Accepting a maximum of 1 digit. │LocalB-A
-- User disconnected │root#cirro asterisk cp ~/test.call /var/spool/asterisk/outgoing
-- Executing [s#test-call:1] read("PJSIP/2300-00000004", "read_result,,1")
[...]
Thanks for any help you might be able to offer.
First of all you can see in app_read docs(and any other doc), that it return different values for incorrect execution(when channel is down).
Also this exact app offer simplified way of determine result:
core show application Read
-= Info about application 'Read' =-
[Synopsis]
Read a variable.
[Description]
Reads a #-terminated string of digits a certain number of times from the user
in to the given <variable>.
This application sets the following channel variable upon completion:
${READSTATUS}: This is the status of the read operation.
OK
ERROR
HANGUP
INTERRUPTED
SKIPPED
TIMEOUT
If that still not suite you, you can direct ask asterisk about CHANNEL(state)
PS You NEVER should write dialplan or any other program with infinite loop. Count your loops and exit at 10+. This will save ALOT of money for client.

AWK (or similar) - change 2 lines below the matching pattern

I have a problem that I think it's easiest to solve with awk but I wrapped my head around it.
Inside a file I have repeating output like this:
....
Name="BgpIpv4RouteConfig_XXX">
<Ipv4NetworkBlock id="13726"
StartIpList="x.y.z.t"
PrefixLength="30"
NetworkCount="10000"
... other output
then this block will repeat.
a)I want to match on BGPIpv4Route.*, then skip 2 lines (the "n" keyword of awk), then when reaching Prefix Length:
- either replace it with random (25,30)
or
- better but I guess harder (no idea came to mind for keeping track of what was used and looping among /25../30) -> first occurrence /25, second one /26...till /30, then rollback to /25
b) then next line with NetworkCount depending on the new value of PrefixCount calculate it as 65536 / 2^(32-Prefix Count)
eg: if PrefixCount on this occurrence was replaced with /25, then NetworkCount on the line following it = 65536 / 2 ^ 7 = 65536 / 128 = 512
I found some examples with inserting/changing a line after one that matched (or with a counter variable X lines below the match) but I got a bit confused with the value generation part and also with the changing of two lines where one is depending on the other.
Not sure I made any sense...my head is a bit overwhelmed with what I'm finding everywhere right now.
Thanks in advance!
this should do
$ awk 'BEGIN {q="\""; FS=OFS="="; n=split("25=26=27=28=29=30",ps)}
/BgpIpv4Route/ {c=c%n+1}
/PrefixLength/ {$2=q ps[c] q}
/NetworkCount/ {$2=q 65536/2^(32-ps[c]) q}1' file
perhaps minimize computation by changing to 2^(ps[c]-16)
If there are free standing PrefixLength and NetworkCount attributes perhaps you need to qualify them for each BgpIpv4Route context.

Cobol RELEASE statement?

I am working on my first project with Cobol Sort VSAM files. I ran into a keyword RELEASE.
The way the book read that I have is the "The release statement transfers records from the INPUT PROCEDURE to the input Phase of the sort operation.
My Question is: That takes what ever I have in my Sort-Rec (or whatever I called it) and sends it directly into the OUTPUT PROCEDURE part of my SORT?
Seems confusing to me here.
Cobol Code:
SORT SORT-FILE ASCENDING KEY SORT-PROVIDER
INPUT PROCEDURE IS PROC-THE-REC THRU PTR-X
OUTPUT PROCEDURE IS WRITE-THE-RPT THRU WTR-X.
MOVE CC-CERT-NO TO SAVE-CERT-NO.
MOVE CC-CERT-STATUS TO SAVE-CERT-STATUS.
MOVE CC-CERT-BEGIN-DATE TO SAVE-CERT-BEGIN-DATE.
MOVE CC-CERT-END-DATE TO SAVE-CERT-END-DATE.
MOVE CC-CERT-FUNDING TO SAVE-CERT-FUNDING.
MOVE CC-PROV-NUMB TO SAVE-PROV-NUMB.
MOVE CC-PROV-RES-CNTY TO SAVE-PROV-RES-CNTY.
MOVE CC-PROV-TYPE TO SAVE-PROV-TYPE.
MOVE CC-WORKER-USERID TO SAVE-WORKER-USERID.
MOVE CC-WORKER-NAME TO SAVE-WORKER-NAME.
RELEASE SORT-REC.
Following from Bill,
When using a sort in Cobol is a bit like having two or three separate programs in the one program.
Pre-Sort (PROC-THE-REC in your case)
|
V
Sort
|
V
Post sort (WRITE-THE-RPT in your case)
The RELEASE statement "Writes" a record to the "Sort" step.
In Unix you could achieve the same thing by
writing 2 separate programs (Pre and post sort)
Replacing the Release with Writes
piping the output from the Pre-Sort to the sort.
On the mainframe you would use 3 JCL steps and temporary files.
On the mainframe, most sites I worked at discourage (ban) the use of the Cobol sort verb and you would write 2 programs and use the utility sort.
The release statement transfers records from the INPUT PROCEDURE to
the input Phase of the sort operation.
The input Phase of a sort is where SORT gets its data from, in this case.
COBOL Program
Loop-construct
Some COBOL code
Release
Next
The sort is actually an external program. In the case of the Mainframe,
it is the installed SORT product (usually DFSORT or SyncSort)
Input Phase of SORT
SORT
Output Phase of SORT
Another-Loop-construct
Some COBOL code
Return
Next
COBOL Program
Your input procedure will process, release, and then continue. When all data has been released, the sort will take place. The sorted records will be presented back to your program at the point of the RETURN statement you will have coded, and this will continue (return, stuff after return, another return, repeat until finished) until all the sorted file is processed (assuming that nothing goes wrong an you want to stop early).
normally, COBOL sort procedures are :
SORT SORTFILE ON SORT-ID, SORT-NAME, SORT-PHONE
INPUT PROCEDURE IS READ-IN
OUTPUT PROCEDURE IS PRINT-SORTED.
READ-IN SECTION.
loop:
READ INPUTFILE.
MOVE IN-ID TO SORT-ID.
MOVE IN-NAME TO SORT-NAME.
MOVE IN-PHONE TO SORT-PHONE.
RELEASE SORT-REC.
PRINT-SORTED SECTION.
loop:
RETURN SORT-REC.
DISPLAY "id#: " SORT-ID.
DISPLAY "Name: " SORT-NAME.
DISPLAY "Phone: " SORT-PHONE.
An INPUT PROCEDURE supplies records to the sort process by writing them to the work file declared in the SORT's SD entry. But to write the records to the work file a special verb - the RELEASE verb is used.
An operational template for an INPUT PROCEDURE, which gets records from and input file and RELEASEs them to the work file, is shown below.
OPEN INPUT InFileName
READ InFileName
PERFORM UNTIL TerminatingCondition
Process input record to create SDWorkRec
RELEASE SDWorkRec
READ InFileName
END-PERFORM
CLOSE InFileName
For further details on the SORT see my tutorial at http://www.csis.ul.ie/cobol/course/SortMerge.htm

Resources