Thank you for reading and willingness to help me.
Am basically a java developer and less knowledge on oracle analytics side.
There exists a materialized view that generates daily holdings of customer units which as days goes by slowed down a lot and certain days failed. When analysed, I found it could be better and easy way re-written.
My environment in windows 2003 running inside Hyper-V which allotted 3000 MB.
Oracle Version :
Oracle9i Enterprise Edition Release 9.2.0.1.0 - Production
PL/SQL Release 9.2.0.1.0 - Production
"CORE 9.2.0.1.0 Production"
TNS for 32-bit Windows: Version 9.2.0.1.0 - Production
NLSRTL Version 9.2.0.1.0 - Production
Problem statement: Am working on previously written Materialized view that ran every day night. As years goes by found it fails often and taken long time. So searching for ways to simplify as based on date range just creating rows for in between rows could help and its our requirement. Don't want to use all_objects as felt it could scan for each test_hold rows 30410 rows... "select count(*) from all_objects = 30410".
Following are our steps.
1:
create table test_hold(act_id varchar2(15), fm varchar2(10),
fund varchar2(10),start_dt date,end_date date,holding number(15,4));
2.
select * from test_hold;
ACT_ID FM FUND START_DATE END_DATE UNITS HOLDINGS
A0001 FM1 ABER001 10/03/2004 11/10/2015 100 100
A0001 FM1 ABER001 12/10/2015 20/10/2015 -100 0
A0002 FM2 FSTA001 14/05/2012 03/03/2013 250 250
A0002 FM2 FSTA001 04/03/2013 19/03/2014 300 550
A0002 FM2 FSTA001 20/03/2014 19/10/2015 -550 0
3. Expected output.
ACT_ID FM FUND TRAN_DATE HOLDNG
A0001 FM1 ABER001 10/03/2004 100
A0001 FM1 ABER001 11/03/2004 100
A0001 FM1 ABER001 12/03/2004 100
A0001 FM1 ABER001 …
A0001 FM1 ABER001 …
A0001 FM1 ABER001 11/10/2015 100
A0001 FM1 ABER001 12/10/2015 0
A0002 FM2 FSTA001 14/05/2012 250
I Tried Level - 1 ...Connect By and pipelined function.
Found Level - 1 not suited for me. Felt Pipeline suited but when full customer set generated ended up with error like below. Also noticed when I try to create as a table with AS SELECT * from piped function, the oracle.exe memory shown at windows task manager keep on growing from 200000 k to > 1000000 k and never cleared unless restarted oracle service.
SQL>set timing on;
SQL> set autotrace traceonly statistics;
SQL> /
ERROR:
`
ORA-00600: internal error code, arguments: [kohdtf048], [], [], [], [], [], [],
[]
17536980 rows selected.
Elapsed: 00:17:34.06
Statistics
----------------------------------------------------------
20 recursive calls
0 db block gets
64632 consistent gets
2295 physical reads
0 redo size
389043748 bytes sent via SQL*Net to client
12860951 bytes received via SQL*Net from client
1169134 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
17536980 rows processed
Kindly help whether am proceeding correctly or any other simple alternative ways available. Thank you for your help.
Related
I'm getting unexpected results streaming in the cloud.
My pipeline looks like:
SlidingWindow(60min).every(1min)
.triggering(Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(30)))
)
)
.withAllowedLateness(15sec)
.accumulatingFiredPanes()
.apply("Get UniqueCounts", ApproximateUnique.perKey(.05))
.apply("Window hack filter", ParDo(
if(window.maxTimestamp.isBeforeNow())
c.output(element)
)
)
.toJSON()
.toPubSub()
If that filter isn't there, I get 60 windows per output. Apparently because the pubsub sink isn't window aware.
So in the examples below, if each time period is a minute, I'd expect to see the unique count grow until 60 minutes when the sliding window closes.
Using DirectRunner, I get expected results:
t1: 5
t2: 10
t3: 15
...
tx: growing unique count
In dataflow, I get weird results:
t1: 5
t2: 10
t3: 0
t4: 0
t5: 2
t6: 0
...
tx: wrong unique count
However, if my unbounded source has older data, I'll get normal looking results until it catches up at which point I'll get the wrong results.
I was thinking it had to do with my window filter, but removing that didn't change the results.
If I do a Distinct() then Count().perKey(), it works, but that slows my pipeline considerably.
What am I overlooking?
[Update from the comments]
ApproximateUnique inadvertently resets its accumulated value when result is extracted. This is incorrect when the value is read more than once as with windows firing multiple times. Fix (will be in version 2.4): https://github.com/apache/beam/pull/4688
I've tried to use Ignite to store events, but face a problem of too much RAM usage during inserting new data
I'm runing ignite node with 1GB Heap and default configuration
curs.execute("""CREATE TABLE trololo (id LONG PRIMARY KEY, user_id LONG, event_type INT, timestamp TIMESTAMP) WITH "template=replicated" """);
n = 10000
for i in range(200):
values = []
for j in range(n):
id_ = i * n + j
event_type = random.randint(1, 5)
user_id = random.randint(1000, 5000)
timestamp = datetime.datetime.utcnow() - timedelta(hours=random.randint(1, 100))
values.append("({id}, {user_id}, {event_type}, '{timestamp}')".format(
id=id_, user_id=user_id, event_type=event_type, uid=uid, timestamp=timestamp.strftime('%Y-%m-%dT%H:%M:%S-00:00')
))
query = "INSERT INTO trololo (id, user_id, event_type, TIMESTAMP) VALUES %s;" % ",".join(values)
curs.execute(query)
But after loading about 10^6 events, I got 100% CPU usage because all heap are taken and GC trying to clean some space (unsuccessfully)
Then I stop for about 10 minutes and after that GC succesfully clean some space and I could continue loading new data
Then again heap fully loaded and all over again
It's really strange behaviour and I couldn't find a way how I could load 10^7 events without those problems
aproximately event should take:
8 + 8 + 4 + 10(timestamp size?) is about 30 bytes
30 bytes x3 (overhead) so it should be less than 100bytes per record
So 10^7 * 10^2 = 10^9 bytes = 1Gb
So it seems that 10^7 events should fit into 1Gb RAM, isn't it?
Actually, since version 2.0, Ignite stores all in offheap with default settings.
The main problem here is that you generate a very big query string with 10000 inserts, that should be parsed and, of course, will be stored in heap. After decreasing this size for each query, you will get better results here.
But also, as you can see in doc for capacity planning, Ignite adds around 200 bytes overhead for each entry. Additionally, add around 200-300MB per node for internal memory and reasonable amount of memory for JVM and GC to operate efficiently
If you really want to use only 1gb heap you can try to tune GC, but I would recommend increasing heap size.
My computing cluster monitoring data is stored in an influx DB with the following shape (minus a few columns):
time number parti user
---- ------ ----- ----
2017-06-02T06:58:52.854866584Z 59 gr01 user01
2017-06-02T06:58:52.854866584Z 6 gr01 user02
2017-06-02T06:58:52.854866584Z 295 gr02 user03
2017-06-02T06:58:52.854866584Z 904 gr03 user04
data points are every 10 minutes. Right now I am plotting the sum for each "parti" with:
select sum(number) from status_logs where time > now() - 1h group by time(10m), parti
However, this becomes very slow when I show more than a few days due to the time(10m). I cannot use a varying time window because the sum() would not make sense anymore.
My question : would there be a way to take the average of the sum over a (variable) time window ?
Thanks !
I'm trying to run the following and I receive an error saying that ERROR: The SAS System stopped processing this step because of insufficient memory.
The dataset has about 1170(row)*90(column) records. What are my alternatives here?
The error infor. is below:
332 proc assoc data=want1 dmdbcat=dbcat pctsup=0.5 out=frequentItems;
333 id tid;
334 target item_new;
335 run;
----- Potential 1 item sets = 188 -----
Counting items, records read: 19082
Number of customers: 203
Support level for item sets: 1
Maximum count for a set: 136
Sets meeting support level: 188
Megs of memory used: 0.51
----- Potential 2 item sets = 17578 -----
Counting items, records read: 19082
Maximum count for a set: 119
Sets meeting support level: 17484
Megs of memory used: 1.54
----- Potential 3 item sets = 1072352 -----
Counting items, records read: 19082
Maximum count for a set: 111
Sets meeting support level: 1072016
Megs of memory used: 70.14
Error: Out of memory. Memory used=2111.5 meg.
Item Set 4 is null.
ERROR: The SAS System stopped processing this step because of insufficient memory.
WARNING: The data set WORK.FREQUENTITEMS may be incomplete. When this step was stopped there were
1089689 observations and 8 variables.
From the documentation (http://support.sas.com/documentation/onlinedoc/miner/em43/assoc.pdf):
Caution: The theoretical potential number of item sets can grow very
quickly. For example, with 50 different items, you have 1225 potential
2-item sets and 19,600 3-item sets. With 5,000 items, you have over 12
million of the 2-item sets, and a correspondingly large number of
3-item sets.
Processing an extremely large number of sets could cause your system
to run out of disk and/or memory resources. However, by using a higher
support level, you can reduce the item sets to a more manageable
number.
So - provide a support= option make sure it's sufficiently high, e.g.:
proc assoc data=want1 dmdbcat=dbcat pctsup=0.5 out=frequentItems support=20;
id tid;
target item_new;
run;
Is there a way to frame the data mining task so that it requires less memory for storage or operations? In other words, do you need all 90 columns or can you eliminate some? Is there some clear division within the data set such that PROC ASSOC wouldn't be expected to use those rows for its findings?
You may very well be up against software memory allocation limits here.
I have several ~50 GB text files that I need to parse for specific contents. My files contents are organized in 4 line blocks. To perform this analysis I read in subsections of the file using file.read(chunk_size) and split into blocks of 4 then analyze them.
Because I run this script often, I've been optimizing and have tried varying the chunk size. I run 64 bit 2.7.1 python on OSX Lion on a computer with 16 GB RAM and I noticed that when I load chunks >= 2^31, instead of the expected text, I get large amounts of /x00 repeated. This continues as far as my testing has shown all the way to, and including 2^32, after which I once again get text. However, it seems that it's only returning as many characters as bytes have been added to the buffer above 4 GB.
My test code:
for i in range((2**31)-3, (2**31)+3)+range((2**32)-3, (2**32)+10):
with open('mybigtextfile.txt', 'rU') as inf:
print '%s\t%r'%(i, inf.read(i)[0:10])
My output:
2147483645 '#HWI-ST550'
2147483646 '#HWI-ST550'
2147483647 '#HWI-ST550'
2147483648 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
2147483649 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
2147483650 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967293 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967294 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967295 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967296 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967297 '#\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967298 '#H\x00\x00\x00\x00\x00\x00\x00\x00'
4294967299 '#HW\x00\x00\x00\x00\x00\x00\x00'
4294967300 '#HWI\x00\x00\x00\x00\x00\x00'
4294967301 '#HWI-\x00\x00\x00\x00\x00'
4294967302 '#HWI-S\x00\x00\x00\x00'
4294967303 '#HWI-ST\x00\x00\x00'
4294967304 '#HWI-ST5\x00\x00'
4294967305 '#HWI-ST55\x00'
What exactly is going on?
Yes, this is the known issue according to the comment in cpython's source code. You can check it in Modules/_io/fileio.c. And the code add a workaround on Microsoft windows 64bit only.