Data keeps on growing TokuMx no repairDatabase - tokumx

TokuMx though has benefits, we are running into issues. Recently we migrated to this engine and in process our clean up scripts are useless. We have transient data that we used clean every night and then reclaim disk via db.repairDatabase . However that command is not supported by TokuMX and as a result we are not able to reclaim the disk.
Is there an alternate way ?

It sounds like partitioned collections are the right abstraction for your application. Normal collections will suffer from the accumulation of MVCC garbage if you have a pattern of deleting large swaths of old data. With partitioned collections, you can drop a partition and reclaim all the space instantaneously.

Related

dask scheduler OOM opening a large number of avro files

I'm trying to run a pipeline via dask on a cluster on gcp. The pipeline loads a lot of avro files from cloud storage (~5300 files with around 300MB each) like this
bag = db.read_avro(
'gcs://mybucket/myfiles-*.avro',
blocksize=5000000
)
It then applies some transformations and saves the data back to cloud storage (as parquet files).
I've tested this pipeline with a fraction of the avro files and it works perfectly, but when I tell it to ingest all the files, the scheduler process sits at 100% CPU for a long time and at some point it runs out of memory (I have tried scaling my master node running the scheduler up to 64GB of RAM but that still does not suffice), while the workers are idling. I assume that the problem is that it has to create an excessive amount of tasks that are all held in RAM before being distributed to the workers.
Is this some sort of antipattern that I'm using when trying to open a very large number of files? If so, is there perhaps a built-in way to better cope with this or would I have to split the avro files manually?
Avro with Dask at scale is not particularly well-trodden territory. There is no theoretical reason it should not work. You could inspect the contents of the graph to see if things are getting serialised there that are large, or if simply a massive number of tasks are being generated. If the former, it may be solvable, and you could raise an issue.
As you say, you may be able to keep the load on the scheduler down by processing sub-batches out of the total set of files at a time and waiting for completion.

Prune / Vacuum Neo4j indexes(3.0.6 enterprise)

We use Neo4J 3.0.6 Enterprise and experience that our queries become very slow.
What we found out hat index files have become very large and probably need to pruned/vacuumed (like postgresql does).
The data we use changes very quickly over time and the index files have a lot of stale or marked deleted data.
Is there a way to do clean index files and if yes, how?

Is it advisable to restart informatica application monthly to improve performance?

We have around 32 datamarts loading around 200+ tables out of which 50% of tables are on 11g Oracle database and 30% on 10g and rest 20 are flat files.
Lately we are facing performance issues while loading the datamarts.
Database parameters as well are network parameters are looking and as throughput is decreasing drastically we are of the opinion now that it is informatica which has problem.
Recently when through put had gone down and server was utilized to its 90% informatica application was restarted and the performance there after was little better than previous performance.
So my question is should we have Informatica restart as a scheduled activity ? Does restart actually improves the performance of the application or there are some other things which can play a role in the same?
What you have here is a systemic problem, but you have not established which component(s) of the system are the cause.
Are all jobs showing exactly the same degradation in performance? If not, what is the common characteristic of those that are? Not all jobs will have the same reliance on the Informatica server -- some will be dependent more on the performance of their target system(s), some on their source system(s), so I would be amazed if all showed exactly the same level of degradation.
What you have here is an exercise in data gathering, and then turning that data into useful information.
If you can isolate the problem to only certain jobs then I would take a log file from a time when the system is performing well, and from a time when it is not, and compare them directly, looking for differences in the performance of their components. You can also look at any database monitoring tools for changes in execution plan.
Rebooting servers? Maybe, but that is not necessarily the solution -- the real problem is the lack of data you have to diagnose your system.
Yes, It is good to do a restart every quarter.
It will refresh the Integration service cache.
Delete files from Cache and storage before you restart.
Since you said you have recently seen some reduced performance recently it might be due various reasons.
Some tips that may help:
Ensure all Indexes are in valid and compiled state.
If you are calling a procedure via worflow check the EXPLAIN plan and Cost ensure it is not doing a full table scan(cost should be less).
3.Gather stats on the source or target tables (especially which have deletes )) - This will help in de fragmentation - deleting the un allocated space. DBMS_STATS
Always good to have an house keeping scheduled weekly to do the above checks on indexes,remove temp/unnecessary files and gather stats (analyze indexes and tables).
Some best practices here performance tips

Slow Query Performance

I am running some very large databases (500 MB and 300 MB) in my application on several different machines.
From a hardware perspective, the machines have been identically configured.
I am using SQL Server CE 4.0 as my DBMS.
The performance critical query has been indexed to improve its performance.
The problem is that on [only] one of the machines, I am observing egregiously slow query performance. This usually happens after a long period of time of inactivity (from a query perspective). After I do several (about 7-8) queries, the slow performance disappears.
The weird thing is that this initial slow query performance does not happen on the other machine.
The only difference between the two machines is the data contained inside the databases.
I suspect that the distribution of data on the slow machine is somehow reducing the effectiveness of the indexing and that SQL Server CE has to rebalance the indexing in a much more significant way than on the other faster machine.
One thing I notice is that when the query is very slow, the disk activity increases significantly and the process corresponding to reading the database shows a spike in the read bytes.
This does not happen on the other machine.
Does anyone know how I might go about root causing this issue?
My code is written in C++ and uses the ATL/OLEDB API to manipulate the database.
UPDATE: My performance profiling activities indicate that it's not the query itself that is slow - it is the processing of the returned rowset that takes a while. For each row returned, I query another database for related data. I understand that this is not the right way to do it but the performance problem only happens on one machine. One thing I noticed is that when I have other unrelated queries happening on the same database in other threads, the unrelated queries will stall the query that is exhibiting the performance problem.

Data Warehouse: One Database or many?

At my new company, they keep all data associated with the data warehouse, including import, staging, audit, dimension and fact tables, together in the same physical database.
I've been a database developer for a number of years now and this consolidation of function and form seems counter to everything I know.
It seems to make security, backup/restore and performance management issues more manually intensive.
Is this something that is done in the industry? Are there substantial reasons for doing or not doing it?
The platform is Netezza. The size is in terabytes, hundreds of millions of rows.
What I'm looking to get from answers to this question is a solid understanding of how right or wrong this path is. From your experience, what are the issues I should be focused on arguing if this is a path that will cause trouble for us down the road. If it is no big deal, then I'd like to know that as well.
In general I would recommend using separate databases. This is the configuration I have always seen used in production and it really makes a lot of sense since - as you mentioned - both databases have fundamentally different purposes / usage patterns / etc.
Edit
If you're using one physical server, the fewer instances on that server the simpler the management and the more efficient the process.
If you put TWO instances on the same Physical Server you get:
Negatives:
Half the memory to use
Twice the count of database process
Positives:
You could take the entire staging db down without affecting the DW
So which is more precious to you, outage windows or CPU and Memory?
On the same the physical server multiple instances make performance management issues MUCH more manual to solve. If you look at the health of one of the instances, it might look fine but users are reporting poor performance, so you have to look at the next instance to see if the problem may be coming from there... and so on per instance.
Security is also harder with more than one instance. At best it's just as hard as a single instance but it's never easier. You'll have two admin accounts (SYS or something), Duplicate process accounts, etc.
Tell us why you think it's better to have more than one instance.
ORIGINAL POST
Can we be clear on terms. When you say "in the same Database" do you mean to say the same instance, or the same physical server. If you did move the staging to a new instance would it reside on the same physical hardware?
I think people get a little too hung up on instances. If you're going to put two instances on the same piece of hardware, you're only doubling the number of everything to very little advantage. All the server processes will be running twice... all the memory pools will be cut in half.
so let's say you really did mean two separate physical boxes...
Let's say you buy 2 12-way boxes (just say). When you're staging db server is done for the day, those 12 CPU's are wasting away. When your users pack up and go home, your prod DW CPUs are wasting away. CPU cycles are perishable, you can't get them back. BUT, if you had one 24 way box... then the staging DB COULD use 20 CPUs at night for some excellent Parallel Execution for building summary tables and your users will have double the capacity for processes during the day.
so let's say you meant the same hardware.
"It seems to make security, backup/restore and performance management issues more manually intensive."
Guaranteed that performance issues are harder to solve the more instances that share the same hardware. Guaranteed.
Security
What security do you do at the instance level?
Backup
What DW are you backing up at the instance level? You're not backing up tablespaces, but rather whole instances? Seems like that pattern will fail at a certain size.
PLATFORM: NETEZZA
Not familiar with the tool specifically. So if it's a single instance on a single box, then the division would seem more logical than physical and therefore the reasons they exist is for management, not performance. You don't increase your CPUs or memory by adding a database, right? So it doesn't seem like there's no performance upside to it. Each DB may be adding separate processes (performance hit), or it might be completely logical like schemas in Oracle. If each database is managed by new processes than data going between them will mean IPC.
Maybe the addition of the Netezza tag will get some traction.
We use databases for every segment (INVENTORY, CRM, BILLING...). There are no performance downsides and maintenance and overview is much better.
Better late than never, but for Netezza:
There are no performance hits while querying cross database. Netezza allows only SELECT operations cross database, no INSERT, UPDATE or DELETEstatements allowed.
This means you cannot do:
THISDB(ADMIN)=>INSERT INTO OTHERDB..TBL SELECT * FROM THISDBTABLE;
but you can do \c OTHERDB then
OTHERDB(ADMIN)=>INSERT INTO TBL SELECT * FROM THISDB..THISDBTABLE;
You are also not able to create a materialized view on a cross-database object, for example:
OTHERDB(ADMIN)=>CREATE MATERIALIZED VIEW BLAH AS SELECT * FROM THISDB..THISDBTABLE;
Administration might be where you will decide (though you probably already did long ago) on what kind of database(s) you'll create. Depending on your infrastructure, you might have a TEST/QA system and a PROD system on the same box, or on separate boxes.
You will gain speed in the load and the output if the tables are in the same schema (database). Obvious...but hey, I said it.
There is more overhead the more tables you put into one schema. Backups time, size of backups, ease of use.
Where I am, we have many multiple TB databases within one data-warehouse. Our rule of thumb is that a single loading process or a single report query should NOT have to span database. This keeps "like" tables together but gives some allowances for our backups and contingency processes. It also makes it a bit easier to "find" data.
For those processes that need to break this rule, we will either move data from one database to the other or allow the process to join across schemas.
I'm not as familiar with Netezza, so I'm not 100% sure what your options might be.
Few points for you to consider
a) If the data in one or more staging, audit, dimension and fact table has to be joined, you are better off keeping them in one database
b) Typically you will retain dimension tables and fact tables in the same database and distribute on most frequently joined columns to leverage "co-located join" functionality of Netezza
c) You should be able to use SQL grant permission to manage access to all objects (DB, tables, views etc)

Resources