I need to load around 10000 entities in Jena SDB every 3 hours
but the first upload itself is taking lot of time around 2 hours.
I am using Jena API to upload the data and my SDB is backed by MySql
I am sure I am doing something wrong.
Please tell me how to increase the speed of upload.
I am using following commands
store.startbulkUpdate();
store.getDataset.getdefaultmodel.add(modelWithData);
store.finishbulkUpdate();
Do I need to specify chunk size etc?
Also what is an optimum configuration for mysql?
Related
I have a requirement to build an Interactive chatbot to answer Queries from Users .
We get different source files from different source systems and we are maintaining log of when files arrived, when they processed etc in a csv file on google cloud storage. Every 30 mins csv gets generated with log of any new file which arrived and being stored on GCP.
Users keep on asking via mails whether Files arrived or not, which file yet to come etc.
If we can make a chatbot which can read csv data on GCS and can answer User queries then it will be a great help in terms of response times.
Can this be achieved via chatbot?
If so, please help with most suitable tools/Coding language to achieve this.
You can achieve what you want in several ways. All depends what are your requirements in response time and CSV size
Use BigQuery and external table (also called federated table). When you define it, you can choose a file (or a file pattern) in GCS, like a csv. Then you can query your data with a simple SQL query. This solution is cheap and easy to deploy. But Bigquery has latency (depends of your file size, but can take several seconds)
Use Cloud function and Cloud SQL. When the new CSV file is generated, plug a function on this event. The function parse the file and insert data into Cloud SQL. Be careful, the function can live up to 9 minutes and max 2Gb can be assign to it. If your file is too large, you can break these limit (time and/or memory). The main advantage is the latency (set the correct index and your query is answered in millis)
Use nothing! In the fulfillment endpoint, get your CSV file, parse it and find what you want. Then release it. Here, you do nothing, but the latency is terrible, the processing huge, you have to repeat the file download and parse,... Ugly solution, but can work if your file is not too large for being in memory
We can also imagine more complex solution with dataflow, but I feel that isn't your target.
I have a 200GB RDF file in .nt format. I want to load it in Virtuoso (using Virtuoso Open-Source Edition 6.1.6). I used Virtuoso bulk loader from command line but loading gets hang after couple of hours of running. Do you have any idea how I can load this large file to Virtuoso efficiently? I want to load it fast.
I also tried to query my 200GB RDF graph from Apache Jena. However after running for 30 minutes it gives me some heap size space related error. If you have any solution for the above problem then kindly let me know.
Jena TDB has a bulk loader which has been used on large data input (hundred's of millions of triples).
What is the actual dataset you are loading? Is it actually just one file? We would recommend splitting into files of about 1GB max, and loading multiple files at a time with the bulk loader.
Have you done any performance tuning of the Virtuoso Server for the resources available on the machine in use, as detailed in the RDF Performance Tuning guide?
Please check with the status(''); command how many buffers are in use as, if you run out during a load, you will be swapping to disk continuously, which will lead to the sort of apparent hangs you report.
Note you can also load the Virtuoso LD Meter functions to monitor the progress of the dataset loads.
I've been having this issue for sometime now. On fillim.com (indie film distribution, so large files) we're using using this fork of the s3_swf_upload gem for rails. We're getting everyone complaining that it will fail sometimes 3-4 times before it will fully upload the file, like almost everyone.
We're on Heroku, and we're then of course needing to do direct uploads to S3.
We're not getting any errors generated, in our logs or in the browser, and we just can not for the life of us find the cause.
Has anyone had these issues before? Does anyone know of alternatives? If anyone knows of an alternative that supports files larger than 2GB, that would be even better.
If You are trying to upload files on amazon s3, Then use AWS::S3 a Ruby Library for uploading files.
http://amazon.rubyforge.org/
I thing default size
:fileSizeLimit (integer = 524288000)
Individual file size limit in bytes (default is 512 MB)
you need to increase your filesizelimit
The repeated failures is unsurprising. If you're going to upload files that large, you want to leverage S3's "multipart upload" support. Essentially, the file is broken-up into pieces, sent in parts, then reassembled on the S3-side.
The official AWS SDK for Ruby supports this feature, but you'd have to implement it into your gem. I don't know whether or not that's outside the scope of what you were looking for.
Also, am I correct in understanding that you're wanting to allow users to upload files > 2GB from their web browsers?
We are using magento 1.4.1 for our store, with 30+ categories and 2000+ products, every time i try to reindex the indexes "Catalog URL Rewrites" takes longer time to complete, please suggest us on how we can improve its speed?
Unfortunately catalog_url_rewrites is the slowest index in Magento when you have a large number of SKUs and the time is multiplied if you have a large number of store views. If you still have the default french/german store views - be sure to delete them, this will speed things up by a factor of 3x.
There are no means to speed up the re-index other than beefing up hardware (or optimising server configuration).
Running re-index via command line will relieve the burden of HTTP, but if the php.ini is the same, then its going to take the same amount of time.
You compare by running
php -i | grep php.ini
And comparing it to the output of of a script accessed via HTTP
phpinfo();
Otherwise, server tuning is everything, improving PHP and MySQL performance (which is a bit beyond the scope of this reply).
I don't know the way to make this process faster. What I would suggest you to do is:
Setup a cronjob which will do like this:
php (mageroot)/shell/indexer.php reindexall
php (mageroot)/shell/indexer.php --reindex catalog_url
I am sure about first one, but not sure about second one.
Cron should run every night, for example.
I'd like to mock large (>100MB) and slow file downloads locally by a ruby service - rails, sinatra, rack or something else.
After starting server and writing something like: http://localhost:3000/large_file.rar, I'd like to slooowly download a file (for testing purposes).
My question is, how to throttle local webserver to certain maximum speed? Because if file is stored locally, it will by default download very fast.
You should use curl for this, which allows you to specify a maximum transfer speed with the --limit-rate option. The following would download a file at about 10KB per second:
curl --limit-rate 10K http://localhost:3000/large_file.rar
From the documentation:
The given speed is measured in bytes/second, unless a suffix is
appended. Appending ‘k’ or ‘K’ will count the number as kilobytes, ‘m’
or M’ makes it megabytes, while ‘g’ or ‘G’ makes it gigabytes.
Examples: 200K, 3m and 1G.
The given rate is the average speed counted during the entire
transfer. It means that curl might use higher transfer speeds in short
bursts, but over time it uses no more than the given rate.
More examples here (search for "speed limit"): http://www.cs.sunysb.edu/documentation/curl/index.html