Jsoup mobile data consumption - html-parsing

I'm using Jsoup to parse in background the html of 3 different webpages every 10 minutes or so. However I found that in 2 days I consumed 18 mb of network data... Is there some way to reduce this huge data consumption? I don't need all the html page, is there a way to download only a part of the website html?

One way out would be to create a webservice that does the scraping and parsing and offers the results back to you in a condensed form. maybe create something on openshift?
18MB is actually not much over 2 days. Are you sure you can't afford this?

Related

Rails 4: metainspector gem slowing app down

In my Rails 4 app, I am using the metainspector gem to allow users to display meta data from a URL they post to a form.
Since I installed this gem, each time I try to visit the page of my own app where metadata is pulled from another website, load time increases significantly.
The load time increase goes from an imperceptible delay for small and local websites, to almost freezing the app for larger and foreign websites.
To give you an idea, a regular page usually loads under 400 ms: when we pull data with metainspector, it can go beyond a 30,000 ms (I measured these load times with rack-mini-profiler).
I did not find much about similar issues online.
Here is what I am trying to figure out:
Does this sound normal or did I setup something in the wrong way?
Is there a way to speed up load time with metainspector? For instance by caching responses?
If there is no way to speed up load time, should I implement a timeout limit and display an error message?
That is perfectly normal, to be exact it's not metainspector what is slowing your app down, it's the fact that you're requesting external URLs.
You should try to cache the responses using the built-in caching mechanism in metainspector, but also if possible move this to an asynchronous job using a background queue, and save or cache the result.

How can I reduce the waiting (ttfb) time

I have a query which involves getting a list of user from a table in sorted order based on at what time it was created. I got the following timing diagram from the chrome developer tools.
You can see that TTFB (time to first byte) is too high.
I am not sure whether it is because of the SQL sort. If that is the reason then how can I reduce this time?
Or is it because of the TTFB. I saw blogs which says that TTFB should be less (< 1sec). But for me it shows >1 sec. Is it because of my query or something else?
I am not sure how can I reduce this time.
I am using angular. Should I use angular to sort the table instead of SQL sort? (many posts say that shouldn't be the issue)
What I want to know is how can I reduce TTFB. Guys! I am actually new to this. It is the task given to me by my team members. I am not sure how can I reduce TTFB time. I saw many posts, but not able to understand properly. What is TTFB. Is it the time taken by the server?
The TTFB is not the time to first byte of the body of the response (i.e., the useful data, such as: json, xml, etc.), but rather the time to first byte of the response received from the server. This byte is the start of the response headers.
For example, if the server sends the headers before doing the hard work (like heavy SQL), you will get a very low TTFB, but it isn't "true".
In your case, TTFB represents the time you spend processing data on the server.
To reduce the TTFB, you need to do the server-side work faster.
I have met the same problem. My project is running on the local server. I checked my php code.
$db = mysqli_connect('localhost', 'root', 'root', 'smart');
I use localhost to connect to my local database. That maybe the cause of the problem which you're describing. You can modify your HOSTS file. Add the line
127.0.0.1 localhost.
TTFB is something that happens behind the scenes. Your browser knows nothing about what happens behind the scenes.
You need to look into what queries are being run and how the website connects to the server.
This article might help understand TTFB, but otherwise you need to dig deeper into your application.
If you are using PHP, try using <?php flush(); ?> after </head> and before </body> or whatever section you want to output quickly (like the header or content). It will output the actually code without waiting for php to end. Don't use this function all the time, or the speed increase won't be noticable.
More info
I would suggest you read this article and focus more on how to optimize the overall response to the user request (either a page, a search result etc.)
A good argument for this is the example they give about using gzip to compress the page. Even though ttfb is faster when you do not compress, the overall experience of the user is worst because it takes longer to download content that is not zipped.

How to manage millions of tiny HTML files

I have taken over a project which is a Ruby on Rails app that basically crawls the internet (in cronjobs). It crawls selected sites to build historical statistics about their subject over time.
Currently, all the raw HTML it finds is kept for future use. The HTML is saved into a Postgres database table and is currently at about 1,2 million records taking up 16 gigabytes and growing.
The application currently reads the HTML once to extract the (currently) relevant data and then flags the row as processed. However, in the future, it may be relevant to reprocess ALL files in one job to produce new information.
Since the tasks in my pipeline includes setting up backups, I don't really feel like doing daily backups of this growing amount of data, knowing that 99,9% of the backup is just static records with a practically immutable column with raw HTML.
I want this HTML out of the PostgreSQL database and into some storage. The storage must be easily manageable (inserts, reads, backups) and relatively fast. I (currently) have no requirement to search/query the content, only to resolve it by a key.
The ideal solution should allow me to insert and get data very fast, easy to backup, perhaps incrementally, perhaps support batch reads and writes for performance.
Currently, I am testing a solution where I push all these HTML files (about 15-20 KB each) to Rackspace Cloudfiles, but it takes forever and the access time is somewhat slow and relatively inconsistent (between 100 ms and several seconds). I estimate that accessing all files sequentially from rackspace could take weeks. That's not an ideal foundation for devising new statistics.
I don't believe Rackspace's CDN is the optimal solution and would like to get some ideas on how to tackle this problem in a graceful way.
I know this question has no code and is more a question about architecture and database/storage solutions, so it may be treading on the edges of SO's scope. If there's a more fitting community for the question, I'll gladly move it.

Can I download and parse a large file with RestKit?

I have to download thousands or millions of hotposts from a web service and store them locally in core data. The json response or file is about 20 or 30 MB, so download will take time. I guess mapping and store it in core data will also take time time.
Can I do it in restkit? or has been designed just for reasonable size responses?
I see I can track progress when downloading a large file, even I see I can know when mapping starts or finishes: http://restkit.org/api/latest/Protocols/RKMapperOperationDelegate.html
Probably I can also encapsulate the core data operation to avoid blocking the UI.
What do you think? Do you think is this feasible? Or should I select a more manual approach? I would like to know your opinion. Thanks in advance.
Your problem is not encapsulation or threading, it's memory usage.
For a start, thousands or millions of 'hot posts' are likely to cause you issues on a mobile device. You should usually be using a web service that allows you to obtain a filtered set of content. If you don't have that already, consider creating it (possibly by uploading the data to a service like Parse.com).
RestKit isn't designed to use a streaming parser, so the full JSON will need to be deserialised into memory before it can be processed. You can try it, but I suspect the mobile device will be unhappy if the JSON is 20 / 30 MB.
So, create a nice web service or use a streaming parser and process the results yourself (which, could technically be done using RestKit mapping operations).

MODX Revo: How can make pages load fast?

I'm working on a site with modx revo. I'm really annoyed by the slow loading op pages. There's a 2sec wait for a page load om my localhost ánd I have a SSD. I've been looking around to find out how to make pageload faster.
I do have alot of getResources-/Gallery (9 total) calls and two Wayfinder calls. I've read it had to to with those, so I got rid of all the getResources and changed them to customs snippets that do only what I need them to do, build a 3-4 item menu. It's still slow, only few hunderd ms slower.
The Galleries (5) are only 3-4 images. I also use babel that checks every resource id for it's translation counterpart.
I'm wondering if it has anything to do with my wampserver (v 2.2) settings...
Now that I've summed it all up, I does look like a heavy page. Will I get long pageloads with any CMS this way?
Any help/hint/tips are apreciated!
You might want to "cache" all snippet tags without using the exclamation mark [[! ... ]].
Here is a blog about caching guidelines: http://www.markhamstra.com/modx-blog/2011/10/caching-guidelines-for-modx-revolution/
Here is a current discussion about speed performance: http://forums.modx.com/thread?thread=74902#dis-post-415390

Resources