The fastest programming language for scraping dynamic sites [closed] - parsing

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I write the site parser on Python (I pull data from the pages, process it, perform various arithmetic operations that are generated with js). I use selenium + pure lxml where it is possible. But I am not happy with the performance.
I want write on the other programming language, more quickly. Only I do not know which one to choose.
Someone writes that Scala does everything, someone says that C++ (not even C), someone for Assembler, someone for Rust, Perl, PHP... In general, I'm confused ... What faster parses a dynamic site?

Assuming the pages being scraped are not in your local network (and maybe even if they are, depending on how they are generated), it's likely that the slowest part of your scrape will be waiting for the page to be sent over the network.
Since you're scraping multiple pages, the simplest way of speeding up the process is to scrape multiple pages in parallel, so that it is not necessary to wait for one page to finish before you start downloading the next one.
Any language which allows parallel processing would work, but even if the language doesn't support it, you could run several scraping processes in parallel using a standard shell.

Related

What frameworks / technologies to use in web applications for Real Time features? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
What do you recommend (or not) to use for Real Time features (like chats or auctions) in web applications?
The most important for me is your opinion or benchmarks about the efficiency / performance / speed of specific frameworks, technologies and solutions.
For example:
Ruby on Rails + ActionCable
Phoenix + Elixir
Socket.io
QUESTION'S CONTEXT:
Each framework, programming language, technology has some advantages and disadvantages which make it more or less effective for Real Time needs. Sometimes we can use multiple technologies to build app's backend, for example when backend is a set of cooperating services (SOA, micorservices, etc.). Due to both, we are able create some features in Ruby on Rails (because the implementation is fast) and other in Java (beacuse it works fast).
If I would be on your side, I would follow Elixir & Phoenix path.
Elixir is basically Erlang with better syntax and it's open for extensions via macros, so you can customize it whatever you want.
Please take a look on these great articles about that:
The road to 2 million websocket connections
Phoenix Channels vs Rails Action Cable
Basically:
Elixir was created to do handle such scenarios with grace, efficiency, low latency, great scalability and fun.
Ps. Please remember that the time of the compilation is not that important as time of handling the request / getting the response / handling multiple websocket connections.
Elixir is not the fasters language, but it leverages concurrency and it's unique in terms of responsiveness.

File processing in Erlang [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Is Erlang good option for file processing of different types? For example- reading pdf,Word document,Excel sheet and transforming them? I know we can use Apache POI/Tika which are Java based and then integrate using JErlang.
I am not very familiar with Erlang's bit syntax but wanted to check if Erlang is suited for such kind of application without using Apache POI?
Erlang has a great binary support which makes it a great language for parsing different kind of binaries.
i.e. to decode a tcp segment by using binary syntax in erlang you can do something like
decode(Segment) ->
case Segment of
<< SourcePort:16, DestinationPort:16,
SequenceNumber:32,
AckNumber:32,
DataOffset:4, _Reserved:4, Flags:8, WindowSize:16,
Checksum:16, UrgentPointer:16,
Payload/binary>> when DataOffset>4
->
OptSize = (DataOffset - 5)*32,
<< Options:OptSize, Message/binary >> = Payload,
<> = <>,
%% Can now process the Message according to the
%% Options (if any) and the flags CWR, ..., FIN.
binary_to_list(Message)
end.
which compared to other languages is a super easy way to use pattern matching and binary support to decode/encode binaries.
Nevertheless, Erlang is more about concurrent processing and message passing between those processes, so I wouldn't use it to transform/parse binaries, I would instead use Erlang to manage the web server/api and to handle all the concurrent connections, and I would delegate the job of transforming the documents to raw c/c++ for performance, plus on c/c++ or either on java you have richer libraries to work with pdf/excel/word documents

Will there be a dynamic code injection for dart? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am currently preparing a talk about Polymer.dart and would like to give a short introduction to dart. There is one question I would like to be prepared for:
Will there be ever a dynamic code injection via <script> for dart?
This article says that there is currently no support for this for a good reason.
However, the currently relativizes the statement a bit and I wonder if there is anything planned in the future to support dynamic code injection?
If for example the "eval" command is introduced in Dart, then the answer is YES, Dart is vulnerable to injection attacks.
Javascript is in this regard like SQL: it has the same vulnerability than all other dynamically interpreted programming languages (this includes all shell scripts, PHP...), which I call "DATA IS CODE". Such languages have a concrete syntax which is meant for human consumption and their processing entails a first step which is called PARSING: the sequence of characters is broken down into an internal structure which describes the meaning of the expression, in a way which the computer can distinguish the DATA from the INSTRUCTIONS. It is the same problem that lead to the introduction of the NX (No-eXecute) bit on modern CPUs. Functions like "eval" open the door to malicious code to be executed with no constraint. Parsing code at runtime should NEVER be allowed in a secure language.
This is why Dart doesn't recomend the use of injections, as explained here:
https://www.dartlang.org/articles/embedding-in-html/#no-script-injection-of-dart-code
"No script injection of Dart code We do not currently support or
recommend dynamically injecting a tag that loads Dart code.
Recent browser security trends, like Content Security Policy, actively
prevent this practice."
But google should do more than that, and forbid it entirely, together with the "eval" command.
It is better to direct such questions to your crystal ball ;-)
Google is very reluctant to make statements about such things.
There were discussions in the past and they considered it and they might reconsider it eventually.
Currently the only option is to launch new isolates and even this is still work in progress and has still limitations that makes this feature hard to use (no access to the browser API for client isolates for example).
I'm not sure this question can really be answered; as it's probably not been decided.
Based on what's written in that page; I think it's very unlikely (especially as other rules, like one script tag, and a single main entry point).
But as with everything, things can change!

Why do some large websites use .html extension? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm just curious... why do some large sites (high traffic, a lot of data) use .html extension on their pages even though it's is clear that it's interpreted by php on the server side?
For example metrolyrics.com/top100.html
It's pretty clear that it uses php on the back-end, but still got .html suffix.
Is it better for SEO? Or am I wrong about the back-end and these pages are really static HTMLs as their extension says?
Every opinion is welcomed. Thanks! :)
Metrolyrics might not necessarily be using PHP for its back-end. It could be using other server side languages such as Ruby or Python.
I'd say one of the main reasons for not having PHP in a websites url is for protection. It is more difficult for people to hack a website if they don't know what language is being used on the back-end.
Secondly, websites tend to look more professional if they don't have an extension. And it raises less questions for end users. It's true that people are more used to seeing .html at the end of a URL, users may get more confused if they see .php instead.
It was a well known convention back in the day where static HTML pages were highly regarded for seo. Basically what they done was to keep thousands of these generated HTML pages on the server, making the website look like a content monster and thus elevating its Page Rank and the amount of google scans.
It's also a good way to cache pages and decrease server calls.

What approach/methodology are you using for one-man software development [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
You can find thousands of questions out there about how you develop software and which methodology is the best one. But mainly these are targeting medium to large teams, with people having different roles and responsibilities.
What I'm interested in is what methodology are you using for your one-man-shows? What steps are you doing, what documents are you creating to get the things you want to develop clear and document it well, to share it with the community?
Especially, I’m interested in the following questions:
_Are you using a structured approach even you’re developing on your own or no at all?
_What phases are you using?
_Which documents are you writing before and after coding?
And if you have “your” standardized approach, can you share templates which you are using?
Thanks in advance,
cheers
Gerry
Personally I think it is a matter of making decisions when it comes to the development process (solo). In my case I wouldn't recommend setting up a massive development process but I would pick elements which prevent problems that I have earlier had. My approach for small applications (in the right order):
Always write down what you are going to make and what you are not going to make (define a scope) - Think of functional requirements (Functional Design)
(OO only) Make a class diagram that displays relations between classes. (Technical Design - Sequence diagrams, while usefull, take up massive amounts of time to make)
Write your program according to what you have just written down (or part of it).
Refactor and redesign your application (once in every X hours, write this one down)
Repeat step 3 to 4 until the result is what you wrote in the Functional Design.
Walk through every corner of your application to find every single path and write this down in a testdocument. Identify possible problems in the paths and test them.
When it comes to big applications however (or assignments for someone else) I prefer using the "medium to large teams" approach. Which almost brings a guarantee that you will not be meeting most problems.

Resources