Derive Text, Images, and LaTeX Equations from Websites

Derive Text, Images, and LaTeX Equations from Websites - latex

Would it be possible to derive the text, images, and LaTeX equations from a particular website so that you can directly customize your own PDF without having the objects blurry? Only the image will have a fixed resolution.
I realize that there are a couple ways of generating a PDF indirectly. Attempting to render a PDF from Wolfram MathWorld on the Riemann Zeta Function, for instance, would be possible by printing and saving it as a PDF via Chrome, but as you zoom in more closely, the LaTeX equations and text naturally become blurry. I tried downloading "Wolfram's CDF Player," but it contains only the syntax for Mathematica's libraries - not the helpful explanations that the Wolfram MathWorld provides. What would be required for me to extract the text, images, and LaTeX equations in a PDF file wihtout having them blurry?

Unless you have access to the LaTeX source that was used to produce the images in a way that isn't apparent from your question, the answer is "you cannot." Casual inspection of the website linked implies that the LaTeX that is used to produce the equations is not readily available (it's probably on a backend system somewhere that produces the images that get put on the web server).
To a browser, it's just an image. The method by which the image was produced is irrelevant to how it appears on the web page, and how it would appear in a PDF (ie. more pixelated than desired).
Note that if a website uses a vector-graphics format like SVG instead of a pixel based format like PNG or JPEG, then those will translate to PDF cleanly, and will zoom nicely. That's a choice that would be made by the webmaster of the site in question.

Inspecting the source reveals that the gifs depicting each equation have alt-text that approximates the LaTeX that would render them (it might be Mathematica code--I'm not familiar with Wolfram's tools). Extracting a reasonable source wouldn't be impossible, but it would be hard. The site is laid out with tables, so even with something like beautiful soup parsing the HTML could be tricky. Some equations are broken up into different gifs, so parsing them would be even trickier. You'd also have to convert from whatever the alt-text is to LaTeX.
All in all, if you don't need to do a zillion pages, I'd suggest copy-pasting the text, saving the images, grabbing the alt-text of each image and doing the converting yourself.

For the given example, you could download the Mathematica notebook for that page. Maybe it is possible to parse something from that.

Related

Does PDF.js extract any page styling information?

I'm in a situation where I need to distinguish whether or not a PDF has a certain layout after scanning the document for text. Is this possible with PDF.js and if so where would I find this information?

Unfortunately, PDFs consist of very low-level drawing commands, and as such it is very difficult to extract any formatting information from them, no matter what tool/library. (See for example, here)

How to remove spot color (s) from an image

Is there a command line tool to remove all spot color channels from a vector input image (type can be ai, eps) and keep only the CMYK or RGB color channels .
What I ve been able to come up with so far is using ghostscript tiffsep device and then recombine the color channel images to one image using imagemagicks -combine option. The drawback of this method is that it is quite compicated and I end up with a tiff image, instead of the original (vector) format.

'Image' has a defined meaning in PostScript, it means a bitmap, a raster. I think, from the context, that you mean something more general.
The simple answer is no, in general you can't do this, and I don't know of any tool which will.
The reason is that to do so would lose information; the marks defined in Separation or DeviceN space would be lost entirely, and its generally regarded as a Bad Idea to discard random parts of the document.
Perhaps you could explain what you are trying to achieve with this (ie why are you doing this), and it might be possible to suggest an alternative method.
If you are a competent C programmer you could produce a Ghostscript subclass device using the existing FILTER device (in gdevflt.c) as a template. That device looks at the type of operation, and either passes it on to the output device, or throws it away. It would be reasonably simple to look at the current colour space and discard Separation or DeviceN space. If you then uses the pdfwrite/ps2write/eps2write outptu device you'd get an EPS, PostScript program or PDF file as the output.
Whether you go down this route, continue with what you have, or find an alternative approach, there are a couple of things you need to think about; how do you plan to tackle Separation inks with process colour names ? Eg /Separation /Black. What about DeviceN spaces where some of the inks are process colours ? Eg a duotone Black and Pantone ink. Should these be preserved or dicarded ?
Your current approach will use the parts of the object which mark process plates, but not those which mark spot colorus, which could give some very peculiar results.
[EDIT]
PDF, PostScript and EPS don't have 'layers' (PDF has a feature, Optional Content, which uses the term 'layers' as a description in the specification but that's all).
An application such as Photoshop and Illustrator can have layers, but in general what they export to has to have those 'layers' converted into something else. That 'something else' depends on what you are saving it as.
Part of the problem is that you are apparently trying to deal with 3 different kinds of input, you say Illustrator (PDF, more or less), Photoshop (raster image) and EPS (PostScript). There is little common ground between the 3, is there a reason to support all of them ?
If you are content to stick with just Illustrator you might be able to do something with Optional Content. I'm not terribly familiar with modern versions of Illustrator, but wouldn't it be simpler to save two versions of the file, one with the answer layer and one without ?
Anyway, Ghostscript can honour Optional Content, so if you can save a PDF file (not PostScript or EPS) from Illustrator, it may be that the layers will persist into the PDF as Optional Content. I suspect they will going by a quick Google. In that case you might be able to run the file through Ghostscript, telling it not to honour the Optional Content portion, and get a PDF file without it present.
Another solution (again limited to PDF) would be to open the PDF file with an editing application such as Acrobat Pro, and simply delete the bits you don't want. Deletion of that kind is relatively reliable.
It still feels like rather a long-winded way to get a PDF file with some of the content removed though. I can't help feeling that just saving two versions from the creating application would be easier.

LaTeX for Chemistry on my website

I am programming a website on the subject of chemistry and for obvious reasons I also have to include structural and molecular formulas on that site. I want to have as few images as possible on the side and would therefore like to know how I can compile LaTeX code on my website, so I can show everything I could do in LaTeX itself.
Thanks in advance.

As outlined in a previous comment, Chemistry.SE has enabled mhchem in MathJax to allow the rendering of simple formula and reaction equations. The MathJax documention actually gives some directions.
As far as structures of organic molecules are concerned, I'm usually draw them using BkChem and export them as the png images.
If I understand you correctly, you would like to avoid the images themselves and not just the act of drawing. Therefore, the idea to generate the drawings from a linear representation (InChi, SMILES) using openbabel will probably not convince you.
As a matter of fact, it is possible to create structure in LaTeX using chemfig and there have been requests to support this package in MathJax. However, it seems that so far, the strong dependance of chemfig on TikZ has prevented this.

Parse image to SESHAT tool?

I am trying to create a simple tool that uses this website's functionality http://cat.prhlt.upv.es/mer/ which parses some strokes of text to a math formula. I noticed that they mention that it converts the input to InkML or MathML.
Now I noticed that according to this link: Tradeoff between LaTex, MathML, and XHTMLMathML in an iOS app? you can use MathJax to convert certain input to MathML.
What I need clarification/assistance with is how can I take input (say from finger strokes) or a picture and then convert it to a format in which I can provide this website from an iOS device and read the result at the top of the page. I have done everything regarding taking a picture or drawing an equation on an iPhone but I am just confused how I can take that and feed it to this site in order to get a result.
Is this possible, and if so how?

I think there's a misunderstanding here. http://cat.prhlt.upv.es/mer/ isn't an API for converting images into formulae—it's just an example demonstration of the Seshat software tool.
If you're looking to convert hand-drawn math expressions into LaTeX or MathML (which can then be pretty printed on your device), you want to compile Seshat and then feed it, not the website, your input. My answer here explains how to format input for Seshat.

can Mathematica be instructed to print-to-file smaller pdf files?

Mathematica 9.0.1.0, Linux.
Create a notebook cell with only the word "Section" and apply the format "Section" to it. Then create a variable x and evaluate it. Then print the two-cell notebook to a pdf file. (We often have to pass these forth and back via email to mobile users.) The resulting pdf file is just under 1MB big. A few more modest additions, and Mma print-to-file yields a 2-3MB files from about one page of notebook. for comparison, my 800 page dense latex-generated book with R graphics consumes about 4MB.
can Mma be instructed to produce more compact pdf files? I know it can rasterize graphics, but this isn't really a graphics problem.

this comes from the folks from wolfram support:
The pdf files are large because they contain embedded fonts for faithful reproduction.
One way to reduce the file size would be to set certain options below to False.
This can be done from Mathematica menu by going to Format->Options Inspector->
Select 'Global Preferences' from Show option values-> go to
Notebook Options->Printing Options-> EmbedExternalFonts set to False.
Do the same for Notebook Options->Printing Options->EmbedStandardPostScriptFonts
set to False.
However, the PDF that is generated may not look exactly like you want it, especially if you send it to someone else. However, if you just want to keep the PDF on your machine, where the fonts exist anyway, this may be a good default option.
apparently, their developers are working on the problem, too.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart