How can I programmatically get the image on this page? - post

The URL http://www.fourmilab.ch/cgi-bin/Earth shows a live map of the Earth.
If I issue this URL in my browser (FF), the image shows up just fine. But when I try 'wget' to fetch the same page, I fail!
Here's what I tried first:
wget -p http://www.fourmilab.ch/cgi-bin/Earth
Thinking, that probably all other form fields are required too, I did a 'View Source' on the above page, noted down the various field values, and then issued the following URL:
wget --post-data "opt=-p&lat=7°27'&lon=50°49'&ns=North&ew=East&alt=150889769&img=learth.evif&date=1&imgsize=320&daynight=-d" http://www.fourmilab.ch/cgi-bin/Earth
Still no image!
Can someone please tell me what is going on here...? Are there any 'gotchas' with CGI and/or form-POST based wgets? Where (book or online resource) would such concepts be explained?

If you will inspect the page's source code, there's a link with img inside, that contains the image of earth. For example:
<img
src="/cgi-bin/Earth?di=570C6ABB1F33F13E95631EFF088262D5E20F2A10190A5A599229"
ismap="ismap" usemap="#zoommap" width="320" height="320" border="0" alt="" />
Without giving the 'di' parameter, you are just asking for whole web page, with references to this image, not for the image itself.
Edit: 'Di' parameter encodes which "part" of the earth you want to receive, anyway, try for example
wget http://www.fourmilab.ch/cgi-bin/Earth?di=F5AEC312B69A58973CCAB756A12BCB7C47A9BE99E3DDC5F63DF746B66C122E4E4B28ADC1EFADCC43752B45ABE2585A62E6FB304ACB6354E2796D9D3CEF7A1044FA32907855BA5C8F

Use GET instead of POST. They're completely different for the CGI program in the background.

Following on from Ravadre,
wget -p http://www.fourmilab.ch/cgi-bin/Earth
downloads an XHTML file which contain an <img> tag.
I edited the XHTML to remove everything but the img tag and turned it into a bash script containing another wget -p command, escaping the ? and =
When I executed this I got a 14kB file which I renamed earth.jpg
Not really programmatic, the way I did it, but I think it could be done.
But as #somedeveloper said, the di value is changing (since it depends on time).

Guys, here's what I finally did. Not fully happy with this solution, as I was (and am still) hoping for a better way... one that gets the image on the first wget itself... giving me the same user experience I get when browsing via firefox.
#!/bin/bash
tmpf=/tmp/delme.jpeg
base=http://www.fourmilab.ch
liveurl=$(wget -O - $base/cgi-bin/Earth?opt=-p 2>/dev/null | perl -0777 -nle 'if(m#<img \s+ src \s* = \s* "(/cgi-bin/Earth\?di= .*? )" #gsix) { print "$1\n" }' )
wget -O $tmpf $base/$liveurl &>/dev/null

What you are downloading is the whole HTML page and not the image. To download the image and other elements too, you'll need to use the --page-requisites (and possibly --convert-links) parameter(s). Unfortunately because robots.txt disallows access to URLs under /cgi-bin/, wget will not download the image which is located under /cgi-bin/. AFAIK there's no parameter to disable the robots protocol.

Related

Is there a script that can extract particular link from txt and write it in another txt file?

I'm looking for a script (or if there isn't, I guess I'll have to write my own).
I wanted to ask if anyone here knows a script that can take a txt file with n links (lets say 200). I need to extract only links that have particular characters in them, let's say I only need links that contain "/r/learnprogramming". I need the script to get those links and write them to another txt files.
Edit: Here is what helped me: grep -i "/r/learnprogramming" 1.txt >2.txt
you can use ajax to read .txt file using jquery
<script src=https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.1/jquery.min.js></script>
<script>
jQuery(function($) {
console.log("start")
$.get("https://ayulayol.imfast.io/ajaxads/ajaxads.txt", function(wholeTextFile) {
var lines = wholeTextFile.split(/\n/),
randomIndex = Math.floor(Math.random() * lines.length),
randomLine = lines[randomIndex];
console.log(randomIndex, randomLine)
$("#ajax").html(randomLine.replace(/#/g,"<br>"))
})
})
</script>
<div id=ajax></div>
If you are using linux or macOS you could use cat and grep to output the links.
cat in.txt | grep /r/programming > out.txt
Solution provided by OP:
grep -i "/r/learnprogramming" 1.txt >2.txt
Since you did not provide the exact format of the document I assume those links are separated by newline characters. In this case, the code is pretty straightforward using Python/awk since you can iterate over file.readlines() and print only those that match your pattern (either by using a lines.contains(pattern) or using a regex if the pattern is more complex). To store the links in a new file simply redirect the stdout to a new file like this:
python script.py > links.txt
The solution above works even if links are separated by an arbitrary symbol s, first read the file into a single string and split it over s. I hope this helps.

How to fix internal link issues when publishing a Docusaurus site on GitLab pages

In my Docusaurus project my internal links work on my local environment, but when I push to GitLab they no longer work. Instead of replacing the original doc title with the new one it adds it to the url at the end ('https://username.io/test-site/docs/overview/add-a-category.html'). I looked over my config file, but I do not understand why this is happening.
I tried updating the id in the front matter for the page, and making sure it matches the id in the sidebars.json file. I have also added customDocsPath and set it to 'docs/' in the config file, though that is supposed to be the default.
---
id: "process-designer-overview"
title: "Process Designer Overview"
sidebar_label: "Overview"
---
# Process Designer
The Process Designer is a collaborative business process modeling and
design workspace for the business processes, scenarios, roles and tasks
that make up governed data processes.
Use the Process Designer to:
- [Add a Category](add-a-category.html)
- [Add a Process or Scenario](Add%20a%20Process%20or%20Scenario.html)
- [Edit a Process or Scenario](Edit%20a%20Process%20or%20Scenario.html)
I updated the add a category link in parenthesis to an md extension, but that broke the link on my local and it still didn't work on GitLab. I would expect that when a user clicks on the link it would replace the doc title in the url with the new doc title ('https://username.gitlab.io/docs/add-a-category.html') but instead it just tacks it on to the end ('https://username.gitlab.io/docs/process-designer-overview/add-a-category.html') and so the link is broken as that is not where the doc is located.
There were several issues with my links. First, I converted these files from html to markdown using Pandoc and did not add front matter - relying instead on the file name to connect my files to the sidebars. This was fine, except almost all of the file names had spaces in them, which you can see in my code example above. This was causing real issues, so I found a Bash script to replace all of the spaces in my file names with underscores, but now all of my links were broken. I updated all of the links in my files with a search and replace in my code editor, replacing "%20" with "_". I also needed to replace the ".html" extension with ".md" or my project would no longer work locally. Again, I did this with a search and replace in my code editor.
Finally, I ended up adding the front matter because otherwise my sidebar titles were all covered in underscores. Since I was working with 90 files, I didn't want to do this manually. I looked for a while and found a great gist by thebearJew and adjusted it so that it would take the file name and add it as the id, and the first heading and add it as the title and sidebar_label, since as it happens that works for our project. Here is the Bash script I found online to convert the spaces in my file names to underscores if interested:
find $1 -name "* *.md" -type f -print0 | \
while read -d $'\0' f; do mv -v "$f" "${f// /_}"; done
Here is the script I ended up with if anyone else has a similar setup and doesn't want to update a huge amount of files with front matter:
# Given a file path as an argument
# 1. get the file name
# 2. prepend template string to the top of the source file
# 3. resave original source file
# command: find . -name "*.md" -print0 | xargs -0 -I file ./prepend.sh file
filepath="$1"
file_name=$("basename" -a "$filepath")
# Getting the file name (title)
md='.md'
title=${file_name%$md}
heading=$(grep -r "^# \b" ~/Documents/docs/$title.md)
heading1=${heading#*\#}
# Prepend front-matter to files
TEMPLATE="---
id: $title
title: $heading1
sidebar_label: $heading1
---
"
echo "$TEMPLATE" | cat - "$filepath" > temp && mv temp "$filepath"

Spanish chracter and some spaceial chracter not allowed in Imagmagick

Spanish chracter and some spaceial chracter not allowed in Imagmagick when creating images from pdf.
There is showing error message "no decode delegate for this image".
Can i ignore that text while creating images from pdf.
Any solution for this?
Thanks!
Try running the following:
convert -list configure | grep -i "DELEGATES"
What does that mean?
**convert -list configure** - displays your imagemagick configuration
**grep -i "DELEGATES"** - takes the results from above and does a case insensitive search for what you're looking for, in this case, you're looking for the section marked DELEGATES
This should return all of the packages that you need in order to run your command.
Using your favorite package manage ( apt, yum,brew, etc ), install those packages ( including their development packages ), then try again.
Here is a thread you can check out, where someone had a very similar issue:
http://www.imagemagick.org/discourse-server/viewtopic.php?t=22488
Hope this helps!

Why does the search results change after PDF optimization in Ghostscript?

When searching for the word find in the PDF file in this Link before Ghostscript optimization the results will give pages number 4,7 and 13 but after the optimization it gives only pages 4 and 13 ignoring page number 7, the script im using for the optimization :
D:/gswin64c -sDEVICE=pdfwrite -dMaxSubsetPct=100 -dAutoRotatePages=/None -dMaxInlineImageSize=0 -dPDFSETTINGS=/ebook -dColorImageResolution=96 -dDetectDuplicateImages=true -dColorImageDownsampleThreshold=1.1 -dDOPDFMARKS -dUseTrimBox -sOutputFile="D:/temp/search_text.pdf" -dNOPAUSE -dNOGC -dBATCH -dNumRenderingThreads=8 -c 50000000 setvmthreshold -f "D:/temp/iphone_user_guide.pdf"
I've tried to add several fonts related parameters to the script such as -dEmbedAllFonts=true and pointing to fonts path also I've tried to play with the parameters by eliminating some but with no result
what could be the cause of this problem?
Ghostscript doesn't do 'optimization'. See my answer here:
GhostScript issues with a CropBox
for some details on what it does do.
Wihtout seeing your file I cannot tell you for certain what the difference is, but most likely the missing text has been drawn as images instead of text for some reason.
By the way, a lot of the options you are sending have absolutely no effect (eg NumRenderingThreads, for a device which doesn't do rendering). You should NOT select -dNOGC, that's a really bad idea, -dDOPDFMARKS is already set for the pdfwrite device.

Find and replace a URL with grep/sed/awk?

Fairly regularly, I need to replace a local url with a live in large WordPress databases. I can do it in TextMate, but it often takes 10+ minutes to complete.
Basically, I have a 10MB+ .sql file and I want to:
Find: http://localhost:8888/mywebsite
and
Replace with: http://mywebsite.com
After that, I'll save the file and do a mysql import to the local/live servers. I do this at least 3-4 times a week and waiting for Textmate has been a pain. Is there an easier/faster way to do this with grep/sed/awk?
Thanks!
Terry
sed 's/http:\/\/localhost:8888\/mywebsite/http:\/\/mywebsite.com/g' FileToReadFrom > FileToWriteTo
This is running switch (s/) globally (/g) and replacing the first URL with the second. Forward slashes are escaped with a backslash.
kent$ echo "foobar||http://localhost:8888/mywebsite||fooooobaaaaaaar"|sed 's#http://localhost:8888/mywebsite#http://mywebsite.com#g'
foobar||http://mywebsite.com||fooooobaaaaaaar
if you want to do the replace in place (change in your original file)
sed -i 's#http://.....#http://mysite#g' input.sql
You don't need to replace the http://
sed "s/localhost:8888/www.my-awesome-page.com/g" input.sql > output.sql

Resources