remove del tag with BeautifulSoup - parsing

I have a little stupid problem with BeautifulSoup and Python3. This is my HTML :
<span id="gaixm--1521602--15686128--ADHP.GEO_LONG" Visibility="None">
<del class="cellChanged NO_REVISION_MARK AmdtDeletedAIRAC" title="Date d'entrée en vigueur: 17 SEP 2015. " id="geaip_4b6c6e3f-9841-400c-9359-6ae9b334448d">001°49'57"E</del>
<ins class="cellChanged AmdtInsertedAIRAC" title="Date d'entrée en vigueur: 17 SEP 2015. " id="geaip_311221e8-2de7-4fce-b261-e0e9fb988238">001°49'52"E</ins>
</span>
I want to remove all the del tag. But when I do :
soup = BeautifulSoup(html, 'lxml')
soup.del.decompose()
tbody_tag = soup.table.tbody
print(tbody_tag)
I have an error (and it's normal, del it's a python name..) :
File "algo.py", line 52
soup.del.decompose()
^
SyntaxError: invalid syntax.
So... How can I do this ?
Thanks for your help !

You can use findAll function and then delete all results
for d in soup.findAll('del'):
d.decompose()

Related

BeautifulSoup - All href links don't appear to be extracting

I am trying to extract all href links that are within class ['address']. Each time I run the code, I only get the first 5 and that's it, even though I know there should be 9.
Web-Page:
https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch
I have read through a variety of threads below, altered my code countless times, including switching through all parsers (html.parser, html5lib, lxml, xml, lxml-xml) but nothing seems to be working. Any idea of what's causing it stop after the 5th iteration? I am still fairly new into python so I apologize if this is a rookie mistake that I'm overlooking. Any help would be appreciated, even the sarcastic answers :)
Beautiful Soup findAll doesn't find them all
Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds
BeautifulSoup fails to parse long view state
Beautifulsoup lost nodes
Missing parts on Beautiful Soup results
Python 64 bit not storing as long of string as 32 bit python
I used pretty similar code on the following web-pages below and did not experience any issues scraping the hrefs:
https://www.walgreens.com/storelistings/storesbystate.jsp?requestType=locator
https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator&state=AK
My code below:
import requests
from bs4 import BeautifulSoup
local_rg = requests.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = local_rg.content
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
for link in local_rg_content_src.find_all('div'):
local_class = str(link.get('class'))
if str("['address']") in str(local_class):
local_a = link.find_all('a')
for a_link in local_a:
local_href = str(a_link.get('href'))
print(local_href)
My results (first 5):
/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
But should be 9:
/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
/locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
/locator/walgreens-12405+brandon+st-anchorage-ak-99515/id=13449
/locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
/locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681
Try using selenium instead of requests to get the source code of the page. Here is how you do it:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = driver.page_source
driver.close()
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
The rest of the code is the same. Here is the full code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = driver.page_source
driver.close()
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
for link in local_rg_content_src.find_all('div'):
local_class = str(link.get('class'))
if str("['address']") in str(local_class):
local_a = link.find_all('a')
for a_link in local_a:
local_href = str(a_link.get('href'))
print(local_href)
Output:
/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
/locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
/locator/walgreens-12405+brandon+st-anchorage-ak-99515/id=13449
/locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
/locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681
the page uses Ajax to load store information from external URL. You can use requests/json module to load it:
import re
import json
import requests
url = 'https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch'
ajax_url = 'https://www.walgreens.com/locator/v1/stores/search?requestor=search'
m = re.search(r'"lat":([\d.-]+),"lng":([\d.-]+)', requests.get(url).text)
params = {
'lat': m.group(1),
'lng': m.group(2)
}
data = requests.post(ajax_url, json=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for result in data['results']:
print(result['store']['address']['street'])
print('https://www.walgreens.com' + result['storeSeoUrl'])
print('-' * 80)
Prints:
1470 W NORTHERN LIGHTS BLVD
https://www.walgreens.com/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
--------------------------------------------------------------------------------
725 E NORTHERN LIGHTS BLVD
https://www.walgreens.com/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
--------------------------------------------------------------------------------
4353 LAKE OTIS PARKWAY
https://www.walgreens.com/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
--------------------------------------------------------------------------------
7600 DEBARR RD
https://www.walgreens.com/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
--------------------------------------------------------------------------------
2197 W DIMOND BLVD
https://www.walgreens.com/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
--------------------------------------------------------------------------------
2550 E 88TH AVE
https://www.walgreens.com/locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
--------------------------------------------------------------------------------
12405 BRANDON ST
https://www.walgreens.com/locator/walgreens-12405+brandon+st-anchorage-ak-99515/id=13449
--------------------------------------------------------------------------------
12051 OLD GLENN HWY
https://www.walgreens.com/locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
--------------------------------------------------------------------------------
1721 E PARKS HWY
https://www.walgreens.com/locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681
--------------------------------------------------------------------------------

Thymeleaf does not allow "lt" as query string parameter name

I cannot use "lt" as query string parameter name in thymeleaf. How can I achieve that?
This is my example code:
<a th:href="#{/payment/otp-resend(lt=${landingToken.sessionId})}" class="sifretekrar" th:text="#{lp.resendOtp}"></a>
And it gives the following error:
org.thymeleaf.exceptions.TemplateProcessingException: Could not parse as expression: "#{/payment/otp-resend(lt=${landingToken.sessionId})}" (template: "otp-entry-page" - line 70, col 12)
at org.thymeleaf.standard.expression.StandardExpressionParser.parseExpression(StandardExpressionParser.java:131)
at org.thymeleaf.standard.expression.StandardExpressionParser.parseExpression(StandardExpressionParser.java:62)
at org.thymeleaf.standard.expression.StandardExpressionParser.parseExpression(StandardExpressionParser.java:44)
at org.thymeleaf.engine.EngineEventUtils.parseAttributeExpression(EngineEventUtils.java:220)
at org.thymeleaf.engine.EngineEventUtils.computeAttributeExpression(EngineEventUtils.java:207)
at org.thymeleaf.standard.processor.AbstractStandardExpressionAttributeTagProcessor.doProcess(AbstractStandardExpressionAttributeTagProcessor.java:125)
at org.thymeleaf.processor.element.AbstractAttributeTagProcessor.doProcess(AbstractAttributeTagProcessor.java:74)
at org.thymeleaf.processor.element.AbstractElementTagProcessor.process(AbstractElementTagProcessor.java:95)
at org.thymeleaf.util.ProcessorConfigurationUtils$ElementTagProcessorWrapper.process(ProcessorConfigurationUtils.java:633)
at org.thymeleaf.engine.ProcessorTemplateHandler.handleOpenElement(ProcessorTemplateHandler.java:1314)
at org.thymeleaf.engine.OpenElementTag.beHandled(OpenElementTag.java:205)
at org.thymeleaf.engine.TemplateModel.process(TemplateModel.java:136)
at org.thymeleaf.engine.TemplateManager.parseAndProcess(TemplateManager.java:661)
at org.thymeleaf.TemplateEngine.process(TemplateEngine.java:1098)
at org.thymeleaf.TemplateEngine.process(TemplateEngine.java:1072)
at org.thymeleaf.spring5.view.ThymeleafView.renderFragment(ThymeleafView.java:362)
at org.thymeleaf.spring5.view.ThymeleafView.render(ThymeleafView.java:189)
at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1370)
at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1116)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1055)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:942)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:998)
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:890)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:634)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:875)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:741)
Best regards.
EDIT
The accepted answer is correct. However, Intellij IDEA shows it as if it has an error. The screenshot is attached below. The following two lines are working while IDE displays error message for both of them:
<a th:href="#{/payment/mps-otp-resend} + '?lt=' + ${landingToken.sessionId}" class="sifretekrar" th:text="#{lp.resendOtp}"></a>
<a th:href="#{/payment/mps-otp-resend('lt'=${landingToken.sessionId})}" class="sifretekrar" th:text="#{lp.resendOtp}"></a>
You can quote lt, which should allow you to use it as a parameter name:
<a th:href="#{/payment/otp-resend('lt'=${landingToken.sessionId})}" class="sifretekrar" th:text="#{lp.resendOtp}"></a>

How to write a regex for web scraping

I have this text on a website I want to scrape:
event: new SEvent({"event_id":"Id","date":"Sat 27 Aug 2016"})
I want to pass this from my controller to my JavaScript file, which is already set up.
I'm having issues parsing the information so that only this is returned:
SEvent({"event_id":"Id","date":"Sat 27 Aug 2016"})
Here's what I tried to no avail:
info = text.to_s.scan(/\"(event)/).uniq
Don't you basically want to remove the "event: new " part of the input string? Maybe I misread your question - if not, this is what you could do:
input = 'event: new SEvent({"event_id":"Id","date":"Sat 27 Aug 2016"})'
input.gsub('event: new ', '')
=> 'SEvent({"event_id":"Id","date":"Sat 27 Aug 2016"})'
or a safer option
input.gsub('event: new SEvent', 'SEvent')`
=> 'SEvent({"event_id":"Id","date":"Sat 27 Aug 2016"})'
You can try with following regex:
/event: new (\w+\([^)]+\))/
info will be an array containing only the matches.
Currently you're matching only the string "event, but to match the entire string you want, you can use something like:
scan(/SEvent\(\{\"event[^\)]*\)/)
That will match the opening identifier string SEvent({"event then everything up to the next closing parentesis. which should capture your entire desired string
SEvent({"event_id":"Id","date":"Sat 27 Aug 2016"})
Also, I wanted to mention the square brackets with the carat [^\)]* means match 0 or more characters EXCEPT closing parenthesis.

Parse FacebookPage Using BeautifullSoup

i'm searching for a name in an html page of facebook.
if I take the file html.txt like this:
html = open('html.txt','r').read()
soup = BeautifulSoup(html)
if I search for the name with find it seems to be ok, but if i Try searching with BS i cant find anything..
>>>html.find("Joseph Tan")
98939
>>>html[98700:99000]
'<div class="fwn fcg"><span class="fcg"><span class="fwb"><a class="profileLink" href="https://www.facebook.com/ASD.391" data-ft="{"tn":"l"}" data-hovercard="/ajax/hovercard/user.php?id=123456">Alex Tan</a></span> condivided the photo <a class="profileLink" '
>>> soup.findAll('div',{'class':'fwn fcg'})
[]
>>> soup.findAll('span',{'class':'fwb'})
[]
>>> soup.findAll('a',{'class':'profileLink'})
[]
>>>
Someone can help me? thanks a lot
EDIT: RE-CREATED HTML PAGE
html page
It is working as below:
print soup.find_all('div', class_=['fwn','fcg'])
OUTPUT:
[<div class="uiHeaderActions rfloat _ohf fsm fwn fcg"><a class="_1c1m" href="#" role="button">Segna tutti come già letti</a> · <a accesskey="m" ajaxify="/ajax/messaging/composer.php" href="/messages/new/" id="u_0_8" rel="dialog" role="button">Invia un nuovo messaggio</a></div>, <div class="uiHeaderActions fsm fwn fcg">Segna come già letto · Impostazioni</div>, <div class="fsm fwn fcg"><a ajaxify="/settings/language/language/?uri=https%3A%2F%2Fwww.facebook.com%2Fshares%2Fview%3Fid%3D10152555113196961&source=TOP_LOCALES_DIALOG" href="#" rel="dialog" role="button" title="Usa Facebook in un'altra lingua.">Italiano</a></div>]
According to ==>this link, this is the style of how to search classes and other HTML elements using BS. Please check.
There were two problems.
1. The way you wrote is not matched with the link above I provided. May be you are not using updated version of BS.
2. There are two classes 'fwn' and 'fcg'. So you have to give their names in a list and this is how I got the output.
Same is is applicable for 'span' and 'a' as below:
print soup.find_all('span', class_='jewelCount')
print soup.find_all('a', class_='_awj')
Your given 'span' with class 'fwb' and given 'a' with class 'profileLink' was not found.Because, they are not present in the HTML.
You can check by printing all spans and a's.
Write print soup.find_all('a') and print soup.find_all('span')* to check on your own.
Hope this will help, if not, write again! :)

What's options are there for localisation in WebWorks?

I'm building a WebWorks version of an Android app that's localised into 39 languages.
At the moment all the localisations are in xml files of key-value pairs, one file per language .
Each language file has about 400 lines (roughly 40k per file).
Users can change the language used in app.
What options are there in WebWorks to solve this kind of situation?
I'd be more than happy to convert the resource files into any other kind of format to make working with it a better experience on the platform.
You could store each language set in a JavaScript file that you include/load as needed. (I've converted the XML data to a "Map" since it is just key/value pairs)
e.g. (just ignore my translations... I just Googled this, I'm by no means fluent in Spanish)
//Spanish File "lang_spanish.js"
var translations = {
"lose_a_turn": "pierde un turno",
"do_not_pass_go": "huele como un camello",
"take_a_card": "tener una tarjeta de",
"you_win_the_game":"sin motocicletas en la biblioteca",
"you_can_not_move":"desbordamiento de la pila puede ser un lugar divertido"
};
In your <head> you can then have a generic script tag, that you just change the source of as needed.
e.g.
<script id="langFile" src="js/lang_english.js"></script>
When you want a different language, just remove this script from the DOM and add your new one. e.g.
function switchLang(langName){
var headTag = document.getElementsByTagName('head')[0];
var scriptFile = document.getElementById('langFile');
headTag.removeChild(scriptFile);
var newScript = document.createElement('script');
newScript.id = 'langFile';
newScript.src = 'js/lang_' + langName + '.js';
headTag.appendChild(newScript);
}
//call with:
switchLang('spanish');
The alternative would be to load all 39 languages by default... but that seems like overkill considering most will only ever want 1 or 2.

Resources