For a pedagogical project I am trying to count the number of lesson elements on the following page: https://www.edx.org/course/subject/computer-science
I am using Poltergeist as a web driver to access the page, but since the page is using a javascript function to add more entries after page load when the user is scrolling down, I then need to replicate that with Poltergeist.
I have tried to scroll down using:
evaluate_script("page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };")
or
execute_script("page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };")
It does not seem to work.
Is there any way for Poltergeist to get to the bottom of the page so that the javascript loads all the elements in the (in)finite loop?
Once they are loaded, they are easy to count.
execute_script is called to execute javascript in the "browser" -- I'm not sure what the 'page' object you're trying to set values on is, but you probably want something more like
execute_script('window.scroll(0,1000);')
As a more complete example
#session.visit 'https://www.edx.org/course/subject/computer-science'
count = #session.all(:css, '.discovery-card', minimum: 1).length()
puts "there are #{count} discovery cards"
#session.execute_script('window.scroll(0,1000);')
new_count = #session.all(:css, '.discovery-card', minimum: count+1, wait: 30).length()
puts "there are now #{new_count} discovery cards"
Related
I want to use playwright to automatically click and expand all the child nodes. But my code only expands part of the nodes. How should I fix the code? Thank you.
Current:
What I want:
import json
import time
from playwright.sync_api import sync_playwright
p = sync_playwright().start()
browser = p.chromium.launch(headless=False, slow_mo=2000)
context = browser.new_context()
page = context.new_page()
try:
# page.add_init_script(js);
page.goto("https://keepa.com/#!categorytree", timeout=10000)
# Click text=Log in / register now to subscribe
page.click("text=Log in / register now to subscribe")
# Click input[name="username"]
page.click("input[name=\"username\"]")
# Fill input[name="username"]
page.fill("input[name=\"username\"]", "tylrr123#outlook.com")
# Click input[name="password"]
page.click("input[name=\"password\"]")
# Fill input[name="password"]
page.fill("input[name=\"password\"]", "adnCgL#f$krY9Q9")
# Click input:has-text("Log in")
page.click("input:has-text(\"Log in\")")
page.wait_for_timeout(2000)
page.goto("https://keepa.com/#!categorytree", timeout=10000)
while(True):
#loc.first.click()
loc = page.locator(".ag-icon.ag-icon-expanded")
print(loc.count())
loc.first.click(timeout=5000)
page.wait_for_timeout(2000)
except Exception as err:
print(err)
finally:
print("finished")`
My code only expands part of the nodes. How should I fix the code? Thank you.
Sometimes I try to do some scripts, but being honest, this was one of the most harder ones. It has been a real challenge.
I think it is finished.
# Import needed libs
import time
from playwright.sync_api import sync_playwright
import datetime
# We save the time when script starts
init = datetime.datetime.now()
print(f"{datetime.datetime.now()} - Script starts")
# We initiate the playwright page
p = sync_playwright().start()
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
# Navigate to Keepa and login
page.goto("https://keepa.com/#!categorytree")
page.click("text=Log in / register now to subscribe")
page.fill("#username", "tylrr123#outlook.com")
page.fill("#password", "adnCgL#f$krY9Q9")
page.click("#submitLogin", delay=200)
# We wait for the selector of the profile user, that means that we are already logged in
page.wait_for_selector("#panelUsername")
# Navigate to the categorytree url
page.goto("https://keepa.com/#!categorytree")
time.sleep(1)
#This function try to click on the arrow for expanding an subtree
def try_click():
# We save the number of elements that are closed trees
count = page.locator(f"//span[#class='ag-group-contracted']").count()
# We iterate the number of elements we had
for i in range(0, count):
# If the last element is visible, then we go inside the "if" statement. Why the last element instead of the first one? Because I don't know why the last element is usually the frist one...Keepa things, don't ask
if page.locator(f"(//span[#class='ag-group-contracted'])[{count-i}]").is_visible():
# Element was visible, so we try to click on it (Expand it). I wrapped the click inside a try/except block because sometimes playwright says that click failed, but actually does not fail and element is clicked. I don't know why
try:
# Clicking the element
page.click(f"(//span[#class='ag-group-contracted'])[{count-i}]", timeout=200)
print(f"Clicking Correct {count-i}. Wheel up")
# If element is clicked, we do scroll up, and we return true
page.mouse.wheel(0, -500)
return True
except:
# As I said, sometimes click fails but is actually clicked, so we return also true. The only way of returning False is if the elements are not visible
print(f"Error Clicking {count-i} but probably was clicked")
return True
# This function basically checks that there are closed trees
def there_is_still_closed_trees():
try:
page.wait_for_selector(selector=f"//span[#class='ag-group-contracted']", state='attached')
return True
except:
print("No more trees closed")
return False
# When we navigated to categorytree page a pop up appears, and you have to move the mouse to make it disappear, so I move the mouse and I keep it on the list, because later we will need to do scroll up and scroll down over the list
page.mouse.move(400, 1000)
page.mouse.move(400, 400)
# Var to count how many times we made scroll down
wheel_down_times = 0
# We will do this loop until there are not more closed trees
while there_is_still_closed_trees():
# If we could not make click (The closed trees were not visibles in the page) we will do scroll down to find them out
if not try_click():
# We do scroll down, and we sum one to the scroll down counter
print("Wheel down")
page.mouse.wheel(0, 400)
wheel_down_times = wheel_down_times + 1
print(f"Wheel down times = {wheel_down_times}")
# Sometimes if we do a lot of scrolls, page can crash, so we sleep the script 10 secs every 100 scrolls
if wheel_down_times % 100 == 0:
print("Sleeping 10 secs in order to avoid page crashes")
time.sleep(10)
# This "if" checks that the latest element of the whole tree is visible and we did more than 5 scroll down. That means that we are at the end of the list and we forget some closed trees, so we do scroll up till we arrive at the top of the list and we will make scroll down trying to find the pending closed trees
if page.locator(f"//span[text()='Walkthroughs & Tutorials']").is_visible() and wheel_down_times > 5:
page.mouse.wheel(0, -5000000)
else:
print(f"Wheel down times from {wheel_down_times} to 0")
wheel_down_times = 0
# Script finishes and show a summary of time
end = datetime.datetime.now()
print(f"{datetime.datetime.now()} - Script finished")
print(f"Script started at: {init}")
print(f"Script ended at: {end}")
print("There should not be any more closed trees")
# This sleeps the script if you want to see the screen. But actually you can remove and page will be closed
time.sleep(10000)
The scripts takes almost 3 hours. I don't know how keepa has a so many categories. Awesome...
I am trying to optimise LCP for this page. I read an article on LCP optimisation where I also found a script which can help to determine which part of the LCP most time is spent on. Script:
const LCP_SUB_PARTS = [
'Time to first byte',
'Resource load delay',
'Resource load time',
'Element render delay',
];
new PerformanceObserver((list) => {
const lcpEntry = list.getEntries().at(-1);
const navEntry = performance.getEntriesByType('navigation')[0];
const lcpResEntry = performance
.getEntriesByType('resource')
.filter((e) => e.name === lcpEntry.url)[0];
// Ignore LCP entries that aren't images to reduce DevTools noise.
// Comment this line out if you want to include text entries.
if (!lcpEntry.url) return;
// Compute the start and end times of each LCP sub-part.
// WARNING! If your LCP resource is loaded cross-origin, make sure to add
// the `Timing-Allow-Origin` (TAO) header to get the most accurate results.
const ttfb = navEntry.responseStart;
const lcpRequestStart = Math.max(
ttfb,
// Prefer `requestStart` (if TOA is set), otherwise use `startTime`.
lcpResEntry ? lcpResEntry.requestStart || lcpResEntry.startTime : 0
);
const lcpResponseEnd = Math.max(
lcpRequestStart,
lcpResEntry ? lcpResEntry.responseEnd : 0
);
const lcpRenderTime = Math.max(
lcpResponseEnd,
// Prefer `renderTime` (if TOA is set), otherwise use `loadTime`.
lcpEntry ? lcpEntry.renderTime || lcpEntry.loadTime : 0
);
// Clear previous measures before making new ones.
// Note: due to a bug this does not work in Chrome DevTools.
// LCP_SUB_PARTS.forEach(performance.clearMeasures);
// Create measures for each LCP sub-part for easier
// visualization in the Chrome DevTools Performance panel.
const lcpSubPartMeasures = [
performance.measure(LCP_SUB_PARTS[0], {
start: 0,
end: ttfb,
}),
performance.measure(LCP_SUB_PARTS[1], {
start: ttfb,
end: lcpRequestStart,
}),
performance.measure(LCP_SUB_PARTS[2], {
start: lcpRequestStart,
end: lcpResponseEnd,
}),
performance.measure(LCP_SUB_PARTS[3], {
start: lcpResponseEnd,
end: lcpRenderTime,
}),
];
// Log helpful debug information to the console.
console.log('LCP value: ', lcpRenderTime);
console.log('LCP element: ', lcpEntry.element);
console.table(
lcpSubPartMeasures.map((measure) => ({
'LCP sub-part': measure.name,
'Time (ms)': measure.duration,
'% of LCP': `${
Math.round((1000 * measure.duration) / lcpRenderTime) / 10
}%`,
}))
);
}).observe({type: 'largest-contentful-paint', buffered: true});
For me, this was the result at the start in 4x CPU slowdown and Fast3G connection.
After that, since render delay was the area where I should focus on, I moved some of the scripts to the footer and also made the "deferred" scripts "async". This is the result:
We can see there is a clear improvement in LCP after the change but, when I test with lighthouse the result is different.
Before:
After:
I am in dilemma now about what step to take. Please suggest!!
I ran a trace of the URL you linked in your question, and the first thing I noticed is that your LCP resource finishes loading pretty early in the page, but it isn't able to render until a file called mirage2.min.js finishes loading.
This explains why your "Element render delay" portion of LCP is so long, and moving your scripts to the bottom of the page or seeing defer of them is not going to solve that problem. The solution is to make it so your LCP image can render without needing to wait until that JavaScript file finishes loading.
Another thing I noticed is this mirage2.min.js file is loaded from ajax.cloudflare.com, which made me think it's a "feature" offered by Cloudflare and not something you set up yourself.
Based on what I see here, I'm assuming that's true:
https://support.cloudflare.com/hc/en-us/articles/219178057
So my recommendation for you is to turn off this feature, because it's clearly not helping your LCP, as you can see in this trace:
There's one more thing you said that I think is worth clarifying:
After that, since render delay was the area where I should focus on, I moved some of the scripts to the footer and also made the "deferred" scripts "async". This is the result:
When I look at your "result" screenshot, I still see that the "element render delay" portion is still > 50%, so while you were correct when you said that "render delay was the area where I should focus on", the fact that it was still high after you made your changes (e.g. moving the scripts and using defer/async) was an indication that the changes you tried didn't fix the problem.
In this case, I believe that if you turn off the "Mirage" feature in your Cloudflare dashboard, you should see a big improvement.
Oh, one more thing, I noticed that you're using importance="high" on your image. This is old syntax that does not work anymore. You should replace that with fetchpriority="high" instead. See this post for details: https://web.dev/priority-hints/
I'm scraping [this page][1] to look for details of schools that are contained within the CSS selectors .box .column which is contained within a div .schools which is loaded dynamically and takes some time to appear.
I've done this with the watir gem and had no problems. Here's the code as reference.
browser = Watir::Browser.new
browser.goto('https://educationdestinationmalaysia.com/schools/pre-university')
js_doc = browser.element(css: '.schools').wait_until(&:present?)
schools_list = Nokogiri::HTML(js_doc.inner_html)
school_cards = schools_list.css('.box .columns .column:nth-child(2)')
I'm now trying to achieve the same with the kimurai gem but I'm not really familiar with Capybara.
What I've Tried
Changing the default max wait time
def parse(response, url:, data: {})
Capybara.default_max_wait_time = 20
puts browser.has_css?('div.schools')
end
using_wait_time
browser.using_wait_time(20) do
puts browser.has_css?('.schools')
end
Passing in a wait argument to has_css?
browser.has_css?('.schools', wait: 20)
Thanks for reading!
[1]: https://educationdestinationmalaysia.com/schools/pre-university
Your Watir code
js_doc = browser.element(css: '.schools').wait_until(&:present?)
returns the element, but in your Capybara code you're calling predicate methods (has_css?, has_xpath?, has_selector?, etc) that just return true or false. Those predicate methods will only wait if Capybara.predicates_wait is true. Is there a specific reason you're using the predicates though? Instead you can just find the element you're interested in, which will wait up to Capybara.default_max_wait_time or you can specify a custom wait option. The "equivalent" to your Watir example of
js_doc = browser.element(css: '.schools').wait_until(&:present?)
schools_list = Nokogiri::HTML(js_doc.inner_html)
school_cards = schools_list.css('.box .columns .column:nth-child(2)'
assuming you had Capybara.default_max_wait_time set to a number high enough for your app and testing setup
school_cards = browser.find('.schools').all('.box .columns .column:nth-child(2)')
If you do need to extend the wait for one of the finds you could do
school_cards = browser.find('.schools', wait: 10).all('.box .columns .column:nth-child(2)')
to wait up to 10 seconds for the .schools element to appear. This could also just be collapsed into
school_cards = browser.all('.schools .box .columns .column:nth-child(2)')
which will also wait (up to Capybara.default_max_wait_time) for at least one matching element to exist before returning it although depending on your exact HTML
school_cards = browser.all('.schools .column:nth-child(2)')
may be just as good and less fragile
Note: you do have to be using a Kimurai engine that supports JS - https://github.com/vifreefly/kimuraframework#available-engines - otherwise you won't be able to interact with dynamic websites
We retrieve information from Elasticsearch 2.7.0 and we allow the user to go through the results. When the user requests a high page number we get the following error message:
Result window is too large, from + size must be less than or equal to:
[10000] but was [10020]. See the scroll api for a more efficient way
to request large data sets. This limit can be set by changing the
[index.max_result_window] index level parameter
The thing is we use pagination in our requests so I don't see why we get this error:
#Autowired
private ElasticsearchOperations elasticsearchTemplate;
...
elasticsearchTemplate.queryForPage(buildQuery(query, pageable), Document.class);
...
private NativeSearchQuery buildQuery() {
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
boolQueryBuilder.should(QueryBuilders.boolQuery().must(QueryBuilders.termQuery(term, query.toUpperCase())));
NativeSearchQueryBuilder nativeSearchQueryBuilder = new NativeSearchQueryBuilder().withIndices(DOC_INDICE_NAME)
.withTypes(indexType)
.withQuery(boolQueryBuilder)
.withPageable(pageable);
return nativeSearchQueryBuilder.build();
}
I don't understand the error because we retreive pageable.size (20 elements) everytime... Do you have any idea why we get this?
Unfortunately, Spring data elasticsearch even when paging results searchs for a much larger result window in the elasticsearch. So you have two options, the first is to change the value of this parameter.
The second is to use the scan / scroll API, however, as far as I understand, in this case the pagination is done manually, as it is used for infinite sequential reading (like scrolling your mouse).
A sample:
List<Pessoa> allItens = new ArrayList<>();
String scrollId = elasticsearchTemplate.scan(build, 1000, false, Pessoa.class);
Page<Pessoa> page = elasticsearchTemplate.scroll(scrollId, 5000L, Pessoa.class);
while (true) {
if (!page.hasContent()) {
break;
}
allItens.addAll(page.getContent());
page = elasticsearchTemplate.scroll(scrollId, 5000L, Pessoa.class);
}
This code, shows you how to read ALL the data from your index, you have to get the requested page inside scrolling.
class ScraperController < ApplicationController
def show
mechanize = Mechanize.new
website = mechanize.get('https://website.com/')
$max = 2
$counter = 0
$link_to_click = 2
#names = []
while $counter <= $max do
#names.push(website.css('.memName').text.strip)
website.link_with(:text => '2').text.strip.click
$link_to_click += 1
$counter += 1
end
end
end
I am trying to scrape 20 items off of each page and then click on the link at the bottom (1, 2, 3, 4, 5, etc.). However, I get the error as seen in the title which tells me that I cannot click the string. So it recognizes that the button '2' exists but will tell me if cannot click it. Ideally, once this is sorted out, I wanted to the use the $link_to_click variable as a way to replace the '2' so that it will increment each time but it always comes back as nil. I have also changed it to .to_s with the same result.
If I remove the click all together, it will scrape the same page 3 times instead of moving onto the next page. I have also removed the text.strip part before the .click and it will do the same thing. I have tried many variations but have had no luck.
I would really appreciate any advice you could offer.
I ended up reviewing the articles I was referencing to solve this and came to this conclusion.
I changed the website_link to website = website.link_with(:text => $link_to_click.to_s).click (because it only worked as a string) and it printed out the first page, second and each one thereafter.
These are the articles that I was referencing to learn how to do this.
http://docs.seattlerb.org/mechanize/GUIDE_rdoc.html
and
https://readysteadycode.com/howto-scrape-websites-with-ruby-and-mechanize