BeautifulSoup indexes

BeautifulSoup indexes - parsing

so I am trying to parse the links for the genres and subgenres in the IMDB page http://www.imdb.com/genre/?ref_=nv_ch_gr_3
and have now been able to parse the main genre tags into something usable
with the following code
table = soup.find_all("table", {"class": "genre-table"})
for item in table:
for x in range(100):
try:
print(item.contents[x].find_all("h3"))
print(len(item.contents[x].find_all("h3")))
except:
pass
and my output is 11 sets of lists with two tags in it like this
[<h3>Action <span class="normal">»</span></h3>, <h3>Adventure <span class="normal">»</span></h3>]
2
I understand this because the containers have a class of "even" and "odd" with two h3 tags in each container, but I didnt specify it to differentiate between even and odd, actually I think I am answering my own question here, am I right in thinking that because it was in a container class odd or even, that bs4 put it in a list to just show that and its up to me to separate them?
Second more important question:
how would I get each h3 link and title into my dataframe that I have set up as
df = pd.DataFrame(columns= ['Genre', 'Sub-Genre', 'Link'])
I've tried
for y in range(2):
df.append({'Genre':'item.contents[x].find_all("h3"))[y].text)}, ignore_index = true)
This is nested in the for loop with x of course (not on its own)
but it doesnt seem to work
any thoughts? karma your way!

First off there's no need to find all tables since only the first one is neccessary:
table = soup.find("table", {'class': 'genre-table'})
and since every other item is redundant(starting with the first) you can iterate the table like this:
for item in list(table)[1::2]:
after this we can get the 'h3' tags in every row and loop through both of them:
row = item.find_all("h3")
for col in row:
because the text in every 'h3' element returns the genre in this format: 'Somegenre \xc2\xbb' i removed the span element before getting the the text:
col.span.extract()
link = col.a['href']
genre = col.text.strip()
after this just add the elements in to a dataframe by index:
df.loc[len(df)]=[genre, None, link]
full code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
df = pd.DataFrame(columns=['Genre', 'Sub-Genre', 'Link'])
req = requests.get('http://www.imdb.com/genre/?ref_=nv_ch_gr_3')
soup = BeautifulSoup(req.content, 'html.parser')
table = soup.find("table", {'class': 'genre-table'})
for item in list(table)[1::2]:
row = item.find_all("h3")
for col in row:
col.span.extract()
link = col.a['href']
genre = col.text.strip()
df.loc[len(df)] = [genre, None, link]

Related

Why is my CNN returning tokens instead or readable labels?

I am currently studying machine learning and have created a CNN using fastai that labels the category of clothes items. I built this model using the Fashion-MNIST data set.
Everything funcitons fine and it looks like it's predicting correctly but I dont know how to make it return the labels and categories rather than this weird tokenized text it is returning. where am I going wrong?
Here is some code
This is where I create the dataframe that has the category mapped to the image path.
from fastcore.all import *
ds = dataFrame.filter(['masterCategory', 'imagePath'], axis=1)
ds
masterCategory imagePath
0 Apparel ../input/fashion-product-images-small/images/1...
1 Apparel ../input/fashion-product-images-small/images/3...
2 Accessories ../input/fashion-product-images-small/images/5...
3 Apparel ../input/fashion-product-images-small/images/2...
4 Apparel ../input/fashion-product-images-small/images/5...
... ... ...
44419 Footwear ../input/fashion-product-images-small/images/1...
44420 Footwear ../input/fashion-product-images-small/images/6...
44421 Apparel ../input/fashion-product-images-small/images/1...
44422 Personal Care ../input/fashion-product-images-small/images/4...
44423 Accessories ../input/fashion-product-images-small/images/5...
44424 rows × 2 columns
Then I create a datablock
def getImages(d): return d['imagePath']
def getLabel(d): return d['masterCategory']
from fastai.vision.all import *
dblock = DataBlock(
blocks=(ImageBlock, MultiCategoryBlock),
get_x=getImages,
splitter=RandomSplitter(valid_pct=0.2, seed=42),
get_y=getLabel,
item_tfms=[Resize(192, method='squish')]
)
Then I use the dataloader and when I show batch, but I get these weird labels instead of the the mater categories.
dsets = dblock.dataloaders(ds, bs=32)
dsets.show_batch(max_n=20)
thank you.

I found the issue, The block I needed is not MultiCategoryBlock, it is CategoryBlock. I thought since there where multiple categories ot pick from that is what was needed but no MulticategoryBlock is used to label one image with multiple categories. Not to pick from multiple categories.

Dart, compare two lists and return element from first list that does not exist in second list

I have two lists,
List first = [{'name':'ABC','serialNumber':'ABC-124-353'},
{'name':'XYZ','serialNumber':'XYZ-123-567'},
{'name':'GRE', 'serialNumber': 'GRE-290-128'}];
List second = [{'name':'PQR','serialNumber':'PQR-D123-SII23'},{'name':'GAR','serialNumber':'GAR-G43-432'},
{'name':'MNOP','serialNumber':'XYZ-123-567'}];
Is there any easier way to compare first list and second list by serialNumber.
such that element from first list that doesn't exist in second list are outputted as a result.
So in this case
[{'name':'ABC','serialNumber':'ABC-124-353'},{'name':'GRE', 'serialNumber': 'GRE-290-128'}]
from first list is desired output, because ABC-124-353 and GRE-290-128 doesn't exist in list second

Another solution would be to use where on your first List to check if the serialNumber is contained in the second list:
final secondSerials = second.map((item) => item['serialNumber']).toSet();
print(first.where((item) => !secondSerials.contains(item['serialNumber'])).toList());

I'd make a set of the serial numbers of the second list, so that you can do efficient contains checks.
So:
var secondListSerials = {for (var entry in secondList) entry["serialNumber"]};
var firstListOnly = [for (var entry in firstList)
if (!secondListSerials.contains(entry["serialNumber"]) entry
];

Google Docs Invoice template with dynamically items row from Google Sheets

I really need your help with this.
I have created an invoice template in Google Docs with databases flowed from Google sheets.
The problem is:
In the template (Google Docs), I only put a specific items line (eg 3 lines).
When the data is changed, such as the number of items lines are changing, how it's automatically gone through Google Docs if there are more than 3 items lines
Many thanks for your help.
Below is my script to get data from G-sheets to G-Docs template.
function Invoice() {
let copyFile = DriveApp.getFileById('id URL').makeCopy(),
copyID = copyFile.getId(),
copyDoc = DocumentApp.openById(copyID),
copyBody = copyDoc.getBody()
let activeSheet = SpreadsheetApp.getActiveSheet(),
numOfCol = activeSheet.getLastColumn(),
activeRowIndex = activeSheet.getActiveRange().getRowIndex(),
activeRow = activeSheet.getRange(activeRowIndex, 1, 1, numOfCol).getValues(),
headerRow = activeSheet.getRange(1, 1, 1, numOfCol).getValues(),
columnIndex = 0
for (; columnIndex < headerRow[0].length; columnIndex++){
copyBody.replaceText('%' + headerRow[0][columnIndex] + '%', activeRow[0][columnIndex])
}
copyDoc.saveAndClose()
Here is screenshot of the files.
Data in G-sheet with the additional item (Item 4)
G-Docs template with specific 3 rows for 3 items lines
When I have 4 items, I must manually amend the G-Docs template. Is there any way to get its automatically.

#Duc I don't think it's possible to pass the new header as placeholder in the GDoc, it sounds like an endless loop.
Unless you pass it as List_ITEM, but I am pretty sure you will lose formatting.

How to scrape a page for certain strings from an array of substrings with Nokogiri

I want to scrape a restaurant page for certain titles of dishes.
I created an array holding keywords:
myarray = {"Rice", "Soup", "Chicken", "Vegetables"}
Whenever one of those keywords is found in a webpage, my scraper is supposed to give me the entire dish-title. I made this work with the following code:
html_doc = Nokogiri::HTML.parse(browser.html)
word = html_doc.at(':contains("Rice"):not(:has(:contains("Rice")))').text.strip
puts word
For example this returns: "Dish 41 - Vegetables with Chicken and Rice"
The problem is that the above code stops after the first dish is found. It does not loop through all dish-titles containing the word rice.
Secondly, I do not know how to let the code check for an entire array of substrings.

Use css. This will find all the elements which matches the given CSS and give you the collection:
words = html_doc.css(':contains("Rice"):not(:has(:contains("Rice")))').map(&:text)

I solved the second part of my question myself with this:
word = html_doc.css(":contains('#{keyword}'):not(:has(:contains('#{keyword}')))").map(&:text)

How can I iterate over list items in Pandoc's lua-filter function?

Pandoc's lua filter makes it really easy to iterate over a document and munge the document as you go. My problem is I can't figure out how to isolate list item elements. I can find lists and the block level things inside each list item, but I can't figure out a way to iterate over list items.
For example let's say I had the following Markdown document:
1. One string
Two string
2. Three string
Four string
Lets say I want to make the first line of each list item bold. I can easily change how the paragraphs are handled inside OrderedLists, say using this filter and pandoc --lua-filter=myfilter.lua --to=markdown input.md
local i
OrderedList = function (element)
i = 0
return pandoc.walk_block(element, {
Para = function (element)
i = i + 1
if i == 1 then return pandoc.Para { pandoc.Strong(element.c) }
else return element end
end
})
end
This will indeed change the first paragraph element to bold, but it only changes the first paragraph of the first list item because it's iterating across all paragraphs in all list items in the list, not on each list item, then on each paragraph.
1. **One string**
Two string
2. Three string
Four string
If I separate the two list items into two separate lists again the first paragraph of the first item is caught, but I want to catch the first paragraph of every list item! I can't find anything in the documentation about iterating over list items. How is one supposed to do that?

The pandoc Lua filter docs have recently been updated with more info on the properties of each type. E.g., for OrderedList elements, the docs should say (it currently says items instead of content, which is a bug):
OrderedList
An ordered list.
content: list items (List of Blocks)
listAttributes: list parameters (ListAttributes)
start: alias for listAttributes.start (integer)
style: alias for listAttributes.style (string)
delimiter: alias for listAttributes.delimiter (string)
tag, t: the literal OrderedList (string)
So the easiest way is to iterate over the content field and change items therein:
OrderedList = function (element)
for i, item in ipairs(element.content) do
local first = item[1]
if first and first.t == 'Para' then
element.content[i][1] = pandoc.Para{pandoc.Strong(first.content)}
end
end
return element
end

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

BeautifulSoup indexes - parsing

Related

Why is my CNN returning tokens instead or readable labels?

Dart, compare two lists and return element from first list that does not exist in second list

Google Docs Invoice template with dynamically items row from Google Sheets

How to scrape a page for certain strings from an array of substrings with Nokogiri

How can I iterate over list items in Pandoc's lua-filter function?

Categories

Resources