Scrapy - Trouble with <TD> parsing alignment

Scrapy - Trouble with <TD> parsing alignment - parsing

I'm attempting to parse data only from the item & Skill Cap columns in the html table here: http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html
When parsing I run into alignment issues where my script is parsing from other columns.
import scrapy
class parser(scrapy.Spider):
name = "recipe_table"
start_urls = ['http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html']
def parse(self, response):
for row in response.xpath('//*[#class="datatable sortable"]//tr'):
data = row.xpath('td//text()').extract()
if not data: # skip empty row
continue
yield {
'name': data[0],
'cap': data[1],
# 'misc': data[2]
}
Results: scrapy runspider cap.py -t json
When it reaches the 3rd row data from an unintended column is being parsed. I'm not sure whats going on with selection.
2019-05-09 19:41:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html> (referer: None)
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Banquet Set', 'cap': u'0'}
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Banquet Table', 'cap': u'0'}
2019-05-09 19:41:28 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ffxi.allakhazam.com/dyn/guilds/Alchemy.html>
{'item_name': u'Cermet Kilij', 'cap': u'Cermet Kilij +1'}

What about explicitly set source column with XPath:
for row in response.xpath('//*[#class="datatable sortable"]//tr'):
yield {
'name': row.xpath('./td[1]/text()').extract_first(),
'cap': row.xpath('./td[3]/text()').extract_first(),
# 'misc': etc.
}

Related

Benthos grok log parse

So I have this log and I was trying to parse it using benthos grok. What I need to do is return json 5 elements:
• Timestamp
• Connection direction (inbound/outbound)
• Source IP
• Destination IP
• Source Port
in json format of this log:
<134>1 2023-01-21T17:18:05Z CHKPGWMGMT CheckPoint 16575 - [action:"Accept"; flags:"411908"; ifdir:"outbound"; ifname:"eth0"; logid:"0"; loguid:"{0x80c5f24,0x273f572f,0x1a6c6aae,0x5f835b6e}"; origin:"10.21.10.2"; originsicname:"cn=cp_mgmt,o=CHKPGWMGMT..f6b99b"; sequencenum:"4"; time:"1674314285"; version:"5"; __policy_id_tag:"product=VPN-1 & FireWall-1[db_tag={F7CAC520-C428-484E-8004-06A1FAC151A3};mgmt=CHKPGWMGMT;date=1667399823;policy_name=Standard]"; dst:"10.21.10.2"; inzone:"Local"; layer_name:"Network"; layer_uuid:"8a994dd3-993e-4c0c-92a1-a8630b153f4c"; match_id:"1"; parent_rule:"0"; rule_action:"Accept"; rule_uid:"102f52bf-da21-49cd-b2e2-6affe347215d"; outzone:"Local"; product:"VPN-1 & FireWall-1"; proto:"6"; s_port:"46540"; service:"1433"; service_id:"https"; src:"10.21.9.1"]
input:
type: file
file:
paths: [./intput.txt]
codec: lines
pipeline:
processors:
- grok:
expressions:
- '%{NGFWLOGFILE}'
pattern_definitions:
NGFWLOGFILE: '%{NOTSPACE:interfaceid} %{TIMESTAMP_ISO8601:timestamp} %{NOTSPACE:Letters} %{NOTSPACE:Mhm} %{NOTSPACE:Skaicius} %{NOTSPACE:AA} %{NOTSPACE:Action}'
# - mapping: |
# root.timestamp = this.timestamp
# root.Action = this.Action
output:
stdout: {}
#output:
# label: ""
# file:
# path: "Output.txt"
# codec: lines
So I tried using grok to parse the log to json format and mapping to filter the part I want to get.
The part there I got stuck is pattern_definitions how to extract data from the list which already has names at the log file or I should use some better approach to the task?

Grok translates to a regular expression under the covers, so I don't think it has any notion of lists and such. Try this:
input:
generate:
count: 1
interval: 0s
mapping: |
root = """<134>1 2023-01-21T17:18:05Z CHKPGWMGMT CheckPoint 16575 - [action:"Accept"; flags:"411908"; ifdir:"outbound"; ifname:"eth0"; logid:"0"; loguid:"{0x80c5f24,0x273f572f,0x1a6c6aae,0x5f835b6e}"; origin:"10.21.10.2"; originsicname:"cn=cp_mgmt,o=CHKPGWMGMT..f6b99b"; sequencenum:"4"; time:"1674314285"; version:"5"; __policy_id_tag:"product=VPN-1 & FireWall-1[db_tag={F7CAC520-C428-484E-8004-06A1FAC151A3};mgmt=CHKPGWMGMT;date=1667399823;policy_name=Standard]"; dst:"10.21.10.2"; inzone:"Local"; layer_name:"Network"; layer_uuid:"8a994dd3-993e-4c0c-92a1-a8630b153f4c"; match_id:"1"; parent_rule:"0"; rule_action:"Accept"; rule_uid:"102f52bf-da21-49cd-b2e2-6affe347215d"; outzone:"Local"; product:"VPN-1 & FireWall-1"; proto:"6"; s_port:"46540"; service:"1433"; service_id:"https"; src:"10.21.9.1"]"""
pipeline:
processors:
- grok:
expressions:
- "%{NGFWLOGFILE}"
pattern_definitions:
NGFWLOGFILE: |-
%{NOTSPACE:interfaceid} %{TIMESTAMP_ISO8601:timestamp} %{NOTSPACE:Letters} %{NOTSPACE:Mhm} %{NOTSPACE:Skaicius} %{NOTSPACE:AA} \[%{GREEDYDATA}; ifdir:"%{DATA:connectionDirection}"; %{GREEDYDATA}; dst:"%{DATA:destinationIP}"; %{GREEDYDATA}; s_port:"%{DATA:sourcePort}"; %{GREEDYDATA}; src:"%{DATA:sourceIP}"\]
output:
stdout: {}

Pandoc Lua : how to add a markdown block around a header without losing the markdown syntax #

I am trying to add a div around a markdown header in a Lua filter, but the # in front of the title disappear in the output.
Header = function(el)
if el.level == 1 then
local content = el.content
local pre = pandoc.RawBlock('markdown','::: test')
local post = pandoc.RawBlock('markdown',':::')
table.insert(content,1,pre)
table.insert(content, post)
return content
else
return el
end
end
Input:
# Linux
## Support for Linux users
Create a shell script
Expected Output
::: test
# Linux
:::
## Support for Linux users
Create a shell script

The content field contains the heading text, but the heading itself is the el element that's not returned. Returning it together with the raw blocks should work though:
return {
pre, el, post
}
Or use a Div element:
function Header (el)
if h.level == 1 then
return pandoc.Div(el, {class = 'test'})
end
end

How to avoid importing nil object when reading spreadsheet with roo on Rails 5.2?

My application manages hierarchical classifications based on lists of values (dictionnaries). At some point, I need to import the parent-child relationships from an Excel sheet, and create persisted ValuesToValues objects.
Based on Ryan Bates' RailsCast 396, I created the import model in which the main loop is:
(2..spreadsheet.last_row).map do |i|
# Read columns indexes
parent = header.index("Parent") +1
level = header.index("Level") +1
code = header.index("Code") +1
# Skip if parent is blank
next if spreadsheet.cell(i, parent).blank?
# Count links
#links_counter += 1
parent_values_list_id = values_lists[((spreadsheet.cell(i, level).to_i) -1)]
child_values_list_id = values_lists[spreadsheet.cell(i, level).to_i]
parent_value_id = Value.find_by(values_list_id: parent_values_list_id, code: spreadsheet.cell(i, parent).to_s).id
child_value_id = Value.find_by(values_list_id: child_values_list_id, code: spreadsheet.cell(i, code).to_s).id
link_code = "#{parent_values_list_id}/#{spreadsheet.cell(i, parent)} - #{child_values_list_id}/#{spreadsheet.cell(i, code)}"
link_name = "#{spreadsheet.cell(i, parent)} #{spreadsheet.cell(i, code)}"
link = ValuesToValues.new( playground_id: playground_id,
classification_id: #classification.id,
parent_values_list_id: parent_values_list_id,
child_values_list_id: child_values_list_id,
parent_value_id: parent_value_id,
child_value_id: child_value_id,
code: link_code,
name: link_name
)
end
The issue is that, when encourtering a root value -without a parent value- the loop creates a nil object, which does not pass the later validation.
How can I build the loop in order to consider only rows where the Parent cell is not empty?

I finally decided to manage my own array of imported values instead of using the array based on the filtered sheet rows.
I added the following code around the main loop:
# Create array of links
linked_values = Array.new
# start loading links for each values list
(2..spreadsheet.last_row).map do |i|
...
and
...
linked_values << link
end
linked_values
Then linked_values array is returned, which only contains valid links records.

RNeo4j Error: 400 Bad Request

I am not sure why I am getting the error below, but I suppose it's something that I am doing wrong.
First, you can grab my dataset by downloading the file dataset.r from this link and loading it into your session with dget("dataset.r").
In my case, I would do dat = dget("dataset.r").
The code below is what I am using to load data into the Neo4j.
library(RNeo4j)
graph = startGraph("http://localhost:7474/db/data/")
graph$version
# sure that the graph is clean -- you should backup first!!!
clear(graph, input = FALSE)
## ensure the constraints
addConstraint(graph, "School", "unitid")
addConstraint(graph, "Topic", "topic_id")
## create the query
## BE CAREFUL OF WHITESPACE between KEY:VALUE pairs for parameters!!!
query = "
MERGE (s:School {unitid:{unitid},
instnm:{instnm},
obereg:{obereg},
carnegie:{carnegie},
applefeeu:{applfeeu},
enrlft:{enrlft},
applcn:{applcn},
admssn:{admssn},
admit_rate:{admit_rate},
ape:{ape},
sat25:{sat25},
sat75:{sat75} })
MERGE (t:Topic {topic_id:{topic_id},
topic:{topic} })
MERGE (s)-[:HAS_TOPIC {score:{score} }]->(t)
"
for (i in 1:nrow(dat)) {
## status
cat("starting row ", i, "\n")
## run the query
cypher(graph,
query,
unitid = dat$unitid[i],
instnm = dat$instnm[i],
obereg = dat$obereg[i],
carnegie = dat$carnegie[i],
applfeeu = dat$applfeeu[i],
enrlft = dat$enrlt[i],
applcn = dat$applcn[i],
admssn = dat$admssn[i],
admit_rate = dat$admit_rate[i],
ape = dat$apps_per_enroll[i],
sat25 = dat$sat25[i],
sat75 = dat$sat75[i],
topic_id = dat$topic_id[i],
topic = dat$topic[i],
score = dat$score[i] )
} #endfor
I can successfully load the first 49 records of my dataframe dat, but errors out on the 50th row.
This is the error that I recieve:
starting row 50
Show Traceback
Rerun with Debug
Error: 400 Bad Request
{"message":"Node 1477 already exists with label School and property \"unitid\"=[110680]","exception":"CypherExecutionException","fullname":"org.neo4j.cypher.CypherExecutionException","stacktrace":["org.neo4j.cypher.internal.compiler.v2_1.spi.ExceptionTranslatingQueryContext.org$neo4j$cypher$internal$compiler$v2_1$spi$ExceptionTranslatingQueryContext$$translateException(ExceptionTranslatingQueryContext.scala:154)","org.neo4j.cypher.internal.compiler.v2_1.spi.ExceptionTranslatingQueryContext$ExceptionTranslatingOperations.setProperty(ExceptionTranslatingQueryContext.scala:121)","org.neo4j.cypher.internal.compiler.v2_1.spi.UpdateCountingQueryContext$CountingOps.setProperty(UpdateCountingQueryContext.scala:130)","org.neo4j.cypher.internal.compiler.v2_1.mutation.PropertySetAction.exec(PropertySetAction.scala:51)","org.neo4j.cypher.internal.compiler.v2_1.mutation.MergeNodeAction$$anonfun$exec$1.apply(MergeNodeAction.scala:80)","org.neo4j.cypher.internal.compiler.v2_1
Here is my session info:
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RNeo4j_1.2.0
loaded via a namespace (and not attached):
[1] RCurl_1.95-4.1 RJSONIO_1.2-0.2 tools_3.1.0
And it's worth noting that I am using Neo4j 2.1.3.
Thanks for any help in advance.

This is an issue with how MERGE works. By setting the score property within the MERGE clause itself here...
MERGE (s)-[:HAS_TOPIC {score:{score} }]->(t)
...MERGE tries to create the entire pattern, and thus your uniqueness constraint is violated. Instead, do this:
MERGE (s)-[r:HAS_TOPIC]->(t)
SET r.score = {score}
I was able to import all of your data after making this change.

How to programatically get holdings of an ETF

I am looking for a way to get the holding list of an ETF via a web service such as yahoo finance. So far, YQL has not yielded the desired results.
As an example ZUB.TO is an ETF that has holdings. here is a list of the holdings by querying the yahoo.finance.quotes we do not get the proper information.
The result.
Is there another table somewhere that would contain the holdings?

Perhaps downloading from Yahoo Finance is not working and/or may not work.
Instead how about using the various APIs the ETF providers already have for downloading the Excel or CSV files of the holdings?
Use the "append_df_to_excel" file as file to import, and then use the code below to make Excel file for all the 11 Sector SPDRs provided by SSgA (State Street global Advisors).
Personally I use this for doing breadth analysis.
import pandas as pd
import append_to_excel
# https://stackoverflow.com/questions/20219254/how-to-write-to-an-existing-excel-file-without-overwriting-data-using-pandas
##############################################################################
# Author: Salil Gangal
# Posted on: 08-JUL-2018
# Forum: Stack Overflow
##############################################################################
output_file = 'C:\my_python\SPDR_Holdings.xlsx'
base_url = "http://www.sectorspdr.com/sectorspdr/IDCO.Client.Spdrs.Holdings/Export/ExportExcel?symbol="
data = {
'Ticker' : [ 'XLC','XLY','XLP','XLE','XLF','XLV','XLI','XLB','XLRE','XLK','XLU' ]
, 'Name' : [ 'Communication Services','Consumer Discretionary','Consumer Staples','Energy','Financials','Health Care','Industrials','Materials','Real Estate','Technology','Utilities' ]
}
spdr_df = pd.DataFrame(data)
print(spdr_df)
for i, row in spdr_df.iterrows():
url = base_url + row['Ticker']
df_url = pd.read_excel(url)
header = df_url.iloc[0]
holdings_df = df_url[1:]
holdings_df.set_axis(header, axis='columns', inplace=True)
print("\n\n", row['Ticker'] , "\n")
print(holdings_df)
append_df_to_excel(output_file, holdings_df, sheet_name= row['Ticker'], index=False)
Image of Excel file generated for SPDRs

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Scrapy - Trouble with <TD> parsing alignment - parsing

What about explicitly set source column with XPath: for row in response.xpath('//*[#class="datatable sortable"]//tr'): yield { 'name': row.xpath('./td[1]/text()').extract_first(), 'cap': row.xpath('./td[3]/text()').extract_first(), # 'misc': etc. }

Related

Benthos grok log parse

Pandoc Lua : how to add a markdown block around a header without losing the markdown syntax #

How to avoid importing nil object when reading spreadsheet with roo on Rails 5.2?

RNeo4j Error: 400 Bad Request

How to programatically get holdings of an ETF

Categories

Resources