How can I include empty strings in HTML text() extracted with XPath? - parsing

I have a page which consists of a table with two columns.
header | value
----------------
field1 | 1
field2 |
field3 | 1
field4 |
field5 | 1
When I select the values I need to get the same number as there are fields. I get the right number with:
>s = scrapy.Selector(response)
>values = s.xpath('//tr/td[#class="tdMainBottom"][2]').extract() # get the second column
>len(values)
5
But:
>s = scrapy.Selector(response)
>values = s.xpath('//tr/td[#class="tdMainBottom"][2]/text()').extract() # get the values
>len(values)
3
I can clean the first list up afterwards, but is there a one-shot way of doing this in XPath/Scrapy?

This works but is kind of ugly:
values = [v.xpath('text()').extract()
for v in s.xpath('//tr/td[#class="tdMainBottom"][2]')]

Related

Copy value from a cell to another cell if it exists in another sheet's column

I have two sheets below. Links also added to each sheet for reference
Posts sheet:
id | title | tags
1 | title 1 | article, sports, football, england
2 | title 2 | news, sports, spain, france
3 | title 3 | opinion, political, france
4 | title 4 | news, political, russia
5 | title 5 | article, market, Germany
Tags sheet:
location | type | category
england | article | sports
spain | news | political
germany | opinion | market
russia | | football
france |
About each sheets:
Posts sheet consists of list of posts with title and tags associated with it.
Tags sheet consists of list of tags categorized to understandable heads.
What I am trying to do:
I need to extract the value from the tags column in Posts sheet and add the tag to individual columns based on what head its coming in tags sheet.
Desired Output:
id | title | type | category | location
1 | title 1 | article | sports, football | england
2 | title 2 | news | sports | spain, france
3 | title 3 | opinion | political | france
4 | title 4 | news | political | russia
5 | title 5 | article | market | Germany
I made this sample code for Google Apps Script that can help you sort the information. I added some comments in case you want to modify some of the columns or cells working on it. Here is the code:
function Split_by_tags() {
// Get the sheets you will work with by the name of the tab
const ss_posts = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Posts Sheet");
const ss_tags = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Tags Sheet");
const ss_output = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Expected Output");
// Get the range of the columns to of "Posts Sheet" we will work with
// If range_1 is the ID, range 2 is the title, range 3 is tags
// If you change the columns in the future, you only need to update this part
let range_1 = ss_posts.getRange("A2:A").getValues().flat();
let range_2 = ss_posts.getRange("B2:B").getValues().flat();
let range_3 = ss_posts.getRange("C2:C").getValues().flat();
// filter the arrays information to only the cells with values for "Posts Sheet"
// This way, you can add new information to the tags rows and they will be added
range_1 = range_1.filter((element) => {return (element !== '')});
range_2 = range_2.filter((element) => {return (element !== '')});
range_3 = range_3.filter((element) => {return (element !== '')});
// The values we will compare the tags with in arrays
let range_type = ss_tags.getRange("A2:A").getValues().flat();
let range_location = ss_tags.getRange("B2:B").getValues().flat();
let range_category = ss_tags.getRange("C2:C").getValues().flat();
// filter the arrays information to only the cells with values for "Tags Sheet"
// This way, you can add new information to the tags rows and they will be added
range_type = range_type.filter((element) => {return (element !== '')});
range_location = range_location.filter((element) => {return (element !== '')});
range_category = range_category.filter((element) => {return (element !== '')});
// new Arrays where the information will be sort, I added a new tag option called "Other"
// just in case the information in the column 2 has a value which is not under "Tags Sheet"
let type_tag = [];
let location_tag = [];
let category_tag = [];
let other_tag = [];
// for to copy the ID from "Posts Sheet" to "Expected Output"
for (let i=0; i< range_1.length ; i++){
ss_output.getRange(i+2,1).setValue(range_1[i]);
};
// for to copy the title from "Posts Sheet" to "Expected Output"
for (let j=0; j< range_2.length ; j++){
ss_output.getRange(j+2,2).setValue(range_2[j]);
};
// fuction to sort the tags from "Posts Sheet" base in "Tags Sheet"
function Separate_value(value_array){
for (let k=0; k < value_array.length; k++){
if(range_type.includes(value_array[k])){
type_tag.push(value_array[k]);
}
else if(range_location.includes(value_array[k])){
location_tag.push(value_array[k]);
}
else if(range_category.includes(value_array[k])){
category_tag.push(value_array[k]);
}
else{
other_tag.push(value_array[k]);
}
};
}
// Function to empty the arrays for the next loop
function Empty_value(){
type_tag = [];
location_tag = [];
category_tag = [];
other_tag = [];
}
// for to add the values we sorted to "Expected Output"
for (let e=0; e < range_3.length; e++ ){
let value_array = range_3[e].split(', ');
Separate_value(value_array)
ss_output.getRange(e+2,3).setValue(type_tag.join(", "));
ss_output.getRange(e+2,4).setValue(category_tag.join(", "));
ss_output.getRange(e+2,5).setValue(location_tag.join(", "));
ss_output.getRange(e+2,6).setValue(other_tag.join(", "));
Empty_value();
};
}
You can bound the script by accessing Extensions > Apps Script in your Google Sheet.
Copy and paste the sample code, and run it. The first time you run the Apps Script, it will ask you for permissions, accept those, and the information will get sorted.
You can also add a trigger to the Apps Script so it can sort the information automatically when new data is added.
Reference:
Create a bound Apps Script.
Create trigger.

query doesn't work with multiple ranges

I have 3 cells that I run a query on.
| name | val | name |
|------+-----+------|
| Test | 1 | Test |
I want to return True if the value is greater than 1. The problem occurs when I try to have the cells separated.. I did a demo to show what I mean (sorry for my bad explaination)
https://docs.google.com/spreadsheets/d/1Nh_YZPtswmTxbvNktTdJtSNchDMkWJt6nVfnV8sKhP8/edit?usp=sharing
This works fine:
=if(QUERY(A2:F2;"select B where A like F";-1) > 1; True; False)
These doesn't work:
=if(QUERY(A2:B2,F2;"select B where A like F";-1) > 1; True; False)
=if(QUERY({A2:B2;F2};"select B where A like F";-1) > 1; True; False)
This works:
=if(QUERY({A2:B2\F2};"select Col2 where Col1 like Col3";-1) > 1; True; False)
Your first error is {}: to place columns next use slash {Col\Col}
The second error is using ABC notation. When using {} data is converted into array, so use Col1, Col2... for columns.

How to count the occurrence of record in field2 of a table based on field1 in progress openedge 4 gl?

I have a table with two fields.
I need to count occurence of data of field2 based on field1 .
Like a(field1) - x(field2) - 1(occurrence of x for a ) , a - y - 2 , b-z-1 for all a b c d
Solved this problem :
DEFINE VARIABLE i AS INTEGER NO-UNDO.
FOR EACH ttPTdetails NO-LOCK BREAK BY ttPTdetails.sentdate BY stat :
i = i + 1.
IF LAST-OF(stat) THEN DO:
DISPLAY ttPTdetails.sentdate (ttPTdetails.stat ) i.
i = 0.
END.
END.

grails database migration plugin - how to conditionally insert rows

How would I write a changelog.groovy for the grails database migration plugin that would insert rows into a table if a row doesn't already exist for a range of ids? For example.
cool_stuff table has
id | some_other_id |
The cool_stuff table is populated with data. Given a range of cool_stuff ids, 1 - 2000, I would like to:
Iterate through the ids, querying the cool_stuff table to see if the combination of cool_stuff id and some_other_id = 2 exists
If it doesn't exist, insert a row with the cool_stuff id and some_other_id = 2
Threre are recoreds on "cool_stuff" table already.
You need conbination of a record that "cool_stuff.id" and "some_other_id == 2"
so, do you want like following?
table of "cool_stuff"
FROM:
id | some_other_id
----|---------------
1 | 2
2 | 1
3 | 2
4 | 1
TO:
id | some_other_id
----|---------------
1 | 2
2 | 1
3 | 2
4 | 1
2 | 2
4 | 2
Is this right??
I would like to do like following if i do that.
databaseChangeLog = {
changeSet(author: "koji", id: "123456789") {
grailsChange {
change {
CoolStuff.list().findAll {
it.someOtherId != 2
}.each{
// save new instance
new CoolStuf(id: it.id, someOtherId:2).save(flush:true)
}
}
}
}
}

grails: converting SQL into domain classes

I am developing a GRAILS application (I'm new to GRAILS and inherited the project from a previous developer). I'm slowly getting a small grasp for how GRAILS operates and the use of DOMAIN classes, hibernate etc. The MySQL db is hosted on Amazon and we're using ElasticCache.
Do any of you more knowledgeable folks know how I can go about converting the following SQL statement into domain classes and query criteria.
if(params?.searchterm) {
def searchTerms = params.searchterm.trim().split( ',' )
def resultLimit = params.resultlimit?: 1000
def addDomain = ''
if (params?.domainname){
addDomain = " and url like '%${params.domainname}%' "
}
def theSearchTermsSQL = ""
/*
* create c.name rlike condition for each search term
*
*/
searchTerms.each{
aSearchTerm ->
if( theSearchTermsSQL != '' ){
theSearchTermsSQL += ' or '
}
theSearchTermsSQL += "cname rlike '[[:<:]]" + aSearchTerm.trim() + "[[:>:]]'"
}
/*
* build query
*
*/
def getUrlsQuery = "select
u.url as url,
c.name as cname,
t.weight as tweight
from
(category c, target t, url_meta_data u )
where
(" + theSearchTermsSQL + ")
and
t.category_id = c.id
and t.url_meta_data_id = u.id
and u.ugc_flag != 1 " + addDomain + "
order by tweight desc
limit " + resultLimit.toLong()
/*
* run query
*
*/
Sql sqlInstance = new Sql( dataSource )
def resultsList = sqlInstance.rows( getUrlsQuery )
}
The tables are as follows (dummy data):
[Category]
id | name
-----------
1 | small car
2 | bike
3 | truck
4 | train
5 | plane
6 | large car
7 | caravan
[Target]
id | cid | weight | url_meta_data_id
----------------------------------------
1 | 1 | 56 | 1
2 | 1 | 76 | 2
3 | 3 | 34 | 3
4 | 2 | 98 | 4
5 | 1 | 11 | 5
6 | 3 | 31 | 7
7 | 5 | 12 | 8
8 | 4 | 82 | 6
[url_meta_data]
id | url | ugc_flag
---------------------------------------------
1 | http://www.example.com/foo/1 | 0
2 | http://www.example.com/foo/2 | 0
3 | http://www.example.com/foo/3 | 1
4 | http://www.example.com/foo/4 | 0
5 | http://www.example.com/foo/5 | 1
6 | http://www.example.com/foo/6 | 1
7 | http://www.example.com/foo/7 | 1
8 | http://www.example.com/foo/8 | 0
domain classes
class Category {
static hasMany = [targets: Target]
static mapping = {
cache true
cache usage: 'read-only'
targetConditions cache : true
}
String name
String source
}
class Target {
static belongsTo = [urlMetaData: UrlMetaData, category: Category]
static mapping = {
cache true
cache usage: 'read-only'
}
int weight
}
class UrlMetaData {
String url
String ugcFlag
static hasMany = [targets: Target ]
static mapping = {
cache true
cache usage: 'read-only'
}
static transients = ['domainName']
String getDomainName() {
return HostnameHelper.getBaseDomain(url)
}
}
Basically, a url from url_meta_data can be associated to many categories. So in essence what I'm trying to achieve should be a relatively basic operation...to return all the urls for the search-term 'car', their weight(i.e importance) and where the ugc_flag is not 1(i.e the url is not user-generated content). There are 100K + of records in the db and these are imported from a third-party provider. Note that all the URLs do belong to my client - not doing anything dodgy here.
Note the rlike I've used in the query - I was originally using ilike %searchterm% but that would find categories where searchterm is part of a larger word, for example 'caravan') - unfortunately though the rlike is not going to return anything if the user requests 'cars'.
I edited the code - as Igor pointed out the strange inclusion originally of 'domainName'. This is an optional parameter passed that allows the user to filter for urls of only a certain domain (e.g. 'example.com')
I'd create an empty list of given domain objects,
loop over the resultsList, construct a domain object from each row and add it to a list of those objects. Then return that list from controller to view. Is that what you're looking for?
1) If it's a Grails application developed from a scratch (rather than based on a legacy database structure) then you probably should already have domain classes Category, Target, UrlMetaData (otherwise you'll have to create them manually or with db-reverse-engineer plugin)
2) I assume Target has a field Category category and Category has a field UrlMetaData urlMetaData
3) The way to go is probably http://grails.org/doc/2.1.0/ref/Domain%20Classes/createCriteria.html and I'll try to outline the basics for your particular case
4) Not sure what theDomain means - might be a code smell, as well as accepting rlike arguments from the client side
5) The following code hasn't been tested at all - in particular I'm not sure how disjunction inside of a nested criteria works or not. But this might be suitable a starting point; logging sql queries should help with making it work ( How to log SQL statements in Grails )
def c = Target.createCriteria() //create criteria on Target
def resultsList = c.list(max: resultLimit.toLong()) { //list all matched entities up to resultLimit results
category { //nested criteria for category
//the following 'if' statement and its body is plain Groovy code rather than part of DSL that translates to Hibernate Criteria
if (searchTerms) { //do the following only if searchTerms list is not empty
or { // one of several conditions
for (st in searchTerms) { // not a part of DSL - plain Groovy loop
rlike('name', st.trim())) //add a disjunction element
}
}
}
urlMetaData { //nested criteria for metadata
ne('ugcFlag', 1) //ugcFlag not equal 1
}
}
order('weight', 'desc') //order by weight
}
Possibly the or restriction works better when written explicitly
if (searchTerms) {
def r = Restrictions.disjunction()
for (st in searchTerms) {
r.add(new LikeExpression('name', st.trim()))
}
instance.add(r) //'instance' is an injected property
}
Cheers,
Igor Sinev

Resources