How to limit the number of results at the tag level using pup? - parsing

In brief:
Is there a way using pup to limit the number of results not overall, but at the tag level?
Backstory/use-case:
Ever since I learned about pup I've been obsessed. I'm constantly thinking of new use cases. This morning I wanted to use it to grab the latest headlines from ESPN.
ESPN seems to have an unordered list like this: <ul class="headlines"> and then a bunch of list items.
A simple solution would be:
$ curl -s -S http://espn.go.com/ | pup .headlines a text{}
right? But, as you can see, there are sometimes multiple links to each topic per line with alternate authors, so then you end up with results like "Low", "Anande", "Stark", and "Dinich" (last names of ESPN authors).
Ideally I'd like to do something like this:
$ curl -s -S http://espn.go.com/ | pup .headlines li a slice{:1} text{}
but that only returns the first result. :\
There are multiple <a> tags per <li>, so I'd like to retrieve all of the <li> items, but limit the number of <a> tags to 1 per <li>. Is this possible?

$ curl -s -S http://espn.go.com/ | pup '.headlines li a:first-of-type text{}'

Related

Result from 'findstr /G:' not complete, comparing to 'grep -f'

I have to look up a list of thousands of gene names (genelist.txt, one column) in a database file called (database.txt, multiple columns). Any lines containing at least one gene name that match the genelist.txt will be extract to output.txt.
I used to do it like this:
findstr /G:genelist.txt database.txt >output.txt
It works well and fast. However, I just found out today that the final output is affected by the gene order in the original genelist.txt. There is one result if using an unsorted gene list, and another result with more lines if sorting the gene list and searching again. But even with the sorted gene list the file output.txt does still not contain all lines as I'm missing some records. I just noticed this after a comparison with the result from
grep -f "genelist.txt" database.txt > output.txt
The results from grep have no difference no matter the gene list is sorted or not, but it's a bit slower than findstr.
I was wondering how this come. Can I add any arguments to findstr to make it return a complete results list?

Splunk get combined result from 2 events

I am splunk noob trying to write a query for a couple of hours but not successful so far.
I want to count the number of times the command 'install' was triggered and the exit code was '0'
Each install command writes log in a new file with format 'install_timestamp' so I am searching for source="install*"
Using 2 source files as example
source1:
event1:command=install
... //a couple of other events
event100:exit_code=0
source2:
event1:command=install -f
... //a couple of other events
event100:exit_code=0
In this case I want the result to be 1. Only 1 occurrence of exit_code=0 when command was 'install' (not -f)
The thing that's confusing me is that the information for command and exit_code is in different events, I can get each of the two events separately but able to figure out how to get the combined result.
Any tips on how can I achieve the result I want ? - Thanks!
It's a little crude but you could do something like this...
("One String" NOT "Bad String") OR "Another String" | stats count by source | where count > 1
It will basically give you a list of files that contain events matching both strings. For your example this would be something like...
("command=install" NOT "-f") OR "exit_code=0" | stats count by source | where count > 1

Create a list of all matched strings without duplicates

I have a list of urls that look like:
http://example.com/page1
http://example.com/page1
http://example.com/page1
http://example.com/page2
http://example.com/page2
http://example.com/page3
From this, I want to create a list that's like:
http://example.com/page1
http://example.com/page2
http://example.com/page3
So if there are more than one match, I want to return only one of the matches. What would the grep pattern be for that? Thanks.
You can very easily do it using awk
$ awk '!url[$0]++' input
http://example.com/page1
http://example.com/page2
http://example.com/page3

Count how many metrics matches a condition in graphite

I have a list of classes that extracts info from the web. Every time each one of them saves something, it sends a different counter to graphite. So, every one of them is a different metric.
How do I know how many of them satisfy a certain condition??
For example, let:
movingAverage(summarize(groupByNode(counters.crawlers.*.saved, 2, "sumSeries), "1hour"), 24)
be the average of content download in past 24 hours. How can i know, at a moment "t", how many of my metrics have this value above 0?
In the rendering endpoint, add format=json. This would return the data-points with corresponding epochs in JSON, which is a breeze to parse. The time-stamps wherein your script sent nothing will be NULL.
[{
"target": "carbon.agents.ip-10-0-0-228-a.metricsReceived",
"datapoints":
[
[912, 1383888170],
[789, 1383888180],
[800, 1383888190],
[null, 1383888200],
[503, 1383888210],
[899, 1383888220]
]
}]
You can use the currentAbove function. Check it out. For example:
currentAbove(stats.route.*.servertime.*, 5)
The above example gets all of the metrics (in the series) that are above 5.
You could then count the number of targets returned, and while Graphite doesn't provide a way to count the "buckets" you should be able to capture it fairly easily.
For example, a way to get a quick count (using curl to pipe to grep and count on the word "target"). Like so:
> curl -silent http://test.net/render?target=currentAbove(stats.cronica.dragnet.messages.*, 5)&format=json \
> | grep -Po "target" | grep -c "target"

GREP create a list of words that contain a sting

I have a folder with a lot of text files and would like to get a list of all words in that folder that contain a certain string. So, e.g. there is words in the form of 'IND:abc', 'IND:cde', ... and I am looking for a way to get a list of all words starting with IND:, so something like:
[IND:abc, IND:cde, IND:...]
Can grep do that?
Cheers,
Chris
grep -ho 'IND:\w\+' * | sort | uniq
-h suppresses the filenames so that you will only get the text. -o prints only the matching path of the text. If you want to see the duplicates just remove the sort, and uniq.

Resources