Duplicates varying slightly in string values with additional temporal aspect - twitter

I use emergency tweets from the netherlands for a project. There are sometimes more than one tweet regarding one event, varying slightly in timestamp and in the string of the tweet itself. I want to delete those "duplicates".
So, In my database if have rows which are quite alike but not exactly the same like
"2014-01-11 10:01:17";"HV 1 METINGEN (+Inc,net: 1+) (KLEIN OGS) (slachtoffers: ) , Van Ostadestraat 332 AMSTERDAM [ ] "
"2014-01-11 09:59:06";"HV 1 METINGEN (+Inc,net: 1+) (KLEIN OGS) (slachtoffers:1) , Van Ostadestraat 332 AMSTERDAM ] "
The Problem is that i have to take into account the temporal aspect and can't just rely on the string. The text can occur multiple times.
Ideal would be an approach where i delete all rows within a temporal buffer of 10 minutes after the first tweet, when the text similarity is over a threshold of 0.75.
for the string comparison i tried similarity(text,text) see
http://www.postgresql.org/docs/9.1/static/pgtrgm.html
for the time aggregation i used :
(extract(minute FROM timestamp_column)::int / 10)
in addition to the regular YYYY-MM-DD-HH24 time aggregation
Any help is appreciated.

Related

Google Sheets: Filter cells with text and numeric value greater than x

I built a scraper for the Twitter account of a local transportation network to find out how many bus and tram rides are cancelled each day. All my results are listed in a Google Sheets table. I now want to build one table sheet with the information on all trams and another one with the information on all busses. They differ in their numeration: tram lines are numbered 1 - 20 (in German "Linie 1", "Linie 2", and so on), whereas the bus numbers are greater than 100.
How can I write a filter command to combine text and number range? That*s what I've got so far, but I don't know how to insert the 1-20 or >100 range behind the word "Linie"...
Additional note: As there are other numbers in each tweet, the number has to follow directly after the word "Linie".
=FILTER(Tabellenblatt1!C:D; REGEXMATCH(Tabellenblatt1!C:C; "Linie"))
You can use REGEXEXTRACT to get the numbers after Linie like this:
REGEXEXTRACT(Tabellenblatt1!C:C, "Linie (\d+)")
With INDEX you'll be able to check the whole range, and since REGEXEXTRACT returns a string you can multiply by one to get its value and compare to your desired numbers:
INDEX(REGEXEXTRACT(Tabellenblatt1!C:C, "Linie (\d+)")*1)<=20
Then you can use both conditions summed to get those that are under 20 and more than 100:
=FILTER(Tabellenblatt1!C:D, (INDEX(REGEXEXTRACT(Tabellenblatt1!C:C, "Linie (\d+)")*1)<=20)+(INDEX(REGEXEXTRACT(Tabellenblatt1!C:C, "Linie (\d+)")*1)>=100))

Repeat N1:Nx rows Y1:Yx times?

I'm trying to create a Google sheet for an address label mail merge to direct people to their nearest outlet.
For 105 people, it might be Store 2 at 300 Block St; for another 60, it might be Store 8 at 55 Front Ave.
The goal is to have Google Sheets output a table with 105 rows of "Store 2; 300 Block Street", 60 rows of "Store 8; 55 Front Ave", etc.
I've tried using
transpose(split(rept("<cell with address>"&",", "<number of rows>"), ","))
but that's super laborious and error-prone to type out if I have 30 locations to repeat the process for.
Any ideas?
EDIT:
I managed to solve the problem soon after posting this but have left it up to see if there was a better way. The key to getting it working was using JOIN. Here is what I ended up using:
=arrayformula(transpose(split(join(",",rept(F2:F&",",H2:H)),",")))
you could create a master key table which will serve as a feeding ground for this formula:
=TRANSPOSE(SPLIT(JOIN(",", ARRAYFORMULA(REPT(SPLIT(
INDIRECT("A1:A"&COUNTA(A1:A)), ",")&",",
INDIRECT("B1:B"&COUNTA(B1:B))))), ","))
or standalone like:
=TRANSPOSE(SPLIT(JOIN(",", ARRAYFORMULA(REPT(SPLIT(
{"300 Block Street"; "55 Front Ave"; "102 King Street"}, ",")&",",
{10; 6; 2}))), ","))

Converting formula to ARRAYFORMULA issues with SUM and INDEX

I have a scoring spreadsheet for a competition I'm working on. Competitors' place/rank are converted into points towards the overall series based on a chart of corresponding values. For ties, the sum of the points covered by all of the tied places are split evenly among the tied competitors (i.e. 2-way tie for 3rd; if 3rd usually gets 10 points and 4th usually gets 8, these competitors will receive (10+8)/2 (2 being the # of tied competitors), so they each receive 9 points).
I have a formula which does this exact calculation:
=IFERROR(IF(ISBLANK($A4:$A),,SUM(INDEX(SeriesPoints, E4:E):INDEX(SeriesPoints, MIN(E4:E + COUNTIF(E$4:E, E4:E) - 1, ROWS(SeriesPoints)))) / COUNTIF(E$4:E, E4:E), 0))
Where 'SeriesPoints' is a 2 column array; column 1 is the places/ranks (1:125) and column 2 is their corresponding point values. Column 'E' is the competitors' rank from the competition.
I have been unable to convert this formula to an ARRAYFORMULA() so I can avoid dragging it down the entire sheet (possibly up to 1000+ competitors over the series).
I'm mildly proficient with MMULT(), so I understood that would be a good approach for switching out SUM(), however, I haven't been able to create a matrix of the values to be summed.
INDEX():INDEX() doesn't work with ARRAYFORMULA() so I've tried switching to VLOOKUP(). With VLOOKUP() I've been able to produce the start and end values of the range of values for a tie, but not the full list. For example, if there is a 3-way tie for 4th, I can produce the respective points for 4th and 6th (the bounds of the tie).
In an attempt to list out even just the numbers from 4:6, I've hit a wall converting what would be a simple ROW() or SEQUENCE() formula to a matrix/array.
The following formula produces an array of the upper and lower bounds of ties or the single place should there be no tie, although the single place gets repeated.
=ARRAYFORMULA(IF(COUNTIF(E$4:E,E4:E)=1,E4:E,{E4:E,E4:E+COUNTIF(E$4:E,E4:E)-1}))
I'm assuming if I can get VLOOKUP({#:#}) to fill properly, I'll be where I need to be.
From here, I feel confident in my abilities to wrap a VLOOKUP() for the actual point values, an MMULT() to sum across these rows for the total, then a simple division to produce the correct point value.
Spreadsheet: https://docs.google.com/spreadsheets/d/1lpNewR3p4i7ZHmlFGLlG1tLuxgO-6onSeH8mWTeclBw/edit?usp=sharing
Currently, my workspace is off to the right. The original formula is in F4 and my test codes are working on column G instead of E.
So for sample placements of 1,1,3,3,3,6,7,8 and sample points values of 1000, 850,738,663,633,603,573,550 I expect the output to be 925 for the two 1st place tied competitors, 678 for the tied 3rd places, 603 for 6th, 573 for 7th, and 550 for 8th.
I'd appreciate any and all help!
=ARRAYFORMULA(IFERROR(IFERROR(VLOOKUP(G4:G, QUERY({INDIRECT("G4:G"&counta(A4:A)+3),
VLOOKUP(ROW(INDIRECT("A1:A"&COUNTA(A4:A))), SeriesPoints, 2, 0)},
"select Col1,sum(Col2) group by Col1 label sum(Col2)''", 0), 2, 0))/
IFERROR(VLOOKUP(G4:G, QUERY(G4:G,
"select G,count(G) where G is not NULL group by G label count(G)''", 0), 2, 0))))

How to group data by age range?

Given data list with two columns: 'Division' and 'Age.'
username year_of_birth
Albert Albo 1977
Bob Bilo 1974
Conan Cornic 1989
Don Duan 1954
Etan Etin 1967
Fabio Forio 1976
I want to put this data into a Pivot Table and group the ages into specified ranges; however, I'm having issues figuring out how to get around grouping them into set increments that don't vary. My first range would need to be 18-24, my next would be 25-29, then 30-34, 35-39, and so on until I hit 64. Then, I would have 65+ all grouped into one, like so:
How could I make it work ?
A simpler (also single formula) might be:
=ArrayFormula(vlookup(year(now())-B2:B+1,Larry,2))
where year of birth is in ColumnB. This though does require a named range (Larry) of:
This repeats the assumption that, wanting month, day, time, everyone is treated as having been born at the very start of the year_of_birth.
A contingency is included for under 18s where 0-17 in the array might be replaced by invalid or such like.
Just for fun, let's see if we can make it in a single formula
Creating a pivot from here is trivial.

How do you sort an alpha-numeric list in excel?

I'm trying to sort a list of documents, but I'm having an issue with the documents that have a letter as a suffix.
Whenever we amend a document we add a letter to the end of the number, but when I sort by number in excel it sorts like this:
1
2
3
10
11
1606
1603D
1605B
1606A
1606C
1610A
1623A
20A
220B
390A
399A
402A
415A
450A
488A
557B
How can I make it sort in order of document number and amendment?
Like so:
1
2
3
10
11
1603D
1605B
1606
1606A
1606C
1610A
1623A
20A
220B
390A
399A
402A
415A
450A
488A
557B
As long as you have a mix of text and number, you won't be able to use Excel's built-in sort to achieve the result you describe.
If you append a letter to a number you effectively change the data type from number to text. Text will always be sorted after any number, hence the number 1606 comes before the text 1606A.
You could try to make all values real text, maybe indicate levels by appending digits with dots, like this:
1.
1.0.
1.1.
1.6.0.3.D
1.6.0.5.B
1.6.0.6.
1.6.0.6.A
1.6.0.6.C
1.6.1.0.A
1.6.2.3.A
2.
2.0.A.
2.2.0.B.
3.
3.9.0.A.
3.9.9.A.
4.0.2.A.
4.1.5.A.
4.5.0.A.
4.8.8.A.
5.5.7.B.
But even that does not give you the sort order you describe as the desired result.
Your desired sort order will be hard to achieve even if all values are text, or if you replace the A, B, C with a decimal .1, .2, .3. -- It's really hard to understand why 20 would come after 1623.
The solution I found was to add a column, and copy this formula into each cell:
=IF(ISNUMBER(--RIGHT(A2)),A2,LEFT(A2,LEN(A2)-1))
The formula removes the letters from the numbers, you can then sort your sheet using the new column of clean numbers.

Resources