I'm trying to read a file that contains tweets in each line and convert each character of tweet to an integer. The file can be found here
However, there is something wrong in 28th line in that file. When I look at the file, I see that line is as follows:
Wish she could have told me herself. #NicoleScherzy #nicolescherzinger
#OneLove #myfav #MyQueen :heavy_black_heart:️:heavy_black_heart:️
Furthermore while reading the file, I print out each line when I read it in that case, the line is printed as (Ignore the first two segment for simplification) :
Wish she could have told me herself. #NicoleScherzy #nicolescherzinger #OneLove #myfav #MyQueen :heavy_black_heart:️:heavy_black_heart:️
Now, If I want to print them character by character I got an error. Here is the code for that and the error I got:
x=" Wish she could have told me herself. #NicoleScherzy #nicolescherzinger #OneLove #myfav #MyQueen :heavy_black_heart:️:heavy_black_heart:️"
for i=1:length(x)
println(x[i])
end
.
.
.
INFO: #
INFO: m
INFO: y
INFO: f
INFO: a
INFO: v
INFO:
INFO: #
INFO: M
INFO: y
INFO: Q
INFO: u
INFO: e
INFO: e
INFO: n
INFO:
INFO: :
INFO: h
INFO: e
INFO: a
INFO: v
INFO: y
INFO: _
INFO: b
INFO: l
INFO: a
INFO: c
INFO: k
INFO: _
INFO: h
INFO: e
INFO: a
INFO: r
INFO: t
INFO: :
INFO: ️
ERROR: UnicodeError: invalid character index
in slow_utf8_next(::Array{UInt8,1}, ::UInt8, ::Int64) at ./strings/string.jl:67
in next at ./strings/string.jl:96 [inlined]
in getindex(::String, ::Int64) at ./strings/basic.jl:70
in macro expansion; at ./REPL[2]:1 [inlined]
in anonymous at ./<missing>:?
What the heck is that ? Why h is represented as h with an bar on top and whey there is a space just before the error mesage, should be there another :
Strings and Unicode are complicated everywhere (because human language is complicated) and in Julia. In addition, the implementation would (and should) probably change in the future. As of v0.5 / v0.6 a way to write the loop in the question is
for c in x
println(c)
end
And to use indexing, something like:
i = 1
while i<=endof(x)
println(x[i])
i = nextind(x,i)
end
In general you should be familiar with endof, nextind to write proper string manipulation in Julia as of v0.5 / v0.6. The REPL help and the documentation should cover them.
Related
I'm having troubles figuring out how to edit a function in GNU APL.
I have tried ∇func (DEFN ERROR), )edit (BAD COMMAND), )editor (BAD COMMAND)
and all give me errors.
Any suggestion on how to edit a simple function would be appreciated.
GNU APL 1.8 on Arch Linux
Build Date: 2020-07-07 19:33:16 UTC
You should be able to use the line editor (a.k.a. the ∇ editor).
Dyalog's documentation should apply to GNU APL too.
An example session:
⍝ define Foo:
∇r←Foo y
r←2×y
∇
⍝ apply Foo (result to stdout):
Foo 10
⍝ change Foo's line 1:
∇Foo[1]
r←3×y
∇
⍝ apply new Foo (result to stdout):
Foo 10
⍝ display Foo's definition (to stderr)
∇foo[⎕]∇
This should print the following to stdout:
20
30
And report Foo's new definition to stderr:
[0] r←foo y
[1] r←3×y
I am trying to experiment with live data from the Coronavirus pandemic (unfortunately and good luck to all of us).
I have developed a small script and I am transitioning into a console application: it uses CSV type providers.
I have the following issue. Suppose we want to filter by region the Italian spread we can use this code into a .fsx file:
open FSharp.Data
let provinceData = CsvProvider< #"https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-province/dpc-covid19-ita-province.csv" , IgnoreErrors = true>.GetSample()
let filterDataByProvince province =
provinceData.Rows
|> Seq.filter (fun x -> x.Sigla_provincia = province)
Being sequences lazy, then suppose I force the complier to load in memory the data for the province of Rome, I can add:
let romeProvince = filterDataByProvince "RM" |> Seq.toArray
This works fine, run by FSI, locally.
Now, if I transition this code into a console application using a .fs file; I declare exactly the same functions and using exactly the same type provider loader; but instead of using the last line to gather the data, I put it into a main function:
[<EntryPoint>]
let main _ =
let romeProvince = filterDataByProvince "RM" |> Seq.toArray
Console.Read() |> ignore
0
This results into the following runtime exception:
System.Exception
HResult=0x80131500
Message=totale_casi is missing
Source=FSharp.Data
StackTrace:
at <StartupCode$FSharp-Data>.$TextRuntime.GetNonOptionalValue#139-4.Invoke(String message)
at CoronaSchiatta.Evoluzione.provinceData#10.Invoke(Object parent, String[] row) in C:\Users\glddm\source\repos\CoronaSchiatta\CoronaSchiatta\CoronaEvolution.fs:line 10
at FSharp.Data.Runtime.CsvHelpers.parseIntoTypedRows#174.GenerateNext(IEnumerable`1& next)
Can you explain that?
Some rows have an odd format, possibly, but the FSI session is robust to those, whilst the console version is fragile; why? How can I fix that?
I am using VS2019 Community Edition, targeting .NET Framework 4.7.2, F# runtime: 4.7.0.0;
as FSI, I am using the following: FSI Microsoft (R) F# Interactive version 10.7.0.0 for F# 4.7
PS: Please also be aware that if I use CsvFile, instead of type providers, as in:
let test = #"https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-province/dpc-covid19-ita-province.csv"
|> CsvFile.Load |> (fun x -> x.Rows ) |> Seq.filter ( fun x-> x.[6 ] = "RM")
|> Seq.iter ( fun x -> x.[9] |> Console.WriteLine )
Then it works like a charm also in the console application. Of course I would like to use type providers otherwise I have to add type definition, mapping the schema to the columns (and it will be more fragile). The last line was just a quick test.
Fragility
CSV Type Providers can be fragile if you don't have a good schema or sample.
Now getting a runtime error is almost certainly because your data doesn't match up.
How do you figure it out? One way is to run through your data first:
provinceData.Rows |> Seq.iteri (fun i x -> printfn "Row %d: %A" (i + 1) x)
This runs up to Row 2150. And sure enough, the next row:
2020-03-11 17:00:00,ITA,19,Sicilia,994,In fase di definizione/aggiornamento,,0,0,
You can see the last value (totale_casi) is missing.
One of CsvProvider's options is InferRows. This is the number of rows the provider scans in order to build up a schema - and its default value happens to be 1000.
So:
type COVID = CsvProvider<uri, InferRows = 0>
A better way to prevent this from happening in the future is to manually define a sample from a sub-set of data:
type COVID = CsvProvider<"sample-dpc-covid19-ita-province.csv">
and sample-dpc-covid19-ita-province.csv is:
data,stato,codice_regione,denominazione_regione,codice_provincia,denominazione_provincia,sigla_provincia,lat,long,totale_casi
2020-02-24 18:00:00,ITA,13,Abruzzo,069,Chieti,CH,42.35103167,14.16754574,0
2020-02-24 18:00:00,ITA,13,Abruzzo,066,L'Aquila,AQ,42.35122196,13.39843823,
2020-02-24 18:00:00,ITA,13,Abruzzo,068,Pescara,PE,42.46458398,14.21364822,0
2020-02-24 18:00:00,ITA,13,Abruzzo,067,Teramo,TE,42.6589177,13.70439971,0
With this the type of totale_casi is now Nullable<int>.
If you don't mind NaN values, you can also use:
CsvProvider<..., AssumeMissingValues = true>
Why does FSI seem more robust?
FSI isn't more robust. This is my best guess:
Your schema source is being regularly updated.
Type Providers cache the schema, so that it doesn't regenerate the schema every time you compile your code, which can be impractical. When you restart an FSI session, you end up regenerating your Type Provider, but not so with the console application. So it might sometimes has the effect of being less error-prone, having worked with a newer source.
Extract the NUMBER from logs and convert storage units.
Sample Logs -
2020-02-04 16:18:56,783 INFO Log4jFactory$Log4jLogger [10.xxx.xxx.xxx]:5701 [Dry-PROD-XC6] [3.7.6] Received auth from Connection[id=26876, /10.xxx.xxx.xxx:5701->/10.xxx.xxx.xxx:56584, endpoint=null, alive=true, type=CSHARP_CLIENT], successfully authenticated, principal : ClientPrincipal{uuid='d7d8b718-ed75-4cc3-b51a-c620bb082255', ownerUuid='058720ad-7b35-40f6-8978-bd9cf7e286ec'}, owner connection : true, client version : null
2020-02-04 16:15:27,519 INFO Log4jFactory$Log4jLogger [10.xxx.xxx.xxx]:5701 [Dry-PROD-XC6] [3.7.6] processors=8, physical.memory.total=31.4G, physical.memory.free=18.8G, swap.space.total=7.8G, swap.space.free=7.8G, heap.memory.used=4.5G, heap.memory.free=536.5M, heap.memory.total=5.0G, heap.memory.max=5.0G, heap.memory.used/total=89.51%, heap.memory.used/max=89.51%, native.memory.used=8.2M, native.memory.free=3.5G, native.memory.total=64.0M, native.memory.max=3.5G, native.meta.memory.used=80.0M, native.meta.memory.free=432.0M, native.meta.memory.percentage=90.75%, minor.gc.count=18605, minor.gc.time=121136ms, major.gc.count=0, major.gc.time=0ms, load.process=0.00%, load.system=0.01%, load.systemAverage=1.00%, thread.count=69, thread.peakCount=229, cluster.timeDiff=4083, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=0, executor.q.priorityOperation.size=0, operations.completed.count=351751533, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=0, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=1, proxy.count=0, clientEndpoint.count=231, connection.active.count=232, client.connection.count=231, connection.count=1
Trying to extract the NUMBER for the Memory fields (like native.memory.used=8.2M, native.memory.free=3.5G, native.memory.total=64.0M, native.memory.max=3.5G) and convert units using Ruby Filter.
Used Logstash KV Filter first to obtain the KV pairs , then trying the following code (I'm new to Ruby Coding)
my Ruby Code -
ruby {
code => '
event.to_hash.keys.each { |k,v|
matches = v.scan(/(\d*\.\d*?i)([KMG])$/)
if matches[2] == nil
event.set(k,event.get(v))
elsif matches[2] == "K"
multiplyBy = 1024
event.set(k, matches[1].to_f * multiplyBy)
elsif matches[2] == "M"
multiplyBy = 1024 * 1024
event.set(k, matches[1].to_f * multiplyBy)
elsif matches[2] == "G"
multiplyBy = 1024 * 1024 * 1024
event.set(k, matches[1].to_f * multiplyBy)
else
event.set(k, event.get(v))
end
}
'
}
Seeing Errors in the logs -
[2020-02-11T17:01:09,603][ERROR][logstash.filters.ruby ][main] Ruby exception occurred: undefined method `scan' for nil:NilClass
appreciate any help or guidance .
Thank you
Regex would do the trick:
If you want to match the unit:
\d*?\.\d*?(M|G)
If you just want to match the number (integer or decimal):
\d*?(\.)\d*
\d stands for digits
* stands x times
? is used here to be non greedy (minimize the effort of the parser) after the * option (good practise)
( | ) stands for OR and set the option between 2 caracters M or G ( this may vary depending on the language in which the parser is executed
If you have any doubt you can test the regular expresion here.
You ca optimise your regex if you control the number of decimals using {}
and add optional the . if you are not sure that you will have always decimals.
As far as I remember, you have option in logstash to parse using regex and output it properly using grok or a builtin filter and output it directly as you want it.
I'm using luacheck (within the Atom editor), but open to other static analysis tools.
Is there a way to check that I'm using an uninitialized table field? I read the docs (http://luacheck.readthedocs.io/en/stable/index.html) but maybe I missed how to do this?
In all three cases in the code below I'm trying to detect that I'm (erroneously) using field 'y1'. None of them do. (At run-time it is detected, but I'm trying to catch it before run-time).
local a = {}
a.x = 10
a.y = 20
print(a.x + a.y1) -- no warning about uninitialized field y1 !?
-- luacheck: globals b
b = {}
b.x = 10
b.y = 20
print(b.x + b.y1) -- no warning about uninitialized field y1 !?
-- No inline option for luacheck re: 'c', so plenty of complaints
-- about "non-standard global variable 'c'."
c = {} -- warning about setting
c.x = 10 -- warning about mutating
c.y = 20 -- " " "
print(c.x + c.y1) -- more warnings (but NOT about field y1)
The point is this: as projects grow (files grow, and the number & size of modules grow), it would be nice to prevent simple errors like this from creeping in.
Thanks.
lua-inspect should be able to detect and report these instances. I have it integrated into ZeroBrane Studio IDE and when running with the deep analysis it reports the following on this fragment:
unknown-field.lua:4: first use of unknown field 'y1' in 'a'
unknown-field.lua:7: first assignment to global variable 'b'
unknown-field.lua:10: first use of unknown field 'y1' in 'b'
unknown-field.lua:14: first assignment to global variable 'c'
unknown-field.lua:17: first use of unknown field 'y1' in 'c'
(Note that the integration code only reports first instances of these errors to minimize the number of instances reported; I also fixed an issue that only reported first unknown instance of a field, so you may want to use the latest code from the repository.)
People who look into questions related to "Lua static analysis" may also be interested in the various dialects of typed Lua, for example:
Typed Lua
Titan
Pallene
Ravi
But you may not have heard of "Teal". (early in its life it was called "tl"); .
I'm taking the liberty to answer my original question using Teal, since I find it intriguing.
-- 'record' (like a 'struct')
local Point = record
x : number
y : number
end
local a : Point = {}
a.x = 10
a.y = 20
print(a.x + a.y1) -- will trigger an error
-- (in VS Code using teal extension & at command line)
From command line:
> tl check myfile.tl
========================================
1 error:
myfile.tl:44:13: invalid key 'y1' in record 'a'
By the way...
> tl gen myfile.tl'
creates a pure Lua file: 'myfile.lua' that has no type information in it. Note: running this Lua file will trigger the 'nil' error... lua: myfile.lua:42: attempt to index a nil value (local 'a').
So, Teal gives you a chance to catch 'type' errors, but it doesn't require you to fix them before generating Lua files.
The ValuesAll function of the Deedle series according to the doco is to
Returns a collection of values, including possibly missing values. Note that the length of this sequence matches the `Keys` sequence.
However the following code will raise an error OptionalValue.Value: is not available. Is this expected behaviour? I was expecting ValuesAll can return double.nan
#I "..\..\packages\Deedle.1.2.4"
#load "Deedle.fsx"
open System
open System.Globalization
open System.Collections.Generic
open Deedle
let ts = [DateTime.Now.Date => Double.NaN; DateTime.Now.Date.AddDays(1.0) => 1.0] |> series
ts.Print()
ts.ValuesAll
>
27/01/16 12:00:00 AM -> <missing>
28/01/16 12:00:00 AM -> 1
val ts : Series<DateTime,float> =
27/01/16 12:00:00 AM -> <missing>
28/01/16 12:00:00 AM -> 1
val it : seq<float>
> ts.ValuesAll
;;
val it : seq<float> = Error: OptionalValue.Value: Value is not available
>
There are different implementations for valuesAll here and C#-friendly ValuesAll here. The latter one just accesses the .Value property of optional value and its signature is seq<float>, not seq<float opt>. Either implementation or docs are not consistent here.
When I used Deedle I filtered series with |> Series.dropMissing as a quick workaround when I needed only present values.