How to parse using Grok from Java.. Is there any example available.? - parsing

I have seen Grok being very strong and lethal in parsing the log data. I wanted to use Grok for log parsing in our application, which is in java.. How can i connect/work with Grok from Java.?

Try downloading java-grok from GitHub: https://github.com/NFLabs/java-grok
You can test patterns using the Grok Debugger: http://grokdebug.herokuapp.com/

Check out this Java library
https://github.com/aicer/grok
You can include it in your project as a maven dependency
<dependency>
<groupId>org.aicer.grok</groupId>
<artifactId>grok</artifactId>
<version>0.9.0</version>
</dependency>
It comes with pre-defined patterns and you can also add yours.
The named patterns are extracted and the results are available in a map with the groups names as the keys and the retrieved values are mapped to these keys.
final GrokDictionary dictionary = new GrokDictionary();
// Load the built-in dictionaries
dictionary.addBuiltInDictionaries();
// Add custom pattern
dictionary.addDictionary(new File(patternDirectoryOrFilePath));
// Resolve all expressions loaded
dictionary.bind();
This next examples adds string patterns directly to the dictionary without using a file
final GrokDictionary dictionary = new GrokDictionary();
// Load the built-in dictionaries
dictionary.addBuiltInDictionaries();
// Add custom pattern directly
dictionary.addDictionary(new StringReader("DOMAINTLD [a-zA-Z]+"));
dictionary.addDictionary(new StringReader("EMAIL %{NOTSPACE}#%{WORD}\.%{DOMAINTLD}"));
// Resolve all expressions loaded
dictionary.bind();
Here is a complete example of how to use the library
public final class GrokStage {
private static final void displayResults(final Map<String, String> results) {
if (results != null) {
for(Map.Entry<String, String> entry : results.entrySet()) {
System.out.println(entry.getKey() + "=" + entry.getValue());
}
}
}
public static void main(String[] args) {
final String rawDataLine1 = "1234567 - israel.ekpo#massivelogdata.net cc55ZZ35 1789 Hello Grok";
final String rawDataLine2 = "98AA541 - israel-ekpo#israelekpo.com mmddgg22 8800 Hello Grok";
final String rawDataLine3 = "55BB778 - ekpo.israel#example.net secret123 4439 Valid Data Stream";
final String expression = "%{EMAIL:username} %{USERNAME:password} %{INT:yearOfBirth}";
final GrokDictionary dictionary = new GrokDictionary();
// Load the built-in dictionaries
dictionary.addBuiltInDictionaries();
// Resolve all expressions loaded
dictionary.bind();
// Take a look at how many expressions have been loaded
System.out.println("Dictionary Size: " + dictionary.getDictionarySize());
Grok compiledPattern = dictionary.compileExpression(expression);
displayResults(compiledPattern.extractNamedGroups(rawDataLine1));
displayResults(compiledPattern.extractNamedGroups(rawDataLine2));
displayResults(compiledPattern.extractNamedGroups(rawDataLine3));
}
}

Related

Converting complex JSON string to Map

When I say complex, means : A lot of nested objects, arrays, etc...
I am actually stuck on this simple thing:
// Get the result from endpoint, store in a complex model object
// and then write to secure storage.
ServerResponse rd = ServerResponse.fromMap(response.data);
appUser = AppUser.fromMap(rd.data); // appData is a complex object
await storage.write(key: keyData, value: userData);
String keyData = "my_data";
const storage = FlutterSecureStorage();
String? userData = appUser?.toJson(); // Convert the data to json. This will produce a JSON with escapes on the nested JSON elements.bear this in mind.
// Now that I stored my data, sucessfully, here comes the challenge: read it back
String dataStored = await storage.read(key: keyData);
// Now What ?
If I decide to go to appUser = AppUser.fromJson(dataStored), will be very complicated because for each level of my Json, too many fromJson, fromMap, toJson, toMap...It's nuts.
Hovever I've a fromMap that's actually works good since Dio always receive the data as Map<String, dynamic>. And my question is: Is there some way to convert a huge and complex JSON stringified to a full Map<String, dynamic> ? json.decode(dataStored) only can convert the properties on root - Nested properties will still continue as JSON string inside a map.
Any clue ??? thanks !
This is the main problem since Dart is lacking data classes. Thus, you need to define fromJson/toJson methods by yourself. However, there are several useful packages that use code generation to cope with this problem:
json_serializable - basically a direct solution to your problem.
freezed - uses json_serializable under the hood, but also implement some extra useful methods, focusing on the immutability of your data classes.
Other than that, unfortunately, you would need to implement fromJson/toJson methods by yourself.
This website might help you.
Simply just paste your json in there and it will automatically generate fromJson/toJson methods for you, then you can custom it by yourself.
I've used it a lot for my company project and it's very helpful.
Here's the link to website : link
import 'package:collection/collection.dart';
import 'package:flutter/foundation.dart';
import 'package:flutter_test/flutter_test.dart';
const userJson = {
'firstName': 'John',
'lastName': 'Smith',
};
class User {
final String firstName;
final String lastName;
const User({required this.firstName, required this.lastName});
Map<String, dynamic> toJson() => {
'firstName': firstName,
'lastName': lastName,
};
factory User.fromJson(dynamic json) {
return User(
firstName: json['firstName'] as String? ?? '',
lastName: json['lastName'] as String? ?? '',
);
}
}
void main() {
test('from and to json methods', () async {
final user = User.fromJson(userJson);
final _userJson = user.toJson();
debugPrint(userJson.toString());
debugPrint(_userJson.toString());
expect(const MapEquality().equals(userJson, _userJson), true);
});
}

How do I write to multiple files in Apache Beam?

Let me simplify my case. I'm using Apache Beam 0.6.0. My final processed result is PCollection<KV<String, String>>. And I want to write values to different files corresponding to their keys.
For example, let's say the result consists of
(key1, value1)
(key2, value2)
(key1, value3)
(key1, value4)
Then I want to write value1, value3 and value4 to key1.txt, and write value4 to key2.txt.
And in my case:
Key set is determined when the pipeline is running, not when constructing the pipeline.
Key set may be quite small, but the number of values corresponding to each key may be very very large.
Any ideas?
Handily, I wrote a sample of this case just the other day.
This example is dataflow 1.x style
Basically you group by each key, and then you can do this with a custom transform that connects to cloud storage. Caveat being that your list of lines per-file shouldn't be massive (it's got to fit into memory on a single instance, but considering you can run high-mem instances, that limit is pretty high).
...
PCollection<KV<String, List<String>>> readyToWrite = groupedByFirstLetter
.apply(Combine.perKey(AccumulatorOfWords.getCombineFn()));
readyToWrite.apply(
new PTransformWriteToGCS("dataflow-experiment", TonyWordGrouper::derivePath));
...
And then the transform doing most of the work is:
public class PTransformWriteToGCS
extends PTransform<PCollection<KV<String, List<String>>>, PCollection<Void>> {
private static final Logger LOG = Logging.getLogger(PTransformWriteToGCS.class);
private static final Storage STORAGE = StorageOptions.getDefaultInstance().getService();
private final String bucketName;
private final SerializableFunction<String, String> pathCreator;
public PTransformWriteToGCS(final String bucketName,
final SerializableFunction<String, String> pathCreator) {
this.bucketName = bucketName;
this.pathCreator = pathCreator;
}
#Override
public PCollection<Void> apply(final PCollection<KV<String, List<String>>> input) {
return input
.apply(ParDo.of(new DoFn<KV<String, List<String>>, Void>() {
#Override
public void processElement(
final DoFn<KV<String, List<String>>, Void>.ProcessContext arg0)
throws Exception {
final String key = arg0.element().getKey();
final List<String> values = arg0.element().getValue();
final String toWrite = values.stream().collect(Collectors.joining("\n"));
final String path = pathCreator.apply(key);
BlobInfo blobInfo = BlobInfo.newBuilder(bucketName, path)
.setContentType(MimeTypes.TEXT)
.build();
LOG.info("blob writing to: {}", blobInfo);
Blob result = STORAGE.create(blobInfo,
toWrite.getBytes(StandardCharsets.UTF_8));
}
}));
}
}
Just write a loop in a ParDo function!
More details -
I had the same scenario today, the only thing is in my case key=image_label and value=image_tf_record. So like what you have asked, I am trying to create separate TFRecord files, one per class, each record file containing a number of images. HOWEVER not sure if there might be memory issues when a number of values per key are very high like your scenario:
(Also my code is in Python)
class WriteToSeparateTFRecordFiles(beam.DoFn):
def __init__(self, outdir):
self.outdir = outdir
def process(self, element):
l, image_list = element
writer = tf.python_io.TFRecordWriter(self.outdir + "/tfr" + str(l) + '.tfrecord')
for example in image_list:
writer.write(example.SerializeToString())
writer.close()
And then in your pipeline just after the stage where you get key-value pairs to add these two lines:
(p
| 'GroupByLabelId' >> beam.GroupByKey()
| 'SaveToMultipleFiles' >> beam.ParDo(WriteToSeparateTFRecordFiles(opt, p))
)
you can use FileIO.writeDinamic() for that
PCollection<KV<String,String>> readfile= (something you read..);
readfile.apply(FileIO. <String,KV<String,String >> writeDynamic()
.by(KV::getKey)
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to("somefolder")
.withNaming(key -> FileIO.Write.defaultNaming(key, ".txt")));
p.run();
In Apache Beam 2.2 Java SDK, this is natively supported in TextIO and AvroIO using respectively TextIO and AvroIO.write().to(DynamicDestinations). See e.g. this method.
Update (2018): Prefer to use FileIO.writeDynamic() together with TextIO.sink() and AvroIO.sink() instead.
Just write below lines in your ParDo class :
from apache_beam.io import filesystems
eventCSVFileWriter = filesystems.FileSystems.create(gcsFileName)
for record in list(Records):
eventCSVFileWriter.write(record)
If you want the full code I can help you with that too.

Google Dataflow Out Of Heap When Creating Multiple Tagged Outputs

I have many large unpartitioned BigQuery tables and files that I would like to partition in various ways. So I decided to try and write a Dataflow job to achieve this. The job I think is simple enough. I tried to write with generics so that I easily apply it both TextIO and BigQueryIO sources. It works fine with small tables, but I keep getting java.lang.OutOfMemoryError: Java heap space when I run it on large tables.
In my main class I either read a file with target keys (made with another DF job) or run a query against a BigQuery table to get a list of keys to shard by. My main class looks like this:
Pipeline sharder = Pipeline.create(opts);
// a functional interface that shows the tag map how to get a tuple tag
KeySelector<String, TableRow> bqSelector = (TableRow row) -> (String) row.get("COLUMN") != null ? (String) row.get("COLUMN") : "null";
// a utility class to store a tuple tag list and hash map of String TupleTag
TupleTagMap<String, TableRow> bqTags = new TupleTagMap<>(new ArrayList<>(inputKeys),bqSelector);
// custom transorm
ShardedTransform<String, TableRow> bqShard = new ShardedTransform<String, TableRow>(bqTags, TableRowJsonCoder.of());
String source = "PROJECTID:ADATASET.A_BIG_TABLE";
String destBase = "projectid:dataset.a_big_table_sharded_";
TableSchema schema = bq.tables().get("PROJECTID","ADATASET","A_BIG_TABLE").execute().getSchema();
PCollectionList<TableRow> shards = sharder.apply(BigQueryIO.Read.from(source)).apply(bqShard);
for (PCollection<TableRow> shard : shards.getAll()) {
String shardName = StringUtils.isNotEmpty(shard.getName()) ? shard.getName() : "NULL";
shard.apply(BigQueryIO.Write.to(destBase + shardName)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withSchema(schema));
System.out.println(destBase+shardName);
}
sharder.run();
I generate a set of TupleTags to use in a custom transform. I created a utility class that stores a TupleTagList and HashMap so that I can reference the tuple tags by key:
public class TupleTagMap<Key, Type> implements Serializable {
private static final long serialVersionUID = -8762959703864266959L;
final private TupleTagList tList;
final private Map<Key, TupleTag<Type>> map;
final private KeySelector<Key, Type> selector;
public TupleTagMap(List<Key> t, KeySelector<Key, Type> selector) {
map = new HashMap<>();
for (Key key : t)
map.put(key, new TupleTag<Type>());
this.tList = TupleTagList.of(new ArrayList<>(map.values()));
this.selector = selector;
}
public Map<Key, TupleTag<Type>> getMap() {
return map;
}
public TupleTagList getTagList() {
return tList;
}
public TupleTag<Type> getTag(Type t){
return map.get(selector.getKey(t));
}
Then I have this custom transform that basically has a function that uses the tuple map to output PCollectionTuple and then moves it to a PCollectionList to return to the main class:
public class ShardedTransform<Key, Type> extends
PTransform<PCollection<Type>, PCollectionList<Type>> {
private static final long serialVersionUID = 3320626732803297323L;
private final TupleTagMap<Key, Type> tags;
private final Coder<Type> coder;
public ShardedTransform(TupleTagMap<Key, Type> tags, Coder<Type> coder) {
this.tags = tags;
this.coder = coder;
}
#Override
public PCollectionList<Type> apply(PCollection<Type> in) {
PCollectionTuple shards = in.apply(ParDo.of(
new ShardFn<Key, Type>(tags)).withOutputTags(
new TupleTag<Type>(), tags.getTagList()));
List<PCollection<Type>> shardList = new ArrayList<>(tags.getMap().size());
for (Entry<Key, TupleTag<Type>> e : tags.getMap().entrySet()){
PCollection<Type> shard = shards.get(e.getValue()).setName(e.getKey().toString()).setCoder(coder);
shardList.add(shard);
}
return PCollectionList.of(shardList);
}
}
The actual DoFn is dead simple it just uses the lambda provided in the main class do find the matching tuple tag in the hash map for side output:
public class ShardFn<Key, Type> extends DoFn<Type, Type> {
private static final long serialVersionUID = 961325260858465105L;
private final TupleTagMap<Key, Type> tags;
ShardFn(TupleTagMap<Key, Type> tags) {
this.tags = tags;
}
#Override
public void processElement(DoFn<Type, Type>.ProcessContext c)
throws Exception {
Type element = c.element();
TupleTag<Type> tag = tags.getTag(element);
if (tag != null)
c.sideOutput(tags.getTag(element), element);
}
}
The Beam model doesn't have good support for dynamic partitioning / large numbers of partitions right now. Your approach chooses the number of shards at graph construction time, and then the resulting ParDos likely all fuses together, so you've got each worker trying to write to 80 different BQ tables at the same time. Each write requires some local buffering, so it's probably just too much.
There's an alternate approach which will do the parallelization across tables (but not across elements). This would work well if you have a large number of relatively small output tables. Use a ParDo to tag each element with the table it should go to and then do a GroupByKey. This gives you a PCollection<KV<Table, Iterable<ElementsForThatTable>>>. Then process each KV<Table, Iterable<ElementsForThatTable>> by writing the elements to the table.
Unfortunately for now you'll have to the BQ write by hand to use this option. We're looking at extending the Sink APIs with built in support for this. And since the Dataflow SDK is being further developed as part of Apache Beam, we're tracking that request here: https://issues.apache.org/jira/browse/BEAM-92

JsonProvider failed to parse JSON from API written myself

I write an API using springMVC.It's a simple API, I just use it to test JsonProvider.
#ResponseBody
#RequestMapping(value = "/api/test", method = RequestMethod.GET)
public TestClass test(final HttpServletRequest request,
final HttpServletResponse response){
return new TestClass("cc");
}
class TestClass{
public TestClass(){
}
public TestClass(final String name) {
super();
this.name = name;
}
private String name;
public String getName() {
return name;
}
public void setName(final String name) {
this.name = name;
}
}
The API simply returns
But JsonProvider just throws a compile error.
Severity Code Description Project File Line
Error The type provider 'ProviderImplementation.JsonProvider' reported an error: Cannot read sample JSON from 'http://localhost/api/test': Invalid JSON starting at character 0, snippet =
----
"{\"name\":
-----
json =
------
"{\"name\":\"cc\"}"
------- JsonProcess c:\users\xx\documents\visual studio 2015\Projects\JsonProcess\JsonProcess\Program.fs 8
The F# code:
open FSharp.Data
[<Literal>]
let jsonValue = """
{"name":"cc"}
"""
type JsonData = JsonProvider<"http://localhost/api/test">
[<EntryPoint>]
let main argv =
0 // return an integer exit code
Use the String literal jsonValue as sample is ok.type JsonData = JsonProvider<jsonValue>
I checked the FSharp.Data source code (its the function asyncRead) to see how they download the JSON. It basically boils down to this:
let readString =
async {
let contentTypes = [ HttpContentTypes.Json ]
let headers = [
HttpRequestHeaders.UserAgent ("F# Data JSON Type Provider")
HttpRequestHeaders.Accept (String.concat ", " contentTypes)
]
let! text = Http.AsyncRequestString("http://www.kujiale.com/api/askinvitesearch?query=cc", headers = headers)
return text
}
If one runs this code against the url http://www.kujiale.com/api/askinvitesearch?query=cc we see something interesting about what's returned:
"[{\"linkToIdeaBook\":\"/u/3FO4K4UR89F1/huabao\",\"linkToDesi
Note that the content starts with " and that the "strings" are "escaped" with \. So it seems the JSON document is returned as an escaped string. According to json.org the root object must be either an object or array so the parser fails at character 0.
If one switches to contentType HttpContentTypes.Text it starts like this:
[{"linkToIdeaBook":"/u/3FO4K4UR89F1/huabao","linkToDesignCollect":"/u
Which actually turns out to be a valid JSON object.
To me it seems somewhat odd that if you ask for content with content type JSON you get an escaped string but that seems to be the root cause of the failure.
How to resolve it is more difficult to say. One way forward would be a PR to FSharp.Data to allow users to specify the content type used to download content.

Test that either one thing holds or another in AssertJ

I am in the process of converting some tests from Hamcrest to AssertJ. In Hamcrest I use the following snippet:
assertThat(list, either(contains(Tags.SWEETS, Tags.HIGH))
.or(contains(Tags.SOUPS, Tags.RED)));
That is, the list may be either that or that. How can I express this in AssertJ? The anyOf function (of course, any is something else than either, but that would be a second question) takes a Condition; I have implemented that myself, but it feels as if this should be a common case.
Edited:
Since 3.12.0 AssertJ provides satisfiesAnyOf which succeeds is one of the given assertion succeeds,
assertThat(list).satisfiesAnyOf(
listParam -> assertThat(listParam).contains(Tags.SWEETS, Tags.HIGH),
listParam -> assertThat(listParam).contains(Tags.SOUPS, Tags.RED)
);
Original answer:
No, this is an area where Hamcrest is better than AssertJ.
To write the following assertion:
Set<String> goodTags = newLinkedHashSet("Fine", "Good");
Set<String> badTags = newLinkedHashSet("Bad!", "Awful");
Set<String> tags = newLinkedHashSet("Fine", "Good", "Ok", "?");
// contains is statically imported from ContainsCondition
// anyOf succeeds if one of the conditions is met (logical 'or')
assertThat(tags).has(anyOf(contains(goodTags), contains(badTags)));
you need to create this Condition:
import static org.assertj.core.util.Lists.newArrayList;
import java.util.Collection;
import org.assertj.core.api.Condition;
public class ContainsCondition extends Condition<Iterable<String>> {
private Collection<String> collection;
public ContainsCondition(Iterable<String> values) {
super("contains " + values);
this.collection = newArrayList(values);
}
static ContainsCondition contains(Collection<String> set) {
return new ContainsCondition(set);
}
#Override
public boolean matches(Iterable<String> actual) {
Collection<String> values = newArrayList(actual);
for (String string : collection) {
if (!values.contains(string)) return false;
}
return true;
};
}
It might not be what you if you expect that the presence of your tags in one collection implies they are not in the other one.
Inspired by this thread, you might want to use this little repo I put together, that adapts the Hamcrest Matcher API into AssertJ's Condition API. Also includes a handy-dandy conversion shell script.

Resources