Is Snappy compressed Avro files queryable in Athena? - avro

Regarding the Athena documentation, we cas use the Avro SerDe to query Avro files, which work for me ( CREATE EXTERNAL TABLE).
But regarding their documentation, Athena does only support Parquet and ORC format when dealing with snappy compression..
I don't understand, why should it be a dependency between the format and the compression for Athena
Is snappy compressed Avro files are not queryable by Athena , or this is just something not defined in their doc ?

I tested it and it does work fine. You can query s3 snappy compressed avro files with Athena

Related

Is possible to read FireDAC MemTable on Python?

I have some binaries (.adb) that is created in FireDAC Project, but i want to read the data in python, is there any documentation on how to read this format? I know it is a proprietary format, but the only way i consume the data is using Delphi?

Best way to process a GCS file within Dataflow?

I have a PCollection of matched GCS filenames, each of which contains a single compressed JSON blob. What's the best way to read the entire file, decompress it (Gzip format), and JSON decode it?
TextIO is really close, but reads data per-line.
GCS API offers an example for how to read the entire file, but it doesn't handle decompression and is leading me to reimplement a lot of core functionality.
Are there any existing APIs and/or examples that can give me a head start? Seems like this would be a pretty common use case.
This isn't natively supported in Dataflow. To accomplish reading a JSON blob out of a file, you could implement FileBasedSource:
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/FileBasedSource
If that's enough to get started, we can continue to update this answer with more information.

detect encoding of xlsx content in ruby

I have an app which allows uploading spreadsheets in xls, xlsx and csv format. The data is later used at various client facing places. The people managing the data use various tools to create the spreadsheets, including mac/excel, win/excel, win/openoffice, linux/libreoffice...
The real problem is the mac/excel encoding, which creates some nasty looking strings. Is there any way to make sure the file content's encoding is valid utf-8?
My approach of just File.read(file.path).valid_encoding? checking works only for csv...
I would look into charlock_holmes, a gem which lets you easily detect and even attempt to transcode files based on their encoding.

What file types does MarkLogic support?

I'm a student and I want to find a search engine for big data. I found MarkLogic Server but I don't know what file types it supports. Does it support doc, docx, pdf, xml, ppt, pptx, etc.? What other types are supported?
At low level, MarkLogic supports storage of xml, plain text, and binary. XML is fully searchable, including range indexes for faceted search. Text is only full-text searchable. Binary is not searchable as is, but there are facilities to extract meta information, and text out of many binary formats. You can find more details about the latter in the online documentation:
http://docs.marklogic.com/guide/search-dev/binary-document-metadata#chapter
There is a sample application that shows this functionality:
http://developer.marklogic.com/code/document-discovery
HTH!

reading file metadata using lua

I wonder if there is a better code/library that would allow reading the file metadata?
So far, I have tried using LuaFileSystem and LuaCom (Scripting.FileSystemObject) but so far none was able to extract all the data. When I mean all the data, other than the usual standard data such as date accessed, date created, date modified, etc, I wanted some more data like in the case for pdf, it will contain other data such as author and title and for image, it will contain data like bit depth, resolution.
You seem to be missing the difference between filesystem metadata and document metadata. Filesystem metadata is the metadata the filesystem stores about a file. Every file has this stuff, because every file is stored on the filesystem. This metadata is not actually stored within the file; if you loaded the file, that wouldn't give you access to the filesystem metadata. You have to talk to the filesystem to get it.
Document metadata is some bit of information within the file that serves as metadata. To get this data you have to read the file, know what the file's format is, and parse that metadata out.
I don't know of any library, Lua or otherwise, that is designed to extract arbitrary metadata from arbitrary file types.

Resources