Why does Tika 2.1 app ignore text in a .txt file? - apache-tika

I am processing a file using Tika 2.1 from the command line under Ubuntu 20.04 using the following command:
java -jar tika-app-2.1.0.jar -t test.txt
The file is a pure text ANSI file (all the chars are 0x0 thru 0x7f). As hard as this is to believe, Tika 2.1 app is ignoring all characters when a specific string is present is the text file.
Here is the text file:
From:
Sent:
text
this is a test
testing
next
last
And here is the output:
this is a test
testing
next
last
To show that this is a pure ANSI text file with no formatting, no Unicode 2-bytes sequences, etc., here is the output of the 'od' command:
0000000 7246 6d6f 0d3a 530a 6e65 3a74 0a0d 6574
0000020 7478 0a0d 0a0d 6874 7369 6920 2073 2061
0000040 6574 7473 0a0d 6574 7473 6e69 2067 0a0d
0000060 0a0d 656e 7478 0a0d 616c 7473
However, if I simply change the "Sent:" to "sent:" the output is:
From:
sent:
text
this is a test
testing
next
last
I've been troubleshooting this issue and do not see the connection. If I append "Sent:" to the first line:
From: Sent:
Sent:
text
this is a test
testing
next
last
The results are:
this is a test
testing
next
last
But if I alter "Sent:" to be "\Sent" on the second line, I get this output:
From: Sent:
\Sent:
text
this is a test
testing
next
last
And this file:
From: Sent:
Sent:
Sent:
text
this is a test
testing
next
last
Results in this output:
this is a test
testing
next
last
But if I place "Sent:" in the first line or a simple (0d 0a) as the first two bytes, the output is fine. Why is it the start of the second line seems to matter, as well as the uppercase or lowercase working but not "Sent:"? Why does preceding "Sent" with a "\" make it work? I've also tried this on a different machines - one running Ubuntu 18.04 and running the jar on a Windows 10 system - both with same results.
What is going on with the basic Tika response to a very simply text file? I have not altered the jar in any way. This is the jar file as downloaded from the Apache Tika site. What am I missing?
Any information is very much appreciated.

Tika is interpreting the text as an email. This specific example was a text extraction of an email and includes certain keywords (e.g. "From:" and "Sent:" in specific positions). That is why when other characters are added at the start of the file, it defaults to interpreting it as a pure text file.
I had thought that the order of interpretation was first based on the ".txt" extension and then analysis of the content (which in this case does not have any metadata with this text file). But that does not appear to be the case here. It seems that analysis is the first order, before it considers the ".txt" extension.
The example was being run through Tika running as a server. Going forward I will use the Tika API and follow the suggestions provided by the commenter (#Gagravarr) by skipping a call to AutoDetectParser,setting the content type property on the metadata and call DefaultParser all via the API.
Tx to #Gagravarr for finding a solution.

Related

Phred qual error after fastq trimming with Cutadapt

I would like to trim the beginning of all the reads in fastq file by a given length, before mapping to the genome with bowtie2. I have used Cutadapt:
cutadapt -u 48 -o output.fastq.gz input.fastq.gz
my fastq files after trimming looks like this:
gunzip -c output.fastq.gz | head
#NB502143:99:HFF7TAFX2:1:11101:4133:1019 1:N:0:ATCACG
CATGAAAAAGAGCTCATTTTCAGATGCAGGAATTCCTATCCG
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
#NB502143:99:HFF7TAFX2:1:11101:19790:1020 1:N:0:ATCACG
CATGATCCACTTTTCCACGCGCTTTGACGACCATTTTATAA
+
EEEEE<EEEEEEEEEEEEEEEEE<EE/EEAEEEEEEEEEEE
#NB502143:99:HFF7TAFX2:1:11101:6327:1020 1:N:0:ATCACG
CATGATCTCAGTAAAGGCATTTGTGGTTGTTAAGTAGCCATT
When I try to map it with bowtie2, I get the following error message:
Saw ASCII character 10 but expected 33-based Phred qual.
I don't get this error if I map input.fastq.gz, so I suspect something wrong is happening during the trimming but I can't figure out what!
I checked both files with FastQC and they're both Sanger / Illumina 1.9 encoded.
Thanks for your help.
I have been having a similar issue. The error occurs when I use cutadapt, but does not happen when I trim with another tool, fastp.
Checking the integrity of the resulting trimmed fastq files showed that some reads had no bases. A tool like fastq_info from the fastq_utils package would work.
If there is an issue, you might need to use the -m <minimum-length> flag when running cutadapt. This will remove reads below a designated length. Alignment after that should work if that was the issue.

Is there any way to give an input file to Stanza (stanford corenlp client) rather then one piece of text while calling server?

I have a .csv file consists of Imdb sentiment analysis data-set. Each instance is a paragraph. I am using Stanza https://stanfordnlp.github.io/stanza/client_usage.html for getting parse tree for each instance.
text = "Chris Manning is a nice person. Chris wrote a simple sentence. He also gives oranges to people."
with CoreNLPClient(
annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse','coref'],
timeout=30000,
memory='16G') as client:
ann = client.annotate(text)
Right now, I have to re-run server for every instance and it is taking a lot of time since I have 50k instances.
1
Starting server with command: java -Xmx16G -cp /home/wahab/treeattention/stanford-corenlp-
4.0.0/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 1200000 -threads
5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-a74576b3341f4cac.props
-preload parse
2
Starting server with command: java -Xmx16G -cp /home/wahab/treeattention/stanford-corenlp-
4.0.0/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 1200000 -threads
5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-d09e0e04e2534ae6.props
-preload parse
Is there any way to pass a file or do batching?
You should only start the server once. It'd be easiest to load the file in Python, extract each paragraph, and submit the paragraphs. You should pass each paragraph from your IMDB to the annotate() method. The server will handle sentence splitting.

GNUCobol compiled program counts one more record than expected

I'm learning COBOL programming and using GNUCobol (on Linux) to compile and test some simple programs. In one of those programs I have found an unexpected behavior that I don't understand: when reading a sequential file of records, I'm always getting one extra record and, when writing these records to a report, the last record is duplicated.
I have made a very simple program to reproduce this behavior. In this case, I have a text file with a single line of text: "0123456789". The program should count the characters in the file (or 1 chararacter long records) and I expect it to display "10" as a result, but instead I get "11".
Also, when displaying the records, as they are read, I get the following output:
0
1
2
3
4
5
6
7
8
9
11
(There are two blank spaces between 9 and 11).
This is the relevant part of this program:
FD SIMPLE.
01 SIMPLE-RECORD.
05 SMP-NUMBER PIC 9(1).
[...]
PROCEDURE DIVISION.
000-COUNT-RECORDS.
OPEN INPUT SIMPLE.
PERFORM UNTIL SIMPLE-EOF
READ SIMPLE
AT END
SET SIMPLE-EOF TO TRUE
NOT AT END
DISPLAY SMP-NUMBER
ADD 1 TO RECORD-COUNT
END-READ
END-PERFORM
DISPLAY RECORD-COUNT.
CLOSE SIMPLE.
STOP RUN.
I'm using the default options for the compiler, and I have tried using 'WITH TEST {BEFORE|AFTER}' but the result is the same. What can be the cause of this behavior or how can I get the expected result?
Edit: I tried using an "empty" file as data source, expecting a 0 record count, using two different methods to empty the file:
$ echo "" > SIMPLE
This way the record count is 1 (ls -l gives a size of 1 byte for the file).
$ rm SIMPLE
$ touch SIMPLE
This way the record count is 0 (ls -l gives a size of 0 bytes for the file). So I guess that somehow the compiled program is detecting an extra character, but I don't know how to avoid this.
I found out that the cause of this behavior is the automatic newline character that vim seems to append when saving the data file.
After disabling this in vim this way
:set binary
:set noeol
the program works as expected.
Edit: A more elegant way to prevent this problem, when working with data files created from a text editor, is using ORGANIZATION IS LINE SEQUENTIAL in the SELECT clause.
Since the problem was caused by the data format, should I delete this question?

Send data over TCP/IP with Netcat or Rails

I have an IP of the server and a port on which I'm able to connect via nc on Ubuntu 14.04.
> nc x.x.x.x PORT
In order to communicate with the server, the first step is to send a WAKEUP call and get acknowledgment. The server expects a 3 byte ID in the wakeup call. An example is provided in the documentation that shows the success scenario of sending the ID and receiving the ack using a software. i.e
The client sends:
<sy><sy><eq>111<et>
And the server responds with:
<sy><ak>A<et><cr>
Here is some detail of <sy>
Within <> brackets is a non-printable ASCII character (<sy> = ASCII 22 or Hex 0x16)
I tried to replicate the exact same scenario but failed to do so. The server doesn't respond to the data I send, although the data is received there. I'm not sure about these tags <sy><sy><eq> etc. How to send the ID(111) along with these tags <sy> correctly?
Also tried to send this data using Rails framework and Bindata ruby gem but don't know how to represent the above format.
netcat is probably the wrong tool for this. Or at least you will want to use some other program to feed it input.
If I were doing this, I would code up something in python or C that would both connect to the server and feed it whatever data I needed to send it (and receive/interpret the responses) leaving out nc altogether. There are many examples on the web.
You can encode the control characters in a byte string in python with the syntax b'\x16' for your <sy> character. Most other languages have an equivalent capability.
I can't be sure exactly what those characters are. It seems likely they are standard ASCII control characters, but they aren't using the standard abbreviations (see http://www.theasciicode.com.ar/ for example). So presumably the documentation you are looking at has a list of the corresponding values. Assuming for the sake of example that <eq> corresponds to the ASCII ENQ character and <et> to the ASCII EOT (and given you already know that <sy> is equivalent to ASCII SYN), your desired string <sy><sy><eq>111<et> can be encoded in a python byte string: b'\x16\x16\x05111\x04'
(or equivalently b'\x16\x16\x05\x31\x31\x31\x04' if you like regularity: the 1 characters are simply ASCII digits, so you can replace each 1 with its binary equivalent b'\x31')
To return to nc, trying to type in the control characters to the nc input from a terminal window is, while possible in most cases, very difficult and error-prone. You will need to know the equivalent control character mapping (for example, 0x16 is "Ctrl-V") and will need to know how to get the terminal to accept that literal character (coincidentally, in linux, you have to precede most control characters with a Ctrl-V in order to enter them as input and avoid having them interpreted in the usual way: Ctrl-D == EOF, Ctrl-C == Interrupt, Ctrl-W == Delete-Previous-Word, etc).
So if you wanted to enter the data above into nc's input from the command line, you would need to type these characters:
Ctrl-V Ctrl-V <sy> / SYN
Ctrl-V Ctrl-V <sy> / SYN
Ctrl-V Ctrl-E <eq> / ENQ
1
1
1
Ctrl-V Ctrl-D <et> / EOT
But also important to note is that ordinarily nc will not actually send anything until you enter a newline (i.e. press the Return key). Then that newline character will also get sent to the server which might not be what you want.

ed - quoting control characters?

How can I search for control characters in unix ed(1)?
For example
ed somefile.log <<EOF
1,$s/.*\015//
w
q
EOF
doesn't work. Neither does \r. Obviously sed(1), awk(1) and other editors can do this, however ed has the very useful line move (m) command which is all I need within the bash script I am using.
I am able to accomplish what I want within the script by entering the control character directly (escaping it with C-v in vi, C-q in emacs for example), but this means that binary characters must be present in my otherwise printable text script.
ed Transport2SVN-W0177.log <<EOF
g/^M/s/.*^M//p
w
q
EOF
The ^M is actually character 0x0d.
ed doesn't provide any support for converting control characters.
The way you have found of inserting control-characters directly into the script (using Ctrl-V at the keyboard) is portable and it works.
It's possible that particular implementations of ed might support this, but it would not be portable.

Resources