TDB2Factory connecting dataset - jena

I try to use TDB2Factory to connect dataset (in turtle syntax) at URL address. But this command:
Dataset ds = TDB2Factory.connectDataset(loc1);
doesnt load any data. There is a problem with syntax or exists another way how to connect dataset from jena apache?

TDB2Factory.connectDataset connects to the local filesystem. It does not read Turtle file but you can load data from a file using RDFDataMgr. This data is written to the database in the local filesystem.
Alternatively, uou can load the database first with the tdb2.tdbloader tool.

Related

Is there any way to load dump files in Neo4j Desktop 5.1.0 in Windows 11?

I have tried all possible ways to load the pole.dump file into Neo4J :
I have been doing the following for past 3 days now:
Opened the Neo4J Desktop and Using the Add Drop-Down Menu , I have added the pole.dump into the Neo4J Desktop.
Then I have selected the Import dump into existing DBMS -> which is my Graph3 Database.
Then I am going to Neo4J Desktop and from the Database Information, I selected the pole database but I am getting this error
Database "pole" is unavailable, its status is "offline".
I also tried this: https://community.neo4j.com/t5/graphacademy-discussions/cannot-create-new-database-from-dump-file/td-p/39914
i. Database-->Open Folder-->DBMS. Here you will see data/dumps folder
ii. Copy pole.dump file to data/dumps folder (Although there is no folder called dumps in the data folder)
iii. Close the browser. Click on ... and select Terminal.
iv. Terminal window shows up. Enter this command:
bin/neo4j-admin load --from=data/dumps/pole.dump --database=pole --force
v. If successful, close the Terminal window and open the db in browser.
vi. Click on the database icon on top left to see the databases from the dropdown box.
Here you will not see pole db.
vii. Select 'system' database. On the right pane run this Cypher:
CREATE DATABASE pole and execute the Cypher.
viii. Run SHOW DATABASES and you should see pole and check the status. Status should be 'online'.
ix. Select pole from the dropdown list. Once selected you should see all the nodes,
relationships on the left. Now you can start playing with it!!
But I could not pass after point iv as it says in the neo4j terminal if I open it from the Neo4J Desktop , that it could not load - in fact it says there is a parsing error.
I did check with the following:
C:\Users\Chirantan\.Neo4jDesktop\relate-data\dbmss\dbms-11aabb23-daca-4d35-9043-6c039d133a34\bin>neo4j-import Graph3 load --from=data/dumps/pole.dump
'neo4j-import' is not recognized as an internal or external command,
operable program or batch file.
I am coming to this platform because I have tried everything available:
https://neo4j.com/docs/operations-manual/current/tools/neo4j-admin/
'neo4j-admin' is not recognized as an internal or external command, operable program or batch file
https://www.youtube.com/watch?v=HPwPh5FUvAk
But could not get any luck.
After step 3, which is:
Then I am going to Neo4J Desktop and from the Database Information, I selected the pole database but I am getting this error
Did you try to start the database with the following command?
START DATABASE pole;
I have already solved the problem. The issue was existent even after I followed the steps provided in the OP. What I did was: I created random texts for all records for names of criminals/victims, friends of victims, friends of criminals, generated random phone numbers, generated random nhs numbers, also generated random addresses using :
https://fossbytes.com/tools/random-name-generator
https://www.randomlists.com/london-addresses?qty=699
Using this code I generated random nhs ids :
import string
import random
# initializing size of string
N = 7
list_str= []
for i in range(699):
# using random.choices()
# generating random strings
res = ''.join(random.choices(string.ascii_uppercase +
string.digits, k=N))
list_str.append(res)
Random Phone Number generated using:
https://fakenumber.in/united-kingdom
There is a better answer.
Go to this url : https://neo4j.com/sandbox/
Then Select the Pre-Installed Databases that come with sandbox- Crime Investigation being one of them with the POLE Database pre-installed.
You will be prompted to open the HOME from there with the POLE Database pre-installed.
You Finally open the Neo4J Browser from here using the drop down menu by pressing the Open Button and access the Neo4J Browser and voila! You can access POLE database using Neo4J

CatBoostError: catboost/libs/train_lib/dir_helper.cpp:20: Can't create train working dir:

I am running catboost on Databricks cluster. Databricks Production cluster is very secure and we cannot create new directory on the go as a user. But we can have pre-created directories. I am passing below parameter for my CatBoostClassifier.
CatBoostClassifier(train_dir='dbfs/FileStore/files/')
It does not work and throws below error.
CatBoostError: catboost/libs/train_lib/dir_helper.cpp:20: Can't create train working dir
you're missing / character at the beginning - it should be '/dbfs/FileStore/files/' instead.
Also, writing to DBFS could be slow, and may fail if catboost is using random writes (see limitations). You may instead point to the local directory of the node, like, /tmp/...., and then use dbutils.fs.cp("file:///tmp/....", "/FileStore/files/catboost/...", True) to copy files from local directory to the DBFS.

How to predownload a transformers model

I want to perform a text generation task in a flask app and host it on a web server however when downloading the GPT models the elastic beanstalk managed EC2 instance crashes because the download takes too much time and memory
from transformers.tokenization_openai import OpenAIGPTTokenizer
from transformers.modeling_tf_openai import TFOpenAIGPTLMHeadModel
model = TFOpenAIGPTLMHeadModel.from_pretrained("openai-gpt")
tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
These are the lines in question causing the issue. GPT is approx 445 MB. I am using the transformers library. Instead of downloading the model at this line I was wondering if I could pickle the model and then bundle it as part of the repository. Is that possible with this library? Otherwise how can I preload this model to avoid the issues I am having?
Approach 1:
Search for the model here: https://huggingface.co/models
Download the model from this link:
pytorch-model: https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-pytorch_model.bin
tensorflow-model: https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-tf_model.h5
The config file: https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-config.json
Source: https://huggingface.co/transformers/_modules/transformers/configuration_openai.html#OpenAIGPTConfig
You can manually download the model (in your case TensorFlow model .h5 and the config.json file), put it in a folder (let's say model) in the repository. (you can try compressing the model, and then decompressing once it's in the ec2 instance if needed)
Then, you can directly load the model in your web server from the path instead of downloading (model folder which contains the .h5 and config.json):
model = TFOpenAIGPTLMHeadModel.from_pretrained("model")
# model folder contains .h5 and config.json
tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
# this is a light download
Approach 2:
Instead of using links to download, you can download the model in your local machine using the conventional method.
from transformers.tokenization_openai import OpenAIGPTTokenizer
from transformers.modeling_tf_openai import TFOpenAIGPTLMHeadModel
model = TFOpenAIGPTLMHeadModel.from_pretrained("openai-gpt")
tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
This downloads the model. Now you can save the weights in a folder using save_pretrained function.
model.save_pretrained('/content/') # saving inside content folder
Now, the content folder should contain a .h5 file and a config.json.
Just upload them to the repository and load from that.
Open https://huggingface.co/models and search the model you want. Click on the model name and finnaly click on "List all files in model". You will get a list of the files you can download.

How to parallel compute csv file store on each worker without use hdfs?

A concept same as data localy on hadoop but I don't want to use hdfs.
I have 3 dask-worker .
I want to compute a big csv filename for example mydata.csv.
I split mydata.csv to small file (mydata_part_001.csv ... mydata_part_100.csv) and store in local folder /data on each worker
e.g.
worker-01 store mydata_part_001.csv - mydata_part_030.csv in local folder /data
worker-02 store mydata_part_031.csv - mydata_part_060.csv in local folder /data
worker-03 store mydata_part_061.csv - mydata_part_100.csv in local folder /data
how to use dask compute to mydata ??
Thank.
It is more common to use some sort of globally accessible file system. HDFS is one example of this, but several other Network File Systems (NFSs) exist. I recommend looking into these instead of managing your data yourself in this way.
However, if you want to do things this way then you are probably looking for Dask's worker resources, which allow you to target specific tasks to specific machines.

SSIS - Use file name parameter in SQL Lookup Command (JET OLEDB)

Can I parameterise the SqlCommand in a Lookup Transformation when using the Jet engine against a CSV file? Is there another way to work with CSV's and Lookups?
I have a JET OLEDB connection that uses an expression to get the folder location from a variable as follows:
"Data Source=" + #[User::SourceRoot] + ";Provider=Microsoft.Jet.OLEDB.4.0;Extended Properties=\"text;HDR=Yes;FMT=Delimited(,)\";"
Then in my SSIS Lookup Transformation I have the following SqlCommand:
SELECT * FROM Users.csv
This works fine, however, I don't want to hard-code "Users.csv". Is there a way to configure this? I've tried setting partial cache, but haven't had any luck using the Advanced screen "Custom query" or using a '?' parameter in the query. (I'm using SQL 2012).
I would create a data flow task that uses a flat file connection manager to read from the CSV and load that to a cache transformation. Then you can use the cache transformation file in the lookup task.

Resources