I am looking for a database that contains the 13F/13G filings in Quandl but can't find any.
Maybe I am not using the right keywords?
Any suggestion where to find a curated dataset? I don't want to end up scraping EDGAR again.
Cheers!
I've been downloading these free curated data sets of Form 13F available as CSV, XLS, and JSON. They're formatted to be immediately analyzable.
Related
I am dealing with tons of PDF documents (petetions data) filled with text data having numbers, tabular data etc. The objective of client is to summarize any such given document to reduce man-force in reading the entire document. I have tried conventional methods like lSA,Gensim-summarizer, BERT extractive summarizer, Pysummarizer.
The results are not at all good, Please suggest me any way where i can find a industry level summarizer(extrative/abstractive) that would give me a good start to solve this issue .
First, you will need to know exactly what data the company wants abstracted from the documents. After that, you may be able to convert the documents to raw text using OCR or some other PDF application, and then extract the data you need. If the company isn't being clear on how they want you to summarize the data, that would be something to talk to them about. It might be as simple as setting a title for the document, or classifying it. If it's document classification I can help you with that, I made a repo for that purpose a little while ago.
I'm still getting into google spreadsheets, recently understood how to format a .txt to be able to use =ImportData properly thanks to Tanaike's assitance, now tackling a -slightly- more challenging task.
Goal:
Automatically extracting specific data from .pdf files hosted inside of a google drive folder and arranging the information into specific cells
Challenges:
Being able to decode the blobs of information, as just the raw data obtained with =ImportData is useless
Truly learning how to use google-apps-script for something useful (that's on my own)
Instructing a single extraction of information rather than constant online status as with =ImportData
[Second Priority] Stop Depending on an add-on (Drive Direct Links) to get the URL of the files
To my understanding, I'll need to do some parsing. I know .pdf is not always straight forward, all the files will come from the same place and have the exact same format, so understanding how to do it once should be enough.
I already know how to get the real/permanent link to the files automatically and how to arrange information segregated into cells using =Index, =Extract and others.
Hope I'm being clear enough. Thanks a lot in advance.
Best regards,
Lucas.-
So, I have an option of sending a document from a database to print either in PDF or XPS. I need to be able to extract specific data, such as name, date, etc. from one of those formats and inserting that data into a word template. The word template is not editable. You can only type within fields... each field has a heading before it, such as name, dob, etc.
Basically I need to be able to automate transferring that information from the PDF or XPS file into the word template.
I'm familiar enough with C++, Python and Java.. so I have no language preference -- whatever gets the job done.
Could you suggest a way I can manage to accomplish this? I've having a bit of a difficulty figuring out the way I can parse/extract data from one of those file types and which file type would be a better candidate. And I definitely have no idea how I can automate the population of fields in the Word Template.
Oh and forgot to mention, this is on Windows 7 (and maybe 8, but mostly 7) machines.
Thank a lot for your help in advance!
This is for anyone who has the same sort of question, so this is how I did it:
I used PDFBox (http://pdfbox.apache.org/) to parse the document and extract the needed data and then I used docx4j (http://www.docx4java.org/trac/docx4j) to insert data into word template. Both are incredible tools and have excellent communities that help out almost immediately.
I need to import data to my app, now i do it via xls spreadsheets, but when in my case it has about 80.000 rows it is slow, so maybe is it better to chose another format? For example, will xml data be more faster in importing?
XML is unlikely to be any faster - it still needs to be parsed as strings and converted.
80,000 rows is quite a lot. How long does it take you?
Edit:
You can make what's happening more visible by dropping puts statements into your code, with timestamps. It's crude, but you can then time between various parts of your code to see which part takes the longest.
Or better yet, have a go at using ruby-prof to profile your code and see where the code is spending the most amount of time.
Either way, getting a more detailed picture of the slow-points is a Good Idea.
You may find there's just one or two bottlenecks that can be easily fixed.
Hello Oracles of StackOverflow,
First time I managed to ask a question on stack overflow, so feel free to throw your cabbages at me. (or correct the way I should be asking my question)
I have this problem. I'm using HDF5 to store massive quantities of cookie information.
My Data is structured in the following way:
CookieID -> Event -> Key_value Pair
There are multiple events for each cookieID. But only one key_value pair per event.
I'd like to know what the best way I should store this in a HDF5.
Currently, I'm storing each cookie as a seperate table within a group in the HDF5, using the cookieID as the name of the table. Unfortunately for me, with 10,000,000 cookies, HDF5 (or specifically PyTables) doesn't approve of this type of storage.
Specifically throwing this error:
/CookieData`` is exceeding the recommended maximum number of children (16384)
I'm wondering if you could recommend the best way of storing this information.
Should I create a flat table? Should I keep this method? Is there something else I can do?
Help is appreciated. Thanks for reading.
Several hours of research later, I've discovered that what I was attempting to do was categorically impossible.
The following link gives details as to the impossibility of using HDF5 with variable-length nested children.
I've decided to go with a flat file for the time being and hope that this is more efficient than a database store. The problem with a flat file in the end is that I have to replicate values in the file, which otherwise should not exist.
If anyone else has any better ideas it would be appreciated.