Prompt design is the most important factor in getting consistent results, and your language choices make a huge difference. Addresses, for example, will sometimes end up as a string and sometimes as a JSON object or an array, with the constituent parts of an address split up. It will also decide on its own way to parse values. But doing this for multiple records is a bad idea because ChatGPT will invent its own schema, using randomly chosen field names from the text. You can paste in a record and say “return a JSON representation of this” and it will do it. Once it’s done, getting ChatGPT to convert a piece of text into JSON is really easy. I spent about a week getting familiarized with both datasets and doing all this preprocessing. Ask ChatGPT to turn each record into JSON.Break the documents into individual records.Clean the data as well as I could, maintaining physical layout and removing garbage characters and boilerplate text.This was critically important because ChatGPT refused to work with poorly OCR’d text. Redo the OCR, using the highest quality tools possible.These were completely unstructured and contained emails and document scans. 1,400 memos from internal police investigations.There were five different forms, bad OCR, and some freeform letters mixed in. A 7,000-page PDF of New York data breach notification forms.To test how well ChatGPT could extract structured data from PDFs, I wrote a Python script (which I’ll share at the end!) to convert two document sets to spreadsheets: The results were lackluster, but ChatGPT, OpenAI’s newest model, has several improvements that make it better suited to extraction: It’s 10x larger than GPT-3 and is generally more coherent as a result, it’s been trained to explicitly follow instructions, and it understands programming languages. After throwing a couple programming problems at OpenAI’s ChatGPT and getting a viable result, I wondered if we were finally there.īack when OpenAI’s GPT-3 was the hot new thing, I saw Montreal journalist Roberto Rocha attempt a similar test. So every time a new iteration of AI technology arrives, I wonder if it’s capable of doing what so many people ask for: to hand off a PDF, ask for a spreadsheet, and get one back. I convert a ton of text documents like PDFs to spreadsheets.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |