LLMs Fail at Parsing Unstructured Data
LLMs will NEVER - no matter how much compute - be able to automate messy, manual tasks that swim in unstructured data and are filled with “edge” cases.
Not going to happen - ever. Infrastructure needs to change for true automation to occur. Take SEC.Gov. It is damn near impossible to accurately parse the unstructured data coming off of that useless Website. All companies’ SEC filings look different. We force companies to file, but there is no standardization - and this is where LLMs fall down. No two tasks look the same? Good luck trying to train an LLM to accomplish said task with a high degree of accuracy, if at all.
I could fix SEC.Gov however. I could automate it. Not with LLMs, but by replacing the current infrastructure with a Blockchain platform. I would would force public companies to tokenize their reporting data for submission to SEC_blockchain.gov. In a Blockchain environment investors could write limitless scripts to analyze every token on the chain.
Yet for now we are saddled with partial automation and lots of manual work. This is why there aren’t more capital markets data tools. This is why scaling CEORater is a beast - because SEC source data is an unbridled mess - and LLMs can’t make heads nor tails of it.



