So imagine that you have all kinds of PDFs docs and some of them may have been scanned, so after that, you need to understand the scan itself, and that takes time as well.
There is a technology called Optical Character Recognition, OCR and that’s often used to get that data in and turn it into printed documents, basically understandable computer documents.
You now have text, which is a big important piece, but you need to understand the content. Your NLP and AI program are gonna have to do some kind of whizzbang magic to basically understand what’s going on in all of that content.
Most good programs will generate internal metadata about the document or the content itself, it might even go down to the bolded words being there for emphasis and the italicized words being there for another reason.