Grow your CSS skills. Land your dream job.

Searching the text within a PDF?

  • # April 22, 2013 at 12:43 am

    Does anyone know of a way where the user would search on a web form and have it return results based on what was IN a PDF document? I can’t think of any way to make it show any kind of excerpt, but even if it could just provide that the file it returns contains the information the user had sought.

    # April 22, 2013 at 3:50 am

    Your question is hard to provide a solution for as it’s a bit vague. Are there structural guidelines to the content of these PDF’s? Where are the PDF’s coming from?

    This is a bit over my head but have you researched on how Google does it with their preview in search? When you search a keyword and hover over the arrow, it has a red border around, what I believe to be, a summary or excerpt. Unfortunately, the excerpt or summary does not always have the keyword in the text. So I’m thinking it must be a bit complex to do this requiring some sort of algorithm. Then again, this isn’t my area.

    Either way, with PHP you can extract content and/or post an excerpt from a PDF.

    # April 22, 2013 at 1:47 pm

    Hmmm that last line is interesting.

    Basically, the situation is these guys have a library of like 75 archived “issues” of PDF’s that are just in a massive list. Most of these are in just text with a couple of images. They were hoping they could make the whole thing searchable within a CMS, similar to how you would do a search and it would help the user get an idea that the result was indeed what they wanted.

    I wasn’t sure that was possible, but it sounds like it’s somewhat possible. I’m not a PHP developer, so that’s WAY over my head :)

    My recommendation initially was that they may want to just have all the old issues as an “archive” and then going forward they may want to think about just writing articles so they are fully searchable and would have the PDF version available to view and download.

    # April 22, 2013 at 11:32 pm

    This may not be ideal but 75 PDF’s doesn’t sound like that much. Of course I don’t know the extent of the content but why don’t they just copy/paste and create a digital web archive?

    # April 23, 2013 at 3:01 am

    That’s where I eventually ended up recommending. It’s probably roughly 900 to 1100 pages of text to copy, but I think it would probably be worth it in the end to actually have that as real content instead of PDF’s only.

    # April 23, 2013 at 3:21 am

    There might be simpler way to extract that content instead of doing it by hand. I’d suggest asking on Stack Overflow.

    # April 23, 2013 at 12:40 pm

    I *think* you can do this with Google’s search https://developers.google.com/custom-search/v1/overview

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic.

*May or may not contain any actual "CSS" or "Tricks".