Grow your CSS skills. Land your dream job.

Using SimpleHtmlDom to extract Html, loop through local directory, save files

  • # January 24, 2013 at 12:20 pm

    Hi All, I’m new and a hello to everyone in the forums everyone! :)

    Anyone have any experience with the Simple Html Dom library on Sourceforge.net
    http://simplehtmldom.sourceforge.net/
    I’ve downloaded the package, and looked at many of the examples.

    What I want to do is clean up, hundreds or even thousands of files in a directory and sub-directories.
    The files have the .asp extension.
    I would like to just extract certain amounts of text, and line breaks, and strip out all the other .asp coded bits/junk.

    I can provide more specifics, just trying to get my head around saving the files and how the paths with that works? The examples are not making sense entirely.

    The code structure is pretty simple, everything is contained in two span tags, everything below that just needs stripped, or left behind.

    < span class="headline">Dude you rock!

    < div class="adsBox">< /div>

    < span class="bodytype">HOMETOWN — Local man rocks the DOM.

    New line or paragraph goes here

    Yet Another.

    End of story, sometimes has an author and maybe an email address

    < /span>

    Everything else goes away, so I’m after two innertext calls for the spans, me thinks…

    Where’s the forum search, also, I must be missing the search feature for just the forums?

    # January 24, 2013 at 12:25 pm

    Oops trying to past a code block me get a FAIL.

    OK tried pasting code above, me sucks at this code pasting in the forums. Oops forgot I could use markdown.

    [FIXED BY MOD]
    Thanks bro, appreciated ;)

    # January 25, 2013 at 11:07 am

    Anyone ever used SimpleHTMLDom then. Any suggestions would be great.
    There are a few tutorials on line, but none quite apply to what I want to do.
    Another thought would be just using a good text editor in my local web folders and cleaning up the code that way.
    The though is to clean up the pages and input all 176,000 articles to a database, so a good database importer will be the next item on the list, to get the stories into wordpress for consumption?
    Does the DigWP book cover anything like that?

    __
    # January 25, 2013 at 11:45 pm

    A DOM tool is not what you need: if I understand you correctly, then the files in question are not proper HTML (or XML): they include asp code and possibly random bits of “other stuff” as well. That would prevent any DOM parser from parsing it.

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.

*May or may not contain any actual "CSS" or "Tricks".