Using SimpleHtmlDom to extract Html, loop through local directory, save files - CSS-Tricks

This topic is empty.

Viewing 4 posts - 1 through 4 (of 4 total)

Author

Posts
January 24, 2013 at 12:20 pm #42209

Crssp
Participant

Hi All, I’m new and a hello to everyone in the forums everyone! :)

Anyone have any experience with the Simple Html Dom library on Sourceforge.net
http://simplehtmldom.sourceforge.net/
I’ve downloaded the package, and looked at many of the examples.

What I want to do is clean up, hundreds or even thousands of files in a directory and sub-directories.
The files have the .asp extension.
I would like to just extract certain amounts of text, and line breaks, and strip out all the other .asp coded bits/junk.

I can provide more specifics, just trying to get my head around saving the files and how the paths with that works? The examples are not making sense entirely.

The code structure is pretty simple, everything is contained in two span tags, everything below that just needs stripped, or left behind.

< span class="headline">Dude you rock!

< div class="adsBox">< /div>

< span class="bodytype">HOMETOWN — Local man rocks the DOM.

New line or paragraph goes here

Yet Another.

End of story, sometimes has an author and maybe an email address

< /span>

Everything else goes away, so I’m after two innertext calls for the spans, me thinks…

Where’s the forum search, also, I must be missing the search feature for just the forums?

January 24, 2013 at 12:25 pm #122156

Crssp
Participant

Oops trying to past a code block me get a FAIL.

OK tried pasting code above, me sucks at this code pasting in the forums. Oops forgot I could use markdown.

[FIXED BY MOD]
Thanks bro, appreciated ;)

January 25, 2013 at 11:07 am #122290

Crssp
Participant

Anyone ever used SimpleHTMLDom then. Any suggestions would be great.
There are a few tutorials on line, but none quite apply to what I want to do.
Another thought would be just using a good text editor in my local web folders and cleaning up the code that way.
The though is to clean up the pages and input all 176,000 articles to a database, so a good database importer will be the next item on the list, to get the stories into wordpress for consumption?
Does the DigWP book cover anything like that?

January 25, 2013 at 11:45 pm #122367

__
Participant

A DOM tool is not what you need: if I understand you correctly, then the files in question are not proper HTML (or XML): they include asp code and possibly random bits of “other stuff” as well. That would prevent any DOM parser from parsing it.
Author

Posts

Viewing 4 posts - 1 through 4 (of 4 total)

The forum ‘Back End’ is closed to new topics and replies.