What I want to do is clean up, hundreds or even thousands of files in a directory and sub-directories.
The files have the .asp extension.
I would like to just extract certain amounts of text, and line breaks, and strip out all the other .asp coded bits/junk.
I can provide more specifics, just trying to get my head around saving the files and how the paths with that works? The examples are not making sense entirely.
The code structure is pretty simple, everything is contained in two span tags, everything below that just needs stripped, or left behind.
< span class="headline">Dude you rock!
< div class="adsBox">< /div>
< span class="bodytype">HOMETOWN — Local man rocks the DOM.
New line or paragraph goes here
End of story, sometimes has an author and maybe an email address
Everything else goes away, so I’m after two innertext calls for the spans, me thinks…
Where’s the forum search, also, I must be missing the search feature for just the forums?
Anyone ever used SimpleHTMLDom then. Any suggestions would be great.
There are a few tutorials on line, but none quite apply to what I want to do.
Another thought would be just using a good text editor in my local web folders and cleaning up the code that way.
The though is to clean up the pages and input all 176,000 articles to a database, so a good database importer will be the next item on the list, to get the stories into wordpress for consumption?
Does the DigWP book cover anything like that?
A DOM tool is not what you need: if I understand you correctly, then the files in question are not proper HTML (or XML): they include asp code and possibly random bits of “other stuff” as well. That would prevent any DOM parser from parsing it.