That’s a lot of code for something that is really simple to do.
10.4’s NSXML* classes can do it all for you. Load in your HTML into an NSXMLDocument, using the tidy options to convert the HTML to XHTML.
Then use a really simple XSLT transform to transform the XHTML into text, stripping out the tags you don’t want (I decided to strip out everything in HEAD and SCRIPT tags, you will probably want to add some more tags)
Thanks Jonathan! I replaced my code with yours, which is much cleaner. It is still plenty fast. As an added bonus, I’m able to remove my dependency on libxml2.
BTW, I noticed in NSXMLDocument that there’s a method XMLDataWithOptions: which you can pass in a parameter of NSXMLDocumentTextKind which “outputs the string value of the document by extracting the string values from all text nodes.” Except that it DOESN’T. If that method worked properly (it actually outputs XHTML), it would be an even simpler approach! I’ve just reported this to Apple; I hope others do too.
One caveat: I found some problems (maybe a bug in the XML parser) if I passed in a very short string with no markup. So what I’m doing if scanning to see if the string has any HTML-looking markup. Only if it has that do I pass the string to the parser. Otherwise I leave it unchanged.
stattic NSCharacterSet *sHTMLSet = nil;
if (nil == sHTMLSet)
{
sHTMLSet = [[NSCharacterSet characterSetWithCharactersInString:@"&"] retain];
}
if (NSNotFound != [result rangeOfCharacterFromSet:sHTMLSet].location)
{ …. }
Here’s another problem, maybe some minor aspect of the XSLT string. If you pass in “” as your string, then instead of getting a string back, you get a mostly-empty XML document back! It turns out that objectByApplyingXSLTString: will return NSData or an NSXMLDocument depending on what’s passed into it.
Thanks for the NSXML method guys. I had been using something that was a bit slower and a lot more messy looking.
I ran into an issue tonight that is another gotha with Jon’s method. If the string being processes only contains html, something like “”, an exception is thrown.
In this case, objectByApplyingXSLTString… returns an NSXMLDocument containing “” instead of an NSData object which naturally causes initWithData to choke. So you need to check if theData is actually data and not an NSXMLDocument before creating the string.
Looking at the docs, I don’t think it’s a bug but they don’t spelled out clearly what’s returned if the XSLT transform results in an empty string. It would be nice if it returned an empty string instead of an empty xml document. I’ll file a bug on it tomorrow when I’m a little bit more awake.
A couple of clarifications about the above three comments. First, in my testing, if the specified string contains no markup, it needs to contain at least 12 characters or else an error is returned from objectByApplyingXSLTString that specifies that the document is empty. Newlines count, so you could append a bunch of newlines to the end of the string. Also, I believe that the NSCharacterSet created in Dan’s example isn’t initialized with an ampersand. Wordpress probably messed up the code. I expect that he’s creating a character set with angle brackets. If this is the case, there is still the possibility that the string could contain a single angle bracket and no markup which would still result in an error.
I’m not sure what should be between the quotes in Dan’s latest post. It may be similar to the issue reported by Brad. Wordpress also stripped HTML from Brad’s post. I’m not sure what was in his example, but if the string only contains an img tag, for example, it will exhibit the behavior described by Dan and Brad. To check if the object returned from objectByApplyingXSLTString is really an NSData object, you can invoke isKindOfClass.
Sorry, it was 2 am here when I posted and I didn’t catch that the html was stripped. Good guess at my string, it was a single image tag and my work around was exactly what you suggested.
Thanks for the note about the string needing at least 12 characters. I’ll a check for length too. I had already added a check similar to Dan’s that looks for brackets or an ampersand.
I haven’t checked to see if there is a new feature available in Leopard that makes this easier. If someone finds one, by all means, please leave a comment.
January 3rd, 2007 at 2:19 pm
That’s a lot of code for something that is really simple to do.
10.4’s NSXML* classes can do it all for you. Load in your HTML into an NSXMLDocument, using the tidy options to convert the HTML to XHTML.
Then use a really simple XSLT transform to transform the XHTML into text, stripping out the tags you don’t want (I decided to strip out everything in HEAD and SCRIPT tags, you will probably want to add some more tags)
NSURL *theURL = [NSURL fileURLWithPath:@"/Users/schwa/Desktop/Test.html"];
NSError *theError = NULL;
NSXMLDocument *theDocument = [[[NSXMLDocument alloc] initWithContentsOfURL:theURL options:NSXMLDocumentTidyHTML error:&theError] autorelease];
NSString *theXSLTString = @”\
\
\
\
\
“;
NSData *theData = [theDocument objectByApplyingXSLTString:theXSLTString arguments:NULL error:&theError];
NSString *theString = [[[NSString alloc] initWithData:theData encoding:NSUTF8StringEncoding] autorelease];
NSLog(@”%@”, theString);
January 3rd, 2007 at 2:21 pm
Looks like Wordpress ate the style sheet. (Obvious in retrospect). Here it is (hopefully unmangled):
<?xml version=’1.0′ encoding=’utf-8′?>
<xsl:stylesheet version=’1.0′ xmlns:xsl=’http://www.w3.org/1999/XSL/Transform’ xmlns:xhtml=’http://www.w3.org/1999/xhtml’>
<xsl:output method=’text’/>
<xsl:template match=’xhtml:head’></xsl:template>
<xsl:template match=’xhtml:script’></xsl:template>
</xsl:stylesheet>”;
January 3rd, 2007 at 11:54 pm
Thanks Jonathan! I replaced my code with yours, which is much cleaner. It is still plenty fast. As an added bonus, I’m able to remove my dependency on libxml2.
January 4th, 2007 at 10:57 am
This is great, guys! I’ve updated my “flatten” implementation (kept for posterity) to point to this entry.
Jon, I’m a bit confused about whhere the style sheet goes … is that the XSLT string that looks like a bunch of escaped newlines in the previous post?
January 4th, 2007 at 11:56 am
Dan, yes, in Jon’s code the style sheet is assigned to theXSLTString. Wordpress mangled it in his first post.
January 4th, 2007 at 12:11 pm
It works great!
BTW, I noticed in NSXMLDocument that there’s a method XMLDataWithOptions: which you can pass in a parameter of NSXMLDocumentTextKind which “outputs the string value of the document by extracting the string values from all text nodes.” Except that it DOESN’T. If that method worked properly (it actually outputs XHTML), it would be an even simpler approach! I’ve just reported this to Apple; I hope others do too.
January 4th, 2007 at 10:42 pm
For those of you following along at home, my bug report was a duplicate of 4296059. I hate it when that happens.
January 5th, 2007 at 12:42 pm
One caveat: I found some problems (maybe a bug in the XML parser) if I passed in a very short string with no markup. So what I’m doing if scanning to see if the string has any HTML-looking markup. Only if it has that do I pass the string to the parser. Otherwise I leave it unchanged.
stattic NSCharacterSet *sHTMLSet = nil;
if (nil == sHTMLSet)
{
sHTMLSet = [[NSCharacterSet characterSetWithCharactersInString:@"&"] retain];
}
if (NSNotFound != [result rangeOfCharacterFromSet:sHTMLSet].location)
{ …. }
January 8th, 2007 at 12:46 am
Here’s another problem, maybe some minor aspect of the XSLT string. If you pass in “” as your string, then instead of getting a string back, you get a mostly-empty XML document back! It turns out that objectByApplyingXSLTString: will return NSData or an NSXMLDocument depending on what’s passed into it.
January 8th, 2007 at 4:21 am
Thanks for the NSXML method guys. I had been using something that was a bit slower and a lot more messy looking.
I ran into an issue tonight that is another gotha with Jon’s method. If the string being processes only contains html, something like “”, an exception is thrown.
In this case, objectByApplyingXSLTString… returns an NSXMLDocument containing “” instead of an NSData object which naturally causes initWithData to choke. So you need to check if theData is actually data and not an NSXMLDocument before creating the string.
Looking at the docs, I don’t think it’s a bug but they don’t spelled out clearly what’s returned if the XSLT transform results in an empty string. It would be nice if it returned an empty string instead of an empty xml document. I’ll file a bug on it tomorrow when I’m a little bit more awake.
January 8th, 2007 at 10:01 am
A couple of clarifications about the above three comments. First, in my testing, if the specified string contains no markup, it needs to contain at least 12 characters or else an error is returned from objectByApplyingXSLTString that specifies that the document is empty. Newlines count, so you could append a bunch of newlines to the end of the string. Also, I believe that the NSCharacterSet created in Dan’s example isn’t initialized with an ampersand. Wordpress probably messed up the code. I expect that he’s creating a character set with angle brackets. If this is the case, there is still the possibility that the string could contain a single angle bracket and no markup which would still result in an error.
I’m not sure what should be between the quotes in Dan’s latest post. It may be similar to the issue reported by Brad. Wordpress also stripped HTML from Brad’s post. I’m not sure what was in his example, but if the string only contains an img tag, for example, it will exhibit the behavior described by Dan and Brad. To check if the object returned from objectByApplyingXSLTString is really an NSData object, you can invoke isKindOfClass.
January 8th, 2007 at 6:47 pm
Sorry, it was 2 am here when I posted and I didn’t catch that the html was stripped. Good guess at my string, it was a single image tag and my work around was exactly what you suggested.
Thanks for the note about the string needing at least 12 characters. I’ll a check for length too. I had already added a check similar to Dan’s that looks for brackets or an ampersand.
April 14th, 2007 at 6:42 pm
if it’s something simple you can always do this :
- (NSString *)stripTags:(NSString *) html {
NSMutableString *result = [[NSMutableString alloc] initWithCapacity:[html length]];
BOOL iguenore = YES;
int index;
unichar c;
for (index = 0; index ‘) {
iguenore = NO;
continue;
}
if (!iguenore) {
[result appendFormat:@"%C",[html characterAtIndex:index]];
}
}
return result;
}
this is just something i wrote into some if my applications
April 14th, 2007 at 6:48 pm
because of the use of greter and less i guess that some code failed to display, i am sorry
August 1st, 2008 at 11:45 am
Thank you…works brilliantly…
I assume this is still the easiest way to do this with xCODE 3.x and Leopard SDK?
August 1st, 2008 at 10:41 pm
I haven’t checked to see if there is a new feature available in Leopard that makes this easier. If someone finds one, by all means, please leave a comment.
August 9th, 2008 at 4:32 am
This ist better and faster and works with the iPhone SDK:
- (NSString *)flattenHTML:(NSString *)html
{
NSScanner *theScanner;
NSString *text;
theScanner = [NSScanner scannerWithString:html];
while ([theScanner isAtEnd] == NO) {
//remove html tag
[theScanner scanUpToString:@"<" intoString:NULL];
[theScanner scanString:@"" intoString:&text];
html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"",text] withString:@”"];
}
return html;
}
Regards from Switzerland
August 9th, 2008 at 4:36 am
Wordpress has deleted the phrase in stringWithFormat:@”"
Put in the @”" following without the +
August 9th, 2008 at 4:37 am
StringWithFormat:@”"
August 9th, 2008 at 4:38 am
Shit… You will find it out..
November 4th, 2009 at 1:49 am
How many times have you tried before the first fast program crashed? I had used the same one but it doesnt resulted in any crashes.
November 18th, 2009 at 8:31 am
@Daniel
you cannot add less than greater than signs in the comment here