Strip HTML Tags
I needed to convert HTML to plain text for the application that I’m developing. I figured someone else had done this before; so, I searched for a solution. I found one that leveraged WebKit and worked well but was too slow for my usage. I found another that leveraged libxml2 and was fast but crashed. In the end, and only because I could browse the source to the version of libxml2 included in Mac OS X, I was able to work around the crash and keep the faster solution.
The first solution I found was on Karelia Software’s Cocoa Open Source site. FlattenHTML.m appeared to solve the problem nicely. However, when I later profiled the application with Shark, it pointed out that a vast majority of time was spent in this method and its children as a result of a bunch of Web Kit stuff being initialized. This performance hit was noticable; so, I searched for a faster solution.
I next found this code that leverages libxml2 to strip HTML tags. After switching to to this solution, the time to strip HTML was insignificant. There was much rejoicing, for a bit. All was fine until the application started crashing while stripping some HTML text:
*** -[NSCFString delegate]: selector not recognized
[self = 0x39d530]
After some debugging, I determined that the htmlSAXParseDoc function would cause this error when passed malformed HTML. I don’t really care how well-formed the HTML is, I just want the plain text. I tried specifying an error function callback in the xmlSAXHandler structure without any success. I then took advantage of the fact that libxml2 is open source and included on Apple’s Open Source web site. For me, this is the primary advantage of Darwin being open source: I can download the source, browse it, and build it, to debug problems in either Darwin or in the way my code leverages Darwin. In this case, by browsing the source, I was able to determine that the default structured error function was being called and resulting in the above exception. While creating a test case for a bug report to Apple (4905905), I learned that this problem only occurs after initializing an NSAppleScript object with the initWithContentsOfURL:error method. Go figure. To work around this problem, I installed my own structured error function that ignored all errors. While debugging this issue I found next to nothing useful on the web; hopefully, this post will save someone some time. Here’s the final code originally from the Objectpark group with my minor modification (tagged with “GCS”):
//
// NSString+OPHTMLTools.m
// GinkoVoyager
//
// Created by Dirk Theisen on 30.06.06.
// Copyright 2006 Objectpark Group.
//
#import "FlattenHTML.h"
#include <libxml2/libxml/xmlmemory.h>
#include <libxml2/libxml/HTMLparser.h>
@implementation NSString (FlattenHTML)
static void charactersParsed(void* context,
const xmlChar* ch, int len)
/*" Callback function for stringByStrippingHTML. "*/
{
NSMutableString* result = context;
NSString* parsedString;
parsedString = [[NSString alloc] initWithBytesNoCopy:
(xmlChar*) ch length: len encoding:
NSUTF8StringEncoding freeWhenDone: NO];
[result appendString: parsedString];
[parsedString release];
}
/* GCS: custom error function to ignore errors */
static void structuredError(void * userData,
xmlErrorPtr error)
{
/* ignore all errors */
(void)userData;
(void)error;
}
- (NSString*) flattenHTML
/*" Interpretes the receiver als HTML, removes all tags
and returns the plain text. "*/
{
int mem_base = xmlMemBlocks();
NSMutableString* result = [NSMutableString string];
xmlSAXHandler handler; bzero(&handler,
sizeof(xmlSAXHandler));
handler.characters = &charactersParsed;
/* GCS: override structuredErrorFunc to mine so
I can ignore errors */
xmlSetStructuredErrorFunc(xmlGenericErrorContext,
&structuredError);
htmlSAXParseDoc((xmlChar*)[self UTF8String], "utf-8",
&handler, result);
if (mem_base != xmlMemBlocks()) {
NSLog( @"Leak of %d blocks found in htmlSAXParseDoc",
xmlMemBlocks() - mem_base);
}
return result;
}
@end
Update 05jan2007: Be sure to read this comment for a better solution.

January 3rd, 2007 at 2:19 pm
That’s a lot of code for something that is really simple to do.
10.4’s NSXML* classes can do it all for you. Load in your HTML into an NSXMLDocument, using the tidy options to convert the HTML to XHTML.
Then use a really simple XSLT transform to transform the XHTML into text, stripping out the tags you don’t want (I decided to strip out everything in HEAD and SCRIPT tags, you will probably want to add some more tags)
NSURL *theURL = [NSURL fileURLWithPath:@"/Users/schwa/Desktop/Test.html"];
NSError *theError = NULL;
NSXMLDocument *theDocument = [[[NSXMLDocument alloc] initWithContentsOfURL:theURL options:NSXMLDocumentTidyHTML error:&theError] autorelease];
NSString *theXSLTString = @”\
\
\
\
\
“;
NSData *theData = [theDocument objectByApplyingXSLTString:theXSLTString arguments:NULL error:&theError];
NSString *theString = [[[NSString alloc] initWithData:theData encoding:NSUTF8StringEncoding] autorelease];
NSLog(@”%@”, theString);
January 3rd, 2007 at 2:21 pm
Looks like Wordpress ate the style sheet. (Obvious in retrospect). Here it is (hopefully unmangled):
<?xml version=’1.0′ encoding=’utf-8′?>
<xsl:stylesheet version=’1.0′ xmlns:xsl=’http://www.w3.org/1999/XSL/Transform’ xmlns:xhtml=’http://www.w3.org/1999/xhtml’>
<xsl:output method=’text’/>
<xsl:template match=’xhtml:head’></xsl:template>
<xsl:template match=’xhtml:script’></xsl:template>
</xsl:stylesheet>”;
January 3rd, 2007 at 11:54 pm
Thanks Jonathan! I replaced my code with yours, which is much cleaner. It is still plenty fast. As an added bonus, I’m able to remove my dependency on libxml2.
January 4th, 2007 at 10:57 am
This is great, guys! I’ve updated my “flatten” implementation (kept for posterity) to point to this entry.
Jon, I’m a bit confused about whhere the style sheet goes … is that the XSLT string that looks like a bunch of escaped newlines in the previous post?
January 4th, 2007 at 11:56 am
Dan, yes, in Jon’s code the style sheet is assigned to theXSLTString. Wordpress mangled it in his first post.
January 4th, 2007 at 12:11 pm
It works great!
BTW, I noticed in NSXMLDocument that there’s a method XMLDataWithOptions: which you can pass in a parameter of NSXMLDocumentTextKind which “outputs the string value of the document by extracting the string values from all text nodes.” Except that it DOESN’T. If that method worked properly (it actually outputs XHTML), it would be an even simpler approach! I’ve just reported this to Apple; I hope others do too.
January 4th, 2007 at 10:42 pm
For those of you following along at home, my bug report was a duplicate of 4296059. I hate it when that happens.
January 5th, 2007 at 12:42 pm
One caveat: I found some problems (maybe a bug in the XML parser) if I passed in a very short string with no markup. So what I’m doing if scanning to see if the string has any HTML-looking markup. Only if it has that do I pass the string to the parser. Otherwise I leave it unchanged.
stattic NSCharacterSet *sHTMLSet = nil;
if (nil == sHTMLSet)
{
sHTMLSet = [[NSCharacterSet characterSetWithCharactersInString:@"&"] retain];
}
if (NSNotFound != [result rangeOfCharacterFromSet:sHTMLSet].location)
{ …. }
January 8th, 2007 at 12:46 am
Here’s another problem, maybe some minor aspect of the XSLT string. If you pass in “” as your string, then instead of getting a string back, you get a mostly-empty XML document back! It turns out that objectByApplyingXSLTString: will return NSData or an NSXMLDocument depending on what’s passed into it.
January 8th, 2007 at 4:21 am
Thanks for the NSXML method guys. I had been using something that was a bit slower and a lot more messy looking.
I ran into an issue tonight that is another gotha with Jon’s method. If the string being processes only contains html, something like “”, an exception is thrown.
In this case, objectByApplyingXSLTString… returns an NSXMLDocument containing “” instead of an NSData object which naturally causes initWithData to choke. So you need to check if theData is actually data and not an NSXMLDocument before creating the string.
Looking at the docs, I don’t think it’s a bug but they don’t spelled out clearly what’s returned if the XSLT transform results in an empty string. It would be nice if it returned an empty string instead of an empty xml document. I’ll file a bug on it tomorrow when I’m a little bit more awake.
January 8th, 2007 at 10:01 am
A couple of clarifications about the above three comments. First, in my testing, if the specified string contains no markup, it needs to contain at least 12 characters or else an error is returned from objectByApplyingXSLTString that specifies that the document is empty. Newlines count, so you could append a bunch of newlines to the end of the string. Also, I believe that the NSCharacterSet created in Dan’s example isn’t initialized with an ampersand. Wordpress probably messed up the code. I expect that he’s creating a character set with angle brackets. If this is the case, there is still the possibility that the string could contain a single angle bracket and no markup which would still result in an error.
I’m not sure what should be between the quotes in Dan’s latest post. It may be similar to the issue reported by Brad. Wordpress also stripped HTML from Brad’s post. I’m not sure what was in his example, but if the string only contains an img tag, for example, it will exhibit the behavior described by Dan and Brad. To check if the object returned from objectByApplyingXSLTString is really an NSData object, you can invoke isKindOfClass.
January 8th, 2007 at 6:47 pm
Sorry, it was 2 am here when I posted and I didn’t catch that the html was stripped. Good guess at my string, it was a single image tag and my work around was exactly what you suggested.
Thanks for the note about the string needing at least 12 characters. I’ll a check for length too. I had already added a check similar to Dan’s that looks for brackets or an ampersand.
April 14th, 2007 at 6:42 pm
if it’s something simple you can always do this :
- (NSString *)stripTags:(NSString *) html {
NSMutableString *result = [[NSMutableString alloc] initWithCapacity:[html length]];
BOOL iguenore = YES;
int index;
unichar c;
for (index = 0; index ‘) {
iguenore = NO;
continue;
}
if (!iguenore) {
[result appendFormat:@"%C",[html characterAtIndex:index]];
}
}
return result;
}
this is just something i wrote into some if my applications
April 14th, 2007 at 6:48 pm
because of the use of greter and less i guess that some code failed to display, i am sorry
August 1st, 2008 at 11:45 am
Thank you…works brilliantly…
I assume this is still the easiest way to do this with xCODE 3.x and Leopard SDK?
August 1st, 2008 at 10:41 pm
I haven’t checked to see if there is a new feature available in Leopard that makes this easier. If someone finds one, by all means, please leave a comment.
August 9th, 2008 at 4:32 am
This ist better and faster and works with the iPhone SDK:
- (NSString *)flattenHTML:(NSString *)html
{
NSScanner *theScanner;
NSString *text;
theScanner = [NSScanner scannerWithString:html];
while ([theScanner isAtEnd] == NO) {
//remove html tag
[theScanner scanUpToString:@"<" intoString:NULL];
[theScanner scanString:@"" intoString:&text];
html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"",text] withString:@”"];
}
return html;
}
Regards from Switzerland
August 9th, 2008 at 4:36 am
Wordpress has deleted the phrase in stringWithFormat:@”"
Put in the @”" following without the +
August 9th, 2008 at 4:37 am
StringWithFormat:@”"
August 9th, 2008 at 4:38 am
Shit… You will find it out..