Strip HTML Tags

I needed to convert HTML to plain text for the application that I’m developing. I figured someone else had done this before; so, I searched for a solution. I found one that leveraged WebKit and worked well but was too slow for my usage. I found another that leveraged libxml2 and was fast but crashed. In the end, and only because I could browse the source to the version of libxml2 included in Mac OS X, I was able to work around the crash and keep the faster solution.

The first solution I found was on Karelia Software’s Cocoa Open Source site. FlattenHTML.m appeared to solve the problem nicely. However, when I later profiled the application with Shark, it pointed out that a vast majority of time was spent in this method and its children as a result of a bunch of Web Kit stuff being initialized. This performance hit was noticable; so, I searched for a faster solution.

I next found this code that leverages libxml2 to strip HTML tags. After switching to to this solution, the time to strip HTML was insignificant. There was much rejoicing, for a bit. All was fine until the application started crashing while stripping some HTML text:


*** -[NSCFString delegate]: selector not recognized
      [self = 0x39d530]

After some debugging, I determined that the htmlSAXParseDoc function would cause this error when passed malformed HTML. I don’t really care how well-formed the HTML is, I just want the plain text. I tried specifying an error function callback in the xmlSAXHandler structure without any success. I then took advantage of the fact that libxml2 is open source and included on Apple’s Open Source web site. For me, this is the primary advantage of Darwin being open source: I can download the source, browse it, and build it, to debug problems in either Darwin or in the way my code leverages Darwin. In this case, by browsing the source, I was able to determine that the default structured error function was being called and resulting in the above exception. While creating a test case for a bug report to Apple (4905905), I learned that this problem only occurs after initializing an NSAppleScript object with the initWithContentsOfURL:error method. Go figure. To work around this problem, I installed my own structured error function that ignored all errors. While debugging this issue I found next to nothing useful on the web; hopefully, this post will save someone some time. Here’s the final code originally from the Objectpark group with my minor modification (tagged with “GCS”):


//
//  NSString+OPHTMLTools.m
//  GinkoVoyager
//
//  Created by Dirk Theisen on 30.06.06.
//  Copyright 2006 Objectpark Group.
//

#import "FlattenHTML.h"

#include <libxml2/libxml/xmlmemory.h>
#include <libxml2/libxml/HTMLparser.h>

@implementation NSString (FlattenHTML)

static void charactersParsed(void* context,
      const xmlChar* ch, int len)
/*" Callback function for stringByStrippingHTML. "*/
{
  NSMutableString* result = context;
  NSString* parsedString;
  parsedString = [[NSString alloc] initWithBytesNoCopy:
      (xmlChar*) ch length: len encoding:
      NSUTF8StringEncoding freeWhenDone: NO];
  [result appendString: parsedString];
  [parsedString release];
}

/* GCS: custom error function to ignore errors */
static void structuredError(void * userData,
      xmlErrorPtr error)
{
   /* ignore all errors */
   (void)userData;
   (void)error;
}

- (NSString*) flattenHTML
/*" Interpretes the receiver als HTML, removes all tags
    and returns the plain text. "*/
{
  int mem_base = xmlMemBlocks();
  NSMutableString* result = [NSMutableString string];
  xmlSAXHandler handler; bzero(&handler,
      sizeof(xmlSAXHandler));
  handler.characters = &charactersParsed;
  
  /* GCS: override structuredErrorFunc to mine so
      I can ignore errors */
  xmlSetStructuredErrorFunc(xmlGenericErrorContext,
      &structuredError);
  
  htmlSAXParseDoc((xmlChar*)[self UTF8String], "utf-8",
      &handler, result);
    
  if (mem_base != xmlMemBlocks()) {
    NSLog( @"Leak of %d blocks found in htmlSAXParseDoc",
      xmlMemBlocks() - mem_base);
  }
  return result;
}

@end

Update 05jan2007: Be sure to read this comment for a better solution.

22 Responses to “Strip HTML Tags”

  1. Jonathan Wight Says:

    That’s a lot of code for something that is really simple to do.

    10.4’s NSXML* classes can do it all for you. Load in your HTML into an NSXMLDocument, using the tidy options to convert the HTML to XHTML.
    Then use a really simple XSLT transform to transform the XHTML into text, stripping out the tags you don’t want (I decided to strip out everything in HEAD and SCRIPT tags, you will probably want to add some more tags)

    NSURL *theURL = [NSURL fileURLWithPath:@"/Users/schwa/Desktop/Test.html"];
    NSError *theError = NULL;
    NSXMLDocument *theDocument = [[[NSXMLDocument alloc] initWithContentsOfURL:theURL options:NSXMLDocumentTidyHTML error:&theError] autorelease];

    NSString *theXSLTString = @”\
    \
    \
    \
    \
    “;

    NSData *theData = [theDocument objectByApplyingXSLTString:theXSLTString arguments:NULL error:&theError];
    NSString *theString = [[[NSString alloc] initWithData:theData encoding:NSUTF8StringEncoding] autorelease];
    NSLog(@”%@”, theString);

  2. Jonathan Wight Says:

    Looks like Wordpress ate the style sheet. (Obvious in retrospect). Here it is (hopefully unmangled):

    <?xml version=’1.0′ encoding=’utf-8′?>
    <xsl:stylesheet version=’1.0′ xmlns:xsl=’http://www.w3.org/1999/XSL/Transform’ xmlns:xhtml=’http://www.w3.org/1999/xhtml’>
    <xsl:output method=’text’/>
    <xsl:template match=’xhtml:head’></xsl:template>
    <xsl:template match=’xhtml:script’></xsl:template>
    </xsl:stylesheet>”;

  3. geoff Says:

    Thanks Jonathan! I replaced my code with yours, which is much cleaner. It is still plenty fast. As an added bonus, I’m able to remove my dependency on libxml2.

  4. Dan Wood Says:

    This is great, guys! I’ve updated my “flatten” implementation (kept for posterity) to point to this entry.

    Jon, I’m a bit confused about whhere the style sheet goes … is that the XSLT string that looks like a bunch of escaped newlines in the previous post?

  5. geoff Says:

    Dan, yes, in Jon’s code the style sheet is assigned to theXSLTString. Wordpress mangled it in his first post.

  6. Dan Wood Says:

    It works great!

    BTW, I noticed in NSXMLDocument that there’s a method XMLDataWithOptions: which you can pass in a parameter of NSXMLDocumentTextKind which “outputs the string value of the document by extracting the string values from all text nodes.” Except that it DOESN’T. If that method worked properly (it actually outputs XHTML), it would be an even simpler approach! I’ve just reported this to Apple; I hope others do too.

  7. geoff Says:

    For those of you following along at home, my bug report was a duplicate of 4296059. I hate it when that happens.

  8. Dan Wood Says:

    One caveat: I found some problems (maybe a bug in the XML parser) if I passed in a very short string with no markup. So what I’m doing if scanning to see if the string has any HTML-looking markup. Only if it has that do I pass the string to the parser. Otherwise I leave it unchanged.
    stattic NSCharacterSet *sHTMLSet = nil;
    if (nil == sHTMLSet)
    {
    sHTMLSet = [[NSCharacterSet characterSetWithCharactersInString:@"&"] retain];
    }
    if (NSNotFound != [result rangeOfCharacterFromSet:sHTMLSet].location)
    { …. }

  9. Dan Wood Says:

    Here’s another problem, maybe some minor aspect of the XSLT string. If you pass in “” as your string, then instead of getting a string back, you get a mostly-empty XML document back! It turns out that objectByApplyingXSLTString: will return NSData or an NSXMLDocument depending on what’s passed into it.

  10. Brad Miller Says:

    Thanks for the NSXML method guys. I had been using something that was a bit slower and a lot more messy looking.

    I ran into an issue tonight that is another gotha with Jon’s method. If the string being processes only contains html, something like “”, an exception is thrown.

    In this case, objectByApplyingXSLTString… returns an NSXMLDocument containing “” instead of an NSData object which naturally causes initWithData to choke. So you need to check if theData is actually data and not an NSXMLDocument before creating the string.

    Looking at the docs, I don’t think it’s a bug but they don’t spelled out clearly what’s returned if the XSLT transform results in an empty string. It would be nice if it returned an empty string instead of an empty xml document. I’ll file a bug on it tomorrow when I’m a little bit more awake.

  11. geoff Says:

    A couple of clarifications about the above three comments. First, in my testing, if the specified string contains no markup, it needs to contain at least 12 characters or else an error is returned from objectByApplyingXSLTString that specifies that the document is empty. Newlines count, so you could append a bunch of newlines to the end of the string. Also, I believe that the NSCharacterSet created in Dan’s example isn’t initialized with an ampersand. Wordpress probably messed up the code. I expect that he’s creating a character set with angle brackets. If this is the case, there is still the possibility that the string could contain a single angle bracket and no markup which would still result in an error.

    I’m not sure what should be between the quotes in Dan’s latest post. It may be similar to the issue reported by Brad. Wordpress also stripped HTML from Brad’s post. I’m not sure what was in his example, but if the string only contains an img tag, for example, it will exhibit the behavior described by Dan and Brad. To check if the object returned from objectByApplyingXSLTString is really an NSData object, you can invoke isKindOfClass.

  12. Brad Miller Says:

    Sorry, it was 2 am here when I posted and I didn’t catch that the html was stripped. Good guess at my string, it was a single image tag and my work around was exactly what you suggested.

    Thanks for the note about the string needing at least 12 characters. I’ll a check for length too. I had already added a check similar to Dan’s that looks for brackets or an ampersand.

  13. joao sampaio Says:

    if it’s something simple you can always do this :

    - (NSString *)stripTags:(NSString *) html {
    NSMutableString *result = [[NSMutableString alloc] initWithCapacity:[html length]];
    BOOL iguenore = YES;
    int index;
    unichar c;

    for (index = 0; index ‘) {
    iguenore = NO;
    continue;
    }
    if (!iguenore) {
    [result appendFormat:@"%C",[html characterAtIndex:index]];
    }
    }
    return result;
    }

    this is just something i wrote into some if my applications

  14. joao sampaio Says:

    because of the use of greter and less i guess that some code failed to display, i am sorry

  15. Michael Kaye Says:

    Thank you…works brilliantly…

    I assume this is still the easiest way to do this with xCODE 3.x and Leopard SDK?

  16. geoff Says:

    I haven’t checked to see if there is a new feature available in Leopard that makes this easier. If someone finds one, by all means, please leave a comment.

  17. Daniel Says:

    This ist better and faster and works with the iPhone SDK:

    - (NSString *)flattenHTML:(NSString *)html
    {

    NSScanner *theScanner;
    NSString *text;

    theScanner = [NSScanner scannerWithString:html];

    while ([theScanner isAtEnd] == NO) {

    //remove html tag
    [theScanner scanUpToString:@"<" intoString:NULL];
    [theScanner scanString:@"" intoString:&text];

    html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"",text] withString:@”"];

    }

    return html;
    }

    Regards from Switzerland

  18. Daniel Says:

    Wordpress has deleted the phrase in stringWithFormat:@”"

    Put in the @”" following without the +

  19. Daniel Says:

    StringWithFormat:@”"

  20. Daniel Says:

    Shit… You will find it out..

  21. Ashley Says:

    How many times have you tried before the first fast program crashed? I had used the same one but it doesnt resulted in any crashes.

  22. Hasnat Says:

    @Daniel
    you cannot add less than greater than signs in the comment here

Leave a Reply

thoughts yet to be boiled down to their essence