提取HTML代码中文字的C#函数

2016-01-29 13:34 40 1 收藏

提取HTML代码中文字的C#函数,提取HTML代码中文字的C#函数

【 tulaoshi.com - ASP.NET 】

/// <summary
/// 去除HTML标记
/// </summary
/// <param name="strHtml"包括HTML的源码 </param
/// <returns已经去除后的文字</returns
public static string StripHTML(string strHtml)
{
string [] aryReg ={
@"<script[^]*?.*?</script",

@"<(/s*)?!?((w+:)?w+)(w+(s*=?s*(([""'])([""'tbnr]|[^7])*?7|w+)|.{0})|s)*?(/s*)?",
          @"([rn])[s]+",
          @"&(quot|#34);",
          @"&(amp|#38);",
          @"&(lt|#60);",
          @"&(gt|#62);",
          @"&(nbsp|#160);",
          @"&(iexcl|#161);",
          @"&(cent|#162);",
          @"&(pound|#163);",
          @"&(copy|#169);",
          @"&#(d+);",
          @"--",
          @"<!--.*n"

         };

   string [] aryRep = {
           "",
           "",
           "",
           """,
           "&",
           "<",
           "",
           " ",
           "xa1",//chr(161),
           "xa2",//chr(162),
           "xa3",//chr(163),
           "xa9",//chr(169),
           "",
           "rn",
           ""
          };

   string newReg =aryReg[0];
   string strOutput=strHtml;
   for(int i = 0;i<aryReg.Length;i++)
   {
    Regex regex = new Regex(aryReg[i],RegexOptions.IgnoreCase );
    strOutput = regex.Replace(strOutput,aryRep[i]);
   }

   strOutput.Replace("<","");
   strOutput.Replace("","");
   strOutput.Replace("rn","");

return strOutput;
}

来源:http://www.tulaoshi.com/n/20160129/1490670.html

上一篇：一个个人网页自动化生成系统（1）
下一篇：制作精美的flash桌面时钟(2)

看过《提取HTML代码中文字的C#函数》的人还看了以下文章更多>>

怎样恢复word受损文档中文字

标签：电脑入门

我用鼠标双击一个word文件时，该文件却无论如何也不能被打开，系统提示该文件损坏，但里面是我的一篇重要论文，请问有什么办法能读出里面的文字？答：您可以通过如下步骤读出该文件中的文字：（1）启动word，单击工具菜单中的选项命令，然后单击常规标签。（2）用鼠标左键单击选中打开时确认转换复选框，接着单击确定按钮。（3）随后...

龙年中文字体设计