网站敏感词过滤的实现（附敏感词库）「建议收藏」

现在基本上所有的网站都需要设置敏感词过滤，似乎已经成了一个网站的标配，如果你的网站没有，或者你没有做相应的处理，那么小心相关部门请你喝茶哦。
最近在调研Java web网站的敏感词过滤的实现，网上找了相关资料，经过我的验证，把我的调研结果写出来，供大家参考。

附敏感词库（仅用于系统过滤，禁止用于其他用途）：https://t00y.com/file/1764647-442914556

一、敏感词过滤工具类

把敏感词词库内容加载到ArrayList集合中，通过双层循环，查找与敏感词列表相匹配的字符串，如果找到以*号替换，最终得到替换后的字符串。

此种方式匹配度较高，匹配速度良好。

初始化敏感词库：

//初始化敏感词库
public void InitializationWork)  
{  
    replaceAll = new StringBuilderreplceSize);  
    forint x=0;x < replceSize;x++)  
    {  
        replaceAll.appendreplceStr);  
    }  
    //加载词库  
    arrayList = new ArrayList<String>);  
    InputStreamReader read = null;  
    BufferedReader bufferedReader = null;  
    try {  
        read = new InputStreamReaderSensitiveWord.class.getClassLoader).getResourceAsStreamfileName),encoding);  
        bufferedReader = new BufferedReaderread);  
        forString txt = null;txt = bufferedReader.readLine)) != null;){  
            if!arrayList.containstxt))  
                arrayList.addtxt);  
        }  
    } catch UnsupportedEncodingException e) {  
        e.printStackTrace);  
    } catch IOException e) {  
        e.printStackTrace);  
    }finally{  
        try {  
            ifnull != bufferedReader)  
            bufferedReader.close);  
        } catch IOException e) {  
            e.printStackTrace);  
        }  
        try {  
            ifnull != read)  
            read.close);  
        } catch IOException e) {  
            e.printStackTrace);  
        }  
    }  
}

过滤敏感词信息：

public String filterInfoString str)  
{   
    sensitiveWordSet = new HashSet<String>);
    sensitiveWordList= new ArrayList<>);
    StringBuilder buffer = new StringBuilderstr);  
    HashMap<Integer, Integer> hash = new HashMap<Integer, Integer>arrayList.size));  
    String temp;  
    forint x = 0; x < arrayList.size);x++)  
    {  
        temp = arrayList.getx);  
        int findIndexSize = 0;  
        forint start = -1;start=buffer.indexOftemp,findIndexSize)) > -1;)  
        {  
            //System.out.println"###replace="+temp);
            findIndexSize = start+temp.length);//从已找到的后面开始找  
            Integer mapStart = hash.getstart);//起始位置  
            ifmapStart == null || mapStart != null && findIndexSize > mapStart))//满足1个，即可更新map  
            {  
                hash.putstart, findIndexSize); 
                //System.out.println"###敏感词："+buffer.substringstart, findIndexSize));
            }  
        }  
    }  
    Collection<Integer> values = hash.keySet);  
    forInteger startIndex : values)  
    {  
        Integer endIndex = hash.getstartIndex);  
        //获取敏感词，并加入列表，用来统计数量
        String sensitive = buffer.substringstartIndex, endIndex);
        //System.out.println"###敏感词："+sensitive);
        if !sensitive.contains"*")) {//添加敏感词到集合
            sensitiveWordSet.addsensitive);
            sensitiveWordList.addsensitive);
}
        buffer.replacestartIndex, endIndex, replaceAll.substring0,endIndex-startIndex));
    }  
    hash.clear);  
    return buffer.toString);  
}

下载地址：SensitiveWord
链接: https://pan.baidu.com/s/12RcZ8-jNHMAR__VscRUDfQ 密码: qmcw

二、Java关键词过滤

这个方式采用的是正则表达式匹配，速度上比第一种稍慢，匹配度良好。

主要代码：

// 从words.properties初始化正则表达式字符串
private static void initPattern) {
    StringBuffer patternBuffer = new StringBuffer);
    try {
        //words.properties
        InputStream in = KeyWordFilter.class.getClassLoader).getResourceAsStream"keywords.properties");
        Properties property = new Properties);
        property.loadin);
        Enumeration<?> enu = property.propertyNames);
        patternBuffer.append"");
        while enu.hasMoreElements)) {
            String scontent = String) enu.nextElement);
            patternBuffer.appendscontent + "|");
            //System.out.printlnscontent);
            keywordsCount ++;
        }
        patternBuffer.deleteCharAtpatternBuffer.length) - 1);
        patternBuffer.append")");
        //System.out.printlnpatternBuffer);
        // unix换成UTF-8
        // pattern = Pattern.compilenew
        // StringpatternBuf.toString).getBytes"ISO-8859-1"), "UTF-8"));
        // win下换成gb2312
        // pattern = Pattern.compilenew StringpatternBuf.toString)
        // .getBytes"ISO-8859-1"), "gb2312"));
        // 装换编码
        pattern = Pattern.compilepatternBuffer.toString));
    } catch IOException ioEx) {
        ioEx.printStackTrace);
    }
}

private static String doFilterString str) {
    Matcher m = pattern.matcherstr);
//      while m.find)) {// 查找符合pattern的字符串
//          System.out.println"The result is here :" + m.group));
//      }
    // 选择替换方式，这里以* 号代替
    str = m.replaceAll"*");
    return str;
}

下载地址：KeyWordFilter
链接: http://pan.baidu.com/s/1kVBl803 密码: xi24

三、DFA算法进行过滤

这种方式听起来高大上，采用DFA算法，这个算法个人不太懂，经测试发现，匹配度不行，速度良好。或许可以改良，还请大神进行改良。

主要有两个文件：SensitivewordFilter.java 和 SensitiveWordInit.java

主要代码：

public int CheckSensitiveWordString txt,int beginIndex,int matchType){
    boolean  flag = false;    //敏感词结束标识位：用于敏感词只有1位的情况
    int matchFlag = 0;     //匹配标识数默认为0
    char word = 0;
    Map nowMap = sensitiveWordMap;
    forint i = beginIndex; i < txt.length) ; i++){
        word = txt.charAti);
        nowMap = Map) nowMap.getword);     //获取指定key
        ifnowMap != null){     //存在，则判断是否为最后一个
            matchFlag++;     //找到相应key，匹配标识+1 
            if"1".equalsnowMap.get"isEnd"))){       //如果为最后一个匹配规则,结束循环，返回匹配标识数
                flag = true;       //结束标志位为true   
                ifSensitivewordFilter.minMatchTYpe == matchType){    //最小规则，直接返回,最大规则还需继续查找
                    break;
                }
            }
        }
        else{     //不存在，直接返回
            break;
        }
    }
    ifmatchFlag < 2 || !flag){        //长度必须大于等于1，为词 
        matchFlag = 0;
    }
    return matchFlag;
}

下载地址：SensitivewordFilter
链接: http://pan.baidu.com/s/1ccsa66 密码: mc1x

四、多叉树查找算法

这个方式采用了多叉树查找算法，至于这个算法是怎么回事，大家可以去查看数据结构相关内容。提供了jar包，直接调用进行过滤。

经测试，这个方法匹配度良好，速度稍慢。

调用方式：

//敏感词过滤
FilteredResult result = WordFilterUtil.filterTextstr, '*');
//获取过滤后的内容
System.out.println"替换后的字符串为:\n"+result.getFilteredContent));
//获取原始字符串
System.out.println"原始字符串为:\n"+result.getOriginalContent));
//获取替换的敏感词
System.out.println"替换的敏感词为:\n"+result.getBadWords));

下载地址：WordFilterUtil
链接: http://pan.baidu.com/s/1nvftzeD 密码: 5t2h

以上就是我的调研结果，希望对大家有所帮助。

最后，附上大量敏感词库下载地址：
链接: https://pan.baidu.com/s/1n-GH-OO6nQ5oJk5h5qHVkA 密码: qsv9

参考了以下文章：
– 《高效精准》敏感字&词过滤
– Java关键字过滤
– Java实现敏感词过滤
– 高效Java敏感词、关键词过滤工具包_过滤非法词句

其他
– 个人博客：http://www.sendtion.cn
– CSDN：http://blog.csdn.net/shuyou612
– GitHub：https://github.com/sendtion

原文链接：https://blog.csdn.net/shuyou612/article/details/74931955

网站敏感词过滤的实现（附敏感词库）「建议收藏」

附敏感词库（仅用于系统过滤，禁止用于其他用途）：https://t00y.com/file/1764647-442914556

一、敏感词过滤工具类

二、Java关键词过滤

三、DFA算法进行过滤

四、多叉树查找算法

Published by

风君子

发表回复取消回复

附敏感词库（仅用于系统过滤，禁止用于其他用途）：https://t00y.com/file/1764647-442914556

一、敏感词过滤工具类

二、Java关键词过滤

三、DFA算法进行过滤

四、多叉树查找算法

Published by

风君子

发表回复 取消回复

发表回复取消回复