Stay Hungry. Stay Foolish.

配置 Bitextor 环境

以下操作均在 Ubuntu 16.04 下完成,运行环境 Azure F1S 型机器

  • 首先配置相关的环境(GCC & Python & Java)
apt update && apt upgrade -y
apt-get install automake cmake build-essential autoconf gawk python-pip python-dev python-magic httrack libtidy-dev libxml2-dev libenca-dev screen git sudo wget curl software-properties-common openjdk-8-jdk -y
pip install pip -U
pip install scipy numpy python-Levenshtein keras tensorflow h5py langid nltk regex

wget https://apertium.projectjj.com/apt/install-release.sh -O - | sudo bash
apt -f install apertium-all-dev

编译安装 Bitextor

wget https://sourceforge.net/projects/bitextor/files/bitextor/bitextor-4.1/bitextor-4.1.3.tar.gz
tar zxf bitextor-4.1.3.tar.gz && rm bitextor-4.1.3.tar.gz && cd bitextor-4.1.3
./configure
screen make -j2 && make install

Bitextor 的使用

  • 输入 bitextor 可得如下使用参数:
USAGE: bitextor [OPTIONS] -u URL -d DIRECTORY    -v VOCABULARY LANG1 LANG2
USAGE: bitextor [OPTIONS] -d DIRECTORY           -v VOCABULARY LANG1 LANG2
USAGE: bitextor [OPTIONS] -U FILE                -v VOCABULARY LANG1 LANG2
USAGE: bitextor [OPTIONS] -D FILE                -v VOCABULARY LANG1 LANG2
USAGE: bitextor [OPTIONS] -e FILE                -v VOCABULARY LANG1 LANG2

WHERE:
  -u URL          URL of a website to crawl (one per line); if option -d is also
                  enabled, the website is downloaded to the directory specified,
                  if not, it is downloaded in a temporal directory in '/tmp' (or
                  a different directory if -t option is used).
  -U FILE         tab-separated file containing a list of URLs to crawl and their
                  corresponding destination path (one per line).
  -d DIRECTORY    folder containing a crawled website (if option -u is enabled,
                  the website is first downloaded and then processed).
  -D FILE         file containing a list of URIs to folders containing crawled
                  websites.
  -e FILE         uses as an imput the output of the module bitextor-webdir2ett
  LANG1           selected language with two letters code (ISO 639-1): en, es, fr, de ...
  LANG2           selected language with two letters code (ISO 639-1): en, es, fr, de ...

OPTIONS:
  -L PATH           custom path where the directory containing the logs of the
                    different modules of bitextor will be stored
  -l LETTR          custom path where the file with extension .lettr (language
                    encoded and typed data with 'raspa') will be created
                    (/lettr.XXXXXX by default).
  -I PATH           custom path where the output of the intermediate files produced
                    by the modules of bitextor will be stored.
  -b NUM            if this option is enabled, the document alignment process is
                    run in both directions and only the first NUM candidates in
                    every direction will be taken into account. With this, the list
                    of final candidates will be obtained computing the average of
                    every pair of candidates in both directions. This option should
                    improve precision and drop recall, since those candidates in a
                    position lower than NUM will be discarded for the alignment
                    (NUM must be in [1,10])
  -v VOCABULARY     option for using a custom multilingual vocabulary for preliminar
                    document alignment. The vocabulary must be a tab-separated file,
                    in which the first line contains the names of the languages
                    corresponding to each column, and the rest of the lines must
                    contain the same word translated to all these languages.
  -m MAX_LINES      maximum number or wrong segment alignments tolerated to accept a
                    pair of documents as a valid document alignment. If this number
                    is reached, the whole document pair is discarded (5 by default).
  -q MIN_QUALITY    threshold for Hunalign confidence score; those pairs of segments
                    with a score
                    lower than MIN_QUALITY will be considered wrong and they will be
                    removed (0 by default).
  -t TMP-DIR        alternative tmp directory (/tmp by default).
  -x                if this option is enabled, the output of the tool will be.
                    formated in the standard XML-based format TMX.
  -a                if this option is enabled, Bitextor will perform the alignment only
                    at the level of documents. The output will be tab-sepparated, with
                    three fields: two with the name of the documents aligned and one with
                    the score provided by hunalign for the pair of documents.
  -M                morphological analyser in the Apertium platform for source language
                    that will allow to apply word matching directly on lemmas; this is an
                    important tool for aglutinant languages in order to obtain a good
                    coverage with the bilingual lexicon approach.
  -N                morphological analyser in the Apertium platform for target language
                    that will allow to apply word matching directly on lemmas; this is an
                    important tool for aglutinant languages in order to obtain a good
                    coverage with the bilingual lexicon approach.
  -O FILE           if this option is enabled, the otput of bitextor will be redirected
                    to file FILE, if not it is redirected to the standard output.

进行双语爬虫

bitextor [OPTIONS] -d DIRECTORY LANG1 LANG2
# 对文件进行系列操作,得到结果可以分为:

bitextor-webdir2ett webpath > ett.txt 
# 将html5文件转化为 base64 格式的全部存在一个文本中

bitextor-ett2lett ett.txt > lett.txt 
# langid.py 识别每个页面的语言

bitextor-lett2lettr lett.txt > lettr.txt 
# 产生每个文档的结构信息,用string格式保留

bitextor-lett2idx --lang1 en --lang2 fr lettr.txt > idx.txt
# 为单词产生索引 语言|单词|出现在文档的id,依据之前的lettr

bitextor-idx2ridx -d en-fr.dic --lang1 en --lang2 fr idx.txt > ridx.txt 
# 根据idx选出每个文档对应最可能的平行文档,依据字典的对应关系,判断可能性

bitextor-distancefilter -l lettr.txt ridx.txt > dis.txt 
# 再次利用之前产生的 html5 结构信息来加强评估可能性

bitextor-align-documents -l lettr.txt dis.txt > aligndocs.txt
# 根据 distance 生成对齐文档

bitextor-align-segments --lang1 en --lang2 fr aligndocs.txt > alignsegs.txt 
# 文档的句子对齐

bitextor-cleantextalign -q 0.3 -m 5 alignsegs.txt > cleanedsegs.txt 
# 文档中错误超过m个就放弃文档 -q 得分低于 [-1,1] 间的某个数 q 就放弃 alignsegs

bitextor-buildTMX --lang1 en --lang2 fr cleanedsegs.txt > res.tmx
# 得到最终结果

一键方案

  • 只爬网页不处理
httrack –depth 3 "https://www.apple.com/" -O ./apple -sN 0 -p1
  • 爬网页加处理
bitextor -u "https://www.apple.com/" -b 1 -v en-zh.dic -x en zh

个人体会

这个工具的实用性较低,首先对于动态页面的兼容性较差(现在基本都是动态网页),其次直接带着 bitextor 的 header 爬,随便什么防火墙都拦截掉了,还有就是并行效率太低,CPU 利用率低,单线程爬虫速度太慢,不使用 screen 之类的挂在后台怕是要一直守着机器,总之绝大多时间花在找能爬的网站了。