LeoYu的技术博客

机器学习,自然语言处理,机器视觉,GPU,SPARK,Python

Pyspider

| Comments

介绍

PySpider:一个国人编写的强大的网络爬虫系统并带有强大的WebUI。采用Python语言编写,分布式架构,支持多种数据库后端,强大的WebUI支持脚本编辑器,任务监视器,项目管理器以及结果查看器。在线示例:http://demo.pyspider.org/ - Write script in python with powerful API - Powerful WebUI with script editor, task monitor, project manager and result viewer - MySQL, MongoDB, SQLite 作为后台数据库 - Javascript 页面支持 - Task priority, retry, periodical and recrawl by age or marks in index page (like update time) - Distributed architecture

功能

webui

web的可视化任务监控 web脚本编写,单步调试 异常捕获、log捕获,print捕获等 scheduler

任务优先级 周期定时任务 流量控制 基于时间周期 或 前链标签(例如更新时间)的重抓取调度 fetcher

dataurl支持,用于假抓取模拟传递 method, header, cookie, proxy, etag, last_modified, timeout 等等抓取调度控制 可以通过适配类似 phantomjs 的webkit引擎支持渲染 processor

内置的pyquery,以jQuery解析页面 在脚本中完全控制调度抓取的各项参数 可以向后链传递信息 异常捕获

SimApp: A Framework for Detecting Similar Mobile Applications by Online Kernel Learning

| Comments

Abstract With the popularity of smart phones and mobile devices, the number of mobile applications (a.k.a. “apps”) has been growing rapidly. Detecting semantically similar apps from a large pool of apps is a basic and important problem, as it is beneficial for various applications, such as app recommendation, app search, etc. However, there is no systematic and comprehensive work so far that focuses on addressing this problem. In order to fill this gap, in this paper, we explore multi-modal heterogeneous data in app markets (e.g., description text, images, user reviews, etc.), and present “SimApp” – a novel framework for detecting similar apps using machine learning. Specifically, it consists of two stages: (i) a variety of kernel functions are constructed to measure app similarity for each modality of data; and (ii) an online kernel learning algorithm is proposed to learn the optimal combination of similarity functions of multiple modalities. We conduct an extensive set of experiments on a real-world dataset crawled from Google Play to evaluate SimApp, from which the encouraging results demonstrate that SimApp is effective and promising.

Word Embedding

| Comments

Word Embeddings技术是目前NLP领域最火热的新技术,据说今年的NAACL2015中有40多篇和Word2vec相关。

1
2
@王威廉:Steve Renals算了一下icassp录取文章题目中包含deep learning的数量,发现有44篇,而naacl则有0篇。有一种说法是,语言(词、句子、篇章等)属于人类认知过程中产生的高层认知抽象实体,而语音和图像属于较为底层的原始输入信号,所以后两者更适合做deep learning来学习特征。
2013年3月4日 14:46

第一篇博客

| Comments

这是我的第一篇文章

知世故而不世故,知人性而包容万物。

本博客包含以下技术内容:

大数据&&并行

  • Hadoop
  • Spark
  • Storm

    科研

  • 机器学习
  • 自然语言处理
  • 计算机视觉

    JAVA代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import java.nio.channels.FileChannel;

public interface CallbackInterface(){
    public void dothings();
    public void exeMethods();
}

public class Server1 {

    public static void main(String[] args) {
        // TODO Auto-generated method stub
        System.out.println("hello world!!!");
    }

}

Python代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
import scipy.sparse
import scipy.optimize

def softmax_cost(theta, num_classes, input_size, lambda_, data, labels):
    """
    :param theta:
    :param num_classes: the number of classes
    :param input_size: the size N of input vector
    :param lambda_: weight decay parameter
    :param data: the N x M input matrix, where each column corresponds
                 a single test set
    :param labels: an M x 1 matrix containing the labels for the input data
    """
    m = data.shape[1]
    theta = theta.reshape(num_classes, input_size)
    theta_data = theta.dot(data)
    theta_data = theta_data - np.max(theta_data)
    prob_data = np.exp(theta_data) / np.sum(np.exp(theta_data), axis=0)
    indicator = scipy.sparse.csr_matrix((np.ones(m), (labels, np.array(range(m)))))
    indicator = np.array(indicator.todense())
    cost = (-1 / m) * np.sum(indicator * np.log(prob_data)) + (lambda_ / 2) * np.sum(theta * theta)
    grad = (-1 / m) * (indicator - prob_data).dot(data.transpose()) + lambda_ * theta
    return cost, grad.flatten()

C/C++ 代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <pthread.h>

#define MAX_STRING 100
#define EXP_TABLE_SIZE 1000
#define MAX_EXP 6
#define MAX_SENTENCE_LENGTH 1000
#define MAX_CODE_LENGTH 40

const int vocab_hash_size = 30000000;  // Maximum 30 * 0.7 = 21M words in the vocabulary

typedef float real;                    // Precision of float numbers

struct vocab_word {
  long long cn;
  int *point;
  char *word, *code, codelen;
};
void InitUnigramTable() {
  int a, i;
  double train_words_pow = 0;
  double d1, power = 0.75;
  table = (int *)malloc(table_size * sizeof(int));
  for (a = 0; a < vocab_size; a++)
    train_words_pow += pow(vocab[a].cn, power);
  i = 0;
  d1 = pow(vocab[i].cn, power) / train_words_pow;
  for (a = 0; a < table_size; a++) {
    table[a] = i;
    if (a / (double)table_size > d1) {
      i++;
      d1 += pow(vocab[i].cn, power) / train_words_pow;
    }
    if (i >= vocab_size) i = vocab_size - 1;
  }
}

Shell

1
2
3
4
5
6
7
8
9
#!/bin/bash
# Program:
#       This program shows "Hello World!" in your screen.
# History:
# 2005/08/23    VBird   First release
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:~/bin
export PATH
echo -e "Hello World! \a \n"
exit 0

我的爱好

Alizee1

Alizee2