192 Word Frequency

Write a bash script to calculate the frequency of each word in a text file words.txt.

For simplicity sake, you may assume:

words.txt contains only lowercase characters and space ' ' characters.
Each word must consist of lowercase characters only.
Words are separated by one or more whitespace characters.

Example:

Assume that words.txt has the following content:

the day is sunny the the
the sunny is is

Your script should output the following, sorted by descending frequency:

the 4
is 3
sunny 2
day 1

Note:

Don't worry about handling ties, it is guaranteed that each word's frequency count is unique.
Could you write it in one-line using Unix pipes?

Link: https://leetcode.com/problems/word-frequency/description/

思路

将空格替换成回车
字符排序，让相同的字符在相邻的位置
统计相同单词的出现次数
按照出现次数从大到小排序
输出

代码

方法一

cat words.txt | tr -s " " "\n" | sort | uniq -c | sort -r | awk '{print $2, $1}'

方法二

awk '{i=1;while(i<=NF){print $i;i++}}' words.txt | sort | uniq -c | sort -r | awk '{print $2, $1}'

知识点温故

1、tr命令

用于转换或删除文件中的字符。tr 指令从标准输入设备读取数据，经过字符串转译后，将结果输出到标准输出设备。

语法：

tr [-cdst][--help][--version][第一字符集][第二字符集]  
tr [OPTION]…SET1[SET2]

选项：

-c, --complement：反选设定字符。也就是符合 SET1 的部份不做处理，不符合的剩余部份才进行转换。即，用SET2替换SET1中没有包含的字符
-d, --delete：删除指令字符。删除SET1中所有的字符，不转换
-s, --squeeze-repeats：缩减连续重复的字符成指定的单个字符。压缩SET1中重复的字符
-t, --truncate-set1：削减 SET1 指定范围，使之与 SET2 设定长度相等。即，将SET1用SET2转换，为缺省值。

参数：

字符集1：指定要转换或删除的原字符集。当执行转换操作时，必须使用参数“字符集2”指定转换的目标字符集。但执行删除操作时，不需要参数“字符集2”；
字符集2：指定要转换成的目标字符集。

一般缺省为-t。如果有 SET2 的话，就是用 SET2 来替换 SET1。

例如，将小写字母转换为大写

cat testfile |tr a-z A-Z

或者

cat testfile |tr [:lower:] [:upper:]

其他参见：http://man.linuxde.net/tr

一些用法：https://blog.csdn.net/zhuying_linux/article/details/6825568

2、uniq

uniq命令用于报告或忽略文件中的重复行，一般与sort命令结合使用。

uniq(选项)(参数)

选项

-c或——count：在每列旁边显示该行重复出现的次数；
-d或--repeated：仅显示重复出现的行列；
-f<栏位>或--skip-fields=<栏位>：忽略比较指定的栏位；
-s<字符位置>或--skip-chars=<字符位置>：忽略比较指定的字符；
-u或——unique：仅显示出一次的行列；
-w<字符位置>或--check-chars=<字符位置>：指定要比较的字符。

参数

输入文件：指定要去除的重复行文件。如果不指定此项，则从标准读取数据；
输出文件：指定要去除重复行后的内容要写入的输出文件。如果不指定此选项，则将内容显示到标准输出设备（显示终端）。

参考：http://man.linuxde.net/uniq

3、NF

在awk中，NF代表的是一个文本文件中一行（一条记录）中的字段个数，NR代表的是这个文本文件的行数（记录数）。

$NF 表示最后一列

192. Word Frequency.md

192 Word Frequency

思路

代码

知识点温故

results matching ""

No results matching ""