q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).
q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs etc. It supports automatic column name and column type detection, and provides full support for multiple encodings.
q's web site is http://harelba.github.io/q/. It contains everything you need to download and use q in no time.
Download links for all OSs are here.
A beginner's tutorial can be found here.
Example 1:
q -H -t "select count(distinct(uuid)) from ./clicks.csv"
Output 1:
229
Example 2:
q -H -t "select request_id,score from ./clicks.csv where score > 0.7 order by score desc limit 5"
Output 2:
2cfab5ceca922a1a2179dc4687a3b26e 1.0
f6de737b5aa2c46a3db3208413a54d64 0.986665809568
766025d25479b95a224bd614141feee5 0.977105183282
2c09058a1b82c6dbcf9dc463e73eddd2 0.703255121794
Example 3:
q -t -H "select strftime('%H:%M',date_time) hour_and_minute,count(*) from ./clicks.csv group by hour_and_minute"
Output 3:
07:00 138148
07:01 140026
07:02 121826
Usage Example 4:
q -t -H "select hashed_source_machine,count(*) from ./clicks.csv group by hashed_source_machine"
Output 4:< 59CC /p>
47d9087db433b9ba.domain.com 400000
Example 5 (total size per user/group in the /tmp subtree):
sudo find /tmp -ls | q "select c5,c6,sum(c7)/1024.0/1024 as total from - group by c5,c6 order by total desc"
Output 5:
mapred hadoop 304.00390625
root root 8.0431451797485
smith smith 4.34389972687
Example 6 (top 3 user ids with the largest number of owned processes, sorted in descending order):
Note the usage of the autodetected column name UID in the query.
ps -ef | q -H "select UID,count(*) cnt from - group by UID order by cnt desc limit 3"
Output 6:
root 152
harel 119
avahi 2
Any feedback/suggestions/complaints regarding this tool would be much appreciated. Contributions are most welcome as well, of course.
Harel Ben-Attia, harelba@gmail.com, @harelba on Twitter
q on twitter: #qtextasdata