Using the Unix Shell

Using the Unix Shell There is No ‘Undelete’

The Unix Shell “A Unix shell is a command-line interpreter or shell that provides a traditional user interface for the Unix operating system and for Unix-like systems. Users direct the operation of the computer by entering commands as text for a command line interpreter to execute or by creating text scripts of one or more such commands.” - Wikipedia

Things to Keep in Mind • There is no ‘undelete’ • Shell commands are case-sensitive (CaPitaLizaTIoNmAttErs) • Do NOT use space, ?, *, \, / or $ in file names because these have special meanings to the shell • Filenames that begin with . are ‘hidden’ • There is no ‘undelete’

The Importance of Being ‘Root’ • ‘Root’ or ‘Superuser’ is the administrator account, which has phenomenal cosmic power. • The ‘sudo’ command allows you to “do as superuser” from an account with ‘sudoprivileges’. • As root in the shell, you can literally ‘delete’ the operating system or operating system files (like choosing to delete Microsoft Windows while using Windows)… and then watch the stars go out… • Moral of the story: If you don’t know what a file is… it’s better to ask or leave it alone. • Installing software can require use of ‘sudo’

Unix Tutorial • http://www.ee.surrey.ac.uk/Teaching/Unix/ • Science.txt file location for tutorial: • http://www.ee.surrey.ac.uk/Teaching/Unix/science.txt • Unix command: • wgethttp://www.ee.surrey.ac.uk/Teaching/Unix/science.txt Additional help/tutorial/walkthrough • http://software-carpentry.org/4_0/shell/

Grep • grep science science.txt • grep science science.txt > newfile1.txt • grep -B 1 -A 2 science science.txt > newfile1.txt • Use man grep to learn more about grep Command line ‘options’ that change the behavior of the ‘grep’ program, with numerical parameters that specify the new behavior. A ‘redirect’ symbol that sends output which would normally go to the screen to a text file instead.

Permissions • Type ls -l *note: those are both lower-case L characters • -rw-r--r--1 krmerrillstaff 358400 Feb 2 13:00 AJB_Merrill-d1100085_au.doc • drwxr-xr-x 47 krmerrill staff 1598 Jul 17 2011 My Pictures - means regular file, d means directory, l (lower-case L) means link first triplet is the user read, write, and execute permissions second triplet is the group permissions last triplet is permissions for everyone else, or ‘other’ ls-al shows above information for all files, including hidden files chmod = change permissions u = user; g = group; o = other; a = all (user, group, and other) r = read; w = write; x = execute chmodu+xfilename adds user execute permission on filename chmod g-wxfilename removes group write and execute permissions from filename Permissions that are not mentioned in this format chmod command are not affected

Useful Shell Commands • See the Linux Command Line Reference document on the course website • Directory commands • Change to sub-directory within the current directory: cd xyz • Change to sub-directory in another part of the directory tree: cd /path/to/filename • Create directory: mkdirnewdir • Remove empty directory: rmdirxyz • Wildcard characters: ? matches any single character, * matches zero or more characters • Example: rm *.txt will remove all files with a name ending in .txt • rm file?.fastq will remove file1.fastq, file2.fastq, … , filex.fastq

Regular Expressions • See the RegularExpressions.pdf document on the course website for an overview of literal characters and metacharacters • Regular expressions are useful within grep, awk, sed and other command-line tools as well as in Java, Perl, Python, and other scripting languages. • Some text editor programs in Linux also use regular expressions, (also called regexps or regex). We will use nedit as an example. • Replacing a space character with a new-line character in a file of barcodes – find ‘(OWB\d+) ’ and replace with ‘\1\n’ – note the trailing space in the first expression.

Command-line example • Testing analyses on a small random sample of a sequence dataset is a good idea – find and fix problems quickly • How to randomly sample the same reads from a set of paired-end files? • A one-line command is saved on the course website to do this. • time paste file1.fastq file2.fastq |awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf | head -2000000 | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print $2 > "file2.fastq"}‘ • Let’s look at this step by step

Command-line example time this tells the system to display the time required to execute the command paste Bigfile1.fastq Bigfile2.fastq | this joins two files of paired-end sequence reads as tab-delimited columns, line by line – the files should have the same number of lines, with reads in the same order in both files awk'{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | this uses the ‘awk’ program to convert the four lines of FASTQ format to tab-separated fields on a single line per sequence record shuf| this utility sorts lines in a file into a random order head -2000000 | this utility takes the first 2 million lines of the re-ordered file sed's/\t\t/\n/g' | this uses the ‘sed’ stream editor to convert the tab delimiters back into new-line characters to restore the 4-line FASTQ format awk'{print $1 > “Subfile1.fastq"; print $2 > “Subfile2.fastq"}' this uses ‘awk’ to split the two tab-delimited columns back into two separate files

How do you come up with this stuff?

Someone else has probably had this problem

Search for help on SeqAnswers or StackExchange http://biostar.stackexchange.com/ The Bioinformatics Forum on SeqAnswers: http://seqanswers.com/forums/forumdisplay.php?f=18

SolexaQA.pl • This Perl script assumes that header lines of sequence files are written in one of several formats • The code uses regular expressions to sort out formats: if( $line =~ /\S+\s\S+/ ){ # Cassava 1.8 variant if( $line =~ /^@[\d\w\-\._]+:[\d\w]+:[\d\w]+:[\d\w]+:(\d+)/ ){ $number_of_tiles = $1 + 1; # Sequence Read Archive variant }elsif( $line =~ /^@[\d\w\-\._\s]+:[\d\w]+:(\d+)/ ){ $number_of_tiles = $1 + 1; } # All other variants }elsif( $line =~ /^@[\d\w\-:\._]*:+\d*:(\d*):[\.\d]+:[\.\/\#\d\w]+$/ ){ $number_of_tiles = $1 + 1; }

Alternate Formats • This Perl script assumes that header lines of sequence files are written in one of several formats • The code uses regular expressions to sort out formats: if( $line =~ /\S+\s\S+/ ){ # Cassava 1.8 variant – does the header line contain a space surrounded by non-space characters? @EAS139:136:FC706VJ:2:2104:15343:197393_1:Y:18:ATCACG $line =~ /^@[\d\w\-\._]+:[\d\w]+:[\d\w]+:[\d\w]+:(\d+)/ ) # NCBI SRA variant – does the header line contain a string with – , _ ,or . before the first colon? @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

SolexaQA.pl $line =~ /^@[\d\w\-\._\s]+:[\d\w]+:(\d+)/ ) # Two other variants – • does first field contain – , ., or _ followed by two more colon-delimited fields? $line =~ /^@[\d\w\-:\._]*:+\d*:(\d*):[\.\d]+:[\.\/\#\d\w]+$/ ) • does first field contain – , ., :, or _ followed by four colon-delimited fields, followed by ., /, or # at the end of the line? Example header line from GSL sequence file: @3:1:1006:20321:YThis would be described by $line =~ /^@\d+:\d+:\d+:\d+:[YN]/

Using the Unix Shell

Using the Unix Shell

Presentation Transcript

Unix Shell Environments

UNIX – The Shell

UNIX – Shell Programming

Programming the shell in UNIX

3. Unix Shell

3. Unix Shell

Unix Shell - Revisited

The Unix Shell

UNIX Shell Scripting

Unix Shell Scripts

UNIX Shell

The UNIX Shell

Unix Shell Script

Unix Shell

UNIX shell environments

Unix Shell Environments

UNIX Shell Script (1)

UNIX Shell-Scripting Basics

Using UNIX Shell Scripts

The UNIX Shell

UNIX Shell Scripting